Kaggle is a platform where companies and researchers post their data. Statisticians and data miners from all over the world compete to produce the best models. This semester, we're competing in two competitions.
Through feature engineering and statistical techniques, each team member implements various ML algorithms on the dataset, including logistic regression, word2vec, and random forests. Team members will then improve their models using optimization techniques. Finally, we use ensemble methods to combine all algorithms into a single 'super-algorithm' that we'll submit to Kaggle.
Finding opportunities to capitalize using data-driven approaches
By using financial data, we hope to formulate accurate predictions for various market sectors. We hope to achieve these goals by using well-known algorithmic trading approaches, and then supplementing them with Machine Learning techniques. We hope to use clustering algorithms, among other unsupervised learning algorithms, to find equities that we wish to trade. Additionally, we aim to use supervised learning techniques and Natural Language Processing techniques to create accurate predictions of future prices.
Yelp Dataset Challenge
Producing a research paper on bias factors affecting star-ratings in Yelp reviews
yelp dataset challenge
The data we're working on includes detailed information on users, businesses, reviews, and data collected through Yelp features like check-ins and tips. The biggest challenge for the team is perhaps the sheer size and complexity of the data, which contains 15 gigabytes of JSON text. The size of the data requires a heavy emphasis on data engineering and parallel computing. Our team consists of students with strong machine learning and statistical backgrounds, and we are excited to explore areas like recommendation systems, time-series analysis, Natural Language Processing, and sentiment analysis.
Formula One Team Performance Prediction
Upon parsing data available on the Formula One official website, this team worked with the data, taking into account the stability in the number of Grand Prix over the years, new team additions, and changes to points table with time, ultimately intending to answer some important questions about the sport.
Reddit Recommendation Engine
Created an engine to generate a set of recommended posts for users of the social website Reddit given prior voting history, previous posts, and comments.
Cayuga County Health Department Data Science Project
Quantified the change in pollution and phosphorus loading in Lake Owasco Watershed since 2011, thereby propelling future water management practices.
EEG Signal Classification with SVMs
Using data science tools to categorize each stage in the sleep cycle, this team analyzed brain waves emitted between neurons at each stage. By using EEG and electroencephalography, signals were captured and passed through machine learning algorithms for classification.
Misinformation Response in Social Networks
Worked with relevant research regarding social media interactions (particularly Reddit); provided an opportunity for members to experiment with data and practice processing/cleaning/scraping techniques surrounding the Reddit API.