Making predictive models for competitions on

kaggle subteam

Kaggle is a platform where companies and researchers post their data. Statisticians and data miners from all over the world compete to produce the best models. This semester, we're competing in two competitions.

Through feature engineering and statistical techniques, each team member implements various ML algorithms on the dataset, including logistic regression, word2vec, and random forests. Team members will then improve their models using optimization techniques. Finally, we use ensemble methods to combine all algorithms into a single 'super-algorithm' that we'll submit to Kaggle.



Algorithmic Trading

Finding opportunities to capitalize using data-driven approaches

algorithmic trading

By using financial data, we hope to formulate accurate predictions for various market sectors. We hope to achieve these goals by using well-known algorithmic trading approaches, and then supplementing them with Machine Learning techniques. We hope to use clustering algorithms, among other unsupervised learning algorithms, to find equities that we wish to trade. Additionally, we aim to use supervised learning techniques and Natural Language Processing techniques to create accurate predictions of future prices.



Yelp Dataset Challenge


Producing a research paper on bias factors affecting star-ratings in Yelp reviews

yelp dataset challenge

The data we're working on includes detailed information on users, businesses, reviews, and data collected through Yelp features like check-ins and tips. The biggest challenge for the team is perhaps the sheer size and complexity of the data, which contains 15 gigabytes of JSON text. The size of the data requires a heavy emphasis on data engineering and parallel computing. Our team consists of students with strong machine learning and statistical backgrounds, and we are excited to explore areas like recommendation systems, time-series analysis, Natural Language Processing, and sentiment analysis.




Formula One Team Performance Prediction

Upon parsing data available on the Formula One official website, this team worked with the data, taking into account the stability in the number of Grand Prix over the years, new team additions, and changes to points table with time, ultimately intending to answer some important questions about the sport.


Reddit Recommendation Engine

Created an engine to generate a set of recommended posts for users of the social website Reddit given prior voting history, previous posts, and comments.


Cayuga County Health Department Data Science Project

Quantified the change in pollution and phosphorus loading in Lake Owasco Watershed since 2011, thereby propelling future water management practices.


EEG Signal Classification with SVMs

Using data science tools to categorize each stage in the sleep cycle, this team analyzed brain waves emitted between neurons at each stage. By using EEG and electroencephalography, signals were captured and passed through machine learning algorithms for classification.


Misinformation Response in Social Networks

Worked with relevant research regarding social media interactions (particularly Reddit); provided an opportunity for members to experiment with data and practice processing/cleaning/scraping techniques surrounding the Reddit API.