Anthony Selva Jessobalan, Handwritten Digit Recognition, August 2019, (Dungang Liu, Liwei Chen)
Image recognition has been an important part of technological research in the past few decades. Image processing is the main reason why computer vision has been made possible. Once the image is captured, the computer stores the image in 3D arrays, with the dimension referring to height width and the color channel. It is then compressed and stored in popular formats such as jpeg, png, etc. For the computer to understand these numbers it is important to train the machine by tagging the image and enable learning. The main idea of this project is to use Image recognition techniques to identify handwritten digits in the image. It is prudent to use deep Neural Networks for complex problems such as image processing. Neural network breaks down complex problems into simpler understandable form. In order to achieve this, Tensorflow library has a host of pre-built methods, which can be used directly. The data for this project is from the online competition hosted by analyticsvidhya.com. The data is divided into test and train ‘csv’ files. ‘train.csv’ has two columns namely ‘filename’ and ‘label’. Filename refers to the 70000 png 28 X 28 size, totaling 31MB, files of these handwritten digits. Label is the tagging associated with each images. Using the tensorflow framework, the images were predicted at an accuracy of 95.28. Tensorflow is a framework that relies a lot on computational power and hence higher accuracy could be obtained by tweaking the hyper parameters on state-of –the-art systems. With limited computational capability tensorflow performed better for image recognition.
Pallavi Singh, Anomaly Detection in Revenue Stream, August 2019, (Dungang Liu, Brittany Gearhart)
The client owns and operates parking facilities at multiple airports across the US. The revenue is calculated and collected through cars parked at these locations based on the price, duration and type of parking along with any discount coupons that may have been used during the transaction. The revenue is collected by Cashiers and Mangers managing the booth. The client has observed that at certain times there have been discrepancies in the revenue collected and the number of cars that exit the parking facility. In most such cases the revenue has been observed to be lower than expected based on the number of cars that were parked in the facility. This observation led to ad hoc investigations, and it was observed that some employees managing the booth were not being completely transparent and honest in their management, and frauds were taking place.
The client wants to identify these frauds in a timely manner as the current process is tedious and ad hoc, and there is a very high possibility of some frauds being missed and overlooked by the very nature of the investigation. The client wants an automated process where such anomalies in the revenue stream can be identified automatically, and timely investigation can be carried out.
It was decided to build the required model for one parking facility, tune the model and validate the results before the same could be scaled to multiple locations.
Vinaya Rao Hejmady, NYC Taxi Rides - Predicting Trip Duration, August 2019, (Peng Wang, Yichen Qin)
To improve the efficiency of electronic taxi dispatching systems, it is important to be able to predict how long a driver will have his taxi occupied. If a dispatcher knew approximately when a taxi driver would be ending their current ride, they would be better able to identify which driver to assign to each pickup request. In this project, I will build a predictive framework that is able to infer the trip time of taxi rides in New York City. The output of such a framework must be the travel time of a particular taxi trip. I will first study and visualize the data, engineer new features, and examine potential outliers. I will then analyze the impact of the features on the target trip_duration values.
Temporal Features Analysis: I will look for time-based trends in the target variable and see if there are patterns that it is following. Finally, I will build a model to make a prediction of the trip duration. I plan to try Regression, Decision Trees, Gradient Boosting and fit the best model to the data.
Nan Li, Bon Secours Mercy Health Reimbursement Analytics, August 2019, (Michael Fry, Jeremy Phifer)
Bon Secours Mercy Health home office recognizes the need to project gross revenue and net revenue on a monthly basis. This process allows management to make critical business decisions and plan accordingly. The goal of the tableau dashboard is providing an integrated visualization which helps the Chief Financial Officer for each group and each market easily understand the projected gross revenue and net revenue performance in the current month. Conducting a payment forecasting for next month helps the leaders to make assumption about the future operation. Management of Bon Secours Mercy Health is interested in quantifying cash collections to understand financial performance and evaluate revenue cycle operations. The ARIMA (1,5) model in SAS predict the monthly payment is closer than the average historical payment data by using five-year historical payment data and is being considered as an alternative approach for payment activity forecasting in the future.
Kratika Gupta, Talkingdata Adtracking Fraud Detection Challenge, August 2019, (Peng Wang, Liwei Chen)
We are challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. To support the modeling, we are provided a generous dataset covering approximately 200 million clicks over 4 days! Evaluation is done on area under the ROC curve between the predicted probability and the observed target. I first started with basic Exploratory Data Analysis to understand the features. I plotted graphs for all features when the app was downloaded, and the app wasn’t downloaded. After getting a basic idea of the distribution of the variables, I then did Feature Engineering as required, to fit the model better by additional features based on the frequency count. The Modelling started with the basic Logistic Regression to understand the simple classification model (linear). I then built a Random Forest to consider the non-linearity in the data. For a better performance and predictive aspect, I tried other advanced models such as Gradient Boosting, SVM and Neural Net.
Anupreet Gupta, Strategy and Portfolio Analytics, August 2019, (Mike Fry, Siddharth Krishnamurthi)
Credit cards have become an important source of revenue for the bank, as it charges a higher Annual Percentage Rate (APR) in-comparison to any other consumer lending products that the bank has to offer to a customer. While the sources of revenue include the interchange fee, transaction charges, and finance charges, the industry is being very competitive and introduces promotional offers to lure the customers to get onboard. Knowing and understanding from a portfolio point of view, it is of great significance to be able to forecast how would the book look like in the future. Having sight of this helps different teams and departments to prepare a plan of action for the coming years. Also, credit card business possesses a risk to the bank whether or not the customer is able to fully repay the amount borrowed using the credit card and there comes the role of collection strategies as in to recover the amount or avoid the customer from being charged-off. Using the model built on existing hardship enrolled customers, predicted the most likely customers to enroll for the program and proactively try to pitch them before they are charged-off from the books. Enhancing customer experience is amongst the core values of Fifth Third Bank. One such attempt is to extend the expiration date of the Rewards Points earned by the premier customer by a year without impacting the financials of the bank. Delinquent customers were identified one such area of opportunity wherein to cover the loss due to the extension.
Anirudh Chekuri, Churn Prediction of Telecommunication Customers, August 2019, (Yichen Qin, Peng Wang)
There is customer churn when the consumer stops doing business with a company. The cost of retaining a customer is low compared to acquiring a new customer. So, for any business churn prediction would prove an important investment in terms customer lifetime value and marketing. In this project we have data from a telecommunication company, and we try to determine the reasons for the customer churn and build a predictive model to give the probability of customer churn with the given data. We have used Random Forest to check variable importance using mean decreasing GINI and mean decreasing accuracy and logistic regression with logit link to determine the probabilities of customer churn with the given data. We can use the variables which turned out to be important factors effecting churn and use them to design actionable strategies to reduce the churn.
Ashish Gyanchandani, Fraud Analytics, August 2019, (Michael Fry, Andy M)
XXX is currently building two types of software products - Integrity Gateway Pre-Approval and Integrity Gateway Monitoring. My work revolved around improving analytical aspects of the Integrity Gateway Monitoring, which helps customers continuously monitor their employee spend data with XXX risk engine algorithm. The data that I worked on was Concur expense data. It contained the expense details entered by the employees in the Concur expense tool. Some of the expenses captured in these reports are Transportation, Supplies, Meals, etc. The dataset contained close to 1.5 million records. The topics that I worked on can be classified into 3 categories. The first category can be classified as exploratory analysis, the second category can be called as criteria set, and the third category can be called modeling. The exploratory analysis required me to look at the round dollar transactions, the creation of expense categories using expense types, etc. Criteria setting needed me to come up with benchmarks that will help XXX to compute the risk score. The benchmarking involved computing cash to non-cash transaction ratio for all the countries, etc. Finally, modeling involved the creation of employee clusters, anomaly, and trend detection.
The programming language/ tools used for this analysis are R, Tableau, and Amazon AWS Sagemaker.
Harpreet Singh Azrot, An Analytical Study of West Nile Virus (WNV) Mosquitos in Chicago, August 2019, (Peng Wang, Yichen Qin)
The objective of the study is to understand and analyse how WNV found in certain mosquitos is affecting the city of Chicago over the last few years. This study can play a crucial role in identifying the important factors and conditions that results in finding these mosquitos. It will also play a crucial part in predicting given specific parameters and conditions thus by a community’s point of view, appropriate actions can be taken to mitigate the risks. These predictions will be achieved by implementing multiple Machine Learning Algorithms and then comparing them to find the best model for prediction purpose.
Palash Arora, Predicting Revenue and Popularity of Products, August 2019, (Yichen Qin, Nanhua Zhang)
Insurance and healthcare companies can benefit by analyzing customer demographics in order to promote the right type of product. In this project our goal is to understand the relationship between various customer demographic factors and product preference in different regions. We also predict the expected revenue for each zip code across United States. We analyzed approximately 6,000 zip codes with 25 predictor variables such as average age, salary and population in order to predict 2 dependent variables: preferred product and expected revenue in each zip code. We have used 2 statistical methods in this project: Linear regression to predict the expected revenue generated in each zip code; Multiple Logistic Regression to predict the preferred product type in each zip code.
Apoorva Rautela, Event Extraction from News Articles using NLP, August 2019, (Charles R. Sox, Amit Kumar)
Huge amounts of text data is generated every day. Some of the information contained in these texts needs to be handled and analysed carefully. Natural language processing can help organizations build custom tools to process this information to gather valuable insights that drive businesses. One of the common applications of NLP is called Event Extraction, which is the process of gathering knowledge about periodical incidents found in texts, automatically identifying information about what happened and when it happened. This ability to contextualize information allows us to connect time distributed events and assimilate their effects, and how a set of episodes unfolds through time. These valuable insights drive organizations, which provide the technology to different market sectors. Steel tariffs have a direct impact on the Oil Country Tubular Goods (OCTG) market. This project aims to extract events from the past 18 months news articles related to ‘steel tariffs’.
In this work, news articles related to ‘steel tariffs’ are collected from newsapi.org and then the text information is processed using NLP techniques. This work focuses mainly on extracting events using ‘extractive text summarization’.
Pravallika Kalidindi, Analysis of Balance Transfers and Credit Line Increase Programs for Credit Cards, August 2019, (Michael Fry, Jacob George)
Balance transfer (BT) and credit line increase (CLI) programs are two main profit generating programs for credit card companies. Throughout this project we tested different marketing channels, customer behavior, profit and risk drivers for balance transfers and credit line increase programs. Balance transfer program includes identifying the right customers and giving them lucrative offers to transfer their credit card debt from another bank to Fifth Third. Credit line increase programs is where the credit card company increases the credit limit for selected credit-worthy customers enabling them to increase their purchases thereby capitalizing on the incremental interchange revenue and finance charges. The key findings of our analysis are as follows. Customers doing a digital BT tend to do a greater number of BT’s but with each BT being of lesser value compared to non-digital BT’s. We observed that % accounts going delinquent and charge-off are higher when they use convenience checks proving convenience checks are riskier. On a portfolio level, customers are taking on more debt after BT, but this behavior is highly dependent on the type of customer. To reduce our losses, we analyzed a potential solution – to cancel the promo APR for a customer when he goes delinquent. We calculated the estimated finance charge collectible at different stages of delinquency cycle. We observed that risk metrics for CLI are close to each in test and control groups.
Guru Chetan Nagabandla Chandrashekar, Improving Delivery Services Using Visualizations, August 2019, (Charles R. Sox, Pratibha Sharan)
The Commercial Effectiveness team at Symphony Health provides Consulting and Analytics services to healthcare companies of all sizes around the country. They provide standard solutions to the brand teams of drugs from pre-approval, pre-launch, launch until patent expiry phase. A lot of these solutions can be standardized and automated to eliminate repetitive work, save time, reduce errors and get to insights faster. One of the ways to achieve this is by developing standardized visualizations and dashboards that capture the must-haves in key solutions or key aspects of a project. This report will go through a few visuals I developed during my internship at Symphony Health. This report will cover the need for developing each visual, understanding the data required, design, outputs and the impact.
Mohit Anand, Predicting Customer Churn in a Telecom Industry, August 2019, (Peng Wang, Liwei Chen)
Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers. Telephone service companies, Internet service providers, pay TV companies, insurance firms, and alarm monitoring services, often use customer attrition analysis and customer attrition rates as one of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new one. Companies from these sectors often have customer service branches which attempt to win back defecting clients, because recovered long-term customers can be worth much more to a company than newly recruited clients. Companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn occurs due to a decision by the customer to switch to another company or service provider, involuntary churn occurs due to circumstances such as a customer's relocation to a long-term care facility, death, or the relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the analytical models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer relationship which companies control, such as how billing interactions are handled or how after-sales help is provided. Predictive analytics use churn prediction models that predict customer churn by assessing their propensity of risk to churn. Since these models generate a small prioritized list of potential defectors, they are effective at focusing customer retention marketing programs on the subset of the customer base who are most vulnerable to churn.
Kunal Priyadarshi, Microsoft Malware Prediction Challenge, August 2019, (Peng Wang, Liwei Chen)
A malware is a software designed to cause damage. We want to help protect more than one billion windows machines from damage before it happens. The problem is to develop techniques to predict if a machine will soon be hit with malware. It is a classification problem and the models were built using decision trees (CART), Random Forest and Gradient Boosting Machines. These are the current state of the art algorithms. They don't require any assumptions between independent and dependent variables and work in non-linear environment. The algorithms used handles missing values on their own as they all are based on decision trees. Also, the entire code is reproducible. While Random Forest and Gradient Boosting Machines were giving comparable area under the curve (AUC) on the test data, the training AUC was significantly larger for random forest. It is recommended to use Gradient Boosting Machines as the final model as bias was similar while variance was lower for GBM.
Aniket Sunil Mahapure, Quora Question Pairs Data Challenge, August 2019, (Peng Wang, Liwei Chen)
This project is based on a Kaggle competition (https://www.kaggle.com/c/quora-question-pairs/overview). Quora is a platform to ask questions and connect with people who contribute unique insights and quality answers. Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. So, Quora is keen to group multiple questions based on their meaning to reduce redundancy and improve overall convenience for users. In this competition, objective is to tackle this natural language processing problem by applying advanced techniques to classify whether question pairs are duplicates or not.
Supriya Sawant, Prediction of Fraudulent Click for Mobile App Ads, August 2019, (Peng Wang, Liwei Chen)
This project is based on Kaggle competition (https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/overview/description). The project is based on fraudulent click traffic for mobile app ads. For companies that advertise online, click fraud can happen at an immense volume, resulting in misleading click data and wastage of money. TalkingData is China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. China is the largest mobile market in the world and therefore suffers from huge volumes of fraudulent traffic. In this project we are required to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad.
Lixin Wang, Geolocation Optimization for Direct Mail Marketing Campaign, August 2019, (Michael Fry, Shu Chen)
This capstone project is part of the intern project for the marketing analytics in Axcess Financial. The goal of the project is to analyze the geo-spatial relationship between stores and customers, identify trade area, and optimize store assignment in the direct mail marketing campaigns. Historical customer information was extracted from database using Structured Query Language (SQL) and PROC SQL procedure in SAS. Spatial analysis was done using pivot table in Excel and simple descriptive analysis was done in SAS. Interactive dashboard was created in Tableau to visualize the geo-spatial distribution of stores and customers. Major market for each store was identified, and trade areas were divided. Store analysis shows that majority of customers in each zip code went to the top 3 stores when there are multiple destination stores. The top 3 destination stores are consistent with the stores assigned to each zip code, suggesting that the current store assignment strategy works well on the zip code level.
Menali Bagga, Analysis Defects and Enhancements Tickets, August 2019, (Michael Fry, Lisa DeFrank)
This capstone project contains two major applications of what I learnt during my master’s in business Analytics– Data Visualization (the tool used here is Power BI) and Data Mining using Clustering Analysis (using K-Means). The first half of the project was done with the motive of visualizing the number of open critical, high priority production support tickets, mainly defects and enhancement type prior to the August 2019 release. To achieve this objective, I used Power BI where I pulled the relevant data by developing suitable queries in Azure DevOps service, then connected it to Power BI and filtered the data directly in it to suit my needs. While, the second half of the project was done with the objective of performing a clustering analysis on the tickets to facilitate capacity planning by putting the tickets in the suitable brackets.
Syed Imad Husain, Blood Supply Prediction, August 2019, (Chuck Sox, Shawn Gregory)
Hoxworth Blood Center is the primary blood donation centers in the Greater Cincinnati area. Uncertainty in blood supply patterns and donor behaviors is one of the greatest challenges dealt by the Donor Services and Blood Center operations. This project deals with developing analytical methods to sustain data driven decision making for them by employing descriptive, predictive and prescriptive analytics. The main areas of focus of this project are understanding donors’ participation (classification) and predicting donor turnouts (regression) for a given drive. Different supervised and unsupervised learning techniques have been employed to uncover trends.
Lakshmi Mounika Cherukuri, Brightree Advanced Analytics Projects, August 2019, (Michael Fry, Fadi Haddad)
Brightree Advanced Analytics team focuses on providing tailored analytical solutions to internal and external customers, through dynamic dashboards that are easy to navigate. The team contributes to the growth of customers by providing clean, consolidated and consumable data insights. This capstone report outlines two of such projects – AllCall KPI Survey and a Customer Profitability Analysis. The AllCall KPI Survey project involves one of our internal customers – AllCall, a subsidiary of Brightree that works on resupply orders. The project requires us to embed a feedback survey questionnaire into Sisense, enabling the surveyor to input feedback through a dynamic dashboard and then store the responses into the Database. Once the survey results are in, we need to analyze the data and then visualize it to a KPI review dashboard, which will identify most efficient callers, number of sales orders taken by each caller and observe whether necessary etiquettes are practiced by callers while communicating with the patients. The Customer Profitability Analysis is a project that involves our external Customers. For this project, the team is tasked to identify manufacturers and items that yield most profits factoring in the Costs, Revenues, Bill Quantities and Number of Sales Orders for each Item Group. In addition to this, Customers also want to detect the main source of new patients (e.g., from Referrals, Ordering Doctors, Marketing Representatives). The final product is expected to be a dynamic dashboard, which shows the aforementioned KPIs and profitability measures over time.
Maryam Torabi, Improving Patient Flow in an Emergency Department: a Computer Simulation Analysis, August 2019, (Yiwei Chen, Yan Yu)
In this study I have used a yearlong operational time-stamp data from a regional Level II Trauma Center emergency department in Virginia to understand the nature of patient flow in this ED and to build computer simulation models. The emergency department routes patients to 4 different treatment teams based on the severity of their condition. I have simulated the current system, and an alternative that pools two of the treatment teams, and delegates some tasks to the triage nurse. Comparing the average Length of Stay (LOS) of patients in the pooled team (ESI2 and ESI3), and the weighted average of patient wait time after triage to get into a bed shows that pooling resources improves both of these performance metrics at 0.05 level.
Apoorva Bagwe, Loss Forecast Model, August 2019, (Charles R. Sox, Adam Phillips)
Axcess Financial Inc. offers different types of loan products to its customers. A Retail Choice Loan Product (CLP) loss forecast model is currently being used by the company to forecast the amount of money the company will lose due to its Retail CLP customers charging off. These models have been developed in SAS and are refreshed every month. Since the processing time of the process is high, the modeling process needs to be replicated on Snowflake. My task was to convert the loss forecast models from SAS to Snowflake. This resulted in reducing the time taken for the execution from 7-8 hours to less than an hour. The project was divided into four phases – creating the base data, forecasting charge offs using Markov Chain modeling, forecasting charge offs using loss curves and improving the overall efficiency of both the process and the model. As two different data integration processes were responsible for the company’s data being loaded in SAS and Snowflake, a lot of checks at each stage were needed to ensure the accuracy of the results at each step. Due to the functional and coding differences in SAS and Snowflake, different data structuring approaches were needed for the replication of the analysis on Snowflake. Another challenge I faced was that the SAS databases were updated daily while the Snowflake databases were updated every few hours. The business models in Ohio, Illinois and Texas were different from the rest. Therefore, loans from these states were analyzed and modelled separately.
Vishnu Guddanti, Response Model Analysis, August 2019, (Michael Fry, Kaixi Song)
Credit cards became an important part of everyone’s life. They have emerged as one of the most convenient and easiest ways to transact. The credit card industry is a lucrative business. The major revenues include interest revenue from revolving balances, missed payments, late fees, annual fees, merchant fees etc. Most banks that issue credit cards run acquisition campaigns to acquire customers. Direct mail campaign is one of the acquisition strategies employed by the Fifth Third Bank. This involves a mail being sent to a prospective customer with an offer such as balance transfer offer or spend to get cashback offer or zero percent APR offer for a certain period. The project seeks to improve the population selection strategy of the direct mail campaigns for the Fifth Third Bank. The population for a direct mail campaign is selected by considering a variety of factors including marketing costs, mail offer, response score deciles, present value of the prospective customer, FICO score, response rate and approval rate of the customer. Based on these factors, return on marketing investment (ROMI) is calculated on FICO group level and response score decile level. Only the population that meets ROMI cut-off is selected for the direct mail campaigns. The project seeks to improve campaign efficiency by calculating ROMI on a granular response score level and FICO group level employing exponential regression models. Through response model analysis, 10k more customers could be targeted resulting in 26 more credit card accounts booked and a Net Present Value (NPV) increase of 9,824 USD for the bank.
Douglas Kinney, Emerging Risk: Visualizing, Mining, and Quantifying Wildfire Exposure, August 2019, (Michael Fry, Dan Madsen)
One of the primary challenges insurance and reinsurance companies face today is understanding catastrophe risk in a changing landscape. Population movement, city development, climate change, and recent major events in California are the key factors driving an increased focus on wildfire risk at natural disaster level. Model vendors have not yet caught up with a succinct, transparent method of quantifying concentrations of risk, aggregating the level of pure hazard, or estimating damageability of a given location. This paper’s focus will be leveraging location data to solve the problem of comparing concentrations of wildfire risk in California, across varying portfolios of business. The aim is to create a customized view of risk for each data set, using proprietary wildfire hazard grading. The end-result is a framework of analysis that produces digestible information for underwriting, executive review, and decision-making purposes.
Nitesh Agarwal, Predicting the Occurrence of Diabetes in PIMA Women, August 2019, (Yan Yu, Akash Jain)
The diabetes data containing information about PIMA Indian females are used for the analysis. Data contains information about 768 females, of which 268 females were diagnosed with Diabetes. The information available includes 8 variables, such as Age, Number of Pregnancies, Glucose, Insulin, etc. Missing values in the dataset constituted to about 30% of the observations. MICE (Multivariate Imputation via Chained Equations) was performed to impute the missing values in the data set. Performing correlation analysis showed that Insulin and Glucose, BMI and Skin Thickness had a moderately high linear correlation. Logistic regression, Classification tree, Random Forest and Support Vector Machine models are deployed, and Support Vector Machine is chosen as the best model based on out of sample AUC. It also has minimum misclassification rate.
Pooja Purohit, Marketing Analytics for JoAnn Stores, August 2019, (Charles R. Sox, Prithvik Kankappa)
In today’s scenario, retail industry is one of the most volatile industries due to uncertain economy, digital competition, increasing number of product launches, shift in customer interests, tariff pressures, supply-chain constraints etc. Jo-Ann has a brick-and-mortar model with a little online presence (~4%) which is facing the same challenges on a day-to-day basis. It is considering tapping the available amount of data in order to optimize its sales and increase margin. Advanced analytics can deliver insights that inform smart decisions from deciding what promotions should be run for a product to what price it should be set to maximize margin. This report is primarily based on designing a simulation and planning tool by Impact team for Joann stores to get far more from their marketing spending, helping plan, anticipate and course correct their promotional strategies in a very dynamic market. As of now, Joann is focusing primarily on two levers i.e. promotion and pricing for improving their financial performance. This report is primarily focused on pricing aspect where using advanced models, Impact team is helping Joann to develop its pricing strategy. Using historical data, granular level demand models are created to anticipate price elasticity. Based on these models, simulators are designed to evaluate the right price for a given product in order to facilitate minimum margin leakage.
Akash Dash, Using Data Analysis to Capture Business Value from the Internet of Things (Iot) Data for a Leading Manufacturer, August 2019, (Michael Fry, Sagar Balan)
Anybody would seldom pay more attention than required to the smart dispenser machines (dispensing tissues, bath towels, hand soap, etc.) in restrooms. Whether on vacation, staying in the best hotels or on a business trip traveling through airports, these ‘smart restrooms’ are a part of our experience. Our client collects a huge amount of instreaming data from their smart restroom solutions and wants to capture business value out of the data. In this paper we describe how, through the use of data analysis, we helped the client with recommendations on two key business problems they have in mind. The first problem we attack is how to help increase sales. And, the second problem revolves around creating more time savings for the maintenance staff (who are the end customers). Our methodology includes a consulting approach to first understand the problem from client stakeholders, and then apply data cleaning, wrangling, exploration, and visualizations to uncover trends and insights. The tools primarily used through the project have been PostGre SQL and R-studio.
Ishali Tiwari, Prediction of Wine Quality by Mining Physiochemical Properties, August 2019, (Yan Yu, Ishan Gupta)
Certification of product quality is expensive and time consuming at times, particularly if an assessment by human experts is required. This project examines the involvement of data mining techniques to facilitate that process. A dataset consisting of physicochemical properties of red wine samples is used to build data mining models to predict quality of wine. The use of machine learning techniques; specifically, binary logistic regression, classification trees, neural networks and support vector machines were explored, and the features that perform well on this classification were engineered. The performance of models is evaluated and compared by the metrics prediction accuracy and AUC (area under receiver operator characteristics curve).
Hareeshbabu Potheypalli, Labeling School Budget Data Using Machine Learning and NLP, August 2019, (Yan Yu, Yichen Qin)
The objective of the current analysis is to use the machine learning methods and NLP techniques to analyze text data. The data is collected from a competition hosted by ‘DrivenData.org’. The data set chosen is having the expense information for a school where each observation is labelled according to the department /object-bought / functionality / Class / user etc. Therefore, this is clearly a ‘Multiclass-Multilabel’ classification problem. Various models used by the contestants are analyzed and reviewed. Models such as simple Logistic Regression, OneVsRestClassifier, RandomForest, CountVectorizer etc. are used in classifying an observation into its corresponding class of each categorical variable. The models are then tuned further to improve the accuracy of the model and the log-loss cost. Also, the future scope and developments of the project are discussed further.
Bolun Zhou, Identify Heart Disease Using Supervised Learning, August 2019, (Yichen Qin, Charles Sox)
In machine learning, logistic regression is used for predicting the probability of occurrence of an event and the probability can be turned into a classification. Logistic regression extensively used in the medical and social sciences as well as marketing applications. It is used to perform on a binary response (dependent) variable. Moreover, CART can be used for classification or regression predictive modeling problems and provides a foundation for important algorithms like bagged decision trees, random forest and boosted decision tree. Especially, Random forest is an extension of Bagging, but it makes significant improvement in terms of prediction. In addition, artificial neural network (ANN) or connectionist systems are computing system that are inspired by the biological neural network that are similar to animal brains. The neural network is basically a framework for many different machine learning algorithms to work together and process the complex data inputs. In this project, we tried to use these techniques to improve the accuracy of diagnosis of heart disease. This study could be useful to predict the presence of heart disease in the patient or find any clear indications of heart health.
Sukanto Roy, FIFA 18 - Playing Position Analysis, July 2019, (Dungang Liu, Peng Wang)
FIFA 18 offers detailed quantitative information on individual players. In modern day football, specific positions represent a player's primary area of operation on the field. It is extremely important to characterize a player according to their position on the field. Each position requires a different combination of skills and physical attributes. With the rapid increase in the volume of soccer data, data science abilities have attracted the attention of coaches and data scientists alike. As a FIFA video game enthusiast and a soccer player, I took this opportunity to work on this problem using the FIFA18 data which is originally from sofifa.com but a structured version of the data was posted on tableau public website. The data is unique at player level, and each player has attribute (e.g. dribbling, aggression, vision) personal (e.g. club, wage, value) and playing position data (rating on various positions). To solve this problem, I have taken a machine learning approach. After data preparation and dimension reduction, 4 supervised learning statistical models were built: KNN, Random Forest, SVM and Neural Network. We classified the 15 playing positions into 4 positions and trained the models with the positions as our response and attributes as the predictors. KNN, SVM and Neural network models had accuracies of 81.81%, 82.27% and 82.26% on the test data. Only the random forest model had an accuracy lower than 80 – 71.32%. Any of the former 3 models can be used by coaches to support their methods and ideas for a player's playing position.
Joe Ratterman, Predicting Future NCAA Basketball Team Success for Schedule Optimization, July 2019, (Mike Fry, Paul Bessire)
Every year, 353 NCAA Division 1 basketball teams compete for 68 bids to the NCAA Men’s College Basketball Tournament. Of those 68 tournament bids, 32 are reserved for conference tournament champions – leaving 36 at-large bids. These bids are given out to the 36 teams that the selection committee deems the best of the rest. While the selection process is not set-in-stone, at-large teams historically have high Ratings Percentage Index (RPI) rankings. RPI was one of the primary tools used by the selection committee up until the 2018 season. Though the NCAA Evaluation Tool (NET) has replaced RPI as the primary evaluation tool, RPI still provides a quick comparison of teams that played different schedules. The calculation for RPI is as follow: RPI = (Win Percentage*0.25) + (Opponents’ Winning Percentage*0.50) + (Opponents’ Opponents’ Winning Percentage*0.25). This paper aims to develop a method to predict a win probability for each NCAA Division 1 program a year in advance. These probabilities will allow a team to simulate the outcome of all games in a given season and optimize their non-conference schedule.
Mahitha Sree Tammineedi, Analysis and Design of Balance Transfer Campaigns, July 2019, (Charles R. Sox, Jacob George)
Every year, banks make billions of dollars on credit cards, so they are always looking to get more debt. A balance transfer is a way of transferring credit card debt from one credit card to another credit card belonging to a different bank. To put simply, it’s a way to gain debt from the competition. This project seeks to analyze the performance of past Balance Transfer (BT) campaigns at Fifth Third Bank and improve the future campaigns by building a present value (PV) model to provide insights on which offers are the most profitable for each segment of customers considering factors such as balance during the promotional period and post promotional period, revenue from fees, closed accounts, charged off accounts, finance charges, and other expenses incurred. The insights uncovered in this study will be used to design future BT campaigns.
Jeffrey Griffiths, Dashboard for a Monthly Operating Report, July 2019, (Michael Fry, Chris Vogt)
Archer Daniels Midland (ADM) is an agricultural giant headquartered in Decatur, IL. In 2014 the company began a digital transformation of its business called 1ADM and moved its I.T. headquarters to Erlanger, KY. Within the I.T. office the Data and Analytics (D&A) team works on data management and data projects for the business. Each month, Sr. Director of Data and Analytics reports to the CIO about the progress her team has made. Currently, the visualizations used to show that progress require a lot of work from the Sr. Director and do not utilize best practices when it comes to data visualization. This summer I was part of an intern team that redesigned the Sr. Director’s Monthly Operating Report (MoR) with good data visualizations that could communicate the progress D&A has made month-over-month (MoM), while also reducing the amount of work the Sr. Director would need to do each month. The intern team met extensively with the Sr. Director and D&A team leadership to understand the story they were trying to tell with their progress metrics.
Avinash Vashishtha, Identification of Ships: Image Classification using Xception Model Architecture, July 2019, (Dungang Liu, Yiwei Chen)
Computer vision has been a booming field with numerous applications across various sectors. But the application which motivated me the most to take up a project in computer vision was Autopilot feature in Tesla. In this problem statement, A Governmental Maritime and Coastguard Agency is planning to deploy a computer vision based automated system to identify ship type only from the images taken by the survey boats. We will be creating a model to classify images into 5 categories- Cargo, Carrier, Cruise, Military, and Tanker. Data has been picked from a Computer vision competition hosted on ‘Analytics Vidhya’ website. Link of the problem statement is given below:
To classify images, we have used Xception model architecture and through transfer learning re-purposed it to solve our problem statement. The final trained model showed an accuracy of 96.2% with most of the error happening in cargo and carrier. Our Model would help classify ships or vessels into respective categories and would save Maritime and Coastguard agency crucial time to respond to any emergencies.
Aabhaas Sethi, Predicting Attrition of Employees, July 2019, (Yan Yu, Yiwei Chen)
Employee attrition can be detrimental to a company's performance in the long term. I have personally observed a negative impact on one of my past employer’s performance because of employee attrition. The objective of this project is to explore the factors that are related to employee attrition through data wrangling and building a model that could be used to predict whether an employee would leave the company or not. I have used different statistical techniques to predict employee attrition and compared the performance for those models. Further, I have explored different sampling techniques such as Over Sampling, Under Sampling, SMOTE, etc. as an attempt to manage the imbalance in the data set. There are only 16% positive response values. Finally, I have compared the predicting performance of the model built through the different sampling techniques with the original model with random sampling.
Anjali Gunjegai, Cost Analysis of Steel Coil Production, July 2019, (Charles Sox, Sabyasachi Bandyopadhyay) Since profitability of a product is the backbone of any product, BRS has decided to take a step towards estimating and optimizing the orders received by analyzing the cost going into producing a coil. This project analyses the various production units in the mill and estimates a cost that is associated with each step and with the help of dashboards, gives the company a look into the major factors contributing to the costs and the potential to optimize the processes. Apart from deriving the cost for a coil, the project also analyses the grades of steel produced and predicts the scrap mix consumed and the foreseen costs for the heats. This project gives a good roadmap to achieve a faster accounting for the costs incurred in the month and an automatic costing tool which calculates an estimate of the cost near real time.
Ashwita Saxena, Can Order Win Rate be Predicted Based on Timeliness of Response to Customer Emails, July 2019, (Peng Wang, Michael Fry)
Ryerson is a metal manufacturing company based in Chicago. Their main products include aluminum, stainless steel, carbon and alloys. Most of their customer interactions and transactions happen through emails. Customers request quotes via email and order products via email as well. Ryerson is currently trying to identify ways of increasing their revenue. They believe that an increase in the number of orders they obtain through email interactions could stimulate revenue growth. One critical variable that impacts their orders is the time in which their representatives reply to customer emails. This project aims at identifying the impact of email response time on order win rate, while also identifying other important factors that impact winning orders. The final models presented have been used for interpretation as well as strong predictions while maintaining model accuracy.
Asher Serota, Application of Business Analytics to Quantifying Reporting and Agent Data at American Modern Insurance Group, July 2019, (Michael Fry, Christopher Rice)
This capstone describes two data analytics projects – SharePoint Analytics Open Rate Report and SCRUB – that I performed for the Marketing/Sales Insight & Analytics Team at American Modern Insurance Group. The main goal of the former was automation and visualization of reporting process. The main goal of the latter project was automation and visualization of the agent data. Both required transition from Excel-based and often manual manipulation and entry of data. To automate the processes, I developed R code utilizing several packages, such as Tidyr and Dpylr, and I also used data cleaning and aggregation techniques. Additionally, I developed methods to visually represent the SharePoint Report, including in PDF and URL formats, and to streamline the SCRUB process.
Megan Eckstein, Texas Workers Compensation Analysis, July 2019, (Michael Fry, John Elder)
Great American Insurance Group (GAIG) writes a significant portion of its business in workers compensation. Because of its magnitude within the industry, looking into particular markets to target or avoid is important to help minimize losses paid on workers compensation claims. One of the subsidiaries of GAIG writes the majority of its business in Texas. To help this subsidiary reduce medical loss from claims, I analyze Texas workers compensation industry data to examine medical losses that have occurred within different markets. This data encompasses eleven years of claims. Based on this analysis, I recommend different market segments to target and avoid within the state of Texas.
Akshay Kher, Optimizing Baby Diaper Manufacturing Process, July 2019, (Mike Fry, Jean Seguro)
Currently, the defect rate for diapers manufactured by P&G is larger than desired. Due to this, a large amount of diapers have to be disposed of leading to substantial monetary loss. Any solution which can even marginally decrease this defect rate would be extremely useful for P&G. Hence, very recently P&G has started capturing data related to the diaper manufacturing process using plant sensors. Through the use of this data we aim to do the following: 1. Build a model that can predict whether a batch of baby diapers would be defective or not. 2. Understand and quantify the impact of input variables on the output variable i.e. defect flag.
Santosh Kumar Biswal, Telco Customer Churn Prediction, July 2019, (Dungang Liu, Yichen Qin)
Customer churn is the loss of customers. The goal of the project is to predict the churn rate of the customers for one the telecommunication client. Knowing how churn rate varies by time of the week or month, product line can be modified according to customer response. A very methodological approach has been followed. We start with data cleaning and exploratory data analysis following which various machine learning algorithms like logistic regression, Decision trees, Random forest used to formulate an appropriate model giving out the best results, i.e lowest misclassification rates. Random Forest found to be best model to predict churn rate and important factor contributing to churn rate is “Monthly Charges”.
Husain Yusuf Radiowala, Commercial Data Warehousing and MDM for an Emerging Pharmaceutical Organization, July 2019, (Michael Fry, Peter Park)
Pharmaceutical companies invest time in research and billions of dollars in launching a promising new drug only to see unsatisfactory sales numbers. Competing products, generics arrive quickly after launch, reducing the time in which a drug remains on the market. Therefore, a successful drug launch is critical in the organization’s success. Effective marketing enables this success. Health Care Providers (HCP’s) and other decision makers need to be communicated about the key clinical and non-clinical benefits of a product. The situation is even more vital for Emerging Pharma companies that lack the pecuniary resources of “Big Pharma” to absorb an unsuccessful launch. These organizations therefore tend to focus on an effective drug – launch strategy and use big data, commercial analytics, 3rd party healthcare data, patient information – available externally and build a blueprint before launch. Using external data poses a challenge, data quality, trust in the accuracy of external sources and integration of the data into a single system are key issues that organizations face. The project aims to define and design an analytics platform with Master Data Management (MDM)  capabilities for an Emerging Pharma company (Company A) which aims to launch its product (Product B) in Q1 2020, across its health care practice area (Practice C). This platform will generate a “Golden Profile”, which is a single source of truth across HCPs in Practice C, spanning from various commercially available data sources. It provides the basis for strategic business decisions pre and post launch of Product B.
Prakash Dittakavi, Eye Exams Prediction, July 2019, (Michael Fry, Josh Tracy)
Visionworks business drives mostly on comprehensive eye exams and exam conversion percentage. Due to inconsistencies in the number of customers who visit the store regularly, there are days when the stores couldn’t perform exams for all customers due to insufficient staff to match high traffic or perform fewer exams due to less traffic and being overstaffed. So, predicting the number of exams will help in optimizing the staff at the stores, thereby leading to an increase in revenues or decrease in costs. The objective is to predict the exams for every market separately as each market is different from other markets and the business is different for different markets. Random Forest model and Linear Regression are used to predict the number of exams for a week and to find the important features. Adjusted R-square is used to compare models.
Aniket Mandavkar, Energy Star Score Prediction, July 2019, (Yichen Qin, Edward Winkofsky)
The NYC Benchmarking Law requires owners of large buildings to annually measure their energy and water consumption in a process called benchmarking. The law standardizes this process by requiring building owners to enter their annual energy and water use in the U.S. Environmental Protection Agency's (EPA) online tool, ENERGY STAR Portfolio Manager and use the tool to submit data to the City. This data informs building owners about a building's energy and water consumption compared to similar buildings, and tracks progress year over year to help in energy efficiency planning. Energy Star Score is a percentage measure of a building's energy performance calculated from energy use. The objective of this study is to use the energy data to build a model that can predict the Energy Star Score of a building and interpret the results to find the factors which influence the score. We will use NYC Benchmarking data set that measures 60 energy-related variables for more than 11,000 buildings in New York City.
Sanjana Bhosekar, Sales Prediction for BigMart, July 2019, (Dungang Liu, Edward Winkofsky)
BigMart(name changed) is a supermarket that could benefit from being able to predict what are the properties of products and stores which play a key role in increasing their sales. The dataset provided has 8523 records and 11 predictor variables. The target variable is the sales of a particular product at an outlet. This is a typical task of performing supervised machine learning. For this project, a linear model, regression tree, random forest, generalized additive model, and neural network were tried and tested to predict the revenues in dollars. Item_MRP turned out to be the most important variable, followed by Outlet Type. This challenge was hosted on a website “Analytics Vidhya” and the metric chosen by them to assess the best model was RMSE. Going by it, Random Forest Model worked the best.
Shashank Bekshe Ravindranath, Exploratory and Sentiment Analysis on the User Review Data, July 2019, (Yan Yu, Edward Winkofsky)
Yelp.com is a crowd-sourced local business review and social networking site. Yelp users can submit a review of their products or services using a one to the five-star rating system and write their experience as a review in their words which acts as a guide for other users who want to use the specific product or service. Traditionally product feedback from users has been heavily dependent on getting customers’ ratings on a set of the standardized questionnaire but with the introduction of text-based data, there is an opportunity to extract much more specific information which can be leveraged to make better business decisions. This paper is interested in using the star rating to quantify the whole user experience and user-written text reviews to understand it qualitatively. Sentiment analysis (also known as opinion mining or emotion AI) using text analysis and data mining techniques will be performed. The data is systematically identified, extracted, quantified, and studied to understand subjective information. It enables understanding of the emotional or subjective mindset of the people which is quite hard to quantify.
Buddha Maharjan, Surgical Discharge Predictive Model, July 2019, (Dungang Liu, Liwei Chen)
A predictive model of surgical discharge helps how effectively hospitals coordinate care of their sickest patients who were leaving the hospital after a stay to treat chronic illness. This is measured through the discharge rate. Without high quality of care coordination, patients can bounce back from home to the hospital and the emergency room, sometimes repeatedly. This will increase hospital readmission. Therefore, better care coordination promises to reduce readmission rate which minimizes cost and improve patients’ lives. It also helps to figure out how many discharged patients are readmitted for re-surgery. The main purpose of this study is to develop a predictive model for surgical discharge. This dataset is taken from The Dartmouth Institute for Health Policy and Clinical Practice which contains year 2013, state labels data for the patient older than 65, both male and female patients who uses Medicare from 50 states, including DC and the United States. The dataset has 52 observations with 19 variables including 1 categorical and 18 numerical variables. The Exploratory Data Analysis and statistical modelling was used to analyze and develop a predictive model using StepAIC method respectively during the variable selection and model building process. The important variables such as X1 (Abdominal Aortic Aneurysm Repair), X2 (Back Surgery), X345 (Coronary Angiography, Coronary Artery Bypass, Percutaneous Coronary), X7 (Cholecystectomy), X10 (Knee Replacement), X13 (Lower Extremity Revascularization), X14 (Transurethral Prostatectomy) and X16 (Aortic Valve Replacement) are included in the final model. The predictive model in the form of multiple regression simply tells us the number of patients discharged after surgery. It helps a hospital to figure out how many of the surgically discharged patients are readmitted within 30-day periods or longer.
Don Rolfes, Stochastic Optimization for Catastrophe Reinsurance Contracts, July 2019, (Yan Yu, Drew Remington)
Reinsurance is essentially insurance, for insurance companies. It serves to reduce variation in a Primary Insurance company’s financial statements, to transfer risk to a Reinsurance company and can maintain financial ratios that are either required by law or desired by shareholders. As with any asset that a company purchases, several questions must be answered. What type should be bought? How much is enough? What is the best deal for what I’m willing to pay? Stochastic Optimization is a useful tool to answer these questions. This project uses a Random Search algorithm paired with a Monte Carlo simulation study in order to find “Optimal” Catastrophe Reinsurance structures. The results suggest a few simple calculations based on Benford’s law that can identify a Reinsurance structure that will perform well on an average basis.
Rahul Agrawal, Bringing Customers Home: Customer Purchase Prediction for an Ecommerce - Propensity to Buy (Predictions), July 2019, (Yichen Qin, Liwei Chen)
An e-commerce retailer Marketing team wants to improve revenue by performing customized customer marketing. For targeting and segmenting customers, we find customers’ propensity of buying a product in the next month. By prioritizing customers based on their respective purchase score, they can reduce the expense of marketing and get higher conversion rate and therefore better ROI. We can leverage customers’ past lifetime characteristics data since enrollment like customer type, engagement, website behavior, purchases and customer satisfaction with the company and predict their future purchase probability and revenue generation using Predictive Analytics and interpret the factors which influence Customer Purchase. We take a supervised learning approach using 2 Target variables, first, does the customer purchase in the next 30 days, second, the total revenue generated in a month from all purchases. First, we predict whether the customer will purchase in the next 30 days using Supervised Binary Classification, secondly, we predict the total revenue generated using Supervised Regression models. Gradient Boosting model performed best in terms of AUC of 0.82 and accuracy of 90%. Customers who visited recently on the website, had more recent orders, had items Added to Cart and higher overall purchase per month are more likely to purchase a product in the next month. Customers who answered that they will purchase 6 or more products in a year have more likelihood of purchasing in the coming month. A Marketing team can leverage this model for accurate personalized marketing, effective email campaigns, clarity of type of customers with their separation parameters and better customer experience.
Jeevan Sai Reddy Beedareddy, Identification of Features that Drive Customer Ratings in eCommerce Industry, July 2019, (Charles R. Sox, Vinay Mony)
In ecommerce websites, ratings given to a product are one of the most important factors which could drive sales. A higher rating given to a product might increase the trust in the same and can motivate other customers to make a purchase. There could be multiple factors which influence ratings given to a product i.e. delivery times of previous purchases, product description, product photos etc. Ugam, a leading next generation data and analytics company which works with multiple retailers wants to design an analytical solution that helps in identifying drivers of customers ratings. Since Ugam works with multiple retailers, the solution must be designed such that it is reproducible across multiple retailers with little manual intervention. Through this project we designed an analytical framework which takes ratings/reviews datasets as input, performs modeling techniques like regression, decision trees, random forest and gradient boosting machines, identifies the best performing model and outputs features which are important in driving the ratings. Variable selection is performed in linear regression and hyper parameter tuning is done in tree-based models to extract the best performing features. The entire process is automated and would require only datasets as input from the user.
Dharahas Kandikattu, Genre Prediction (Multi-Label Classification) Based on Movie Plots, July 2019, (Yichen Qin, Edward Winkofsky)
A genre is an informal set of conventions and settings that help in categorizing a movie. Nowadays filmmakers started making movies by blending the traditional genres like horror or comedy giving birth to produce new kinds of genres like horror comedies. As time is going by it is becoming harder to classify a movie into a single genre. Most of the movies these days fall under at least 2 genres. To help solve this problem of finding all the genres associated with the movies based on the plot of the movie, our traditional multi-class classification wouldn’t be very helpful. To solve this problem, I have used a concept called multi-label classification. In my project, I have discussed how we can predict all the genres associated with a movie just by looking at the plot of the movie with the help of NLP and multi-label classification using algorithms like Naïve Bayes and Support Vector Machines.
Nagarjun Sathyanarayana, Portuguese Bank Telemarketing Campaign Analysis, July 2019, (Yan Yu, Peng Wang)
The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y). The project will enable the bank to determine the factors which determines the customers' response to the campaign and establish a target customer profile for future marketing plans. The data was split in to Training (80%) and Test (20%) groups. The following algorithms have been employed: Logistic Regression, Classification Tree, Random Forest, Neural Network, Naïve Bayes.
Traditionally, statistical analysis is performed using SAS or R. However, in recent years, Python has developed into a preferred statistical analysis tool. But using Python as a standalone software does not provide operational efficiency. Improving the operational efficiency is crucial, especially while handling huge datasets especially in a Bank, which could have billions of datasets. To achieve this, I have made use of Apache Spark. Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark also provides the handy PySpark integration that lets us run Python codes on top of the Spark terminal. This parallel processing enables faster handling of large datasets and implementation of more complex machine learning algorithms for more accurate predictions.
Ann Mohan Kunnath, Predicting the Designer from Fashion Runway Images with Computer Vision / Deep Learning Techniques, July 2019, (Yan Yu, Peng Wang)
From Coco Chanel to Alexander McQueen, every fashion designer has his/her unique signature to their fashion outfits. The ability to identify a designer based on the fashion accessory was a skill reserved only to the best of the best fashion connoisseur. However, the power to identify designer based on the fashion accessory is now becoming commonplace with the advent of computer vision. This capability is being piloted by many retailers and fashion magazines to increase their online sales and brand recognition. In this project, I’m building a computer vision model capable of predicting the fashion designer based on runway images. There are 50 classes of designers and the evaluation parameter that has been used is categorical accuracy. Network architecture and optimization algorithms are key to the performance of any neural network and hence I have focused on finding the optimal combination of these two parameters for this problem. For network architectures, DenseNet and ResNet have been leveraged as they help in overcoming the issue of vanishing gradient that occurs in deep neural networks. For optimization algorithms, Adam, stochastic gradient descent with momentum, and RMSprop have been leveraged. The results for each model on the training, validation and test sets were compared. It was the ResNet architecture with 18 layers combined with the Adam optimizer that worked best for this dataset.
Ashutosh Sharma, Term Deposit Subscription Prediction, July 2019, (Dungang Liu, Yichen Qin)
Promotion of services or products is done by either using mass campaign or direct marketing. Usually mass campaigns, focusing on large number of people, are inefficient and have low response rates. On the contrary, direct marketing focusses on a small set of people who are believed to be interested in the product. Hence attracting a higher response rate & bringing efficiency in the marketing campaigns. In this report, we are using Portuguese bank’s telemarketing data. The main idea of the project is to work on different techniques which could accurately predict the outcome of direct marketing and then compare the results. For this exploratory data analysis was done to understand the data and figure out if any relationships exist within the data. We then compared the various machine learning algorithms like logistic regression, Decision trees and Random forest to find out which algorithm can most accurately predict the outcome. It was found that random forest gave the most accurate results for predicting if the customer has subscribed for the term deposit.
Varsha Agarwalla, Measuring Adherence and Persistency of Patients towards a Drug Based on their Journey and Performing Survival Analysis, July 2019, (Michael Fry, Rohan Amin)
Client ABC is a large pharmaceutical company and is a client of KMK Consulting Inc. ABC has a diverse number of drugs in various disease areas. XYZ drug is a lifetime medicine prescribed in cases of chronic heart failure. It is priced around $4,000 annually. There are multiple reasons why patients do not take medication on a timely basis. Hence, non-adherence to prescription medications has received increased attention as a public health problem. The development of adherence-related quality measures is intended to enable quality improvement programs that align patient, provider, and payer incentives toward optimal use of specific prescribed therapies. The project shared in this report is calculation of these measures and based on that survival analysis has been performed. The client uses these metrics to track how their drug is performing in the market, potential patients who are consistent and later drop off, and based on that can plan the next steps.
Keerthi Gopalakrishnan, Sentiment Analysis of Twitter Data: Food Delivery Service Comparison, July 2019, (Yan Yu, Peng Wang)
Natural Language Processing (NLP) is a hotbed of research in data science these days and one of the most common applications of NLP is sentiment analysis. Thousands of text documents can be processed for sentiment (and other features including named entities, topics, themes, etc.) in seconds, compared to the hours it would take a team of people to manually complete the same task. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses. In this project the sentiment of tweets is identified with respect to food delivery services like Grubhub, Doordash, Zomato. The food delivery service is one of the largest customer review dependent industries. A collection of good reviews can change the company’s future in the league. Three approaches have been chosen for this project. One is the calculation of the sum of positives and negatives scores when compared with predefined positive and negative words. Second is Naïve Bayes’ using the sentiment package in R. Third is the syuzhet’s lexicons approach using the ‘syuzhet’ package in R. During the course of this project, we will be analyzing the most frequent words in each Food Delivery dataset, The 3 level emotion comparison of positive neutral negative, The 6 level emotion comparison of Joy, sadness, trust, fear, disgust, surprise, and a Word Cloud analysis. All three algorithms point out to one major result. That is, Grubhub has received more positive responses on twitter in comparison to Doordash and Zomato.
Tingyu Zhao, House Price Analysis: Ames Housing Data, July 2019, (Dungang Liu, Chuck Sox)
Real estate industry is growing rapidly in recent years. It is fascinating to find out which factor impacts the price of a house most and if there is a model to accurately predict house sales price. Consequently, I will solve the following questions in this project: 1. find out the most important variables in predicting house price. 2. Build statistical models to predict house price and try to decrease model MSE. To solve the problems above, I chose the Ames Housing Data, which contains data of the sales records of individual residential property in Ames, Iowa from 2006 to 2010. There are 80 variables and 2919 observations in the data set. I cleaned the data set, input missing values, conducted exploratory analysis and built two nonlinear regression models: Random Forest and Gradient Boosting to predict housing price. I found out that variable “GrLivArea (ground living area square feet)” is the most important variable to predict house price by two models. Random Forest model presented the lowest out of sample model MSE: 19.22 and the least difference between in sample MSE and out of sample MSE. Models presented very good results in predicting house price from the perspective of model MSE. Consequently, I found out the trend of sales price in real estate industry is organized and predictable.
Chase Williams, Examining the Relationship between Internet Speed Tests, Helpdesk Calls and Technician Dispatches, July 2019, (Charles Sox, Joe Fahey)
Customer experience is critical to the success of any company, but especially those that provide an intangible product or a service to their customers. Providers of high-speed internet face the challenge of providing the internet speeds purchased by customers regardless of the hardware and wireless speed limitations in place by the customers’ devices. Understanding the highest and lowest performing operating systems and browsers can help providers to maximize the customers’ experience. In addition, by examining the internet speed test and helpdesk call data, providers can gain the ability to predict a technician dispatch and possibly solve the issue prior to the customer request. Improving the customer experience by solving technical issues prior to the customer request could reduce churn and improve profitability.
Zarak Shah, Bank Loan Predictions, July 2019, (Yichen Qin, Edward Winkofsky)
This Data set was posted on Kaggle as a competition. The dataset on Kaggle had two data sets: one for training the model, this dataset had 100,514 observations and the testing dataset had 10353 observations. There were 16 variables in the training dataset and 15 variables in the testing dataset. We have to predict the Loan Status column in the training dataset, we will only be using the training dataset here since the dependent variable is not included in the dataset used for testing. For this Capstone we used Logistic Regression and Classification trees. We analyzed our results using AUC curve, ROC curve, Cost function and Logistic regression measures.
Pruthvi Ranjan Reddy Pati, Time Series Forecasting of Sales with Multiple Seasonal Periods, July 2019, (Dungang Liu, Liwei Chen)
Companies need to understand the fluctuations of demand to keep the right amount of inventory on hand. Underestimating demand can lead to loss of sales due to the lack of supply of goods. On the other hand, overestimating demand results in surplus of inventory incurring high carrying costs. Realizing demand makes a company competitive and resilient to market conditions. Appropriate forecasting models enable us to predict future demand aptly. This paper models the times series data of everyday store sales of an item across 5 years of sales history from 2013 to 2018. The data is split with the first 4 years of the data as Train and the last 1 year as Test to evaluate the performances of Time Series Forecasting techniques like ARIMA, SARIMA and TBATS. The data exhibits multiple seasonality with weekly and annual periods. This complexity of the data clearly shows the limitations of the ARIMA and SARIMA. TBATS performed the best providing a Train Absolute Mean Error of 0.124721 and a Test Absolute Mean Error of 7.236229. It was able to model both the weekly and annual seasonality along with the trend.
Vaidiyanathan Lalgudi Venkatesan, Marketing Campaign for Financial Institution, July 2019, (Dungang Liu, Liwei Chen)
Marketing campaigns are characterized by focusing on the customer needs and their overall satisfaction. There are different variables we need to take into consideration when making a marketing campaign, that determine whether the campaign will be successful or not: Product, Price, Promotion and Place. The data is related to direct marketing campaigns of a Portuguese banking institution. The Marketing Campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to confirm if the product (bank term deposit) would be ('yes') or not ('no') subscribed. The goal is to predict if the client will subscribe to a term deposit or not; to identify strategies in order to improve the effectiveness of future marketing campaigns for the bank. In order to answer this, we have to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions in order to develop future strategies. The dataset consists of 11162 rows and 17 variables, including the dependent variable: ‘Deposit’. Statistical techniques such as Logistic Regression, Classification tree, Random Forest were used to classify customers. Random Forest performs best and returned the lowest misclassification rate and highest AUC. The most important variables from these methods in the order of importance are: Month, Balance, Age. From this we understand that the propensity of marketing conversion depends on which month of contact, balance of the individual and age of the customer.
Shriya Sunil Kabade, Customer Churn Analysis, July 2019, (Dungang Liu, Liwei Chen)
Customer loyalty is important for every business. Loyal customers help a company grow by engaging more and improving brand image. Due to intense competition in the telecommunication industry, retaining customers is of utmost importance. Churn occurs when a customer ceases to use the products or services offered by a company. Insights into customer behavior can help a company understand early indicators of churn and avoid churn of customers in the future. The goal of this project is to identify key factors that make a customer churn and predict whether a customer will churn or not. The data for this project is taken from IBM sample datasets. The data is that of a telecom company ‘Telco’ with 7 thousand records and 21 features. Customers who have churned within the last month have been flagged. The features include information about the customer account, demographic information and customer behavior information in the form of services that the customer has signed up for. Various binary classification models like logistic regression, random forest, XGBoost have been built and compared based on classifier performance and ability to correctly classify churned customers. The final XGBoost model classifies 88.6% of the churned customers correctly and is not able to capture only 58 instances of churned customers. This model can be used by the telecom company to target customers with a potential to churn and retain them.
Gopalakrishnan Kalarikovilagam Subramanian, Analysis of Over-Sampling and Under-Sampling Techniques for an Unbalanced Data Set, July 2019, (Dungang Liu, Yan Yu)
Fraudulent transactions after card hacking is becoming a major concern for credit card industries. It is becoming a major deterrent in customer usage of credit cards. The goal of this project is to build a model for detecting fraudulent transaction using previous documented data of fraudulent transactions. The data set contains 30 variables. Two of the variables are ‘Amount’ of the transaction and ‘Time’ of the transaction which is the time elapsed after the first transaction. The other 28 variables are named V1-V28 and are a result of Principal Component Analysis (PCA) transformation. The output variable - ‘Class’ is binary and hence is a classification problem. Therefore, the following methods are used: Logistic Regression, Decision trees, Random Forest, Gradient Boosting Machine (GBM) and Support Vector Machine (SVM). Over-Sampling and Under-
Sampling schemes are used to overcome the unbalance in the data set. Precision and Recall values are used as indicators of efficacy of a scheme. Over-sampling is leading to a good mixture of precision and recall values. Under-sampling is resulting in a very good recall value but very poor precision value. Among the modeling algorithms, Random Forest is performing best and giving very good precision and recall values. V14, V10, V12, V17 and V4 are the top five important variables which helps in determining whether a transaction is fraudulent or not.
Varsha Shiva Kumar, Recommendation System for Instacart, July 2019, (Dungang Liu, Liwei Chen)
Instacart is an online application through which customers can order groceries and get them delivered on the same day from nearby grocery stores. In an era of big data technologies, recommendation engines play a crucial role in increasing the number of purchases per customer for companies. With the objective of increasing purchases per customer, Instacart wants to build a robust and accurate recommendation system that will recommend customers to reorder products.
The goal of this project is to build a recommendation system which will identify products most likely to be reordered by customers in a given order. The data available is transactional data of orders from customers over time. It contains over 3 million orders from more than 200,000 Instacart users. The information in the set of relational datasets are order related details like order number, days since last order was placed by the user, the products bought in the order etc. Through feature engineering, new variables like total orders, total reorders, average basket size of each user, etc. were created to better understand the relationship between users and products that they order. Using Logistic Regression, Classification Tree, Adaptive and XG Boosting, the reorder flag was predicted for a product in an order. XG Boosting was chosen as the final model because of its high recall value. Order rate of a product by a user and number of orders between purchases of product were identified as the two most significant variable that impact the likelihood of a product purchase.
Shruti Arvind Jibhakate, Analysis and Prediction of Customer Behavior in Direct Marketing Campaigns, July 2019, (Dungang Liu, Yan Yu)
Marketing is aimed at selling products to existing customers and acquiring new customers. Out of various types of marketing, direct marketing is most effective. It allows targeting individual customers but is more expensive than other methods. This analysis will help predict customer’s propensity to subscribe to a term deposit with the bank based on the data collected through the telephonic direct marketing campaign. The goal is to predict whether a customer would subscribe to a term deposit or not, based on data collected from telephonic direct marketing campaigns conducted by the Portuguese banking institution. It will thus help enhance customer acquisition, cross-selling of products, improve targeting and hence increasing the return on investments for the marketing campaigns. The variables record the customer’s profile information, marketing campaign information and social and economic context attributes between May 2008 and November 2010. Exploratory data analysis and various statistical modeling approaches are undertaken to better understand the data and develop a robust prediction algorithm. This analysis uses Logistic Regression, Decision Tree, and Generalized Additive Model. Since this information is used for decision making, methods which differentiate based on controllable parameters are preferred. Based on this criterion, the Logistic Regression model is most interpretable and has the highest predictive power. The model classifies subscribers from non-subscribers based on client information – default status, contact type; campaign information – month, previous campaign outcome, number of contacts performed in current campaign and number of days since the customer was last contacted; and, socio-economic factors – number of employees and consumer confidence index.
Ruby Yadav, Bike Rental Demand Prediction Using Linear Regression & Advanced Tree Models, July 2019, ( Dungang Liu, Liwei Chen)
Bike Rental service is a popular business in USA nowadays full of students or tourists or due to traffic and health promotions. With the automated process of renting a bike on an as needed basis makes it very convenient for the consumers to rent a bike. In this project we need to predict the bike rentals demand for a Bike Rental Program in Washington DC. We are examining the impact of different factors like season, weather, weekend/weekday etc. in order to predict the demand. Since the response variable is continuous, we are using linear regression model to predict the demand. Advanced machine learning techniques like regression tree, random forest, bagging is also used to predict the demand and from these we have chosen the best model for rental demand prediction and determined the significant factors that actually make an impact on the number of bikes rented. Here we have chosen Random forest as the best method to predict the demand for the bikes as the mean squared error is least for Random Forest.
Charles Brendan Cooper, What Matters Most? Analyzing NCAA Men’s Basketball Tournament Games, July 2019, (Dungang Lui, Yan Yu)
Every year since 1939, the National Collegiate Athletics Association (NCAA) hosts its end-of-the-season single elimination tournament for Division 1 Men’s Basketball, commonly known as March Madness, to determine who will be the national champion. In its current form, the tournament consists of 64 teams from conferences across the nation divided into 4 sections. In recent history, spectators of the sport are becoming more and more interested in trying to predict how the tournament will play out. Who will be in the final four teams, what kinds of upsets will there be (lower seeded team beating a higher one), who will be the eventual national champion? All of these questions and more are debated upon by journalists, sports newscasters, and fans so much so that there is a name for it all, Bracketology. Using the data I have available, my goal is to provide what variables prove to be the most impactful for predicting the outcome of games within the tournament. This will offer insight into how teams might approach an upcoming game versus an opponent based on their attributes. Once the data set has been built from a group of tables provided by a Kaggle competition sponsored by Google (resulting in 99 variables and 1,962 observations), a stepwise variable selection process will be applied, and a final model built from those critical variables. This results in 17 core variables out of the total 99 available and an out-of-sample AUC of 0.787.
Beth Hilbert, Promotions: Impact of Mailer and Display Location on Kroger Purchases, July 2019, (Charles Sox, Yan Yu)
This project examined which promotions are most effective for pastas, sauces, pancake mixes, and syrups in Kroger stores. A dataset provided by 84.51 was used to analyze how weekly mailers and in-store displays correlate with sales (number of baskets) for each product. Random Forest, LASSO Regression, Hierarchical Clustering, and Association Rules were used to answer this question.
The first part of the analysis used Random Forest to determine which promotion types had the greatest influence on the number of baskets. Given these data, in-store end caps and interior page mailer placements had the most influence on purchases/baskets. The second part of the analysis used LASSO Regression and Hierarchical Cluster to identify similar products using product segmentation. Similar products were clustered and then analyzed to determine what differentiated the groups. This analysis showed that sauces appear four times as often in the cluster most responsive to promotions. Given these results, promoting sauces on end caps is recommended. The final part of the analysis used Association Rules to evaluate purchase pairings using market basket analysis. The analysis focused on the Private Label brand because it was available for both pastas and sauces. Given these data, promoting Private Label sauce on the back page of the mailer, instead of the interior page of the mailer, is recommended to increase the chance of pairing with Private Label spaghetti.
Rahul Muttathu Sasikumar, Credit Risk Modelling – Lending Club Loan Data, July 2019, (Yan Yu, Nanhua Zhang)
Credit risk refers to the chance that a borrower will be unable to make their payments on time and default on their debt. It refers to the risk that a lender may not receive their interest due or the principal lent on time. It is extremely difficult and complex to pinpoint exactly how likely a person is to default on their loan. At the same time, properly assessing credit risk can reduce the likelihood of losses from default and delayed repayment. Credit risk modelling is the best way for lenders to understand how likely a loan is to get repaid. In other words, it’s a tool to understand the credit risk of a borrower. This is especially important because this credit risk profile keeps changing with time and circumstances. As technology has progressed, new ways of modeling credit risk have emerged including credit risk modelling using R and Python. These include using the latest analytics and big data tools to model credit risk. In this project, I will be using the data from Lending Club which is a US peer-to-peer lending company, headquartered in San Francisco, California. The past loan data is used to train a machine learning model which identifies a loan applicant as risky of defaulting the loan or not.
Jyotsana Shrivastava, Factors Influencing Length of Service of Employees, July 2019, (Yan Yu, Dung Tran)
Macy’s has 587 stores in 42 states in US with 225,656 employees as of Dec’2017. These stores are divided into 5 geographical regions for retail sales. We had observed in the prior analysis with Macy’s People analytics team that average sales vs plan for all Macy’s stores has a negative relation with Length of service of employees. In this paper, Length of service of employees also being referenced as ‘LOS’ is analyzed with varied employee information as available through the HR information base. The analysis as conducted is for overall Macy’s stores and on a region level. Further deeper analysis is also conducted on furniture and non-furniture stores and region wise variation if applicable. The analysis is done using data mining fundamentals using R throughout the project. This paper would aid Macy’s to incorporate HR related changes as required on a region or store level. Key findings of the paper are that LOS has a positive relation with Average standard working hours of employees in the store and a negative relation with the number of full-time employees in the store. North-East region which includes Macy’s flagship store Herald square has the highest LOS amongst all stores and this region NE has a positive relation with LOS. The key results are summarized in the paper and detailed results are available upon request.
Xi Ru, Predicting High-Rating Apps, July 2019, (Yan Yu, Yichen Qin)
As there are so many applications developed each day, it’s difficult for the software developers to determine what kinds of applications will become popular after it is released and be rated with higher rating by the public. This project is created to predict the high rating applications on google play store so that the app developers can invest their time and resources properly to gain profit. In this project, both regression and classification models are built to find the “best” model by comparing their prediction accuracy. The criteria used in model assessments for regression models are AIC, BIC, Adjusted R-square, and MSE. The classification models are assessed by both in-sample and out-of-sample AUC and MR. To eliminate the impacts of different categories, the same data mining techniques applied for a single category, which is the largest category “Family” in the original dataset. The model with highest predictive accuracy for multiple categories is Random Forest, while the performance of predictive models for single category “Family” do not have significant differences.
Bharath Vattikuti, IBM HR Analytics: An Analysis on Employee Attrition & Performance, July 2019, (Yan Yu, Liwei Chen)
Attrition is a problem that impacts all businesses, irrespective of geography, industry and size of the company. Employee attrition leads to significant cost to business, including hiring expenses, training new employee along with lost sales and productivity. Hence, there is a great business interest in, understanding the drivers of and, minimizing the employee attrition. If the reasons behind employee attrition are identified, the company can create a better working environment for the employees and if employee attrition can be predicted, the company could take necessary actions to stop the valuable employees from leaving. So, this report aims to explore the HR dataset by IBM Watson Analytics, manipulate it to get some meaningful relation between response variable (whether an employee left the company or not) and dependent variables which provide information about an employee. Then, multiple predictive statistical models are built in order to predict the possibility of an employee leaving the firm and factors were studied by plotting variable importance. In order to evaluate the model performance, Prediction Accuracy, Sensitivity and AUC are considered. Of the different models-built, Support Vector Machines (SVM) was picked due to its higher F1 score, comparable accuracy.
Rashmi Prathigadapa, Movie Recommender Systems, July 2019, (Yan Yu, Edward Winkofsky)
A movie recommender system has gained popularity and importance in our social life due to its ability to provide enhanced entertainment. It employs a statistical algorithm that seeks to predict users' ratings for a particular movie, based on the similarity between the movies, or similarity between the users that previously rated those movies. This enables the system to suggest the movies to its users based on their interest or/and popularity of the movie. There are a lot of recommender systems in the existence, most of them either cannot recommend a movie to its existing users efficiently or to a new user. In this project, we not only focus on the recommender systems for an existing user based on their taste and shared interests but also a recommender system that can suggest the new users based on the popularity. The dataset used has 45,000 movies and all the information about cast, crew, ratings, keywords etc. We have built 4 recommender systems namely: Simple Recommender, Content Based Recommender, Collaborative Filtering, Hybrid Engine.
Paul Boys, Statistical Inference for Predicting Parkinson’s Disease Using Audio Variables on an At-Risk Population, July 2019, (Yan Yu, Edward Winkofsky)
Parkinson’s disease is a degenerative neurological disorder characterized by progressive loss of motor control. Approximately 90% of people diagnosed with Parkinson’s disease (PD) have speech impairments. Development of an audio screening tool can aide in early detection and treatment of PD. This paper re-examines research data of audio speech variables from recordings of three groups: 1) healthy controls, 2) patients newly diagnosed with PD and 3) an at-risk group. The focus of this paper is on accurately predicting Parkinsonism in the at-risk group. The original research reported a 70% accuracy using quadratic discriminant analysis (QDA). This paper examines QDA, linear discriminant analysis (LDA), support vector machines (SVM) and Random Forest with use of least absolute shrinkage and selection operator (LASSO) for feature selection. LASSO selected two variables. Utilizing these two LASSO variables, Random Forest had the best out-of-sample accuracy at 72%. SVM and Random Forest resulted in sensitivities superior to QDA and LDA. Utilizing model accuracy on the control and PD group as model selection criterion, the SVM with a Bessel kernel was chosen as the optimal model. This SVM model was 64% accurate when validated on the at-risk group. Human speech screening of the at-risk group resulted in correctly identifying speech impairments in 2 of the 23 Parkinson’s positive patients. This SVM model improved on the human performance by correctly identifying speech impairment in 8 of these 23.
Harish Nandhaa Morekonda Rajkumar, Improving Predictions of Fraudulent Transactions Using SMOTE, July 2019, (Yan Yu, Edward Winkofsky)
This project aims to predict whether or not a credit card transaction is fraudulent. On the highly imbalanced dataset, logistic regression and random forest models are applied to understand how well the true positives are captured. Two sampling techniques - Random Oversampling and SMOTE are explored in this project. SMOTE stands for Synthetic Minority Oversampling Technique, a process in which synthetic instances are generated from the minority feature space to offset the imbalance. The models are applied again on the resampled data, and the area under the ROC and PR curves are observed to increase sharply. With SMOTE data, it is also observed that there is a sharp drop in the false positives, reducing by up to 38% and possibly leading to hundreds of thousands of dollars in cost savings. The threshold range is also found to increase, allowing more room for the model to be flexible.
Shagun Narwal, Analysis of e-Commerce Customer Reviews and Predicting Product Recommendation, July 2019, (Dungang Liu, Yichen Qin)
The dataset used for the project belongs to the e-commerce industry, specifically a women’s clothing website. In the last 2 decades, the e-commerce industry has consistently leveraged data to improve sales, advertisement and customer experience. Customers on a women’s e-commerce website have provided reviews and also voted whether they will recommend the product or not. This data was analyzed to generate insights in the e-commerce product reviews and recommendations space. Also, sentiment analysis was performed and used to understand the association between review words and customer sentiments. Further, the reviews were used to develop a model which can predict whether the person will recommend the product or not.
Lissa Amin, Psychographic Segmentation of Consumers in the Banking Industry, July 2019, (Michael Fry, Allison Carissimi)
The competitive landscape of the banking industry has forced traditional retail banks to shift their focus towards becoming more consumer-centric organizations and maintain levels of service and convenience that compete with the experiences consumers have both within and outside the industry. In order to be a truly customer-centric organization, there must first be a true understanding of who the consumer is which includes their needs, attitudes, preferences and behaviors. Market segmentation which aims to identify the unique groups of consumers that exist within the market based on shared characteristics is critical to understanding the consumer. Bank XYZ is one example of a traditional retail bank that has adopted more customer-centric values and is working to redefine the way they build products, services and marketing campaigns in order to drive value for both the customer and the bank. This analysis focuses on a psychographic segmentation of the consumers within Bank XYZ’s geographic footprint and identifies the unique groups that exist based on their attitudes, needs, behaviors and beliefs.
Harsharaj Narsale, Owners’ Repeat Booking Pattern Study and Forecasting, July 2019, (Charles Sox, Andrew Spurling)
NetJets Inc., a subsidiary of Berkshire Hathaway, is an American company that sells part ownership or shares (called fractional ownership) of private business jets. Accurate demand forecasting is essential for operational planning. Fleet management plays an important role in operations. Fluctuating demand with high variance needs to be fulfilled daily on time without declining any requesting flight to keep reputation of the company on high stature. The company has its own fleet as well as it can subcontract aircrafts from other companies on a temporary basis. Subcontracts are costly and hence need to be avoided, if possible. Subcontracted flights can be avoided by detailed planning based on accurate forecasts. Flight demand is currently being forecasted from last 5 years using the time series models built in SAS enterprise with an accuracy of ~96% in forecasting the total number of flights booked. As most of the owners are business or sport personnel, flights booked for annual business meetings or sport events are expected to show certain patterns. Owners’ repeat booking behavior can help in fine-tuning demand forecasts. Booking pattern has been analyzed in this project across 365/366 days of the year. Autoregressive time series model is built with an MAPE error rate of 10.75% for repeat percentage. Temporal aspects of flight booking in advance across new and repeat bookers has been analyzed to improve demand forecast for a flight day well in advance.
Neha Nagpal, Predicting Dengue Disease Spread, July 2019, (Dungang Liu, Chuck Sox)
Dengue fever is a mosquito-borne disease that occurs in tropical and sub-tropical parts of the world. Because it is carried by mosquitoes, the transmission dynamics of dengue are related to climate variables such as temperature and precipitation. In this project, various climatic factors are analyzed to generate a model for predicting the number of cases of dengue in the future, which in turn would help public health workers and people around the world take steps to reduce the impact of these epidemics. The data is provided by U.S. Centers for Disease Control and Prevention for two cities: San Juan, Puerto Rico and Iquitos, Peru. Various statistical methods have been used to find the best model as per the data. Finally, Support Vector Machine provided the lowest Mean Absolute error which makes it a suitable model for predicting the number of dengue cases in the future. From the analysis, we also found that humidity in the air, high temperature and some specific seasons result in a greater number of dengue cases.
Swagatam Das, IMDB Movie Score Prediction: An Analytical Approach, July 2019, (Dungang Liu, Liwei Chen)
Films have always been an integral part of the world of entertainment. They can be used as a medium to convey important messages to the audience and also a creative medium to portray the fictional world. A filmmaker’s aspiration is not only to achieve commercial success but also to gain critical acclaim and content appreciation by the audience. The most commonly used metric for filmmakers, audience, critics, etc. is the IMDB Score. This score out of 10 marks the overall success or failure of a film. In this project, I have studied the factors that affect the final IMDB Score right from popularity of Actors/Directors to commercial aspects like Budgets and Gross Earnings. The intent is to help future filmmakers make educated decisions while creating films. The data has been sourced from Kaggle which was originally pulled from the IMDB website. I have used Data Visualizations and Machine Learning Algorithms to make predictions regarding the response i.e. the IMDB score. I have concluded that total number of users who voted for the movie, duration of the movie, budget and gross earnings of the movie are important factors that determine the IMDB score and would recommend future film makers to look into these factors before producing/directing films. I have used different models to assess the prediction accuracy of the IMDB Score. From the analysis, Random Forest Algorithm had the best accuracy rate of 78.42% compared to 75.65%, 78% and 77.5% for Multinomial Logistic Regression, Decision Tree and Gradient Boosting respectively.
Akshay Singhal, Vehicle Loan Default Prediction, July 2019, (Dungang Liu, Yichen Qin)
The major chunk of the income for banks comes from the interest earned on loans, but loans can be risky too. Banks must deal with the risk to reward ratio for any kind of loan. This is where credit scoring comes in. Loan defaults will cause huge losses for the banks, so they pay much attention on this issue and apply various methods to detect and predict default behaviors of their customers. In this report, we attempt to predict the risk of the loan being default based on the past data. The data was obtained from Kaggle. The data contains the information about the customers from the Indian Subcontinent. The main idea of the project is to find out the factors responsible for the loan default. For this Exploratory Data Analysis was done to understand the data better and to study the relationship between various variables then a comparison of the performance and accuracy of various machine learning algorithms like logistics regression, Random forest and Gradient Boosting is done to find out which technique works best in this scenario. It was found that the loan default is highly influenced by the loan amount and the customer’s credit history. Random forest gave out the best results for the prediction of default based on the data.
Mrinal Eluvathingal, A Machine Learning Walkthrough, July 2019, (Peng Wang, Ed Winkofsky)
The main goal of this project is to provide a machine learning walkthrough of a dataset and through the process of Data Munging, Exploration, Imputation, Engineering and Modeling show that the stages of Preprocessing and Feature Engineering are the most important, and is the foundation upon which a model can be more powerful. Using the Ames Housing Dataset we perform Exploratory Data Analysis and Feature Engineering and Selection using advanced techniques including an innovative new method for feature creation, and compare different machine learning algorithms and analyze their relative performance. This project, titled ‘A Machine Learning Walkthrough’ lays emphasis on the most important part of the Data Science problem which is data preparation.
Margaret Ledbetter, The Role of Sugar Propositions in Driving Share in the Food and Beverage Category, April 2019, (Yan Yu, Joelle Gwinner)
Sugar continues to be a “hot topic” for [food and beverage] consumers and a driver of recent buyer and volume declines in the aisle. To date, there has been limited understanding of consumer preferences for specific sugar ingredients – i.e., natural vs. added – and lower sugar propositions. This research seeks to understand the role of sugar ingredient and lower sugar propositions as well as other factors in the [food and beverage] consumer purchase decision, including: brand, variety, all natural claim, and added benefit. The insights uncovered in this research will be used to inform [Client] line optimization and new product development for FY20 + beyond.
Jacob Blizzard, Pricing Elasticity Evaluation Tool, April 2019, (Yan Yu, Chad Baker)
EG America is looking to maximize profits for their beer category through pricing analytics. Pricing analytics is at the core of profitability, but setting the right price to maximize profits is often difficult and extremely complex. This project aims to create a tool that recommends the optimal price to maximize profit by using historic sales data and the price elasticity of demand for top selling items within each state in which EG America operates. The tool compiles sales data queried from internal data systems and market research data from Nielsen and calculates the optimal price to set for each item by using the elasticity coefficient and the cost of each item. Setting the correct price for beer in convenience stores is of utmost importance due to the customer base that is exceedingly price aware and very sensitive to price changes.
Daniel Schwendler, Single Asset Multi-period Schedule Optimization, April 2019 (Mike Fry, Paul Bessire)
In a production environment, the capacity to produce finished materials is the primary focus of operations leadership. Sophisticated systems surround the scheduling of production assets and resources are dedicated to making the most with what capacity is available. The two primary changes that impact a production environment’s ability to increase production are efficiency and total asset availability; in other words, increasing production with current assets or increasing the total potential for production through increases in staffing or production equipment. In many environments, the purchasing of production equipment is a sizable capital expenditure that is not an option or requires comprehensive justification. In a continuous flow example, the scheduling of a single machine can drive the entire supply chain. The production schedule of this machine is critical. Operations management relies on production and scheduling to steer the business. In these cases, the application of optimization provides objective recommendations and isolates skilled resources on decision making. In this paper, we will explore the application of mixed integer linear optimization in a continuous flow environment as an enterprise resource planning tool. An optimal master production schedule alone adds value in understanding machine capabilities to meet demand, while it also informs many other facets of the business. Material requirement planning, Inventory management, sales forecasting, required maintenance and supply chain logistics are all critical considerations.
Khiem Pham, Optimization in Truck-and-Drone Delivery Network, April 2019, (Leonardo Lozano, Yiwei Chen)
With the introduction of unmanned aerial vehicles (UAV), also known as drones, several companies with shipping service promised to greatly cut down the delivery time using these devices. One of the reasonable methods to use the drones is to launch them not directly from the shipping centers, but from the normal delivery trucks themselves. With multiple vehicles delivering at the same time, this can save a lot of time. In this paper, we discuss a general approach to the problem from an optimization point of view. We consider different drones’ specifications as well as the number of drones to deploy. We aim to formulate a model that can return optimal vehicle routes and measure the computational expense of the model.
Lauren Shelton, Black Friday Predictions, April 2019, (Dungang Liu, Liwei Chen)
In the United States, the day after Thanksgiving is known as Black Friday, the biggest shopping day of the year. Stores offer their best sales to kickoff the holiday season. Store owners could benefit from being able to predict what customers want to buy, how much they are willing to spend, or the demographic of customers to target. For this project, a linear model, generalized additive model, neural network model, and classification tree model are used to predict purchase prices in dollars. All predictor variables, including gender, age, marital status, city category, years in city, occupation category, and product categories, were important. The final model chosen was the linear model, which performed best.
Niloufar Kioumarsi, Mining Hourly Residential Energy Profiles in order to Extract Family Lifestyle Patterns, April 2019, (Yichen Qin, Peng Wang)
This study presents a new methodology for domestic hourly energy time series characterization based on hierarchical clustering of seasonal and trend components of energy load profiles. It decomposed energy time series in to their trend, seasonal and noise components. It segmented households into two groups through clustering their trend components. In order to interpret the trend clustering results, it looked at the correlation of energy time series in each cluster with weather. The study also examined the influence of household characteristics on patterns of electricity use. Each trend-cluster was linked to household characteristics (house age and size) by applying a decision tree classification algorithm. Finally, the seasonal component of energy profiles were used to cluster customers based on family lifestyle patterns. This study constructed 6 profile classes/typical load profiles reflecting different family lifestyles, which can be used in various energy saving behavioral studies.
Mark McNall, From Sinatra to Sheeran: Analyzing Trends in Popular Music with Text Mining, April 2019, (Dungang Liu, Edward P. Winkofsky)
Starting in late 1958, Billboard Magazine’s Hot 100 became the single, unified popularity chart for singles in the music industry. Because music is such a universal and beloved form of art and entertainment, exploring how popular music has changed over the years could provide interesting and valuable insight both for consumers and for the music industry (musicians, songwriters, lyricists). One way to explore trends in music is analyzing their lyrics. This project aims to analyze the lyrics of every #1 hit over time by using a variety of text-mining applications such as tokenization, TF-IDF ratio, lexical density and diversity, compression algorithms, and sentiment analysis. After extensive research, results showed #1 hits have steadily gotten more repetitive over time, as popular songs have had a declining lexical density and increasing compression ratios. Sentiment analysis showed that popular music has also gotten more negative, and emotions such as anger and fear are more prevalent in lyrics than positive emotions such as joy compared to the past. Finally, the usage of profanity in popular music has skyrocketed in the last two decades, showing that music has not only got more negative but also more vulgar.
Laura K. Boyd, Predicting Project Write Downs, May 2019, (Edward Winkofsky, Michael Fry)
Company A is a national engineering consulting firm that provides multi-media services within four distinct disciplines: Environmental, Facilities, Geotechnical, and Materials. The goal of this project is to investigate the amount of project write downs for the Cincinnati office, specifically for the Environmental Department over the past four years. Preliminary review of Company A’s data indicates that the average monthly write downs for the Environmental Department is approximately $17,000. This analysis will use linear regression to explore the relationships of project data for the Cincinnati office between 2015 and 2018. Backward selection was utilized to select predictors and removed one at a time if determined that the predictor variable does not contribute to the overall prediction. The following limitations were encountered during this analysis: available data analytics tools and data integrity. As part of the BANA program I was exposed to multiple data analytical tools including R, SAS, and Microsoft Excel. Company A does not utilize R or SAS; therefore, Microsoft Excel was used for this analysis. When dealing with data its integrity is always a concern, especially when your data relies heavily on user inputs. The data used in this analysis was exported from a project management database. The data in the database is entered by each project manager and relies on accurate and up-to-date entries.
Joe Reyes, Cincinnati Real Estate – Residential Market Recovery, May 2019, (Shaun Bond, Megan Meyer)
During the Recession of 2007-2009, real estate was affected nationwide. Local homeowners in the Cincinnati tri-state also felt the impact of the downturn. The Hamilton County Auditor’s Office maintains and publishes real estate sale records. This data is useful in evaluating not only the general market for different neighborhoods, but also, how much local property values were influenced by the recession. Historical sales volumes and values provide insight into the overall character of real estate values as a function of property sales. Further, consideration of market supply and demand during the same period gives a view into the drivers behind the decline and rebounds around the recession. This brief summarizes residential trends using the above data from the years 1998 through 2018 for Hamilton County. Comparison to national and regional information to this local information gives an idea of how Cincinnati residents fared compared to the Midwest and USA.
Poranee Julian, A Simulation Study on Confidence Intervals of Regression Coefficients by Lasso and Linear Estimators, May 2019, (Yichen Qin, Dungang Liu)
We performed a simulation study to compare the confidence intervals of regression coefficients by Lasso (a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces) and linear regressions. We studied five cases. The first three cases contains three different numbers of independent regressors. In the fourth case, we studied a data set of correlated regressors with a given correlation matrix Toeplitz. The last case is similar to the fourth case but the correlation matrix given by AR1. The results showed that linear regression slightly performed better. However, Lasso regression gave effective models as well.
David Horton, Predicting Single Game Ticket Holder Interest in Season Plan Upsells, December 2018, (Yan Yu, Joseph Wendt)
Using customer data provided from the San Antonio Spurs, a statistical model was built that predicts the likelihood that an account which only purchased single game tickets in the previous year will upgrade to some sort of plan, either partial or full season, in the current year. The model uses only variables derived from customer purchase and attendance histories (games attended, tickets purchased and attended, money spent) over the years 2013-2016. The algorithm used for training was the Microsoft Azure Machine Learning Studio implementation of a two-class decision jungle. Training data was constructed as customers who had purchased only single game tickets in the previous year. This data was split randomly so that 75% was used for training and 25% for testing. In later runs, all data from 2016 was withheld from training and testing as a validation set, as noted in the results section. The final model (including 2016 data in training) shows a test accuracy of 84.9%, where 50% accuracy is interpreted as statistically random and 100% yields only perfect predictions. This model is likely to see improvements in predictive power as demographic information is added, new variables are derived, the feature selection method becomes more sophisticated, the model choice becomes more sophisticated, model parameters are optimized, and more data becomes available.
Ravi Theja Kandati, Lending Club – Identification of Profitable Customer Segment, August 2018, (Yan Yu, Olivier Parent)
Lending club issues unsecured loans to different segments of customers. The interest rate for the loan is dependent on the credit history of the customer and various other factors like income levels, demographics, etc. The data of the borrowers is public. The objective is to analyze the dataset and identify the good customers from the bad customers (“charged off”) using machine learning techniques. This dataset falls under the category of class imbalanced dataset as the number of good customers are far greater in number than the number of bad customers. As this is a typical classification problem, CatBoost technique is used for modelling.
Pengzu Chen, Churn Prediction of Subscription-based Music Streaming Service, August 2018, (Dungang Liu, Leonardo Lozano)
As a well-known subscription business model, paid music streaming became the largest recorded music market revenue source in 2017. Churn prediction is critical for subscriber retention and profit growth in a subscription business. This project uses a leading Asian music streaming service’s data to identify parameters that have an impact on users’ churn behavior and to predict churn. The data contains user information, transaction records and daily user activity logs. Prediction models are built with logistic regression, classification tree and support vector machines algorithms, and their performances are compared. The results indicate that classification tree model has the best performance among the three in terms of asymmetric misclassification rate. The parameters that have a big impact on churn are whether subscription auto-renew is enabled, payment method, whether the users cancel the membership actively, payment plan length, and user activities 0-2 days before subscription expires. This informs the service provider where customer relationship management should focus.
Tongyan Li, Worldpay Finance Analytics Projects, August 2018, (Michael Fry, Tracey Bracke)
Worldpay, Inc. (formerly Vantiv, Inc.) is an American payment processing and technology provider headquartered in the greater Cincinnati area. As a Data Science Analytics Intern, I directly work with the Finance Analytics team on multiple projects. The main purpose of the first project is to automate the process that used to be manually accomplished within different databases, RStudio was used and substantially reduced the time required to produce flat files for further usage. In the second project, the year-over-year (YoY) average ticket price (AVT) growth was analyzed. The Customer Attrition project focuses on the study customer’s attrition behavior.
Navin Mahto, Generating Text by Training a Recurrent Neural Network on English Literary Experts, August 2018, (Yan Yu, Yichen Qin)
Since the advent of modern computing we have been trying to make computers learn and respond in a way unique to humans. While we have chatbots which mimic human response by pre-coded answers, they are not fluid or robust in their response. In our project we would like to train a Recurrent Neural Network on an English Classic “War and Peace” by Leo Tolstoy and make it generate sentences similar in nature and structure to the language in the book. The sequential structure of RNN and its ability to retain previous inputs make them ideal to learn a literary style of the book. On increasing the length of RNN and epoch values, the error decreases from max of 2.9 to 2.2, and we see that the text generated resembles closer to English language.
Non-coherent output: “the soiec and the coned and the coned and the cone”.
Slightly coherent output: “the sage to the countess and the sale to the count”.
Zach Amato, Principal Financial Group: GWIS Portfolio Management Platform, July 2018, (Michael Fry, Jackson Bohn)
The overall goal of the GWIS Portfolio Management Platform project is to help bring together the disparate tools, research, and processes of the PPS boutique into as few locations as possible. Throughout the summer, we have started prototyping the Portfolio Viewer module and putting structure around the Research Module. In doing this, data management and data visualization skills have been used to meet the needs of the project and of the business. Future steps in the project will include completing current work and the modules in process and engaging in the Portfolio Construction and Trading modules. Future work will require data management, data manipulation, statistical testing, and optimization.
Nicholas Charles, Craft Spirits: A Predictive Model, July 2018, (Dungang Liu, Edward Winkofsky)
A new trend driving growth in the spirits industry is craft. Craft spirits are usually produced by small distilleries that use local ingredients. In the US, the spirits industry is structured as a three-tier system with manufacturers, distributors, and retailers. In certain states, the state government controls a portions of the three-tier system. For instance, the State of Iowa controls the distribution. The state purchases product from the manufacturers and subsequently sells to private retailers in the state. In doing so, the state tracks all transactions at the store level and makes available this data to the public. This project takes that open data and builds a logistic regression model that can be used to predict the outcome of a transaction as either a craft purchase or noncraft purchase. This information may prove useful to distilleries and distributors that specialize in craft by helping to pinpoint where their resources should be focused.
Keshav Tyagi, CC Kitchen’s Dashboard, July 2018, (Michael Fry, Harrison Rogers)
I am working as a Business Analyst Co-Op with Project Management Operations Division within the Castellini Group of Companies providing Business solutions to CC Kitchens, which is one of its subsidiaries specializing in Deli and Ready to eat products. The project, which I was assigned to, aims at creating an executive level dashboard for CC Kitchens visualizing important business metrics, which can assist top-level executives in making informed decisions on a day-to-day basis.
My responsibilities include but are not limited to interacting with different sectors within the company to identify the data sources for the above metric, data cleansing, creating data pipelines, preparing data through SQL stored procedures and creating visuals over them in a tool called DOMO. The data resided in flat files, Excel sheets, emails, ERP, API’s and I created an automated data flow architecture to collect and dump data at a centralized SQL warehouse.
Swidle Remedios, Analysis of Customer Rebook Pattern for Refund Policy Optimization, July 2018, (Michael Fry, Antonio Lannes Filho)
Priceline offers lower rates to its customers on certain deals which are non-cancellable by policy. In order to improve customer satisfaction, certain changes were deployed in June 2015 to make exceptions to these policies and refund the customers. These exceptions are only applied to cancel requests that fall under Priceline’s predefined categories/cancel reasons. In this paper, the orders processed under Priceline’s Cancel and exception policy will be analyzed for two of the top cancel reasons. The goal is to determine if the refunds are successfully driving customer behavior and repurchase habits. The insights obtained from the analysis will help the Customer care team at Priceline redesign and optimize the policies for each of the cancel reasons.
Rashmi Subrahmanya, Analysis of Tracker System Data, July 2018, (Michael Fry, Peter M. Chang)
Tracker system is an internal system in Boeing which records requests from employees working on the floor. Based on the nature of request, they are directed to a respective department. Standards organization is responsible for supply of standards (such as nuts, bolts, rivets) for assembling aircraft. Any request related to standards such as missing or defective or insufficient number of standards are directed to Standards organization, which then resolves the request. The resolution time and in fact, the requests directly impact the aircraft assembling process. The project focuses on analysis of tracker request data to identify patterns in the data. Data is analyzed on two key metrics - number of requests and average resolution time of the requests. The top problem type names, area of aircraft, hour and day with highest requests have been identified. This would help understand the reasons behind these requests and help take preventive action such as increase staffing at a particular time of the day so that the requests are resolved quicker. Two dashboards were developed to show active number of requests and to show the requests by Integration Centers for 7XX program.
Spandana Pothuri, Data Instrumentation and Significance Testing for a Digital Product in the Entertainment Industry, July 2018, (Dungang Liu, Chin-Yang Hung)
The entertainment technology industry runs on data. As entertainment is created, unlike a more need-based industry like agriculture, it is important to see how the receiver uses the end product. Based on the feedback loop, more products are created or existing products are made better. How a user uses an application, determines how its next version is built. In this world, clicks translate to dollars, and data is of utmost importance. This paper focuses on the cycle of data in a technological project, starting from instrumentation and tracking to reporting and deriving the business impact of the product. The product featured in this paper is Twitch’s extensions discovery page. The goal of launching this product was to the increase visibility for extensions. This product succeeded, increasing viewership by 37%.
Sourapratim Datta, Product Recommendation: A Hybrid Recommendation Model, July 2018, (Michael Fry, Shreshth Sharma)
This report provides a recommendation of products (movies) to be licensed for an African country channel. The product recommendations are based on its features such as genre, release year, US box office revenue etc. and its performance on other African and worldwide channels. A hybrid recommendation model combining the product features (Content Based Recommendation) and the performance of the products (Collaborative Filtering model) has been developed. For the content-based recommendation, a similarity matrix is calculated based on the user preferences of the market and the most similar products are considered. To calculate the performance of the products that have not been telecast in the African channels, a collaborative filtering model is trained on the known performance indexes. From the predicted performance of the products, the top products are considered. Finally, combining the considered products from both the methods, a final list of products has been recommended by giving equal weightage to both methods.
Akhil Gujja, Hiring the Right Pilots – An Analysis of Candidate Applications, July 2018, (Michael Fry, Steven Dennett)
Employees are an asset to any organization, and the key to any firm’s success. For a company to grow, flourish, and succeed, the right people must be hired for the job from the start. Hiring the right personnel is a time consuming and a tedious task for any organization. Especially, in the aviation industry, where safety and reliability are of utmost importance, hiring the right pilots is critical. Even under ideal situations, hiring pilots can be an arduous task. It is extremely difficult to predict exactly how well pilots will perform in the cockpit. It is because a pilot’s future performance cannot just be predicted based on academic performance or historical flying metrics. It depends on a lot on non-quantifiable metrics. Candidate profiles are analyzed to understand the profiles that made through the selection process, and the ones that did not make it through the resume screening process. This analysis can be used by the recruiters to rank candidate profiles and expedite the hiring process of top ranked candidates.
Yang He, Incremental Response Model for Improved Customer Targeting, July 2018. (Dungang Liu, Anisha Banerjee)
Traditional marketing response models score customers based on their likelihood to purchase. However, among potential customers, some customers would purchase regardless of any marketing incentive while some customers would purchase only because of marketing contact. Therefore, the traditional predictive models sometimes lead to money wasting on customers who would shop regardless of marketing offers and customers who would stop shopping if you ‘disturb’ them with marketing offers. The Oriental Trading Company Inc., a company with catalog heritage, does not want to send any catalogs to the customers who would purchase naturally for cost saving and profit maximization. For my internship, the main objective is to distinguish customer groups that need catalogs to shop from customer groups that will shop naturally or will not shop if given marketing incentive using incremental response model in SAS Enterprise Miner. This report shows that the basic theory of the incremental response model and how the model is applied to an Oriental Trading Company dataset. A combined incremental response model was successfully built using demographic and transactional attributes. The estimated model identified that incremental response was 11.9%, which was 1.7 times higher than baseline incremental response (7.2%). The model was used to predict customers’ purchase behavior in future marketing activity. Additionally, from the outputs of modeling, we identified that the overall number of orders, sales of some major products and days since first purchase were the most important factors affecting customers’ response to the mailed catalogs.
Nandakumar Bakthisaran, Customer Service Analytics – NTT Data, July 2018, (Michael Fry, Praveen Kumar S)
The following work describes the application of data analysis techniques for a healthcare provider. There are two tasks covered here. The first is an investigation to identify the root cause of an anomalous occurrence in a business process. The average of the scores measuring agent performance on a call exhibited an unusual rise starting 2018. The chief cause was identified to be a deviation from standard procedure by evaluators. Naturally, the subsequent recommendation was to ensure greater adherence to the procedure followed. The second task follows it with scrutiny of the single scoring benchmark used for different types of incoming calls and how it falls short of measuring performances in an accurate manner. Probability distributions were attempted to be fit to the underlying data for each type to check for any inherent distributions. The conclusion was to employ a type-specific system of scoring using point estimates obtained from existing data.
Christopher Uberti, General Motors Energy/Carbon Optimization Group, July 2018, (Michael Fry, Erin Lawrence)
This capstone outlines a strategy for implementing improved statistical metrics used for analyzing General Motors (GM) factories. Current GM reporting methods and data available are outlined. Two methodologies are outlined in this paper for improved metrics and dashboards. The first is an analysis of individual HVAC units within a factory (of which each factor has dozens) in order to identify units that be performing poorly or are not set to correct modes. The second methodology is creating a prediction model for overall plant energy usage based on historical data. This would provide plant operators a method for comparing current energy usage to past performance while taking into account changes in weather, production, etc. Finally some potential dashboards are mocked up for use in the energy reporting software.
Anumeha Dwivedi, Sales Segmentation for a New Ameritas Universal Life Insurance Product, July 2018, (Dungang Liu, Trinette James)
The key to great sales for a new product is knowing the right kind of customers (who are most profitable) for it and deploying your best agents (high performing sales persons) out to them. Therefore, this project is aimed at performing customer and agent segmentation for a new Ameritas Universal Life Insurance product that is slated for a launch later this year. The segmentation is based on customer, agent, policy and riders data on other such historical products. The segmentation utilizes different demographic, geographic and behavioral attributes that are available directly or could be inferred or sourced externally. Segments developed would not only allow for more effective marketing efforts (better training, better-informed agents and marketing collateral) but also result in better profitability from the sales. Sales segmentation has been attempted using suitable clustering (unsupervised machine learning) techniques and the results suggest that cluster of clients in the early sixties and mid-thirties are most profitable and, in that order, and form the major chunk of the customer base. The age band of 45-55 years has not been as profitable with higher coverage amounts for medium premium payments. On the agents end, the most experienced agents (oldest in age and biggest tenures with Ameritas) have been most successful selling UL policies, followed by the youngest group of agents in their thirties and shortest tenures of 2-5 years while the ones with 6-15 years tenure in the 45-55 years age band are more complacent and limited with the sales of these policies.
Kamaldeep Singh, DOTA Player Performance Prediction, July 2018, (Peng Wang, Yichen Qin)
Dota2 is a free-to-play multiplayer online battle arena (MOBA) video game. Dota 2 is played in matches between two teams of five players, with each team occupying and defending their own separate base on the map. Each of the ten players independently controls a powerful character, known as a "hero" (which they choose at the start of the match), who all have unique abilities and differing styles of play. During a match, players collect experience points and items for their heroes to successful battle with the opposing team's heroes, who attempt to do the same to them. A team wins by being the first to destroy a large structure located in the opposing team's base, called the "Ancient". The objective of this project is to come up with an algorithm that can predict a player’s performance with a specific hero, by learning from his performance with other heroes. The response variable used for quantifying performance is KDA ratio i.e. Kill to deaths ratio of that user with that hero. The techniques used in this project are Random Forest, Gradient Boosting and H2O package that encapsulates various techniques and automates model selection. Data was provided by Analytics Vidhya and is free to be used by anyone.
Sarita Maharia, NetJets Dashboard Management, July 2018, (Michael Fry, Stephanie Globus)
Data visualization plays a vital role in exploring the data and summarizing the analysis results across the organization. The visualizations in Netjets were developed using disparate tools on a need basis without any set of corporate standards. Once the employees began using Tableau as the data visualization tool, it became even more important to have a centralized team to develop the infrastructure, set the corporate standards and enforce required access mechanism. The Center of Excellence for Visual Analytics (CoE-VA) team now serves as the central team to monitor the visualization development across the organization. This team requires a one-stop solution to answer analytics community queries related to dashboard usage and compare access between users. NetJets Dashboard Management project is developed as the solution to enable CoE-VA team to monitor the existing dashboards and access structure. This primary purpose of this project is to present dump of dashboards’ usage and access data in a concise and user-friendly framework. Two exploratory dashboards are developed for this project to accept the user input and provide the required information visually with a provision to download the data. The immediate benefit from this project is the time and effort savings for CoE-VA team. The turnaround time for comparing access between two users is now reduced from approximately an hour to few seconds. The long-term benefit from this project would be to promote the Tableau usage culture in the organization by tracking dashboard usage and educating end-users on the under-utilized dashboards.
Mohammed Ajmal, New QCUH Impact Dashboard &Product Performance Dashboard, July 2018, (Michael Fry, Balji Mohanam)
Qubole charges for its services to its customers based on their cloud usage. Its current revenue methodology is dependent on the instance type that is being used and the Qubole defined factor (QCUH factor) associated with it. The first project evaluates the impact of new QCUH factor from revenue standpoint. Additionally, a dashboard is also built that would enable the sales team to identify customers whose invoices would increase due to new QCUH factor. The dashboard also has functionalities that will aid the sales team to arrive at mutually agreed terms with individual customers with respect to the new QCUH factor. Currently Qubole does not have one single reporting platform where all the important metrics are tracked. The second project attempts to answer this concern. There are two dashboards that are built. The first dashboard tracks all critical metrics at overall or organization level. The second dashboard tracks almost all the metrics that are being tracked in the first dashboard at an individual customer level. The user needs to input the customer name to populate the data for the concerned customer. The two dashboards provide comprehensive overview of the health of Qubole.
Maitrik Sanghavi, Member Churn Prediction & CK Health/Goals Dashboards, July 2018, (Michael Fry, Rucha Fulay)
This document provides information on two key projects executed while interning at Credit Karma. A bootstrapped logistic regression model was created to predict the probability of a Credit Karma member not returning to the platform within 90 days. This model can be used to effectively target the active members who are ‘at-risk’ of churning and also be used as a baseline reference for future model improvisations. CK Health and CK Goals dashboards have been modified and updated to monitor company’s key performance indicators and track 2018 goals. These dashboards have been created using BigQuery and Looker and are automatically updated daily.
Nitish Ghosal, Producer Behavior Analysis, July 2018, (Dungang Liu, Trinette James)
Ameritas Insurance Corporation Limited works on the B2B marketing model partnering with agencies, financial advisors, agents & brokers to sell its products and services to the end-customer. An agent can choose to sell products for multiple insurance companies, but he/she can be contracted full time with only one insurance company. In order to incentivize an agent’s affiliation with Ameritas, it has an Agents Benefits & Rewards Program in place which works on the principle of “greater the agent’s results, greater the rewards”. The aim of my study is to provide a holistic overview of our agents’ behavior and identify their drivers of success through segmentation and agent profiling. In order to achieve this, data visualizations were created in tableau to find trends and clustering was performed in SPSS to segment the agents into groups. The agents could be grouped into five distinct categories - top agents, disability insurance specialists, life insurance specialists, generalists and inactive agents. The analysis revealed that factors such as benefits, club qualification, contract type, club level, agency distribution region, persistency rates, home agency state, AIC (Ametrias Investment Corporation) affiliation are some of the factors which have an impact on an agent’s success and his sales revenues.
Krishna Chaitanya Vamaraju, Recommender System on the Movie Lens Data Set, July 2018, (Dungang Liu, Olivier Parent)
Recommendation systems are used in most e-commerce websites to promote products, up-sell and cross-sell products to new or existing customers based on the history of data present for existing customers. This helps in recommending the right products to customers thereby increasing sales. The current report is a summary of various techniques that are used to for recommendation. A comparison of the models against the time taken to run and the issues concerning each model are discussed in the report. For the current project, data from Kaggle has been utilized for the analysis - The 100k MovieLense ratings data set. The goal of the current project is to use the MovieLens data in R and build recommendation engines based on different techniques using the Recommender Lab package. This could, if deployed into production, serve as a system like that we see on Netflix. For the analysis cosine similarity is used to compute the similarity between users and items. The Recommendation Techniques that have been used are User based Collaborative Filtering, Item based collaborative Filtering and Collaborative techniques based on Popularity and Randomness. Also, a recommender system using Singular value decomposition and K-Nearest Neighbors is also built. A comparison of the techniques indicates that the popular methods technique gives the highest accuracy as well as good run time however this depends on the data set and the stage of recommendation we are in. Finally, the right metric one wants to indicate using a recommender system determines the type of Recommender system one should build.
Ananthu Narayan Ambili Jayanandan Nair, Comparing Deep Neural Network and Gradient Boosted Trees for S&P 500 Prediction, July 2018, (Yan Yu, Olivier Parent)
The objective of this project was to build a model to accurately predict the S&P 500 index in the (t+1)st minute using the component values in the tth minute. Two different machine learning techniques, Artificial Neural Networks, and Gradient Boosted Trees were used to build the models. Tensor flow, which makes use of the NVIDIA GPU was used for training the Neural Network model. H2O, which speeds up the training process by parallelized implementations of algorithms was used for Gradient boosted trees. The models were compared using their Mean Squared Errors and the Neural Network model was found to be better suited for this application.
Prerit Saxena, Forecasting Demand of Drug XYZ using Advanced Statistical Techniques, July 2018, (Michael Fry, Ning Jia)
Client ABC is a large pharmaceutical company and a client of KMK Consulting Inc. ABC has a diverse portfolio of drugs in various disease areas. The organization is structured in the form of division for every disease area. The NET team is responsible for ABC’s drugs in the Neuro-endocrine tumor area, a disease area with a market of about $1.5B globally. ABC’s major drug XYZ is in the market for a few years and has a major market share. The drug is a “Buy and Bill” drug which means hospitals buy the drugs in advance and stock it and then bill the payers according to the usage. The project shared in this report is the forecasting exercise for drug XYZ. In this project, forecasting has been done for 3 phases: remaining 2018, 2019 and 2020-2021. The team uses various forecasting methods such as ARIMA, Holt-Winters and trends in conjunction with business knowledge to prepare forecasts of number of units of drugs, as well as dollar sales for the upcoming years.
Manoj Muthamizh Selvan, Donation Prediction Analysis, July 2018, (Andrew Harrison, Rodney Williams)
UC Foundation Advancement Services team participates in the process to bring donations to UC in an effective manner. The team has data of all the donors collected in the past 12 years and are interested in understanding any findings and insights from the data. The UC Foundation team would like to predict probability of large future donations and target the donors effectively. Hence, the team would want to understand: Trigger and factors responsible for the donations and Probability of donors to donate a larger amount (> $10,000). Random Forest model was used to identify the trigger factors and also predict the high donors on the prospect population. The results are being used by the Salesforce team of UC Foundation to target the high donors with better accuracy than heuristic based models.
Akhilesh Agnihotri, Employee Attrition Analysis, July 2018, (Dungang Liu, Peng Wang)
Human resources plays an important part of developing and making a company or an organization. The HR department is responsible for maintaining and enhancing the organization's human resources by planning, implementing, and evaluating employee relations and human resources policies, programs, and practices. Employers generally consider attrition a loss of valuable employees and talent. However, there is more to attrition than a shrinking workforce. As employees leave an organization, they take with them much-needed skills and qualifications that they developed during their tenure. If the reasons behind employee attrition are identified, the company can create a better working environment for the employees and if employee attrition can be predicted, the company could take necessary actions to stop the valuable employees from leaving. So, this report attempts to explore the HR data, manipulate it to get some meaningful relation between response variable (whether an employee left the company or not) and dependent variables which provide information about an employee. Then, the report also tries to build several statistical models which can predict the probability of an employee leaving the company given his information and conclude on the best model having highest performance.
Amoul Singhi, Identifying Factors that Distinguishes High Growth Customers, July 2018, (Dungang Liu, Lingchan Guo)
If a bank is able to identify customers who have potential to spend more next year than what they have spent this year they can market better products to them and increase customer satisfaction along with their profits. The aim of this analysis is to identify the set of customer features which can distinguish high growth customers from others. The data collected for the analysis was divided into 2 categories transactional and demographical. From Data Exploration some factors which were identified which have a different behavior in both the group. After which, a Linear Regression Model was made with percentage increase from 2016 to 2017 as target variable to identify the factors which are statistically important in determining the growth of cardholder. A logistic regression model was made with classifying accounts with more than 25% growth as high growth customers. This was followed by tree model and Random Forest model to increase the efficiency of the model. It was found that though some of the variables statistically significant but their coefficient is very low implying though they are important in determining the growth of a accounts but their impact is not much. There are 2 transactional variables which were identified from Random Forest which can help to determine if a customer is high growth or not, but accuracy of the model is quite low. Overall there are certain factors which are identified as important but it’s very difficult to predict if a cardholder is going to spend more in upcoming years.
Sudaksh Singh, Path Cost Optimization: Tech Support Process Improvement, July 2018, (Michael Fry, Rahul Sainath)
The objective of this project is to optimize the process of diagnosis and resolution of issues with various products faced by the customers of one of world’s largest technology companies and addressed by tech support agents. The organization’s tech support agents use multiple answer flow trees which is a tree structured question-answer based graph used by agents for diagnosing issues in customer products. The objective of optimizing the answer flow trees is achieved by studying the historical performance of the issue diagnosis and resolution activity carried out by various agents using these trees. Performance of these trees is measured across a collection of metrics defined to capture the speed and accuracy of issue diagnosis and resolution. Based on the analysis, recommendations are made to reorder or prune the answer flow tree to achieve better performance across these metrics. These measures and the following recommendations for editing the answer flow trees will serve as a starting point for more advanced, holistic techniques to design algorithms which generate new answer flow trees having the best performance across all metrics while considering the constraints which limits the reordering and pruning of the answer flow graph.
Max Clark, HEAT Group Customer Lead Scoring, July 2018, (Michael Fry, Maylem Gonzalez)
The HEAT Group is responsible for all events taking place at the American Airlines Arena, such as NBA basketball games, concerts and other performances. While offering a winning and popular product will yield high demand, the HEAT Group must employ analytical methods to smartly target customers who otherwise would not be attending events. The purpose of this project was to determine the differences between the populations of the HEAT Group’s two main customer groups, premium and standard customers. Furthermore, a machine learning model was implemented to, one, create a scoring method that will be used to assess which customers the HEAT Group would have the highest probability to convert from standard to premium customers, and, two, determine which features have the largest impact. It is discovered that the age and financial status are the largest and most important differentiators for the two population groups.
Mohit Deshpande, Wine Review Analysis, July 2018, (Yichen Qin, Peng Wang)
Analyzing structured data is simpler as compared to unstructured data because observations in structured data are arranged in a specific format suitable for implementing analytical techniques on them. On the other hand, internal structure of unstructured data (audio, video, text etc.) do not adhere to any format. Nowadays, unstructured data generation is at an all-time high and thus, comprehending methods to analyze them is the need of the hour. This project aims to study and implement one such technique of Text Analytics which is used to examine textual data. Initial part of the project revolves around performing Exploratory Data Analysis on a dataset containing wine reviews to discover hidden patterns in the data. The latter part focuses on analyzing the text heavy variables by performing basic text, word frequency and sentiment analysis.
Devanshu Awasthi, Analysis of Key Sales-Related Metrics through Dashboards, July 2018, (Michael Fry, Jayvanth Ishwaran)
Visibility is key to running a business at any scale. Organizations have a constant need to assess where they stand day-in and day-out and where they can improve. With the advancements in tools and technologies that help capture large chunks of operational and business data at even shorter intervals, companies have started to explore methods of using this data in a better way to get insights more frequently. One step in this direction at NetJets is to move away from the traditional methods of using static systems for business reporting. Descriptive analytics using advanced techniques of data management and data visualization is used to create dashboards which can be shared across the organization with different stakeholders. Dashboards prove to be extremely useful in analysis as they show the trends for different metrics over the month, and help us dig down deeper through the multiple layers of information. This project involved transforming the daily reporting mechanism for the sales team through dashboards for three large categories – cards business, gross sales and net sales. Each category has important metrics the business users are concerned with. On any day, these dashboards help analyze the time-series data associated with these categories and assess how the business has fared on certain metrics for the month, identify anomalies and get a comprehensive view of the expected sales for the rest of the month.
Ayank Gupta, Predicting Hospital-Acquired Pressure Injuries, July 2018, (Michael Fry, Susan Kayser)
A pressure injury is defined by NPUAP as "localized damage to the skin and/or underlying soft tissue usually over a bony prominence or related to a medical or another device. The injury can present as intact skin or an open ulcer and may be painful. The injury occurs as a result of intense and/or prolonged pressure or pressure in combination with shear. The tolerance of soft tissue for pressure and shear may also be affected by microclimate, nutrition, perfusion, co-morbidities, and condition of the soft tissue. Identification of factors responsible for the pressure injury can be very difficult and is of vital importance for the hospital bed manufacturers. It is crucial to identify the type of pressure injury a patient might acquire in a hospital and educate the nurses to take proactive measures instead of preventive measures. The objective is to predict whether a patient will have a hospital-acquired pressure injury given various demographic information about the patient, information about the wound, and the hospital.
Renganathan Lalgudi Venkatesan, Detection of Data Integrity in Streaming Sensor Data, July 2018, (Michael Fry, Netsanet Michael)
The Advanced Manufacturing Analytics team wants to identify if there are any data integrity issues in the streaming sensor data that is gathered from the manufacturing floors. The infrastructure for asset tracking has already been set up in several phases of time. Each site consists of sensors that capture the spatiotemporal information of all the tagged assets thereby giving real-time information regarding the whereabouts of each of the assets. Depending on their purpose, there are several types of these sensors positioned in different locations within a given plant. This project aims at examining the integrity of the streaming data and to monitor the health of the flow, detect and label the time frames of historical disruptions. Also, since the streaming data is on its early stages of infrastructure, the goal of the analysis is to identify the shortcomings and explore the possible improvements that will be required for future production critical processes. Several methods are proposed to address the current and recent streaming data issues by capturing for disruptions in the data feed using historical data. This would help in capturing disruption on time and thus make real-time site operation decisions. The infrastructure also has times in which there is a data loss as well as high volume of one time data (referred as an outliers in the data). The project proposes methods to detect and quantify the data losses as well as in detecting outliers in the sensor data based on the operating characteristics and other factors at the site.
Aksshit Vijayvergia, Predict Default Capability of a Loan Applicant, July 2018, (Yichen Qin, Ed Winkofsky)
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders. Borrowing from financial institutions is a difficult affair for this sect of people. Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities. So, for the capstone project I will be digging deep into the dataset provided by Home Credit on the analytics website called Kaggle. In order to classify whether an applicant will default, I will be analyzing and munging two datasets. The first dataset is extracted from the application filed by a client in Home Loan’s portal. The second dataset contains a client’s historical borrowing information.
Sylvia Lease, Analytics & Communication: Leveraging Data to Make Connections, Summarize Results, & Provide Meaningful Insights, July 2018, (Mike Fry, Steve Rambeck)
Entering into my time as Ever-Green Energy’s Business Analyst Intern, a defined project goal was established to create a variety of reports for client and internal use alike. Armed with newly developed skills in coding, data visualization, and managing data, it was quickly realized that these skills would serve as tools for an overarching, more imperative goal: to communicate effectively. Over several weeks, the opportunity to merge data with communication took a variety of forms. In the beginning, discussions with various leaders and groups within the company translated to an understanding of how analytics could lend itself to furthering the company’s mission. This led to a recognition of how analytics could assist in bridging a gap between the IT and Business Development groups to create reports that helped the teams serve clients by answering key questions and interests. Ultimately, through the creation of polished and carefully designed reports, communication was key in the success of each created report by whether the report provided useful insights and summaries of data in a clear and efficient manner.
Anitha Vallikunnel, Product Reorder Predictions, July 2018, (Dungang Liu, Yichen Qin)
Uncovering the shopping patterns helps the retail chains to design better promotional offers, forecast the demand, optimize the brick and mortar store aisles - in short, everything to build a better experience for the customer. In this project, using the Instacart’s open sourced transactional data, I have identified and predicted the items that are ordered together. Apriori algorithms and association rules are used for market basket analysis to achieve this. Using feature engineering and gradient boosted tree models, reordering of items are also predicted. This will help the retailers in demand forecasting and identifying the items that will be ordered more frequently. F1 score is used as the metric for measuring prediction accuracy for reordering. On training sample, we got F1 score of 0.772 and the F1 score in the out of sample method is 0.752.
Ananya Dutta, Trip Type Classification, July 2018, (Dungang Liu, Peng Wang)
Walmart is the world’s largest retailer and improving customer’s shopping experience is a major goal for them. In this project we are trying to recreate their existing trip type classification with a limited number of features. This categorization of trip types will help Walmart improve the customer’s experience by personalized targeting and can also help them identify business softness related to specific trip types. The data contains over 600k rows consisting of purchase data at a scan level. After rigorous feature engineering and model comparisons we found that the results using an Extreme gradient boosting model is promising with an accuracy of ~90% in training and ~70% in testing data. After looking at importance of variables - total number of units, total number of, count of UPCs, coefficient of variation of percentage contributions across departments and items sold in departments like financial services, produce, menswear and DSD Grocery were found important in building this classifier.
Nitha Benny, Recommendation Engine, July 2018, (Dungang Liu, Yichen Qin)
Recommendation engines are widely used today across e-commerce, video streaming, movie recommendations etc. and this is how each of these businesses maintains their edge in the highly competitive online business world. The idea behind using recommendation engines itself is intriguing and this project aims to compare collaborative filtering techniques to better understand how recommendation engines work. Two main types of collaborative filtering i.e., user based, and item-based methods are used here. The two models are built, and we calculate the MSE and MAE values for each. The models are then evaluated using ROC curve and precision-recall plots for a different number of recommendations. We find that the user based collaborative filtering method using the cosine similarity function works best giving a lower MSE value of 1.064 as well as the better area under the curve and precision-recall curves. Hence, the User-based collaborative filtering method will help businesses recommend better products to their customers and thus improve their customer experience.
Nirupam Sharma, UC Clermont Data Analysis and Visualization, July 18, (Mike Fry, Susan Riley)
For my summer internship, I worked as Graduate Student at UC Clermont College in Batavia, Ohio for the office of Institutional Effectiveness from May 2018 to July 2018. My responsibilities were to build R analytical engine to perform data analysis and to design Tableau dashboards highlighting key university insights. Data used in analysis consisted of tables describing information about number of enrollments, courses, employees, accounts and sections for different semesters. The analytical engine was written in R language to connect to data, combine data tables and perform SQL and descriptive analysis to get inferences in trends across years. The results of analysis were used to build dashboards in Tableau. My responsibilities for Tableau work were to create new calculative fields, parameters and dynamic actions and use other advanced Tableau features learned during my masters at UC to build charts and dashboards to be uploaded on the UC Clermont website. The analytical engine I built allowed college to perform data pipeline tasks effortlessly and quickly without much human input thus saving the college a lot of time and resource efforts. The dashboards built help the college to better understand trends in data and make recommendations to management. My internship allowed me to hone my R and Tableau skills. I learnt to use many advanced R packages and my ability to write quality code increased significantly. My experience at UC Clermont College will allow me to work more professionally and effectively in my future job.
Scott Fueston, Preventing and Predicting Crash-related Injuries, July 2018, (Yan Yu, Craig Zielazny)
This study aims to identify influential factors that elevate a motorist’s risk of sustaining a serious or fatal injury during a motor vehicle crash. Addressing these factors could then potentially save lives, prevent long-term pain and suffering, and avert liabilities and monetary damages. Using population comparisons through exploratory data analysis and model creation for prediction, contributory factors to devastating injuries have been identified. These include: lack of restraint use, deployment of an air bag, crashing into a fixed object, crashing head-on, a roadside collision, time of day is night, the vehicle type is a car, speeding, a rollover occurred, impact of first collision occurred in the front-left corner, disabling damage occurred to the vehicle, and alcohol involved. This information could be invaluable to key members in areas such as policy design, regulatory agencies, car manufacturers, and consumers in: developing clear communications and advocacy for ways to aid in prevention, proposing and implementing effective policy and laws, aiding in the approach taken in manufacturing and designing future automobiles, and elevating the general public’s awareness in terms of risk factors.
Amit Kumar Mishra, Customer Churn Prediction, July 2018, (Yan Yu, Yichen Qin)
Churn Rate is defined as the number of customers who moved out of the subscription of an organization. It is an important component in the profitability of an organization. This gives an indication of the revenue lost by an organization. Additionally, an organization can comprehend the factors which are responsible for customer churn and can allocate its resources to those factors. A customer retention program can be developed so that customer retention is maintained. Thus, given the significance of customer churn, the telecommunication customer data is obtained from the IBM repository and was explored to find the factors that are responsible for customer churn. Various machine learning techniques like logistic regression (with various link functions, namely – probit, logit and cloglog, and using different variable selection procedures), tree, random forest, support vector machine and gradient boosting were used to predict the customer churn and the best model was identified in terms of in-sample and out-of-sample performance. Tenure, contract, internet service, monthly charges and payment method were found to be the most important variables for predicting the customer churn in the telecommunication industry. Among all the different classification techniques, support vector machine with radial basis kernel (RBF) performed the best in terms of accuracy with 80.10% of data classified correctly.
Pranil Deone, Default of Credit Card Clients Dataset, July 2018, (Peng Wang, Liwei Chen)
The Default of credit card client’s data set is used for the purpose of this project. The main objective is to build a credit risk model which accurately identifies the customers who will default their credit card bill payment in the next month. The model is based on the credit history of the customers which includes information regarding their limit balance, previous month’s payment status, previous month’s bill amount. Also, various demographic factors like age, sex, education, marital status has been considered to build the model. The data set contains 30000 observations and 25 variables. Some preprocessing is done on the data to prepare for analysis and modeling. Quantitative and categorical variables are identified and separated for performing appropriate exploratory data analysis. Data modeling techniques like generalized logistic regression, stepwise variable selection, LASSO regression and Gradient Boosting Machine have been used to build different credit risk models. The model performance is evaluated on the training and the testing data. Model performance criteria like misclassification rate and AUC have been used to evaluate different models and select the best model.
Hemang Goswami, Ames Housing Dataset, July 2018, (Dungang Liu, Yichen Qin)
Residential real estate prices are fascinating… and frustrating at the same time. None of the parties involved in the house buying process: the homebuyer, the home-seller, the real estate agent, the banker can point out the factors affecting the house pricing with total conviction. This project explores the AMES Housing Dataset which contains information on the residential property sales that occurred in Ames, Ohio from 2006 to 2010. The dataset has 2930 observations with 80 features describing the state of the property including our variable of interest: Sale Price. After creating 10 statistical models ranging from a basic linear regression model to the highly complex models Gradient Boosting and Neural Network, we were successfully able to predict the house prices with a MSE as low as 0.015. In the process, we found out that the Overall quality of the house, exterior condition, area of the first floor and neighborhood were some of the key features affecting the prices.
Ameya Jamgade, Breast Cancer Wisconsin Prediction, July 2018, (Yan Yu, Yichen Qin)
Breast cancer is a cancer that develops from the breast tissue. Certain changes in the DNA (mutations) result in uncontrolled growth of the cells, eventually leading to cancer. Breast cancer is the one of the most common types of cancer in women in the United States, ranking second among cancer deaths. This project aims at analyzing data of women residing in the state of Wisconsin, USA by applying data mining techniques to classify whether the tumor mass is benign or malignant. Data for this project was obtained from UCI Machine Learning repository containing information of 569 women across 32 different attributes. Data cleaning and exploratory data analysis procedures were performed to prepare and summarize main characteristics of the data-set. The data was portioned into training and test sets, consisting of 80% and 20% split respectively and data mining algorithms such as K-nearest neighbor, random Forest and Support Vector Machine were used for classification of the diagnosis Y-variable as malignant or benign. The optimal value of K is 11 for k-nearest neighbor classifier which gives 98.23% accuracy. The tuned random Forest model has an error rate of 3.87% and identified the top 5 predictor variables. The tuned SVM model gives accuracy of 98.68% and 95.58% on training and test data respectively. The findings indicated in this project can be used by the heath-care community to perform additional research corresponding to these attributes to help prevent the pervasiveness of breast cancer.
Sai Uday Kumar Appalla, Predicting the Health of Babies Using Cardiotocograms, July 2018, (Yan Yu, Yichen Qin)
The aim behind doing this research is to predict the health of a baby based on different diagnostic features observed in the cardiotocograms. The data was collected from the UCI Machine Learning repository. Different Machine Learning algorithms were built to try and understand what are the factors that have a significant influence on the baby’s health and predict the health state of the baby based on these factors with the best possible accuracy. Initially, basic classifiers like K-nearest neighbours and Decision Trees are used to make predictions. These algorithms have higher interpretability and they help us understand the significance of different variables in the analysis. During the later parts of the analysis, complex classifiers like Random Forest, Gradient Boosting and Neural Networks are used to boost the accuracy of the predictions. Finally, after looking at all the different model metrics, Gradient Boosting tree is selected as the best model as it has better model metrics than any of the other models.
Piyush Verma, Building a Music Recommendation System Using Information Retrieval Technique, July 2018, (Peng Wang, Yichen Qin)
Streaming music have become one of the top sources of entertainment for millennials. Because of Globalization, people all around the world are now able to access different kinds of music. The global recorded music industry is worth $15.7 billion and is growing at 6% as per 2016. Digital music is responsible for driving 50% of those sales. There are 112 million paid subscribers for the streaming business and roughly a total of 250 million users, if we include those who don’t pay. Thus, it becomes very important for streaming service providers like YouTube, Spotify and Pandora to continuously improve their service to the users. Recommendation Systems are one such information retrieval technique to predict the ratings or popularity a user would give/have for an item. In this project I would be exploring bunch of methods to predict ratings of users for different artists using GroupLen’s Last.FM dataset.
Poorvi Deshpande, Sales Excellence, July 2018, (Yichen Qin, Ed Winkofsky)
One of the features that a bank offers is to provide loans. The process by which the bank decides whether an applicant should receive a loan is called underwriting. An effective underwriting and loan approval process is a key predecessor to favorable portfolio quality, and a main task of the function is to avoid as many undue risks as possible. The aim of this process, when undertaken with well-defined principals, the lender is able to ensure good credit quality. This is a problem faced by a digital arm of a bank. The primary aim of this division is to increase customer acquisition through digital channels. This division sources leads through various channels like search, display, email campaigns and via affiliate partners. As expected, they see differential conversion depending on the sources and the quality of these leads. Consequently, they now want to identify the leads' segments having a higher conversion ratio (lead to buying a product) so that they can specifically target these potential customers through additional channels and re-marketing. They have provided a partial data set for salaried customers from the last 3 months. They also capture basic details about customers. We need to identify the segment of customers with a high probability of conversion in the next 30 days.
Jatin Saini, An Analysis of Identifying Diseased Trees in Quickbird Imagery, July 2018, (Yan Yu, Edward P Winkofsky)
Machine learning algorithms are used widely to identify patterns in data. One of its applications has been found in identifying diseased trees from Quickbird imagery. In this project, we apply logistics regression, LASSO and Classification Trees (CART) models on imagery data to identify significant variables. We designed this study to create training and testing dataset and compared Area Under Curve (AUC) The results using logistic regression showed us 0.97 AUC value for both training and testing datasets, on the other hand, CART showed AUC 0.92 on testing data and 0.91 on training data. After examining the accuracy of different algorithms, we conclude that logistic regression showed us more accurate results on training and testing data.
Raghu Kannuri, Recommender System Using Collaborative Filtering and Matrix Factorization, July 2018, (Peng Wang, Yichen Qin)
This project aims to develop a recommender system using various machine learning techniques. A recommender system helps in developing a customized list of recommendations for every user and thus, acting as a virtual salesman. It predicts missing user-product rating by drawing information from the user's past product ratings or buying history and ratings by similar users. Content-based Filtering, Knowledge-based, Collaborative Filtering and Hybrid filtering are the widely used recommender system techniques. This project deals with techniques like Item-Based (IBCF) and User-Based (UBCF) collaborative filtering with different similarity metrics and Matrix Factorization with Alternative Least Squares. The results of Matrix Factorization outperformed UBCF and IBCF in all evaluation metrics like precision, recall and AUC.
Madhava Chandra, Analysis on Loan Default Prediction, July 2018, (Yichen Qin, Peng Wang)
The purpose of this study was to determine what constitutes risky lending. Each line item in the data corresponded to a loan, and had various features relating to loan amount, employment information of the borrower, payments made, and the classification of the loan as charged off or active with any delays in payments noted. An exploratory data analysis was performed on the data, to look for outliers and individual distributions of the variables. Following which, the interactions between these variables were studied to weed out highly correlated variables. Owing to low representation of defaults in the sample, this was treated as an imbalanced class problem, wherein traditional random sampling would not yield optimal results. To overcome this problem, stratified sampling, random under and over sampling, SMOTE and ADASYN methodologies were explored.
All the above sampling methodologies were trained and tested on logistic regression to pick which sampling procedure to follow for this exercise. Following which, it was found that SMOTE gave the best results. To best classify which loans would likely default from the given dataset, various statistical learning techniques, such as Regression, Tree-based methods- standalone, boosting and bagging ensemble methods, Support Vector Machines and Neural Networks were employed. Amongst these classifiers, Gradient boosting was observed to have the best performance, although with further fine tuning, Deep Neural Networks could possibly classify better.
Samrudh Keshava Kumar, Analytical Analysis of Marketing Campaign Using Data Mining Techniques, July 2018, (Dungang Liu, Peng Wang)
Marketing products is an expensive investment for a company, and spending money to market to customers who might not be interested in the product is inefficient. This project aims to determine and understand the various factors which influence a customer’s response to a marketing campaign. This will help the company design targeted marketing campaigns to cross-sell products. Predictive models were built to predict the response of each customer to the campaign based on various characteristics of the customer using models such as logistic regression, Random Forests and Gradient Boosted trees. The factors affecting the response was determined to be Employment status, Income, type of renew offer, months since policy inception and last claim. The models were validated using a test set and the best accuracy was achieved by the Random Forests model. It has an AUC of 0.995 and misclassification rate of 1.3%.
Rohit Bhaya, Lending Club Data – Assessing Factors to Determine Loan Defaults, July 2018, (Yan Yu, Peng Wang)
Lending club is an online peer-to-peer platform that connects the loan customers with potential investors. Loan applicants can borrow money in the range of $1,000 to $40,000, and the investors can choose the loan products they want to invest in and make profit. The loan data was available on Kaggle and contains applicant information about loans that were originated between 2007 and 2015. Using the available information for applicants who have already paid-off the loan, various machine learning algorithms are built to estimate the propensity of a customer’s default. Further, it was observed that the step AIC approach for logistic regression had the best performance amongst all the models tested. This final model was then used to build an applicant default scorecard that has a range between 300 and 900. A higher score indicates a higher propensity for an applicant to default. Further, the scorecard gave good performance in both the training data and the test data. This scorecard was then used on the active customer base to score an applicant’s propensity to default. From the distribution of this score, it was observed that the most of the active loan customers fall into low-risk category. Further, for higher score applicants, the management can prepare preventive strategies to avoid future losses.
Nitin Sharma, A Study of Factors Influencing Smoking Behavior, July 2018. (Dungang Liu, Liwei Chen)
In this study, statistical analysis is performed to understand the factors that influence smoking habits. Data used in this experiment is obtained from a survey conducted in Slovakia on participants aged 15-30 years. This dataset is available for the public at the Kaggle website. Data collected in this survey includes information about “Smoking habits” of the participants. This is the variable of interest which is a categorical variable with values: Never smoked, Tried smoking, Former smoker and Current smoker. The goal of this study is to find out which factors influence the smoking habit. Machine learning techniques (logistic regression, ensemble methods) are used to predict whether an individual is a current/past smoker or is someone who has never smoked. The best model selected in this study provides an overall accuracy of 83% in the test sample. The result of this study is applicable only to 15-30 years old Slovakia population and cannot be associated with a different population.
Yiyin Li, Foreclosure in Cincinnati Neighborhoods, July 2018, (Yan Yu, Charles Sox)
The main purpose of this paper is to analyze what factors would likely affect foreclosure in Cincinnati neighborhoods and build a model to predict whether the property will be foreclosed by banks. The dataset that is analyzed lists all real estate transactions in Cincinnati from 1998 to 2009. In this paper, after a brief description of the project background and data, exploratory data analysis will be provided, which mainly includes a basic analysis of each individual variable, the correlation statistics between variables and the basic information of 47 Cincinnati neighborhoods. Then, 10 different types of models and a model comparison are provided in the modeling section in order to find the best model to predict the foreclosure. In conclusion, properties’ sales price, building and land value, selling year and that year’s properties mortgage rate, and the median family income are the most influential variables, and the gradient boosting model is the best model for predicting foreclosure.
Adrián Vallés Iñarrea, Predicting Customer Satisfaction and Dealing with Imbalanced Data, July 2018, (Dungang Liu, Shaobo Li)
From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. In this paper, we will compare logistic regression, classification tree, random forest and extreme gradient boosting models to predict whether a customer is satisfied or dissatisfied with their banking experience. Doing so would allow banks to take proactive steps to improve a customer's happiness before it's too late. The dataset was published in Kaggle by Santander Bank, a Spanish banking group with operations across Europe, South America, North America and Asia. It is composed of 76020 observations and 371 variables that have been semi-anonymized to protect the client’s information. 96.05% of the customers are satisfied and only 3.95% are dissatisfied, making this classification problem to be highly imbalanced. Since most of the commonly used classification algorithms do not work well for imbalanced problems, we also compare in this paper two ways to deal with the imbalanced data classification issue. One is based on cost sensitive learning, and the other is based on a sampling technique. Both methods are shown to improve the prediction accuracy of the minority class, and have favorable performance compared to the existing algorithms.
Guansheng Liu, Development of Statistical Models for Pneumocystis Infection, July 2018, (Peng Wang, Liwei Chen)
The yeast-like fungi Pneumocystis reside in lung alveoli and can cause a lethal infection known as Pneumocystis pneumonia (PCP) in hosts with impaired immune systems. Current therapies for PCP suffer from significant treatment failures and a multitude of serious side effects. Novel therapeutic approaches, such as newly developed drugs are needed to treat this potentially lethal opportunistic infection. In this study, I built a simplified two-stage model for Pneumocystis growth and determined how different parameters control the levels of Trophic and Cyst forms of the organism by employing machine learning methods including multivariate linear regression model, partial least squares, regression tree, random forest and gradient boosting machine. It was discovered that parameters of K_sTro (replication rate of Trophic form), K_dTro (degradation rate of Trophic form) and K_TC (transformation rate from Trophic form to Cyst form) play predominant roles in controlling the growth of Pneumocystis. This study is of great clinical significance, as the extracted statistical trends on the dynamic changes of the Pneumocystis will guide the design of novel and effective treatments for controlling the growth of Pneumocystis and PCP therapy.
Vignesh Arasu, Major League Baseball Analytics: What Contributes Most to Winning, July 2018, (Yan Yu, Matthew Risley)
Big data and analytics has been a growing force in Major League Baseball. The principle of moneyball vitalizes the importance of two of these statistics, on-base percentage and slugging (Total Bases/Number of At Bats) as the core principles for building winning franchises. The analysis of this report of data from all teams from 1962-2012 incorporating methods of multiple linear regression, logistic regression, regression and classification trees, generalized additive models, linear discriminant analysis, and k-means clustering creating the best models for the number of wins by a team(linear regression response variable) and whether or not a team makes the postseason(logistic regression response variable) shows that runs scored, runs given up, on-base percentage, and slugging do have strong effects on team success of wins and making the playoffs. The in-sample best models of supervised logistic regression techniques all show great results with AUC values all over 0.90 while the unsupervised k-means clustering technique showed that the data can be effectively grouped in 3 clusters. A mix of supervised and unsupervised study techniques show that a variety of statistical techniques can be used to analyze baseball data.
Preethi Jayaram Jayaraman, Prediction of Kickstarter Project Success, July 2018, (Yichen Qin, Bradley Boehmke)
Kickstarter is an American public-benefit corporation that uses crowdsourcing to bring creative projects to life. As a crowdfunding platform, Kickstarter promotes projects across multiple categories such as film, music, comics, journalism, games and technology, among others. In this project, the Kickstarter Projects Database was analyzed and explored in detail. The patterns identified in the Data exploration stage were used as inputs in for predictive modeling. Classification models such as Logistic Regression and Classification Trees were built to classify the Kickstarter projects. Performance across the two models was compared on the validation set (hold-out set – 20% of the data) using accuracy, sensitivity and AUC as the performance criteria. ROC curves were also plotted for both the models. The Logistic Regression model was chosen as the best model for the Kickstarter project classification with an accuracy of 0.9996 and AUC of 0.9999. The performance of the Logistic Regression model (best performing model) was evaluated on the test data to conclude the classification problem. The Logistic Regression model classified the Kickstarter projects with an accuracy of 99.96% on the test data. The analysis of the Kickstarter Projects was further extended to include projects of states - ‘Suspended’, ‘Live’, ‘Undefined’ and ‘Canceled’, recoded as ‘Failed’. Building Logistic Regression and Classification Tree models resulted in Logistic Regression as the best model with a classification accuracy of 0.9656 on the test data.
Rohit Pradeep Jain, Image Classification: Identifying the object in a picture, July 2018, (Yichen Qin, Liwei Chen)
The objective of this project was to classify images of fashion objects (like T-shirts, sneakers, etc.) based on the pixel information contained in these pictures. The image was classified into one of the 10 available classes of fashion objects using different modeling techniques and a final model was chosen based on the accuracy on the cross-validation dataset. The final model was then tested on the untouched testing dataset to validate the out of bag accuracy. The project serves as a benchmark for more advanced studies in the image classification field and helps in technologies like stock photography.
Priyanka Singh, Mobile Price Prediction, July 2018, (Peng Wang, Liwei Chen)
The aim of this study is to classify the prices of mobile devices from 0 to 3 with the higher number denoting higher prices. The dataset has a total of 2000 observations and 21 variables. The response variable, the price range is to be predicted with the highest accuracy possible. The analysis starts with performing the exploratory data analysis followed by the construction of machine learning models. The exploratory data analysis revealed that the categorical variables weren’t significant enough in determining the price of the devices. The numeric variables, battery power and ram of the phones, had a considerable impact on the prices. Classification tree, random forest, support vector machines and gradient boosting machines were used to predict the price of the phones. Support vector machine model was chosen as our final model as it gave the lowest misclassification rate of 0.08 and highest area under curve (AUC) value of 0.97. The features used in generating the model were: ram, battery power, pixel width, pixel height, the weight of the mobile, internal memory, mobile depth, clock speed and touchscreen.
Gautami Hegde, HR Analytics: Predicting Employee Attrition, July 2018, (Yan Yu, Yichen Qin)
Employee attrition is a major problem to an organization. One of the goals of the HR Analytics department is to identify the employees that are likely to leave the organization in the future and take actions to retain them before they leave. Thus, the aim of this project is to understand the key factors that influence this attrition and predict the attrition of an employee based on these factors. The dataset used here is the HR analytics dataset by IBM Watson Analytics which is a sample dataset created by IBM data scientists. In this project, the exploratory data analysis includes feature selection based on distributions, correlation and data visualizations. After eliminating some features, logistic regression, generalized additive model, decision tree and random forest techniques are used for building models. In order to evaluate the model performance, the prediction accuracy and AUC are considered. Of the different classification techniques, logistic regression model and generalized additive model were found to be the best in predicting the employee attrition.
Venkata Sai Lakshmi Srikanth Popuri, Prediction of Client Subscription from Bank Marketing Data, July 2018, (Peng Wang, Yichen Qin)
Classification is one of the most important and interesting problems in today’s world. It has applications ranging from email spam tagging to fraud detection to predictions in the healthcare industry. The area of interest here is Bank Marketing of a Portuguese banking institution. The marketing teams at banks run campaigns to pursue clients to subscribe for a term deposit. The purpose of this paper is to apply various data and statistical techniques to analyze and model the bank marketing data and predict whether a client will subscribe for a term deposit. The analysis aims at addressing this classification problem by performing Explanatory Data Analysis (EDA), building models like Logistic Regression, Step AIC & Step BIC models, Classification Tree, Linear Discriminant Analysis (LDA), Support Vector Machines(SVM), Random Forest(RF), Gradient Boosting(GB) and validating these models using the misclassification rate and area under the ROC curve. The performance of SVM is better than other models for this dataset with a low out-of-sample Misclassification Rate and good AUC values.
Ali Aziz, Financial Coding for School Budgets, July 2018, (Yan Yu, Peng Wang)
School budget items must be labelled according to their description in a difficult task known as financial coding. A predictive model that outputs the probability of each label can help in accomplishing this work. In this project the effectiveness of several data processing techniques and machine learning algorithms was studied. After applying data imputation and natural language processing techniques, a one-vs-rest classifier consisting of L1 regularized logistic regression models performed the best out of all classifiers investigated. This classifier achieved an out-of-sample Log Loss of 0.5739, an improvement of approximately 17% on the baseline predictive model.
Shashank Badre, A Study on Online Retail Data Set to Understand the Characteristics of Customer Segments that Are Associated with the Business, July 2018, (Peng Wang, Yichen Qin)
Online retailers in the world who happen to have a small business and are new entrants in the market are keen on using data mining and business intelligence techniques to better understand existing and potential customer base. However, such small businesses often lack expertise and technical know-how to perform requisite analysis. This study will help such online retailers to understand the approach and different ways the data can be utilized to gain insights into its customer base. This study is done on an online retail data set to understand characteristics of different segments of customers. Based on these characteristics the study will explain which customers segments contribute high monetary value and which customer segments contribute low monetary value to the business.
Ravish Kalra, Phishing Attack Prediction Engine, July 2018, (Dungang Liu, Edward Winkofsky)
A phishing attack forces the users to enter their credentials in a fake website or by making them open a malware in their system. This, in turn, results in identity theft or financial losses. The aim of this project is to build a prediction engine through which a browser plugin can accurately predict whether a given website is legitimate or fraudulent after capturing certain features from the page. The scope of the project is only limited to websites and does not involve any kind of other electronic media. The data set used for the analysis has been obtained from UCI Machine Learning repository. After evaluating a website through 30 documented features, the model predicts a binary response of 0 (legitimate) or 1 (phishing). Methods of analysis include (but not limited to) visualization of spread of different features, identifying correlation between covariates and the dependent variable and implementing different classification algorithms such as Logistic Regression, Decision Tree, SVM and Random Forest. Due to the unavailability of asymmetric weights for false positives and false negatives, various other evaluation metrics such as F Score, Log Loss etc. along with out-sample AUC are compared. The Random Forest model outperforms other modelling strategies considerably. Although a blackbox classifier, Random Forest model works well for the purpose of a back-end prediction engine that insulates decision making from the users.
Akul Mahajan, TMDB - "May the Force be with you", July 2018, (Yan Yu, Yichen Qin)
Today, we live in an era where almost every important business decision is guided through the application of statistics, one of the most popular areas in this regard is the application of statistical models in machine learning and prediction modelling in order to garner outcomes and align them with the goals of the industry and formulate and improve strategy to meet these goals. TMDB is one of the most popular datasets on Kaggle which houses the data of 5000 movies from different genres, geographies and languages. The use of predictive modelling can be applied to gain insights about the expected performance of the movie before they are released and formulate proper marketing strategies and campaigns in order to further improve their performance. This paper employs the use of some of the advanced predictive algorithms like linear regression, CART, Random Forests, Generalized additive model and Gradient Boosting along with tuning these model to achieve optimum performance and evaluating their potential using proper evaluation metrics.
Kevin McManus, Analysis of High School Alumni Giving, July 2018, (Yan Yu, Bradley Boehmke)
Archbishop Moeller High School has an ambitious plan to increase its participation rate (giving + activities), up from 4% a few years ago to 13% last year with a goal of 15%. Donations to the 2017 Unrestricted Fund were made by 9% of the 11,524 alumni base and reflected an increase of 258% vs 2013. The analyses focused on a regression predictive model for donation amount and a classification model to predict which alumni will donate. Both suggest that prior alumni giving, and connections to the school via other affiliations were strong predictors, among several others. The school should focus on creating opportunities for involvement by alumni as well as maintain strong connections to its base who give consistently. Overall, higher wealth levels were not a significant predictor for giving to the Unrestricted Fund. The analyses also performed unsupervised clustering which suggested there were distinct groups of those strongly connected with the school through other affiliations and those who were not. The former group tended to live within 100 miles of Cincinnati and give at a higher rate than the other groups. Even the clustering of giving alumni showed a small consistent group of givers and a second group of occasional donors. The former group also had a higher rate of other connections to the school compared to those who gave only occasionally.
Ritisha Andhrutkar, Sentiment Analysis of Amazon Unlocked Phone Reviews, July 2018, (Yichen Qin, Peng Wang)
Online customer reviews hold a powerful effect on the behavior of consumers and, therefore, the performance of a brand in the Age of Internet today. According to a survey, 88% of consumers trust online reviews as much as personal recommendations for purchasing any item on an e- commerce website. Positive reviews boost the confidence of an organization while Negative reviews suggest areas of improvement. It is also certain that having more reviews for a product will result in a high conversion rate for that product. This report is aimed at analyzing and understanding the trend of human behavior towards unlocked mobile phones sold on Amazon. The dataset utilized has been scraped from the e-commerce website and consists of several listings of phones along with their features such as Brand Name, Price, Rating, Reviews and Review Votes. Text Mining techniques have been leveraged on the dataset to identify the sentiment of each customer review which would help Amazon and, in turn, the manufacturer to improve their current products and sustain their brand name.
Swapnil Patil, Applications of Unsupervised Machine Learning Algorithms on Retail Data, July 2018, (Peng Wang, Yichen Qin)
Data Science and Analytics is widely used in the retail industry. With the advent of bid data tools and higher computing power, sophisticated algorithms can crunch huge volumes of transactional data to extract meaningful insights. Companies such as Kroger invest heavily to transform more than a hundred-year-old retail industry through analytics. This project is an attempt to apply unsupervised learning algorithms on the transactional data to formulate strategies to improve the sales of the products. This project deals with online retail store data taken from UCI Machine Learning Repository. The data pertains to a UK-based registered online retail store’s transaction between 01/12/2010 and 09/12/2011. The retail store mostly sells different gift items to wholesalers around the globe. The objective of the project is to apply statistical techniques such as clustering, association rules and collaborative filtering to come up with different business strategies that may lead to an increase in the sales of the products.
Tathagat Kumar, Market Basket Analysis and Association Rules for Instacart Orders, June 2018, (Yichen Qin, Yan Yu)
For any retailer it is extremely important to identify customer habits, why they make certain purchases, gain insight about their merchandise, movement of goods, peak time of sales and set of products which are purchased together. It helps them in structuring store lay out, designing various promotion and coupons and combining all with a customer loyalty card which makes all the above strategy even more useful. The first public anonymized dataset from Instacart is selected for this paper and the goal is to analyze this data set for finding out fast moving items, frequent basket size, peak order times, frequently reordered items and high moving products in aisles. This paper also demonstrates the loyal customer habit pattern and prediction of their future purchase with reasonable accuracy. Market basket analysis with association rules are used to discover the top strong rules of product association based on different association measures e.g. support, confidence and lift. Analysis has been conducted to uncover the strong rules for high frequent and less frequent items. Also, it is shown in the example of top selling products demonstrating which product will follow before and after its purchase using left hand and right-hand association rules.
Sayali Dasharath Wavhal, Employee Attrition Prediction on Class-Imbalanced Data using Cost-Sensitive Classification, April 2018, (Yichen Qin, Dungang Liu)
Human Resource is the most valuable asset for an organization and every organization aims at retaining its valuable workforce. Main goal of every HR Analytics department is to identify the employees that are likely to leave the organization in the future and take actions to retain them before they leave. This paper aims at identifying the factors resulting in employee attrition and build a classifier to predict employee attrition. The analysis aims at addressing the class-imbalance classification problem by exploring the performance of various Machine Learning models like Logistic Regression, Classification Tree using Recursive Partitioning, Generalized Additive Modeling and Gradient Boosting Machine. This being a highly-imbalanced class problem, with only 15% Positives, “Accuracy” is not a suitable indicator of model performance. Thus, to avoid the bias of the classifier towards the majority class, Cost-Sensitive classification was adopted to tackle misclassification of minority class, where False Negatives have a higher penalty as compared to False Positives. The model performance was evaluated based on Sensitivity (Recall), Specificity, Precision, Misclassification Cost and Area under the ROC Curve. The analysis in this paper suggests that although the recursive partitioning and ensemble techniques of decision trees have a good predictive power of the minority class, but more stable prediction performance is observed with the Logistic Regression Model and Generalized Additive Model.
Yong Han, Whose Votes Changed the Presidential Elections?, April 2018, (Dungang Liu, Liwei Chen)
The unique aspect of the YouGov / CCAP data was that it contained the information of 2008 to 2016 elections from the same group of 8000 voters. This might provide information on voting patterns between elections.
The goals of this study were to find: Was any predictor significant to the 2012 and 2016 presidential vote? Was it consistent between elections? Was any predictor significant to the change-vote between two elections? Was it consistent? Based on exploratory data analysis, 70% of voters never changed their votes, and 20% of voters changed at least once in last three elections. Was any predictor significantly associated with this behavior?
Using VGLM method, this study found that: In single elections, some common predictors were significant in elections, such as Gender, Child, Education, Age, Race and Marital status. Meantime, different elections had different significant predictors. In vote-change between two elections, significant predictors were different between two different elections. Between 2012-2016 elections, model suggested that Education, Income and Race were significant to vote-change. While between 2008-2012, model suggested that Child and Employment status were significant to vote-change. With 2016 elections data, the never-change-vote model found that Income, Age, Ideology, News and Married status were significant to this never-change-vote behavior. Individual election models could predict ~60% of votes in testing samples. Utilizing a previous vote as a predictor, models could predict ~ 89% of votes in testing samples. The never-change-vote model predicted well on the 70% never-change-vote voters, but missed almost all on the 20% change-vote voters.
Yanhui Chen, Binning on Continuous Variables and Comparison of Different Credit Scoring Techniques, April 2018, (Peng Wang, Yichen Qin)
Binning is a widely-used method to group a continuous variable into a categorical variable. In this project, I binned the continuous variables amount, duration and age in German credit data, and performed a comparative analysis on the logistic model using binned variables, to logistic model without using binned variables, to logistic additive model without using binned variables, to random forest, and to gradient boosting. I found that the performance of logistic with binning model is the weakest one among fitted five models. I also shown that the variable importance varied with different models, and the variable checkingstatus is selected as one of the important variables in most of the built models. Binned variables duration and amount were determined to be important variables in logistic with binning model. Random forest is the only model which selected variable history as an important variable.
Jamie H. Wilson, Fine Tuning Neural Networks in R, April 2018, (Yan Yu, Edward Winkofsky)
As artificial neural networks grow in popularity, it is important to understand how they work and the layers of options that go into building a neural network. The fundamental components of a neural network are the activation function, the error measurement and the method of backpropagation. These methods make neural networks good at finding complex nonlinear relationships amongst predictor and response variables as well as interactions between predictor variables. However, neural networks are difficult to explain, can be computationally expensive and tend to overfit the data. There are two primary R packages for neural networks: nnet and neuralnet. The nnet package has fewer tuning options but can handle unstandardized and standardized data. The neuralnet package has a myriad of options, but only handles standardized data. When building a predictive model using the Boston Housing Data, both packages are capable of producing effective models. Tuning the models is important to get valid and robust results. Given the amount of tuning parameters in neuralnet, these models perform better than the models built in nnet.
Kenton Asbrock, The Price to Process: A Study of Recent Trends in Consumer-Based Processing Power and Pricing, April 2018, (Uday Rao, Jordan Crabbe)
This analysis investigates the effects of the deceleration of the observational Moore’s Law on consumer based central processing units. Moore’s Law states that the number of transistors in a densely integrated circuit approximately doubles every two years. The study involved a data-set containing information about 2241 processors released by Intel between 2001 and 2017, which is the approximate time frame associated with the decline of Moore’s Law. Data wrangling and pre-processing was performed in R to clean the data and convert it to a state that was ready for analysis. Data was then aggregated by platform to study the evolution of processing across desktops, servers, embedded devices, and mobile devices. Formal time series procedures were then applied to the entire data set to study how processing speed and price has changed recently and how future forecasts are expected to behave. It was determined that while processing speeds are in a period of stagnation, the price paid for computational power has been decreasing and is expected to decrease in the future. While the negative effects of the decline of Moore’s Law may have an impact on a small fraction of the market through speed stagnation, the overall price decrease of processing performance will benefit the average consumer.
Hongyan Ma, A Return Analysis for S&P 500, April 2018, (Yan Yu, Liwei Chen)
Time series analysis is commonly used to analyze and forecast economic data. It helps to identify patterns, to understand and model the data as well as to predict short-term trends. The primary purpose of this paper is to study the Moving Window analysis and GARCH Models built through analyzing the monthly return of S&P 500 for recent 50 years from January 1968 to December 2017.
In this paper, we first studied the raw data to check its patterns and distributions, and then analyzed the monthly returns in different time windows, that is, 10-year, 20-year, 30-year and 40-year by Moving Window analysis. We found that over the long horizon, the S&P 500 had produced significant returns for investors who had long stayed in investment. However, for a given 10-year period, the return can go even negatively. Finally, we fitted several forms of GARCH models in normal distributions as well as in student t distributions and found the GARCH (1,1) Student-t model as the best model in terms of the Akaike’s Information Criteria and log-likelihood values.
Justin Jodrey, Predictive Models for United States County Poverty Rates and Presidential Candidate Winners, April 2018, (Yan Yu, Bradley Boehmke)
The U.S. Census Bureau administers the American Community Survey (ACS), an annual survey that collects data on various demographic factors. Using a Kaggle dataset that aggregates data at the United States county level and joining other ACS tables to it from the U.S. FactFinder website, this paper analyzes two types of predictive models: regression models to predict a county’s poverty rate and classification models to predict a county’s 2016 general election presidential candidate winner. In both the regression and classification settings, a generalized additive model best predicted county poverty rates and county presidential winners.
Trent Thompson, Cincinnati Reds – Concessions and Merchandise Analysis, April 2018, (Yan Yu, Chris Calo)
Concession and Merchandise sales account for a substantial percentage of revenue for the Cincinnati Reds. Thoroughly analyzing the data captured from Concession and Merchandise sales can help the Reds with pricing, inventory management, planning and product bundling. The scope of this Concession and Merchandise analysis includes general exploratory data analysis, identifying key trends in sales, and analyzing common order patterns. One major finding from this analysis was calculating 95% confidence intervals of Concession and Merchandise sales resulting in improved efficiency in inventory management. Another learning is that generally, fans buy their main food items (hot dog, burger, pizza) before the game and then beverages, desserts and snacks during the game. Finally, strong order associations exist among koozies with light beer and bratwursts with beverages and peanuts. I recommend displaying the koozies over the refrigerator with light beers and bundling bratwursts in a similar manner to the current hot dog bundle with hopes of driving a lift in sales.
Xi Chen, Decomposing Residential Monthly Electric Utility into Cooling Energy Use by Different Machine Learning Techniques, April 2018, (Peng Wang, Yan Yu)
Today the residential sector consumes about 38% of energy produced, of which nearly a half is consumed by HVAC systems. One of the main energy-related problems is that most households do not operate in an energy efficient manner, such as utilizing natural ventilation or adjusting the thermostat upon weather conditions, thus leading to higher usage than necessary. It has been reported that energy saving behaviors may lead to 25% energy-use reduction just by giving consumers a more detailed electricity bill with the same building settings. Therefore, the scope of this project is to construct a monthly HVAC energy use predictive model with simple and accessible predictors for home. The dataset used in this project include weather, metadata, electricity-usage-hours data downloaded from pecan street data port. The final dataset used in this project contains 3698 observations and 11 variables. Multiple linear regression, regression tree, random forest, and gradient boosting are four types of machine learning techniques that are applied to predict the monthly HVAC cooling uses. Root Mean Squared Error (RMSE) and adjusted R2 are two criteria that are adopted to evaluate the model fitness. All models are highly predictive based on the range of R2 from 0.823 to 0.885. Gradient boosting model has the best overall quality of the prediction with out-of-sample RMSE as 0.57.
Fan Yang, Breast Cancer Diagnose Analysis, April 2018, (Yichen Qin, Dungang Liu)
The dataset studied in this paper explains breast cancer tissue from two dimensions. The tissue is either benign or malignant. Our target is to recognize malignant tissue by knowing the dimension (mean, standard error and the worst) of it. This paper shows a section of feature selection which is based on correlation analysis and data visualization. After eliminating some correlated and visually unclassified features, logistic regression, random forest and xgboosting are conducted on training and validation data. 10 fold cross validation is also used for estimating performance of all the models, then prediction accuracy from different models are compared and area under ROC is used to evaluate model performance on validation data.
Sinduja Parthasarathy, Income Level Prediction using Machine Learning Techniques, April 2018, (Yichen Qin, Dungang Liu)
Income is an essential component in determining the economic status and standard of living of an individual. An individual’s income largely influences his nation’s GDP and financial growth. Knowing one’s income can also assist an individual in financial budgeting and tax return calculations. Hence, given the importance of knowing an individual’s income, the US Census data from the UCI Machine Learning Repository was explored in detail to identify the factors that contribute to a person’s income level. Furthermore, machine learning techniques such as Logistic regression, Classification tree, Random forests, and Support Vector Machine were used to predict the income level and subsequently identify the model that most accurately predicted the income level of an individual.
Relationship status, Capital gain and loss, Hours worked per week and Race of an individual were found to be the most important factors in predicting the income level of an individual. Of the different classification techniques that were built and tested for performance, the logistic regression model was found to be the best performing, with the highest accuracy of 84.63% in predicting the income level of an individual.
Jessica Blanchard, Predictive Analysis of Residential Building Heating and Cooling Loads for Energy Efficiency, March 2018, (Peng Wang, Dungang Liu)
This study’s focus is to predict the required heating load and cooling load of a residential building through multiple regression techniques. Prediction accuracy is tested with in-sample, out-of-sample, and cross-validation procedures. A dataset of 768 observations, eight potential predictor variables, and two dependent variables (heating and cooling load) will be explored to help architects and contractors utilize and predict the necessary air supply demand and thus design more energy efficient homes. Exploratory Data Analysis not only uncovered relationships between the explanatory and dependent variables, but relationships amongst explanatory variables as well. To create a model with accurate predictability, the following regression techniques were examined and compared to one another: Multiple Linear Regression, Stepwise, LASSO, Ridge, Elastic-Net, and Gradient Boosting. While each method has its advantages and disadvantages, the models created using LASSO Regression to predict heating and cooling load, balance simplicity and accuracy relatively well. However, when compared against the results from Gradient Boosting, the LASSO models produced greater root mean squared error. Overall, the regression trees created with Gradient Boosting yielded the best predictive results with parameter tuning to regulate “overfitting.” These models meet the purpose of this study to provide residential architects and contractors a straightforward model with greater accuracy than the current “Rules of Thumb” practice.
Zachary P. Malosh, The Impact of Scheduling on NBA Team Performance, November 2017, (Michael Magazine, Tom Zentmeyer)
Every year, the NBA releases their league schedule for the coming year. The construction of the schedule contains many potential schedule-based factors (such as rest, travel, and home court) that can impact each game. Understanding the impact of these factors is possible by creating a regression model that quantifies the team performance in a particular game in terms of final score and fouls committed. Ultimately, rest, distance, attendance, and time in the season had direct impact on the final score of the game while the attendance at a game led to an advantage in fouls called against the home team. The quantification of the impact of these factors can be used to anticipate variations in performance to improve accuracy in a Monte Carlo simulation.
Oscar Rocabado, Multiclass Classification of the Otto Group Products, November 2017, (Yichen Qin, Amitabh Raturi)
Otto group is a multinational group with thousands of products that need to be classified consistently in nine groups. The better the classification, the more insights they can generate about their product range. However, the data is highly unbalanced among classes so we try to find out if the balancing group Synthetic Minority Oversampling Technique has notable effects in the performance of the accuracy and Area under the Curve of the classifiers. Given the data set is obfuscated so that the interpretability of the dataset is impossible, we will use black box methods like Linear and Gaussian Support Vectors Machines and Multilayer Perceptron and Ensembles that combines classifiers like Random Forests and Majority Voting.
Shixie Li, Credit Card Fraud Detection Analysis: Over sampling and under sampling of imbalanced data, November 2017, (Yichen Qin, Dungang Liu)
Imbalanced credit fraud data is analyzed by over sampling and under sampling methods. A model is built with logistic regression and area under PRROC (Precision-Recall curve) is used to show model performance of each method. The disadvantage of using area under ROC is that due to the imbalance of the data the specificity will be always close to 1. Therefore the area under the curve does not work well on imbalanced data. This disadvantage is shown by comparison in this paper. Instead a precision-recall plot is used to find a reasonable region for the cutoff point based on the result from selected model. The cutoff value should be chosen within the region or around the region and it is all depends on whether precision or recall is more important to the bank.
Cassie Kramer, Leveraging Student Information System Data for Analytics, November 2017, (Michael Fry, Nicholas Frame)
In 2015, The University of Cincinnati began to transition its Student Information System from a homegrown system to a system created by Oracle PeopleSoft called Campus Solutions and branded by UC as Catalyst. In order to perform reporting and analytics on this data, the data must be extracted from the source system, modeled and loaded into a data warehouse. The data can then be exported to perform analytics. In this project, the process of extract, modeling, loading and analyzing will be covered. The goal will be to predict students’ GPA and retention for a particular college.
Parwinderjit Singh, Alternative Methodologies for Forecasting Commercial Automobile Liability Frequency, October 2017, (Yan Yu, Caolan Kovach-Orr)
Insurance Services Office, Inc. publishes quarterly forecast of Commercial Automobile liability frequency (number of commercial automobile insurance claims reported/paid) to help insurers make better pricing and reserving decisions. This paper proposes forecasting models based on time-series forecasting techniques as an alternative to already existing traditional methods and intends to improve the existing forecasting capabilities. ARIMAX forecasting models have been developed with economic indicators as external regressors. These models resulted in a MAPE (Mean Absolute Percentage Error) ranging from 0.5% to ~9% which is a significant improvement from currently used techniques.
Anjali Chappidi, Un-Crewed Aircraft Analysis & Maintenance Report Analysis, August 2017, (Michael Fry, Jayvanth Ishwaran)
This Internship comprised of two projects: Analysis of some crew data using SAS and analysis of the aircraft maintenance reports using text mining in R. The first project identifies and analyzes how different factors affected the crew ratio on different fleets. The goal of the second project is to study the maintenance logs which consisted of the work order description and work order action related to the aircrafts that were reported to go under maintenance.
Vijay Katta, A Study of Convolutional Neural Networks, August 2017, (Yan Yu, Edward Winkofsky)
The advent of Convolutional Neural Networks has drastically improved the accuracy of image processing. Convolutional Neural Networks in short CNNs, are presently the crux of deep learning applications in computer vision. The purpose of this capstone is to investigate the basic concepts of Convolutional Neural Networks in a stepwise manner and to build a simple CNN model to classify images. The study involves understanding the concepts behind different layers in CNN, studying the different CNN architectures, understanding the training algorithms of CNNs, studying the applications of CNNs, and applying CNN for image classification. A simple image classification model was designed on an ImageNet dataset which contains 70,000 images of digits. The accuracy of the best model was found to be 98.74. From the study, it is concluded that a highly accurate image processing model is achievable in a few minutes given the dataset has less than 0.1 million observations.
Yan Jiang, Selection of Genetic Markers to Predict Survival Time of Glioblastoma Patients, August 2017, (Peng Wang, Liwei Chen)
Glioblastoma multiforme (GBM) is the most aggressive primary brain tumor with survival time less than 3 months in >50% patients. Gene analysis is considered as a feasible approach for the predication of patient’s survival time. The advanced gene sequencing techniques normally produce large amount of genetic data which contain important information for the prognosis of GBM. An efficient method is urgently needed to extract key information from these data for clinical decision making. The purpose of this study is to develop a new statistical approach to select genetic markers for the prediction of GBM patient’s survival time. The new method named Cluster-LASSO linear regression model has been developed by combining nonparametric clustering and LASSO linear regression methods. Compared to the original LASSO model, the new Cluster-LASSO model simplifies the model by 67.8%. The Cluster-LASSO model selected 19 predictor variables after clustering instead of 59 predictor variables in LASSO model. The predictor genes selected for Cluster-LASSO model are ZNF208, GPRASP1, CHI3L1, RPL36A, GAP43, CLCN2, SERPINA3, SNX10, REEP2, GUCA1B, PPCS, HCRTR2, BCL2A1, MAGEC1, SIRT3, GPC1, RNASE2, LSR and ZNF135. In addition, The Cluster-LASSO model surpasses the out of sample performance of LASSO model by 1.89%. Among the 19 genes selected in the Cluster-LASSO model, the positively associated HCRTR2 gene and negatively associated GAP43 are especially interesting and worth of further study. A further study to confirm their relationship to the survival time of GBM and possible mechanism would contribute tremendously to the understanding of GBM.
Jing Gao, Patient Satisfaction Rating Prediction Based on Multiple Models, August 2017, (Peng Wang, Liwei Chen)
As the development of economy and technology, online health consultation provides a convenient platform which enables the patients seeking the suggestion and treatment quickly and efficiently, especially in China. Due to the large population density, physicians may need to take hundreds of patients every day at hospital, which is really time-consuming for patients. So there is no wonder why online health consultation grows so rapidly recently. Since healthcare service always related to issues of mortality and life quality for patients, hence online healthcare services and the patient satisfaction are always important to keep this industry running safely and efficiently. So in this project, we focus on the patient