David Horton, Predicting Single Game Ticket Holder Interest in Season Plan Upsells, December 2018, (Yan Yu, Joseph Wendt)
Using customer data provided from the San Antonio Spurs, a statistical model was built that predicts the likelihood that an account which only purchased single game tickets in the previous year will upgrade to some sort of plan, either partial or full season, in the current year. The model uses only variables derived from customer purchase and attendance histories (games attended, tickets purchased and attended, money spent) over the years 2013-2016. The algorithm used for training was the Microsoft Azure Machine Learning Studio implementation of a two-class decision jungle. Training data was constructed as customers who had purchased only single game tickets in the previous year. This data was split randomly so that 75% was used for training and 25% for testing. In later runs, all data from 2016 was withheld from training and testing as a validation set, as noted in the results section. The final model (including 2016 data in training) shows a test accuracy of 84.9%, where 50% accuracy is interpreted as statistically random and 100% yields only perfect predictions. This model is likely to see improvements in predictive power as demographic information is added, new variables are derived, the feature selection method becomes more sophisticated, the model choice becomes more sophisticated, model parameters are optimized, and more data becomes available.
Ravi Theja Kandati, Lending Club – Identification of Profitable Customer Segment, August 2018, (Yan Yu, Olivier Parent)
Lending club issues unsecured loans to different segments of customers. The interest rate for the loan is dependent on the credit history of the customer and various other factors like income levels, demographics, etc. The data of the borrowers is public. The objective is to analyze the dataset and identify the good customers from the bad customers (“charged off”) using machine learning techniques. This dataset falls under the category of class imbalanced dataset as the number of good customers are far greater in number than the number of bad customers. As this is a typical classification problem, CatBoost technique is used for modelling.
Pengzu Chen, Churn Prediction of Subscription-based Music Streaming Service, August 2018, (Dungang Liu, Leonardo Lozano)
As a well-known subscription business model, paid music streaming became the largest recorded music market revenue source in 2017. Churn prediction is critical for subscriber retention and profit growth in a subscription business. This project uses a leading Asian music streaming service’s data to identify parameters that have an impact on users’ churn behavior and to predict churn. The data contains user information, transaction records and daily user activity logs. Prediction models are built with logistic regression, classification tree and support vector machines algorithms, and their performances are compared. The results indicate that classification tree model has the best performance among the three in terms of asymmetric misclassification rate. The parameters that have a big impact on churn are whether subscription auto-renew is enabled, payment method, whether the users cancel the membership actively, payment plan length, and user activities 0-2 days before subscription expires. This informs the service provider where customer relationship management should focus.
Tongyan Li, Worldpay Finance Analytics Projects, August 2018, (Michael Fry, Tracey Bracke)
Worldpay, Inc. (formerly Vantiv, Inc.) is an American payment processing and technology provider headquartered in the greater Cincinnati area. As a Data Science Analytics Intern, I directly work with the Finance Analytics team on multiple projects. The main purpose of the first project is to automate the process that used to be manually accomplished within different databases, RStudio was used and substantially reduced the time required to produce flat files for further usage. In the second project, the year-over-year (YoY) average ticket price (AVT) growth was analyzed. The Customer Attrition project focuses on the study customer’s attrition behavior.
Navin Mahto, Generating Text by Training a Recurrent Neural Network on English Literary Experts, August 2018, (Yan Yu, Yichen Qin)
Since the advent of modern computing we have been trying to make computers learn and respond in a way unique to humans. While we have chatbots which mimic human response by pre-coded answers, they are not fluid or robust in their response. In our project we would like to train a Recurrent Neural Network on an English Classic “War and Peace” by Leo Tolstoy and make it generate sentences similar in nature and structure to the language in the book. The sequential structure of RNN and its ability to retain previous inputs make them ideal to learn a literary style of the book. On increasing the length of RNN and epoch values, the error decreases from max of 2.9 to 2.2, and we see that the text generated resembles closer to English language.
Non-coherent output: “the soiec and the coned and the coned and the cone”.
Slightly coherent output: “the sage to the countess and the sale to the count”.
Zach Amato, Principal Financial Group: GWIS Portfolio Management Platform, July 2018, (Michael Fry, Jackson Bohn)
The overall goal of the GWIS Portfolio Management Platform project is to help bring together the disparate tools, research, and processes of the PPS boutique into as few locations as possible. Throughout the summer, we have started prototyping the Portfolio Viewer module and putting structure around the Research Module. In doing this, data management and data visualization skills have been used to meet the needs of the project and of the business. Future steps in the project will include completing current work and the modules in process and engaging in the Portfolio Construction and Trading modules. Future work will require data management, data manipulation, statistical testing, and optimization.
Nicholas Charles, Craft Spirits: A Predictive Model, July 2018, (Dungang Liu, Edward Winkofsky)
A new trend driving growth in the spirits industry is craft. Craft spirits are usually produced by small distilleries that use local ingredients. In the US, the spirits industry is structured as a three-tier system with manufacturers, distributors, and retailers. In certain states, the state government controls a portions of the three-tier system. For instance, the State of Iowa controls the distribution. The state purchases product from the manufacturers and subsequently sells to private retailers in the state. In doing so, the state tracks all transactions at the store level and makes available this data to the public. This project takes that open data and builds a logistic regression model that can be used to predict the outcome of a transaction as either a craft purchase or noncraft purchase. This information may prove useful to distilleries and distributors that specialize in craft by helping to pinpoint where their resources should be focused.
Keshav Tyagi, CC Kitchen’s Dashboard, July 2018, (Michael Fry, Harrison Rogers)
I am working as a Business Analyst Co-Op with Project Management Operations Division within the Castellini Group of Companies providing Business solutions to CC Kitchens, which is one of its subsidiaries specializing in Deli and Ready to eat products. The project, which I was assigned to, aims at creating an executive level dashboard for CC Kitchens visualizing important business metrics, which can assist top-level executives in making informed decisions on a day-to-day basis.
My responsibilities include but are not limited to interacting with different sectors within the company to identify the data sources for the above metric, data cleansing, creating data pipelines, preparing data through SQL stored procedures and creating visuals over them in a tool called DOMO. The data resided in flat files, Excel sheets, emails, ERP, API’s and I created an automated data flow architecture to collect and dump data at a centralized SQL warehouse.
Swidle Remedios, Analysis of Customer Rebook Pattern for Refund Policy Optimization, July 2018, (Michael Fry, Antonio Lannes Filho)
Priceline offers lower rates to its customers on certain deals which are non-cancellable by policy. In order to improve customer satisfaction, certain changes were deployed in June 2015 to make exceptions to these policies and refund the customers. These exceptions are only applied to cancel requests that fall under Priceline’s predefined categories/cancel reasons. In this paper, the orders processed under Priceline’s Cancel and exception policy will be analyzed for two of the top cancel reasons. The goal is to determine if the refunds are successfully driving customer behavior and repurchase habits. The insights obtained from the analysis will help the Customer care team at Priceline redesign and optimize the policies for each of the cancel reasons.
Rashmi Subrahmanya, Analysis of Tracker System Data, July 2018, (Michael Fry, Peter M. Chang)
Tracker system is an internal system in Boeing which records requests from employees working on the floor. Based on the nature of request, they are directed to a respective department. Standards organization is responsible for supply of standards (such as nuts, bolts, rivets) for assembling aircraft. Any request related to standards such as missing or defective or insufficient number of standards are directed to Standards organization, which then resolves the request. The resolution time and in fact, the requests directly impact the aircraft assembling process. The project focuses on analysis of tracker request data to identify patterns in the data. Data is analyzed on two key metrics - number of requests and average resolution time of the requests. The top problem type names, area of aircraft, hour and day with highest requests have been identified. This would help understand the reasons behind these requests and help take preventive action such as increase staffing at a particular time of the day so that the requests are resolved quicker. Two dashboards were developed to show active number of requests and to show the requests by Integration Centers for 7XX program.
Spandana Pothuri, Data Instrumentation and Significance Testing for a Digital Product in the Entertainment Industry, July 2018, (Dungang Liu, Chin-Yang Hung)
The entertainment technology industry runs on data. As entertainment is created, unlike a more need-based industry like agriculture, it is important to see how the receiver uses the end product. Based on the feedback loop, more products are created or existing products are made better. How a user uses an application, determines how its next version is built. In this world, clicks translate to dollars, and data is of utmost importance. This paper focuses on the cycle of data in a technological project, starting from instrumentation and tracking to reporting and deriving the business impact of the product. The product featured in this paper is Twitch’s extensions discovery page. The goal of launching this product was to the increase visibility for extensions. This product succeeded, increasing viewership by 37%.
Sourapratim Datta, Product Recommendation: A Hybrid Recommendation Model, July 2018, (Michael Fry, Shreshth Sharma)
This report provides a recommendation of products (movies) to be licensed for an African country channel. The product recommendations are based on its features such as genre, release year, US box office revenue etc. and its performance on other African and worldwide channels. A hybrid recommendation model combining the product features (Content Based Recommendation) and the performance of the products (Collaborative Filtering model) has been developed. For the content-based recommendation, a similarity matrix is calculated based on the user preferences of the market and the most similar products are considered. To calculate the performance of the products that have not been telecast in the African channels, a collaborative filtering model is trained on the known performance indexes. From the predicted performance of the products, the top products are considered. Finally, combining the considered products from both the methods, a final list of products has been recommended by giving equal weightage to both methods.
Akhil Gujja, Hiring the Right Pilots – An Analysis of Candidate Applications, July 2018, (Michael Fry, Steven Dennett)
Employees are an asset to any organization, and the key to any firm’s success. For a company to grow, flourish, and succeed, the right people must be hired for the job from the start. Hiring the right personnel is a time consuming and a tedious task for any organization. Especially, in the aviation industry, where safety and reliability are of utmost importance, hiring the right pilots is critical. Even under ideal situations, hiring pilots can be an arduous task. It is extremely difficult to predict exactly how well pilots will perform in the cockpit. It is because a pilot’s future performance cannot just be predicted based on academic performance or historical flying metrics. It depends on a lot on non-quantifiable metrics. Candidate profiles are analyzed to understand the profiles that made through the selection process, and the ones that did not make it through the resume screening process. This analysis can be used by the recruiters to rank candidate profiles and expedite the hiring process of top ranked candidates.
Yang He, Incremental Response Model for Improved Customer Targeting, July 2018. (Dungang Liu, Anisha Banerjee)
Traditional marketing response models score customers based on their likelihood to purchase. However, among potential customers, some customers would purchase regardless of any marketing incentive while some customers would purchase only because of marketing contact. Therefore, the traditional predictive models sometimes lead to money wasting on customers who would shop regardless of marketing offers and customers who would stop shopping if you ‘disturb’ them with marketing offers. The Oriental Trading Company Inc., a company with catalog heritage, does not want to send any catalogs to the customers who would purchase naturally for cost saving and profit maximization. For my internship, the main objective is to distinguish customer groups that need catalogs to shop from customer groups that will shop naturally or will not shop if given marketing incentive using incremental response model in SAS Enterprise Miner. This report shows that the basic theory of the incremental response model and how the model is applied to an Oriental Trading Company dataset. A combined incremental response model was successfully built using demographic and transactional attributes. The estimated model identified that incremental response was 11.9%, which was 1.7 times higher than baseline incremental response (7.2%). The model was used to predict customers’ purchase behavior in future marketing activity. Additionally, from the outputs of modeling, we identified that the overall number of orders, sales of some major products and days since first purchase were the most important factors affecting customers’ response to the mailed catalogs.
Nandakumar Bakthisaran, Customer Service Analytics – NTT Data, July 2018, (Michael Fry, Praveen Kumar S)
The following work describes the application of data analysis techniques for a healthcare provider. There are two tasks covered here. The first is an investigation to identify the root cause of an anomalous occurrence in a business process. The average of the scores measuring agent performance on a call exhibited an unusual rise starting 2018. The chief cause was identified to be a deviation from standard procedure by evaluators. Naturally, the subsequent recommendation was to ensure greater adherence to the procedure followed. The second task follows it with scrutiny of the single scoring benchmark used for different types of incoming calls and how it falls short of measuring performances in an accurate manner. Probability distributions were attempted to be fit to the underlying data for each type to check for any inherent distributions. The conclusion was to employ a type-specific system of scoring using point estimates obtained from existing data.
Christopher Uberti, General Motors Energy/Carbon Optimization Group, July 2018, (Michael Fry, Erin Lawrence)
This capstone outlines a strategy for implementing improved statistical metrics used for analyzing General Motors (GM) factories. Current GM reporting methods and data available are outlined. Two methodologies are outlined in this paper for improved metrics and dashboards. The first is an analysis of individual HVAC units within a factory (of which each factor has dozens) in order to identify units that be performing poorly or are not set to correct modes. The second methodology is creating a prediction model for overall plant energy usage based on historical data. This would provide plant operators a method for comparing current energy usage to past performance while taking into account changes in weather, production, etc. Finally some potential dashboards are mocked up for use in the energy reporting software.
Anumeha Dwivedi, Sales Segmentation for a New Ameritas Universal Life Insurance Product, July 2018, (Dungang Liu, Trinette James)
The key to great sales for a new product is knowing the right kind of customers (who are most profitable) for it and deploying your best agents (high performing sales persons) out to them. Therefore, this project is aimed at performing customer and agent segmentation for a new Ameritas Universal Life Insurance product that is slated for a launch later this year. The segmentation is based on customer, agent, policy and riders data on other such historical products. The segmentation utilizes different demographic, geographic and behavioral attributes that are available directly or could be inferred or sourced externally. Segments developed would not only allow for more effective marketing efforts (better training, better-informed agents and marketing collateral) but also result in better profitability from the sales. Sales segmentation has been attempted using suitable clustering (unsupervised machine learning) techniques and the results suggest that cluster of clients in the early sixties and mid-thirties are most profitable and, in that order, and form the major chunk of the customer base. The age band of 45-55 years has not been as profitable with higher coverage amounts for medium premium payments. On the agents end, the most experienced agents (oldest in age and biggest tenures with Ameritas) have been most successful selling UL policies, followed by the youngest group of agents in their thirties and shortest tenures of 2-5 years while the ones with 6-15 years tenure in the 45-55 years age band are more complacent and limited with the sales of these policies.
Kamaldeep Singh, DOTA Player Performance Prediction, July 2018, (Peng Wang, Yichen Qin)
Dota2 is a free-to-play multiplayer online battle arena (MOBA) video game. Dota 2 is played in matches between two teams of five players, with each team occupying and defending their own separate base on the map. Each of the ten players independently controls a powerful character, known as a "hero" (which they choose at the start of the match), who all have unique abilities and differing styles of play. During a match, players collect experience points and items for their heroes to successful battle with the opposing team's heroes, who attempt to do the same to them. A team wins by being the first to destroy a large structure located in the opposing team's base, called the "Ancient". The objective of this project is to come up with an algorithm that can predict a player’s performance with a specific hero, by learning from his performance with other heroes. The response variable used for quantifying performance is KDA ratio i.e. Kill to deaths ratio of that user with that hero. The techniques used in this project are Random Forest, Gradient Boosting and H2O package that encapsulates various techniques and automates model selection. Data was provided by Analytics Vidhya and is free to be used by anyone.
Sarita Maharia, NetJets Dashboard Management, July 2018, (Michael Fry, Stephanie Globus)
Data visualization plays a vital role in exploring the data and summarizing the analysis results across the organization. The visualizations in Netjets were developed using disparate tools on a need basis without any set of corporate standards. Once the employees began using Tableau as the data visualization tool, it became even more important to have a centralized team to develop the infrastructure, set the corporate standards and enforce required access mechanism. The Center of Excellence for Visual Analytics (CoE-VA) team now serves as the central team to monitor the visualization development across the organization. This team requires a one-stop solution to answer analytics community queries related to dashboard usage and compare access between users. NetJets Dashboard Management project is developed as the solution to enable CoE-VA team to monitor the existing dashboards and access structure. This primary purpose of this project is to present dump of dashboards’ usage and access data in a concise and user-friendly framework. Two exploratory dashboards are developed for this project to accept the user input and provide the required information visually with a provision to download the data. The immediate benefit from this project is the time and effort savings for CoE-VA team. The turnaround time for comparing access between two users is now reduced from approximately an hour to few seconds. The long-term benefit from this project would be to promote the Tableau usage culture in the organization by tracking dashboard usage and educating end-users on the under-utilized dashboards.
Mohammed Ajmal, New QCUH Impact Dashboard &Product Performance Dashboard, July 2018, (Michael Fry, Balji Mohanam)
Qubole charges for its services to its customers based on their cloud usage. Its current revenue methodology is dependent on the instance type that is being used and the Qubole defined factor (QCUH factor) associated with it. The first project evaluates the impact of new QCUH factor from revenue standpoint. Additionally, a dashboard is also built that would enable the sales team to identify customers whose invoices would increase due to new QCUH factor. The dashboard also has functionalities that will aid the sales team to arrive at mutually agreed terms with individual customers with respect to the new QCUH factor. Currently Qubole does not have one single reporting platform where all the important metrics are tracked. The second project attempts to answer this concern. There are two dashboards that are built. The first dashboard tracks all critical metrics at overall or organization level. The second dashboard tracks almost all the metrics that are being tracked in the first dashboard at an individual customer level. The user needs to input the customer name to populate the data for the concerned customer. The two dashboards provide comprehensive overview of the health of Qubole.
Maitrik Sanghavi, Member Churn Prediction & CK Health/Goals Dashboards, July 2018, (Michael Fry, Rucha Fulay)
This document provides information on two key projects executed while interning at Credit Karma. A bootstrapped logistic regression model was created to predict the probability of a Credit Karma member not returning to the platform within 90 days. This model can be used to effectively target the active members who are ‘at-risk’ of churning and also be used as a baseline reference for future model improvisations. CK Health and CK Goals dashboards have been modified and updated to monitor company’s key performance indicators and track 2018 goals. These dashboards have been created using BigQuery and Looker and are automatically updated daily.
Nitish Ghosal, Producer Behavior Analysis, July 2018, (Dungang Liu, Trinette James)
Ameritas Insurance Corporation Limited works on the B2B marketing model partnering with agencies, financial advisors, agents & brokers to sell its products and services to the end-customer. An agent can choose to sell products for multiple insurance companies, but he/she can be contracted full time with only one insurance company. In order to incentivize an agent’s affiliation with Ameritas, it has an Agents Benefits & Rewards Program in place which works on the principle of “greater the agent’s results, greater the rewards”. The aim of my study is to provide a holistic overview of our agents’ behavior and identify their drivers of success through segmentation and agent profiling. In order to achieve this, data visualizations were created in tableau to find trends and clustering was performed in SPSS to segment the agents into groups. The agents could be grouped into five distinct categories - top agents, disability insurance specialists, life insurance specialists, generalists and inactive agents. The analysis revealed that factors such as benefits, club qualification, contract type, club level, agency distribution region, persistency rates, home agency state, AIC (Ametrias Investment Corporation) affiliation are some of the factors which have an impact on an agent’s success and his sales revenues.
Krishna Chaitanya Vamaraju, Recommender System on the Movie Lens Data Set, July 2018, (Dungang Liu, Olivier Parent)
Recommendation systems are used in most e-commerce websites to promote products, up-sell and cross-sell products to new or existing customers based on the history of data present for existing customers. This helps in recommending the right products to customers thereby increasing sales. The current report is a summary of various techniques that are used to for recommendation. A comparison of the models against the time taken to run and the issues concerning each model are discussed in the report. For the current project, data from Kaggle has been utilized for the analysis - The 100k MovieLense ratings data set. The goal of the current project is to use the MovieLens data in R and build recommendation engines based on different techniques using the Recommender Lab package. This could, if deployed into production, serve as a system like that we see on Netflix. For the analysis cosine similarity is used to compute the similarity between users and items. The Recommendation Techniques that have been used are User based Collaborative Filtering, Item based collaborative Filtering and Collaborative techniques based on Popularity and Randomness. Also, a recommender system using Singular value decomposition and K-Nearest Neighbors is also built. A comparison of the techniques indicates that the popular methods technique gives the highest accuracy as well as good run time however this depends on the data set and the stage of recommendation we are in. Finally, the right metric one wants to indicate using a recommender system determines the type of Recommender system one should build.
Ananthu Narayan Ambili Jayanandan Nair, Comparing Deep Neural Network and Gradient Boosted Trees for S&P 500 Prediction, July 2018, (Yan Yu, Olivier Parent)
The objective of this project was to build a model to accurately predict the S&P 500 index in the (t+1)st minute using the component values in the tth minute. Two different machine learning techniques, Artificial Neural Networks, and Gradient Boosted Trees were used to build the models. Tensor flow, which makes use of the NVIDIA GPU was used for training the Neural Network model. H2O, which speeds up the training process by parallelized implementations of algorithms was used for Gradient boosted trees. The models were compared using their Mean Squared Errors and the Neural Network model was found to be better suited for this application.
Prerit Saxena, Forecasting Demand of Drug XYZ using Advanced Statistical Techniques, July 2018, (Michael Fry, Ning Jia)
Client ABC is a large pharmaceutical company and a client of KMK Consulting Inc. ABC has a diverse portfolio of drugs in various disease areas. The organization is structured in the form of division for every disease area. The NET team is responsible for ABC’s drugs in the Neuro-endocrine tumor area, a disease area with a market of about $1.5B globally. ABC’s major drug XYZ is in the market for a few years and has a major market share. The drug is a “Buy and Bill” drug which means hospitals buy the drugs in advance and stock it and then bill the payers according to the usage. The project shared in this report is the forecasting exercise for drug XYZ. In this project, forecasting has been done for 3 phases: remaining 2018, 2019 and 2020-2021. The team uses various forecasting methods such as ARIMA, Holt-Winters and trends in conjunction with business knowledge to prepare forecasts of number of units of drugs, as well as dollar sales for the upcoming years.
Manoj Muthamizh Selvan, Donation Prediction Analysis, July 2018, (Andrew Harrison, Rodney Williams)
UC Foundation Advancement Services team participates in the process to bring donations to UC in an effective manner. The team has data of all the donors collected in the past 12 years and are interested in understanding any findings and insights from the data. The UC Foundation team would like to predict probability of large future donations and target the donors effectively. Hence, the team would want to understand: Trigger and factors responsible for the donations and Probability of donors to donate a larger amount (> $10,000). Random Forest model was used to identify the trigger factors and also predict the high donors on the prospect population. The results are being used by the Salesforce team of UC Foundation to target the high donors with better accuracy than heuristic based models.
Akhilesh Agnihotri, Employee Attrition Analysis, July 2018, (Dungang Liu, Peng Wang)
Human resources plays an important part of developing and making a company or an organization. The HR department is responsible for maintaining and enhancing the organization's human resources by planning, implementing, and evaluating employee relations and human resources policies, programs, and practices. Employers generally consider attrition a loss of valuable employees and talent. However, there is more to attrition than a shrinking workforce. As employees leave an organization, they take with them much-needed skills and qualifications that they developed during their tenure. If the reasons behind employee attrition are identified, the company can create a better working environment for the employees and if employee attrition can be predicted, the company could take necessary actions to stop the valuable employees from leaving. So, this report attempts to explore the HR data, manipulate it to get some meaningful relation between response variable (whether an employee left the company or not) and dependent variables which provide information about an employee. Then, the report also tries to build several statistical models which can predict the probability of an employee leaving the company given his information and conclude on the best model having highest performance.
Amoul Singhi, Identifying Factors that Distinguishes High Growth Customers, July 2018, (Dungang Liu, Lingchan Guo)
If a bank is able to identify customers who have potential to spend more next year than what they have spent this year they can market better products to them and increase customer satisfaction along with their profits. The aim of this analysis is to identify the set of customer features which can distinguish high growth customers from others. The data collected for the analysis was divided into 2 categories transactional and demographical. From Data Exploration some factors which were identified which have a different behavior in both the group. After which, a Linear Regression Model was made with percentage increase from 2016 to 2017 as target variable to identify the factors which are statistically important in determining the growth of cardholder. A logistic regression model was made with classifying accounts with more than 25% growth as high growth customers. This was followed by tree model and Random Forest model to increase the efficiency of the model. It was found that though some of the variables statistically significant but their coefficient is very low implying though they are important in determining the growth of a accounts but their impact is not much. There are 2 transactional variables which were identified from Random Forest which can help to determine if a customer is high growth or not, but accuracy of the model is quite low. Overall there are certain factors which are identified as important but it’s very difficult to predict if a cardholder is going to spend more in upcoming years.
Sudaksh Singh, Path Cost Optimization: Tech Support Process Improvement, July 2018, (Michael Fry, Rahul Sainath)
The objective of this project is to optimize the process of diagnosis and resolution of issues with various products faced by the customers of one of world’s largest technology companies and addressed by tech support agents. The organization’s tech support agents use multiple answer flow trees which is a tree structured question-answer based graph used by agents for diagnosing issues in customer products. The objective of optimizing the answer flow trees is achieved by studying the historical performance of the issue diagnosis and resolution activity carried out by various agents using these trees. Performance of these trees is measured across a collection of metrics defined to capture the speed and accuracy of issue diagnosis and resolution. Based on the analysis, recommendations are made to reorder or prune the answer flow tree to achieve better performance across these metrics. These measures and the following recommendations for editing the answer flow trees will serve as a starting point for more advanced, holistic techniques to design algorithms which generate new answer flow trees having the best performance across all metrics while considering the constraints which limits the reordering and pruning of the answer flow graph.
Max Clark, HEAT Group Customer Lead Scoring, July 2018, (Michael Fry, Maylem Gonzalez)
The HEAT Group is responsible for all events taking place at the American Airlines Arena, such as NBA basketball games, concerts and other performances. While offering a winning and popular product will yield high demand, the HEAT Group must employ analytical methods to smartly target customers who otherwise would not be attending events. The purpose of this project was to determine the differences between the populations of the HEAT Group’s two main customer groups, premium and standard customers. Furthermore, a machine learning model was implemented to, one, create a scoring method that will be used to assess which customers the HEAT Group would have the highest probability to convert from standard to premium customers, and, two, determine which features have the largest impact. It is discovered that the age and financial status are the largest and most important differentiators for the two population groups.
Mohit Deshpande, Wine Review Analysis, July 2018, (Yichen Qin, Peng Wang)
Analyzing structured data is simpler as compared to unstructured data because observations in structured data are arranged in a specific format suitable for implementing analytical techniques on them. On the other hand, internal structure of unstructured data (audio, video, text etc.) do not adhere to any format. Nowadays, unstructured data generation is at an all-time high and thus, comprehending methods to analyze them is the need of the hour. This project aims to study and implement one such technique of Text Analytics which is used to examine textual data. Initial part of the project revolves around performing Exploratory Data Analysis on a dataset containing wine reviews to discover hidden patterns in the data. The latter part focuses on analyzing the text heavy variables by performing basic text, word frequency and sentiment analysis.
Devanshu Awasthi, Analysis of Key Sales-Related Metrics through Dashboards, July 2018, (Michael Fry, Jayvanth Ishwaran)
Visibility is key to running a business at any scale. Organizations have a constant need to assess where they stand day-in and day-out and where they can improve. With the advancements in tools and technologies that help capture large chunks of operational and business data at even shorter intervals, companies have started to explore methods of using this data in a better way to get insights more frequently. One step in this direction at NetJets is to move away from the traditional methods of using static systems for business reporting. Descriptive analytics using advanced techniques of data management and data visualization is used to create dashboards which can be shared across the organization with different stakeholders. Dashboards prove to be extremely useful in analysis as they show the trends for different metrics over the month, and help us dig down deeper through the multiple layers of information. This project involved transforming the daily reporting mechanism for the sales team through dashboards for three large categories – cards business, gross sales and net sales. Each category has important metrics the business users are concerned with. On any day, these dashboards help analyze the time-series data associated with these categories and assess how the business has fared on certain metrics for the month, identify anomalies and get a comprehensive view of the expected sales for the rest of the month.
Ayank Gupta, Predicting Hospital-Acquired Pressure Injuries, July 2018, (Michael Fry, Susan Kayser)
A pressure injury is defined by NPUAP as "localized damage to the skin and/or underlying soft tissue usually over a bony prominence or related to a medical or another device. The injury can present as intact skin or an open ulcer and may be painful. The injury occurs as a result of intense and/or prolonged pressure or pressure in combination with shear. The tolerance of soft tissue for pressure and shear may also be affected by microclimate, nutrition, perfusion, co-morbidities, and condition of the soft tissue. Identification of factors responsible for the pressure injury can be very difficult and is of vital importance for the hospital bed manufacturers. It is crucial to identify the type of pressure injury a patient might acquire in a hospital and educate the nurses to take proactive measures instead of preventive measures. The objective is to predict whether a patient will have a hospital-acquired pressure injury given various demographic information about the patient, information about the wound, and the hospital.
Renganathan Lalgudi Venkatesan, Detection of Data Integrity in Streaming Sensor Data, July 2018, (Michael Fry, Netsanet Michael)
The Advanced Manufacturing Analytics team wants to identify if there are any data integrity issues in the streaming sensor data that is gathered from the manufacturing floors. The infrastructure for asset tracking has already been set up in several phases of time. Each site consists of sensors that capture the spatiotemporal information of all the tagged assets thereby giving real-time information regarding the whereabouts of each of the assets. Depending on their purpose, there are several types of these sensors positioned in different locations within a given plant. This project aims at examining the integrity of the streaming data and to monitor the health of the flow, detect and label the time frames of historical disruptions. Also, since the streaming data is on its early stages of infrastructure, the goal of the analysis is to identify the shortcomings and explore the possible improvements that will be required for future production critical processes. Several methods are proposed to address the current and recent streaming data issues by capturing for disruptions in the data feed using historical data. This would help in capturing disruption on time and thus make real-time site operation decisions. The infrastructure also has times in which there is a data loss as well as high volume of one time data (referred as an outliers in the data). The project proposes methods to detect and quantify the data losses as well as in detecting outliers in the sensor data based on the operating characteristics and other factors at the site.
Aksshit Vijayvergia, Predict Default Capability of a Loan Applicant, July 2018, (Yichen Qin, Ed Winkofsky)
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders. Borrowing from financial institutions is a difficult affair for this sect of people. Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities. So, for the capstone project I will be digging deep into the dataset provided by Home Credit on the analytics website called Kaggle. In order to classify whether an applicant will default, I will be analyzing and munging two datasets. The first dataset is extracted from the application filed by a client in Home Loan’s portal. The second dataset contains a client’s historical borrowing information.
Sylvia Lease, Analytics & Communication: Leveraging Data to Make Connections, Summarize Results, & Provide Meaningful Insights, July 2018, (Mike Fry, Steve Rambeck)
Entering into my time as Ever-Green Energy’s Business Analyst Intern, a defined project goal was established to create a variety of reports for client and internal use alike. Armed with newly developed skills in coding, data visualization, and managing data, it was quickly realized that these skills would serve as tools for an overarching, more imperative goal: to communicate effectively. Over several weeks, the opportunity to merge data with communication took a variety of forms. In the beginning, discussions with various leaders and groups within the company translated to an understanding of how analytics could lend itself to furthering the company’s mission. This led to a recognition of how analytics could assist in bridging a gap between the IT and Business Development groups to create reports that helped the teams serve clients by answering key questions and interests. Ultimately, through the creation of polished and carefully designed reports, communication was key in the success of each created report by whether the report provided useful insights and summaries of data in a clear and efficient manner.
Anitha Vallikunnel, Product Reorder Predictions, July 2018, (Dungang Liu, Yichen Qin)
Uncovering the shopping patterns helps the retail chains to design better promotional offers, forecast the demand, optimize the brick and mortar store aisles - in short, everything to build a better experience for the customer. In this project, using the Instacart’s open sourced transactional data, I have identified and predicted the items that are ordered together. Apriori algorithms and association rules are used for market basket analysis to achieve this. Using feature engineering and gradient boosted tree models, reordering of items are also predicted. This will help the retailers in demand forecasting and identifying the items that will be ordered more frequently. F1 score is used as the metric for measuring prediction accuracy for reordering. On training sample, we got F1 score of 0.772 and the F1 score in the out of sample method is 0.752.
Ananya Dutta, Trip Type Classification, July 2018, (Dungang Liu, Peng Wang)
Walmart is the world’s largest retailer and improving customer’s shopping experience is a major goal for them. In this project we are trying to recreate their existing trip type classification with a limited number of features. This categorization of trip types will help Walmart improve the customer’s experience by personalized targeting and can also help them identify business softness related to specific trip types. The data contains over 600k rows consisting of purchase data at a scan level. After rigorous feature engineering and model comparisons we found that the results using an Extreme gradient boosting model is promising with an accuracy of ~90% in training and ~70% in testing data. After looking at importance of variables - total number of units, total number of, count of UPCs, coefficient of variation of percentage contributions across departments and items sold in departments like financial services, produce, menswear and DSD Grocery were found important in building this classifier.
Nitha Benny, Recommendation Engine, July 2018, (Dungang Liu, Yichen Qin)
Recommendation engines are widely used today across e-commerce, video streaming, movie recommendations etc. and this is how each of these businesses maintains their edge in the highly competitive online business world. The idea behind using recommendation engines itself is intriguing and this project aims to compare collaborative filtering techniques to better understand how recommendation engines work. Two main types of collaborative filtering i.e., user based, and item-based methods are used here. The two models are built, and we calculate the MSE and MAE values for each. The models are then evaluated using ROC curve and precision-recall plots for a different number of recommendations. We find that the user based collaborative filtering method using the cosine similarity function works best giving a lower MSE value of 1.064 as well as the better area under the curve and precision-recall curves. Hence, the User-based collaborative filtering method will help businesses recommend better products to their customers and thus improve their customer experience.
Nirupam Sharma, UC Clermont Data Analysis and Visualization, July 18, (Mike Fry, Susan Riley)
For my summer internship, I worked as Graduate Student at UC Clermont College in Batavia, Ohio for the office of Institutional Effectiveness from May 2018 to July 2018. My responsibilities were to build R analytical engine to perform data analysis and to design Tableau dashboards highlighting key university insights. Data used in analysis consisted of tables describing information about number of enrollments, courses, employees, accounts and sections for different semesters. The analytical engine was written in R language to connect to data, combine data tables and perform SQL and descriptive analysis to get inferences in trends across years. The results of analysis were used to build dashboards in Tableau. My responsibilities for Tableau work were to create new calculative fields, parameters and dynamic actions and use other advanced Tableau features learned during my masters at UC to build charts and dashboards to be uploaded on the UC Clermont website. The analytical engine I built allowed college to perform data pipeline tasks effortlessly and quickly without much human input thus saving the college a lot of time and resource efforts. The dashboards built help the college to better understand trends in data and make recommendations to management. My internship allowed me to hone my R and Tableau skills. I learnt to use many advanced R packages and my ability to write quality code increased significantly. My experience at UC Clermont College will allow me to work more professionally and effectively in my future job.
Scott Fueston, Preventing and Predicting Crash-related Injuries, July 2018, (Yan Yu, Craig Zielazny)
This study aims to identify influential factors that elevate a motorist’s risk of sustaining a serious or fatal injury during a motor vehicle crash. Addressing these factors could then potentially save lives, prevent long-term pain and suffering, and avert liabilities and monetary damages. Using population comparisons through exploratory data analysis and model creation for prediction, contributory factors to devastating injuries have been identified. These include: lack of restraint use, deployment of an air bag, crashing into a fixed object, crashing head-on, a roadside collision, time of day is night, the vehicle type is a car, speeding, a rollover occurred, impact of first collision occurred in the front-left corner, disabling damage occurred to the vehicle, and alcohol involved. This information could be invaluable to key members in areas such as policy design, regulatory agencies, car manufacturers, and consumers in: developing clear communications and advocacy for ways to aid in prevention, proposing and implementing effective policy and laws, aiding in the approach taken in manufacturing and designing future automobiles, and elevating the general public’s awareness in terms of risk factors.
Amit Kumar Mishra, Customer Churn Prediction, July 2018, (Yan Yu, Yichen Qin)
Churn Rate is defined as the number of customers who moved out of the subscription of an organization. It is an important component in the profitability of an organization. This gives an indication of the revenue lost by an organization. Additionally, an organization can comprehend the factors which are responsible for customer churn and can allocate its resources to those factors. A customer retention program can be developed so that customer retention is maintained. Thus, given the significance of customer churn, the telecommunication customer data is obtained from the IBM repository and was explored to find the factors that are responsible for customer churn. Various machine learning techniques like logistic regression (with various link functions, namely – probit, logit and cloglog, and using different variable selection procedures), tree, random forest, support vector machine and gradient boosting were used to predict the customer churn and the best model was identified in terms of in-sample and out-of-sample performance. Tenure, contract, internet service, monthly charges and payment method were found to be the most important variables for predicting the customer churn in the telecommunication industry. Among all the different classification techniques, support vector machine with radial basis kernel (RBF) performed the best in terms of accuracy with 80.10% of data classified correctly.
Pranil Deone, Default of Credit Card Clients Dataset, July 2018, (Peng Wang, Liwei Chen)
The Default of credit card client’s data set is used for the purpose of this project. The main objective is to build a credit risk model which accurately identifies the customers who will default their credit card bill payment in the next month. The model is based on the credit history of the customers which includes information regarding their limit balance, previous month’s payment status, previous month’s bill amount. Also, various demographic factors like age, sex, education, marital status has been considered to build the model. The data set contains 30000 observations and 25 variables. Some preprocessing is done on the data to prepare for analysis and modeling. Quantitative and categorical variables are identified and separated for performing appropriate exploratory data analysis. Data modeling techniques like generalized logistic regression, stepwise variable selection, LASSO regression and Gradient Boosting Machine have been used to build different credit risk models. The model performance is evaluated on the training and the testing data. Model performance criteria like misclassification rate and AUC have been used to evaluate different models and select the best model.
Hemang Goswami, Ames Housing Dataset, July 2018, (Dungang Liu, Yichen Qin)
Residential real estate prices are fascinating… and frustrating at the same time. None of the parties involved in the house buying process: the homebuyer, the home-seller, the real estate agent, the banker can point out the factors affecting the house pricing with total conviction. This project explores the AMES Housing Dataset which contains information on the residential property sales that occurred in Ames, Ohio from 2006 to 2010. The dataset has 2930 observations with 80 features describing the state of the property including our variable of interest: Sale Price. After creating 10 statistical models ranging from a basic linear regression model to the highly complex models Gradient Boosting and Neural Network, we were successfully able to predict the house prices with a MSE as low as 0.015. In the process, we found out that the Overall quality of the house, exterior condition, area of the first floor and neighborhood were some of the key features affecting the prices.
Ameya Jamgade, Breast Cancer Wisconsin Prediction, July 2018, (Yan Yu, Yichen Qin)
Breast cancer is a cancer that develops from the breast tissue. Certain changes in the DNA (mutations) result in uncontrolled growth of the cells, eventually leading to cancer. Breast cancer is the one of the most common types of cancer in women in the United States, ranking second among cancer deaths. This project aims at analyzing data of women residing in the state of Wisconsin, USA by applying data mining techniques to classify whether the tumor mass is benign or malignant. Data for this project was obtained from UCI Machine Learning repository containing information of 569 women across 32 different attributes. Data cleaning and exploratory data analysis procedures were performed to prepare and summarize main characteristics of the data-set. The data was portioned into training and test sets, consisting of 80% and 20% split respectively and data mining algorithms such as K-nearest neighbor, random Forest and Support Vector Machine were used for classification of the diagnosis Y-variable as malignant or benign. The optimal value of K is 11 for k-nearest neighbor classifier which gives 98.23% accuracy. The tuned random Forest model has an error rate of 3.87% and identified the top 5 predictor variables. The tuned SVM model gives accuracy of 98.68% and 95.58% on training and test data respectively. The findings indicated in this project can be used by the heath-care community to perform additional research corresponding to these attributes to help prevent the pervasiveness of breast cancer.
Sai Uday Kumar Appalla, Predicting the Health of Babies Using Cardiotocograms, July 2018, (Yan Yu, Yichen Qin)
The aim behind doing this research is to predict the health of a baby based on different diagnostic features observed in the cardiotocograms. The data was collected from the UCI Machine Learning repository. Different Machine Learning algorithms were built to try and understand what are the factors that have a significant influence on the baby’s health and predict the health state of the baby based on these factors with the best possible accuracy. Initially, basic classifiers like K-nearest neighbours and Decision Trees are used to make predictions. These algorithms have higher interpretability and they help us understand the significance of different variables in the analysis. During the later parts of the analysis, complex classifiers like Random Forest, Gradient Boosting and Neural Networks are used to boost the accuracy of the predictions. Finally, after looking at all the different model metrics, Gradient Boosting tree is selected as the best model as it has better model metrics than any of the other models.
Piyush Verma, Building a Music Recommendation System Using Information Retrieval Technique, July 2018, (Peng Wang, Yichen Qin)
Streaming music have become one of the top sources of entertainment for millennials. Because of Globalization, people all around the world are now able to access different kinds of music. The global recorded music industry is worth $15.7 billion and is growing at 6% as per 2016. Digital music is responsible for driving 50% of those sales. There are 112 million paid subscribers for the streaming business and roughly a total of 250 million users, if we include those who don’t pay. Thus, it becomes very important for streaming service providers like YouTube, Spotify and Pandora to continuously improve their service to the users. Recommendation Systems are one such information retrieval technique to predict the ratings or popularity a user would give/have for an item. In this project I would be exploring bunch of methods to predict ratings of users for different artists using GroupLen’s Last.FM dataset.
Poorvi Deshpande, Sales Excellence, July 2018, (Yichen Qin, Ed Winkofsky)
One of the features that a bank offers is to provide loans. The process by which the bank decides whether an applicant should receive a loan is called underwriting. An effective underwriting and loan approval process is a key predecessor to favorable portfolio quality, and a main task of the function is to avoid as many undue risks as possible. The aim of this process, when undertaken with well-defined principals, the lender is able to ensure good credit quality. This is a problem faced by a digital arm of a bank. The primary aim of this division is to increase customer acquisition through digital channels. This division sources leads through various channels like search, display, email campaigns and via affiliate partners. As expected, they see differential conversion depending on the sources and the quality of these leads. Consequently, they now want to identify the leads' segments having a higher conversion ratio (lead to buying a product) so that they can specifically target these potential customers through additional channels and re-marketing. They have provided a partial data set for salaried customers from the last 3 months. They also capture basic details about customers. We need to identify the segment of customers with a high probability of conversion in the next 30 days.
Jatin Saini, An Analysis of Identifying Diseased Trees in Quickbird Imagery, July 2018, (Yan Yu, Edward P Winkofsky)
Machine learning algorithms are used widely to identify patterns in data. One of its applications has been found in identifying diseased trees from Quickbird imagery. In this project, we apply logistics regression, LASSO and Classification Trees (CART) models on imagery data to identify significant variables. We designed this study to create training and testing dataset and compared Area Under Curve (AUC) The results using logistic regression showed us 0.97 AUC value for both training and testing datasets, on the other hand, CART showed AUC 0.92 on testing data and 0.91 on training data. After examining the accuracy of different algorithms, we conclude that logistic regression showed us more accurate results on training and testing data.
Raghu Kannuri, Recommender System Using Collaborative Filtering and Matrix Factorization, July 2018, (Peng Wang, Yichen Qin)
This project aims to develop a recommender system using various machine learning techniques. A recommender system helps in developing a customized list of recommendations for every user and thus, acting as a virtual salesman. It predicts missing user-product rating by drawing information from the user's past product ratings or buying history and ratings by similar users. Content-based Filtering, Knowledge-based, Collaborative Filtering and Hybrid filtering are the widely used recommender system techniques. This project deals with techniques like Item-Based (IBCF) and User-Based (UBCF) collaborative filtering with different similarity metrics and Matrix Factorization with Alternative Least Squares. The results of Matrix Factorization outperformed UBCF and IBCF in all evaluation metrics like precision, recall and AUC.
Madhava Chandra, Analysis on Loan Default Prediction, July 2018, (Yichen Qin, Peng Wang)
The purpose of this study was to determine what constitutes risky lending. Each line item in the data corresponded to a loan, and had various features relating to loan amount, employment information of the borrower, payments made, and the classification of the loan as charged off or active with any delays in payments noted. An exploratory data analysis was performed on the data, to look for outliers and individual distributions of the variables. Following which, the interactions between these variables were studied to weed out highly correlated variables. Owing to low representation of defaults in the sample, this was treated as an imbalanced class problem, wherein traditional random sampling would not yield optimal results. To overcome this problem, stratified sampling, random under and over sampling, SMOTE and ADASYN methodologies were explored.
All the above sampling methodologies were trained and tested on logistic regression to pick which sampling procedure to follow for this exercise. Following which, it was found that SMOTE gave the best results. To best classify which loans would likely default from the given dataset, various statistical learning techniques, such as Regression, Tree-based methods- standalone, boosting and bagging ensemble methods, Support Vector Machines and Neural Networks were employed. Amongst these classifiers, Gradient boosting was observed to have the best performance, although with further fine tuning, Deep Neural Networks could possibly classify better.
Samrudh Keshava Kumar, Analytical Analysis of Marketing Campaign Using Data Mining Techniques, July 2018, (Dungang Liu, Peng Wang)
Marketing products is an expensive investment for a company, and spending money to market to customers who might not be interested in the product is inefficient. This project aims to determine and understand the various factors which influence a customer’s response to a marketing campaign. This will help the company design targeted marketing campaigns to cross-sell products. Predictive models were built to predict the response of each customer to the campaign based on various characteristics of the customer using models such as logistic regression, Random Forests and Gradient Boosted trees. The factors affecting the response was determined to be Employment status, Income, type of renew offer, months since policy inception and last claim. The models were validated using a test set and the best accuracy was achieved by the Random Forests model. It has an AUC of 0.995 and misclassification rate of 1.3%.
Rohit Bhaya, Lending Club Data – Assessing Factors to Determine Loan Defaults, July 2018, (Yan Yu, Peng Wang)
Lending club is an online peer-to-peer platform that connects the loan customers with potential investors. Loan applicants can borrow money in the range of $1,000 to $40,000, and the investors can choose the loan products they want to invest in and make profit. The loan data was available on Kaggle and contains applicant information about loans that were originated between 2007 and 2015. Using the available information for applicants who have already paid-off the loan, various machine learning algorithms are built to estimate the propensity of a customer’s default. Further, it was observed that the step AIC approach for logistic regression had the best performance amongst all the models tested. This final model was then used to build an applicant default scorecard that has a range between 300 and 900. A higher score indicates a higher propensity for an applicant to default. Further, the scorecard gave good performance in both the training data and the test data. This scorecard was then used on the active customer base to score an applicant’s propensity to default. From the distribution of this score, it was observed that the most of the active loan customers fall into low-risk category. Further, for higher score applicants, the management can prepare preventive strategies to avoid future losses.
Nitin Sharma, A Study of Factors Influencing Smoking Behavior, July 2018. (Dungang Liu, Liwei Chen)
In this study, statistical analysis is performed to understand the factors that influence smoking habits. Data used in this experiment is obtained from a survey conducted in Slovakia on participants aged 15-30 years. This dataset is available for the public at the Kaggle website. Data collected in this survey includes information about “Smoking habits” of the participants. This is the variable of interest which is a categorical variable with values: Never smoked, Tried smoking, Former smoker and Current smoker. The goal of this study is to find out which factors influence the smoking habit. Machine learning techniques (logistic regression, ensemble methods) are used to predict whether an individual is a current/past smoker or is someone who has never smoked. The best model selected in this study provides an overall accuracy of 83% in the test sample. The result of this study is applicable only to 15-30 years old Slovakia population and cannot be associated with a different population.
Yiyin Li, Foreclosure in Cincinnati Neighborhoods, July 2018, (Yan Yu, Charles Sox)
The main purpose of this paper is to analyze what factors would likely affect foreclosure in Cincinnati neighborhoods and build a model to predict whether the property will be foreclosed by banks. The dataset that is analyzed lists all real estate transactions in Cincinnati from 1998 to 2009. In this paper, after a brief description of the project background and data, exploratory data analysis will be provided, which mainly includes a basic analysis of each individual variable, the correlation statistics between variables and the basic information of 47 Cincinnati neighborhoods. Then, 10 different types of models and a model comparison are provided in the modeling section in order to find the best model to predict the foreclosure. In conclusion, properties’ sales price, building and land value, selling year and that year’s properties mortgage rate, and the median family income are the most influential variables, and the gradient boosting model is the best model for predicting foreclosure.
Adrián Vallés Iñarrea, Predicting Customer Satisfaction and Dealing with Imbalanced Data, July 2018, (Dungang Liu, Shaobo Li)
From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. In this paper, we will compare logistic regression, classification tree, random forest and extreme gradient boosting models to predict whether a customer is satisfied or dissatisfied with their banking experience. Doing so would allow banks to take proactive steps to improve a customer's happiness before it's too late. The dataset was published in Kaggle by Santander Bank, a Spanish banking group with operations across Europe, South America, North America and Asia. It is composed of 76020 observations and 371 variables that have been semi-anonymized to protect the client’s information. 96.05% of the customers are satisfied and only 3.95% are dissatisfied, making this classification problem to be highly imbalanced. Since most of the commonly used classification algorithms do not work well for imbalanced problems, we also compare in this paper two ways to deal with the imbalanced data classification issue. One is based on cost sensitive learning, and the other is based on a sampling technique. Both methods are shown to improve the prediction accuracy of the minority class, and have favorable performance compared to the existing algorithms.
Guansheng Liu, Development of Statistical Models for Pneumocystis Infection, July 2018, (Peng Wang, Liwei Chen)
The yeast-like fungi Pneumocystis reside in lung alveoli and can cause a lethal infection known as Pneumocystis pneumonia (PCP) in hosts with impaired immune systems. Current therapies for PCP suffer from significant treatment failures and a multitude of serious side effects. Novel therapeutic approaches, such as newly developed drugs are needed to treat this potentially lethal opportunistic infection. In this study, I built a simplified two-stage model for Pneumocystis growth and determined how different parameters control the levels of Trophic and Cyst forms of the organism by employing machine learning methods including multivariate linear regression model, partial least squares, regression tree, random forest and gradient boosting machine. It was discovered that parameters of K_sTro (replication rate of Trophic form), K_dTro (degradation rate of Trophic form) and K_TC (transformation rate from Trophic form to Cyst form) play predominant roles in controlling the growth of Pneumocystis. This study is of great clinical significance, as the extracted statistical trends on the dynamic changes of the Pneumocystis will guide the design of novel and effective treatments for controlling the growth of Pneumocystis and PCP therapy.
Vignesh Arasu, Major League Baseball Analytics: What Contributes Most to Winning, July 2018, (Yan Yu, Matthew Risley)
Big data and analytics has been a growing force in Major League Baseball. The principle of moneyball vitalizes the importance of two of these statistics, on-base percentage and slugging (Total Bases/Number of At Bats) as the core principles for building winning franchises. The analysis of this report of data from all teams from 1962-2012 incorporating methods of multiple linear regression, logistic regression, regression and classification trees, generalized additive models, linear discriminant analysis, and k-means clustering creating the best models for the number of wins by a team(linear regression response variable) and whether or not a team makes the postseason(logistic regression response variable) shows that runs scored, runs given up, on-base percentage, and slugging do have strong effects on team success of wins and making the playoffs. The in-sample best models of supervised logistic regression techniques all show great results with AUC values all over 0.90 while the unsupervised k-means clustering technique showed that the data can be effectively grouped in 3 clusters. A mix of supervised and unsupervised study techniques show that a variety of statistical techniques can be used to analyze baseball data.
Preethi Jayaram Jayaraman, Prediction of Kickstarter Project Success, July 2018, (Yichen Qin, Bradley Boehmke)
Kickstarter is an American public-benefit corporation that uses crowdsourcing to bring creative projects to life. As a crowdfunding platform, Kickstarter promotes projects across multiple categories such as film, music, comics, journalism, games and technology, among others. In this project, the Kickstarter Projects Database was analyzed and explored in detail. The patterns identified in the Data exploration stage were used as inputs in for predictive modeling. Classification models such as Logistic Regression and Classification Trees were built to classify the Kickstarter projects. Performance across the two models was compared on the validation set (hold-out set – 20% of the data) using accuracy, sensitivity and AUC as the performance criteria. ROC curves were also plotted for both the models. The Logistic Regression model was chosen as the best model for the Kickstarter project classification with an accuracy of 0.9996 and AUC of 0.9999. The performance of the Logistic Regression model (best performing model) was evaluated on the test data to conclude the classification problem. The Logistic Regression model classified the Kickstarter projects with an accuracy of 99.96% on the test data. The analysis of the Kickstarter Projects was further extended to include projects of states - ‘Suspended’, ‘Live’, ‘Undefined’ and ‘Canceled’, recoded as ‘Failed’. Building Logistic Regression and Classification Tree models resulted in Logistic Regression as the best model with a classification accuracy of 0.9656 on the test data.
Rohit Pradeep Jain, Image Classification: Identifying the object in a picture, July 2018, (Yichen Qin, Liwei Chen)
The objective of this project was to classify images of fashion objects (like T-shirts, sneakers, etc.) based on the pixel information contained in these pictures. The image was classified into one of the 10 available classes of fashion objects using different modeling techniques and a final model was chosen based on the accuracy on the cross-validation dataset. The final model was then tested on the untouched testing dataset to validate the out of bag accuracy. The project serves as a benchmark for more advanced studies in the image classification field and helps in technologies like stock photography.
Priyanka Singh, Mobile Price Prediction, July 2018, (Peng Wang, Liwei Chen)
The aim of this study is to classify the prices of mobile devices from 0 to 3 with the higher number denoting higher prices. The dataset has a total of 2000 observations and 21 variables. The response variable, the price range is to be predicted with the highest accuracy possible. The analysis starts with performing the exploratory data analysis followed by the construction of machine learning models. The exploratory data analysis revealed that the categorical variables weren’t significant enough in determining the price of the devices. The numeric variables, battery power and ram of the phones, had a considerable impact on the prices. Classification tree, random forest, support vector machines and gradient boosting machines were used to predict the price of the phones. Support vector machine model was chosen as our final model as it gave the lowest misclassification rate of 0.08 and highest area under curve (AUC) value of 0.97. The features used in generating the model were: ram, battery power, pixel width, pixel height, the weight of the mobile, internal memory, mobile depth, clock speed and touchscreen.
Gautami Hegde, HR Analytics: Predicting Employee Attrition, July 2018, (Yan Yu, Yichen Qin)
Employee attrition is a major problem to an organization. One of the goals of the HR Analytics department is to identify the employees that are likely to leave the organization in the future and take actions to retain them before they leave. Thus, the aim of this project is to understand the key factors that influence this attrition and predict the attrition of an employee based on these factors. The dataset used here is the HR analytics dataset by IBM Watson Analytics which is a sample dataset created by IBM data scientists. In this project, the exploratory data analysis includes feature selection based on distributions, correlation and data visualizations. After eliminating some features, logistic regression, generalized additive model, decision tree and random forest techniques are used for building models. In order to evaluate the model performance, the prediction accuracy and AUC are considered. Of the different classification techniques, logistic regression model and generalized additive model were found to be the best in predicting the employee attrition.
Venkata Sai Lakshmi Srikanth Popuri, Prediction of Client Subscription from Bank Marketing Data, July 2018, (Peng Wang, Yichen Qin)
Classification is one of the most important and interesting problems in today’s world. It has applications ranging from email spam tagging to fraud detection to predictions in the healthcare industry. The area of interest here is Bank Marketing of a Portuguese banking institution. The marketing teams at banks run campaigns to pursue clients to subscribe for a term deposit. The purpose of this paper is to apply various data and statistical techniques to analyze and model the bank marketing data and predict whether a client will subscribe for a term deposit. The analysis aims at addressing this classification problem by performing Explanatory Data Analysis (EDA), building models like Logistic Regression, Step AIC & Step BIC models, Classification Tree, Linear Discriminant Analysis (LDA), Support Vector Machines(SVM), Random Forest(RF), Gradient Boosting(GB) and validating these models using the misclassification rate and area under the ROC curve. The performance of SVM is better than other models for this dataset with a low out-of-sample Misclassification Rate and good AUC values.
Ali Aziz, Financial Coding for School Budgets, July 2018, (Yan Yu, Peng Wang)
School budget items must be labelled according to their description in a difficult task known as financial coding. A predictive model that outputs the probability of each label can help in accomplishing this work. In this project the effectiveness of several data processing techniques and machine learning algorithms was studied. After applying data imputation and natural language processing techniques, a one-vs-rest classifier consisting of L1 regularized logistic regression models performed the best out of all classifiers investigated. This classifier achieved an out-of-sample Log Loss of 0.5739, an improvement of approximately 17% on the baseline predictive model.
Shashank Badre, A Study on Online Retail Data Set to Understand the Characteristics of Customer Segments that Are Associated with the Business, July 2018, (Peng Wang, Yichen Qin)
Online retailers in the world who happen to have a small business and are new entrants in the market are keen on using data mining and business intelligence techniques to better understand existing and potential customer base. However, such small businesses often lack expertise and technical know-how to perform requisite analysis. This study will help such online retailers to understand the approach and different ways the data can be utilized to gain insights into its customer base. This study is done on an online retail data set to understand characteristics of different segments of customers. Based on these characteristics the study will explain which customers segments contribute high monetary value and which customer segments contribute low monetary value to the business.
Ravish Kalra, Phishing Attack Prediction Engine, July 2018, (Dungang Liu, Edward Winkofsky)
A phishing attack forces the users to enter their credentials in a fake website or by making them open a malware in their system. This, in turn, results in identity theft or financial losses. The aim of this project is to build a prediction engine through which a browser plugin can accurately predict whether a given website is legitimate or fraudulent after capturing certain features from the page. The scope of the project is only limited to websites and does not involve any kind of other electronic media. The data set used for the analysis has been obtained from UCI Machine Learning repository. After evaluating a website through 30 documented features, the model predicts a binary response of 0 (legitimate) or 1 (phishing). Methods of analysis include (but not limited to) visualization of spread of different features, identifying correlation between covariates and the dependent variable and implementing different classification algorithms such as Logistic Regression, Decision Tree, SVM and Random Forest. Due to the unavailability of asymmetric weights for false positives and false negatives, various other evaluation metrics such as F Score, Log Loss etc. along with out-sample AUC are compared. The Random Forest model outperforms other modelling strategies considerably. Although a blackbox classifier, Random Forest model works well for the purpose of a back-end prediction engine that insulates decision making from the users.
Akul Mahajan, TMDB - "May the Force be with you", July 2018, (Yan Yu, Yichen Qin)
Today, we live in an era where almost every important business decision is guided through the application of statistics, one of the most popular areas in this regard is the application of statistical models in machine learning and prediction modelling in order to garner outcomes and align them with the goals of the industry and formulate and improve strategy to meet these goals. TMDB is one of the most popular datasets on Kaggle which houses the data of 5000 movies from different genres, geographies and languages. The use of predictive modelling can be applied to gain insights about the expected performance of the movie before they are released and formulate proper marketing strategies and campaigns in order to further improve their performance. This paper employs the use of some of the advanced predictive algorithms like linear regression, CART, Random Forests, Generalized additive model and Gradient Boosting along with tuning these model to achieve optimum performance and evaluating their potential using proper evaluation metrics.
Kevin McManus, Analysis of High School Alumni Giving, July 2018, (Yan Yu, Bradley Boehmke)
Archbishop Moeller High School has an ambitious plan to increase its participation rate (giving + activities), up from 4% a few years ago to 13% last year with a goal of 15%. Donations to the 2017 Unrestricted Fund were made by 9% of the 11,524 alumni base and reflected an increase of 258% vs 2013. The analyses focused on a regression predictive model for donation amount and a classification model to predict which alumni will donate. Both suggest that prior alumni giving, and connections to the school via other affiliations were strong predictors, among several others. The school should focus on creating opportunities for involvement by alumni as well as maintain strong connections to its base who give consistently. Overall, higher wealth levels were not a significant predictor for giving to the Unrestricted Fund. The analyses also performed unsupervised clustering which suggested there were distinct groups of those strongly connected with the school through other affiliations and those who were not. The former group tended to live within 100 miles of Cincinnati and give at a higher rate than the other groups. Even the clustering of giving alumni showed a small consistent group of givers and a second group of occasional donors. The former group also had a higher rate of other connections to the school compared to those who gave only occasionally.
Ritisha Andhrutkar, Sentiment Analysis of Amazon Unlocked Phone Reviews, July 2018, (Yichen Qin, Peng Wang)
Online customer reviews hold a powerful effect on the behavior of consumers and, therefore, the performance of a brand in the Age of Internet today. According to a survey, 88% of consumers trust online reviews as much as personal recommendations for purchasing any item on an e- commerce website. Positive reviews boost the confidence of an organization while Negative reviews suggest areas of improvement. It is also certain that having more reviews for a product will result in a high conversion rate for that product. This report is aimed at analyzing and understanding the trend of human behavior towards unlocked mobile phones sold on Amazon. The dataset utilized has been scraped from the e-commerce website and consists of several listings of phones along with their features such as Brand Name, Price, Rating, Reviews and Review Votes. Text Mining techniques have been leveraged on the dataset to identify the sentiment of each customer review which would help Amazon and, in turn, the manufacturer to improve their current products and sustain their brand name.
Swapnil Patil, Applications of Unsupervised Machine Learning Algorithms on Retail Data, July 2018, (Peng Wang, Yichen Qin)
Data Science and Analytics is widely used in the retail industry. With the advent of bid data tools and higher computing power, sophisticated algorithms can crunch huge volumes of transactional data to extract meaningful insights. Companies such as Kroger invest heavily to transform more than a hundred-year-old retail industry through analytics. This project is an attempt to apply unsupervised learning algorithms on the transactional data to formulate strategies to improve the sales of the products. This project deals with online retail store data taken from UCI Machine Learning Repository. The data pertains to a UK-based registered online retail store’s transaction between 01/12/2010 and 09/12/2011. The retail store mostly sells different gift items to wholesalers around the globe. The objective of the project is to apply statistical techniques such as clustering, association rules and collaborative filtering to come up with different business strategies that may lead to an increase in the sales of the products.
Tathagat Kumar, Market Basket Analysis and Association Rules for Instacart Orders, June 2018, (Yichen Qin, Yan Yu)
For any retailer it is extremely important to identify customer habits, why they make certain purchases, gain insight about their merchandise, movement of goods, peak time of sales and set of products which are purchased together. It helps them in structuring store lay out, designing various promotion and coupons and combining all with a customer loyalty card which makes all the above strategy even more useful. The first public anonymized dataset from Instacart is selected for this paper and the goal is to analyze this data set for finding out fast moving items, frequent basket size, peak order times, frequently reordered items and high moving products in aisles. This paper also demonstrates the loyal customer habit pattern and prediction of their future purchase with reasonable accuracy. Market basket analysis with association rules are used to discover the top strong rules of product association based on different association measures e.g. support, confidence and lift. Analysis has been conducted to uncover the strong rules for high frequent and less frequent items. Also, it is shown in the example of top selling products demonstrating which product will follow before and after its purchase using left hand and right-hand association rules.
Sayali Dasharath Wavhal, Employee Attrition Prediction on Class-Imbalanced Data using Cost-Sensitive Classification, April 2018, (Yichen Qin, Dungang Liu)
Human Resource is the most valuable asset for an organization and every organization aims at retaining its valuable workforce. Main goal of every HR Analytics department is to identify the employees that are likely to leave the organization in the future and take actions to retain them before they leave. This paper aims at identifying the factors resulting in employee attrition and build a classifier to predict employee attrition. The analysis aims at addressing the class-imbalance classification problem by exploring the performance of various Machine Learning models like Logistic Regression, Classification Tree using Recursive Partitioning, Generalized Additive Modeling and Gradient Boosting Machine. This being a highly-imbalanced class problem, with only 15% Positives, “Accuracy” is not a suitable indicator of model performance. Thus, to avoid the bias of the classifier towards the majority class, Cost-Sensitive classification was adopted to tackle misclassification of minority class, where False Negatives have a higher penalty as compared to False Positives. The model performance was evaluated based on Sensitivity (Recall), Specificity, Precision, Misclassification Cost and Area under the ROC Curve. The analysis in this paper suggests that although the recursive partitioning and ensemble techniques of decision trees have a good predictive power of the minority class, but more stable prediction performance is observed with the Logistic Regression Model and Generalized Additive Model.
Yong Han, Whose Votes Changed the Presidential Elections?, April 2018, (Dungang Liu, Liwei Chen)
The unique aspect of the YouGov / CCAP data was that it contained the information of 2008 to 2016 elections from the same group of 8000 voters. This might provide information on voting patterns between elections.
The goals of this study were to find: Was any predictor significant to the 2012 and 2016 presidential vote? Was it consistent between elections? Was any predictor significant to the change-vote between two elections? Was it consistent? Based on exploratory data analysis, 70% of voters never changed their votes, and 20% of voters changed at least once in last three elections. Was any predictor significantly associated with this behavior?
Using VGLM method, this study found that: In single elections, some common predictors were significant in elections, such as Gender, Child, Education, Age, Race and Marital status. Meantime, different elections had different significant predictors. In vote-change between two elections, significant predictors were different between two different elections. Between 2012-2016 elections, model suggested that Education, Income and Race were significant to vote-change. While between 2008-2012, model suggested that Child and Employment status were significant to vote-change. With 2016 elections data, the never-change-vote model found that Income, Age, Ideology, News and Married status were significant to this never-change-vote behavior. Individual election models could predict ~60% of votes in testing samples. Utilizing a previous vote as a predictor, models could predict ~ 89% of votes in testing samples. The never-change-vote model predicted well on the 70% never-change-vote voters, but missed almost all on the 20% change-vote voters.
Yanhui Chen, Binning on Continuous Variables and Comparison of Different Credit Scoring Techniques, April 2018, (Peng Wang, Yichen Qin)
Binning is a widely-used method to group a continuous variable into a categorical variable. In this project, I binned the continuous variables amount, duration and age in German credit data, and performed a comparative analysis on the logistic model using binned variables, to logistic model without using binned variables, to logistic additive model without using binned variables, to random forest, and to gradient boosting. I found that the performance of logistic with binning model is the weakest one among fitted five models. I also shown that the variable importance varied with different models, and the variable checkingstatus is selected as one of the important variables in most of the built models. Binned variables duration and amount were determined to be important variables in logistic with binning model. Random forest is the only model which selected variable history as an important variable.
Jamie H. Wilson, Fine Tuning Neural Networks in R, April 2018, (Yan Yu, Edward Winkofsky)
As artificial neural networks grow in popularity, it is important to understand how they work and the layers of options that go into building a neural network. The fundamental components of a neural network are the activation function, the error measurement and the method of backpropagation. These methods make neural networks good at finding complex nonlinear relationships amongst predictor and response variables as well as interactions between predictor variables. However, neural networks are difficult to explain, can be computationally expensive and tend to overfit the data. There are two primary R packages for neural networks: nnet and neuralnet. The nnet package has fewer tuning options but can handle unstandardized and standardized data. The neuralnet package has a myriad of options, but only handles standardized data. When building a predictive model using the Boston Housing Data, both packages are capable of producing effective models. Tuning the models is important to get valid and robust results. Given the amount of tuning parameters in neuralnet, these models perform better than the models built in nnet.
Kenton Asbrock, The Price to Process: A Study of Recent Trends in Consumer-Based Processing Power and Pricing, April 2018, (Uday Rao, Jordan Crabbe)
This analysis investigates the effects of the deceleration of the observational Moore’s Law on consumer based central processing units. Moore’s Law states that the number of transistors in a densely integrated circuit approximately doubles every two years. The study involved a data-set containing information about 2241 processors released by Intel between 2001 and 2017, which is the approximate time frame associated with the decline of Moore’s Law. Data wrangling and pre-processing was performed in R to clean the data and convert it to a state that was ready for analysis. Data was then aggregated by platform to study the evolution of processing across desktops, servers, embedded devices, and mobile devices. Formal time series procedures were then applied to the entire data set to study how processing speed and price has changed recently and how future forecasts are expected to behave. It was determined that while processing speeds are in a period of stagnation, the price paid for computational power has been decreasing and is expected to decrease in the future. While the negative effects of the decline of Moore’s Law may have an impact on a small fraction of the market through speed stagnation, the overall price decrease of processing performance will benefit the average consumer.
Hongyan Ma, A Return Analysis for S&P 500, April 2018, (Yan Yu, Liwei Chen)
Time series analysis is commonly used to analyze and forecast economic data. It helps to identify patterns, to understand and model the data as well as to predict short-term trends. The primary purpose of this paper is to study the Moving Window analysis and GARCH Models built through analyzing the monthly return of S&P 500 for recent 50 years from January 1968 to December 2017.
In this paper, we first studied the raw data to check its patterns and distributions, and then analyzed the monthly returns in different time windows, that is, 10-year, 20-year, 30-year and 40-year by Moving Window analysis. We found that over the long horizon, the S&P 500 had produced significant returns for investors who had long stayed in investment. However, for a given 10-year period, the return can go even negatively. Finally, we fitted several forms of GARCH models in normal distributions as well as in student t distributions and found the GARCH (1,1) Student-t model as the best model in terms of the Akaike’s Information Criteria and log-likelihood values.
Justin Jodrey, Predictive Models for United States County Poverty Rates and Presidential Candidate Winners, April 2018, (Yan Yu, Bradley Boehmke)
The U.S. Census Bureau administers the American Community Survey (ACS), an annual survey that collects data on various demographic factors. Using a Kaggle dataset that aggregates data at the United States county level and joining other ACS tables to it from the U.S. FactFinder website, this paper analyzes two types of predictive models: regression models to predict a county’s poverty rate and classification models to predict a county’s 2016 general election presidential candidate winner. In both the regression and classification settings, a generalized additive model best predicted county poverty rates and county presidential winners.
Trent Thompson, Cincinnati Reds – Concessions and Merchandise Analysis, April 2018, (Yan Yu, Chris Calo)
Concession and Merchandise sales account for a substantial percentage of revenue for the Cincinnati Reds. Thoroughly analyzing the data captured from Concession and Merchandise sales can help the Reds with pricing, inventory management, planning and product bundling. The scope of this Concession and Merchandise analysis includes general exploratory data analysis, identifying key trends in sales, and analyzing common order patterns. One major finding from this analysis was calculating 95% confidence intervals of Concession and Merchandise sales resulting in improved efficiency in inventory management. Another learning is that generally, fans buy their main food items (hot dog, burger, pizza) before the game and then beverages, desserts and snacks during the game. Finally, strong order associations exist among koozies with light beer and bratwursts with beverages and peanuts. I recommend displaying the koozies over the refrigerator with light beers and bundling bratwursts in a similar manner to the current hot dog bundle with hopes of driving a lift in sales.
Xi Chen, Decomposing Residential Monthly Electric Utility into Cooling Energy Use by Different Machine Learning Techniques, April 2018, (Peng Wang, Yan Yu)
Today the residential sector consumes about 38% of energy produced, of which nearly a half is consumed by HVAC systems. One of the main energy-related problems is that most households do not operate in an energy efficient manner, such as utilizing natural ventilation or adjusting the thermostat upon weather conditions, thus leading to higher usage than necessary. It has been reported that energy saving behaviors may lead to 25% energy-use reduction just by giving consumers a more detailed electricity bill with the same building settings. Therefore, the scope of this project is to construct a monthly HVAC energy use predictive model with simple and accessible predictors for home. The dataset used in this project include weather, metadata, electricity-usage-hours data downloaded from pecan street data port. The final dataset used in this project contains 3698 observations and 11 variables. Multiple linear regression, regression tree, random forest, and gradient boosting are four types of machine learning techniques that are applied to predict the monthly HVAC cooling uses. Root Mean Squared Error (RMSE) and adjusted R2 are two criteria that are adopted to evaluate the model fitness. All models are highly predictive based on the range of R2 from 0.823 to 0.885. Gradient boosting model has the best overall quality of the prediction with out-of-sample RMSE as 0.57.
Fan Yang, Breast Cancer Diagnose Analysis, April 2018, (Yichen Qin, Dungang Liu)
The dataset studied in this paper explains breast cancer tissue from two dimensions. The tissue is either benign or malignant. Our target is to recognize malignant tissue by knowing the dimension (mean, standard error and the worst) of it. This paper shows a section of feature selection which is based on correlation analysis and data visualization. After eliminating some correlated and visually unclassified features, logistic regression, random forest and xgboosting are conducted on training and validation data. 10 fold cross validation is also used for estimating performance of all the models, then prediction accuracy from different models are compared and area under ROC is used to evaluate model performance on validation data.
Sinduja Parthasarathy, Income Level Prediction using Machine Learning Techniques, April 2018, (Yichen Qin, Dungang Liu)
Income is an essential component in determining the economic status and standard of living of an individual. An individual’s income largely influences his nation’s GDP and financial growth. Knowing one’s income can also assist an individual in financial budgeting and tax return calculations. Hence, given the importance of knowing an individual’s income, the US Census data from the UCI Machine Learning Repository was explored in detail to identify the factors that contribute to a person’s income level. Furthermore, machine learning techniques such as Logistic regression, Classification tree, Random forests, and Support Vector Machine were used to predict the income level and subsequently identify the model that most accurately predicted the income level of an individual.
Relationship status, Capital gain and loss, Hours worked per week and Race of an individual were found to be the most important factors in predicting the income level of an individual. Of the different classification techniques that were built and tested for performance, the logistic regression model was found to be the best performing, with the highest accuracy of 84.63% in predicting the income level of an individual.
Jessica Blanchard, Predictive Analysis of Residential Building Heating and Cooling Loads for Energy Efficiency, March 2018, (Peng Wang, Dungang Liu)
This study’s focus is to predict the required heating load and cooling load of a residential building through multiple regression techniques. Prediction accuracy is tested with in-sample, out-of-sample, and cross-validation procedures. A dataset of 768 observations, eight potential predictor variables, and two dependent variables (heating and cooling load) will be explored to help architects and contractors utilize and predict the necessary air supply demand and thus design more energy efficient homes. Exploratory Data Analysis not only uncovered relationships between the explanatory and dependent variables, but relationships amongst explanatory variables as well. To create a model with accurate predictability, the following regression techniques were examined and compared to one another: Multiple Linear Regression, Stepwise, LASSO, Ridge, Elastic-Net, and Gradient Boosting. While each method has its advantages and disadvantages, the models created using LASSO Regression to predict heating and cooling load, balance simplicity and accuracy relatively well. However, when compared against the results from Gradient Boosting, the LASSO models produced greater root mean squared error. Overall, the regression trees created with Gradient Boosting yielded the best predictive results with parameter tuning to regulate “overfitting.” These models meet the purpose of this study to provide residential architects and contractors a straightforward model with greater accuracy than the current “Rules of Thumb” practice.
Replace this text component with your accordion's content.
Zachary P. Malosh, The Impact of Scheduling on NBA Team Performance, November 2017, (Michael Magazine, Tom Zentmeyer)
Every year, the NBA releases their league schedule for the coming year. The construction of the schedule contains many potential schedule-based factors (such as rest, travel, and home court) that can impact each game. Understanding the impact of these factors is possible by creating a regression model that quantifies the team performance in a particular game in terms of final score and fouls committed. Ultimately, rest, distance, attendance, and time in the season had direct impact on the final score of the game while the attendance at a game led to an advantage in fouls called against the home team. The quantification of the impact of these factors can be used to anticipate variations in performance to improve accuracy in a Monte Carlo simulation.
Oscar Rocabado, Multiclass Classification of the Otto Group Products, November 2017, (Yichen Qin, Amitabh Raturi)
Otto group is a multinational group with thousands of products that need to be classified consistently in nine groups. The better the classification, the more insights they can generate about their product range. However, the data is highly unbalanced among classes so we try to find out if the balancing group Synthetic Minority Oversampling Technique has notable effects in the performance of the accuracy and Area under the Curve of the classifiers. Given the data set is obfuscated so that the interpretability of the dataset is impossible, we will use black box methods like Linear and Gaussian Support Vectors Machines and Multilayer Perceptron and Ensembles that combines classifiers like Random Forests and Majority Voting.
Shixie Li, Credit Card Fraud Detection Analysis: Over sampling and under sampling of imbalanced data, November 2017, (Yichen Qin, Dungang Liu)
Imbalanced credit fraud data is analyzed by over sampling and under sampling methods. A model is built with logistic regression and area under PRROC (Precision-Recall curve) is used to show model performance of each method. The disadvantage of using area under ROC is that due to the imbalance of the data the specificity will be always close to 1. Therefore the area under the curve does not work well on imbalanced data. This disadvantage is shown by comparison in this paper. Instead a precision-recall plot is used to find a reasonable region for the cutoff point based on the result from selected model. The cutoff value should be chosen within the region or around the region and it is all depends on whether precision or recall is more important to the bank.
Cassie Kramer, Leveraging Student Information System Data for Analytics, November 2017, (Michael Fry, Nicholas Frame)
In 2015, The University of Cincinnati began to transition its Student Information System from a homegrown system to a system created by Oracle PeopleSoft called Campus Solutions and branded by UC as Catalyst. In order to perform reporting and analytics on this data, the data must be extracted from the source system, modeled and loaded into a data warehouse. The data can then be exported to perform analytics. In this project, the process of extract, modeling, loading and analyzing will be covered. The goal will be to predict students’ GPA and retention for a particular college.
Parwinderjit Singh, Alternative Methodologies for Forecasting Commercial Automobile Liability Frequency, October 2017, (Yan Yu, Caolan Kovach-Orr)
Insurance Services Office, Inc. publishes quarterly forecast of Commercial Automobile liability frequency (number of commercial automobile insurance claims reported/paid) to help insurers make better pricing and reserving decisions. This paper proposes forecasting models based on time-series forecasting techniques as an alternative to already existing traditional methods and intends to improve the existing forecasting capabilities. ARIMAX forecasting models have been developed with economic indicators as external regressors. These models resulted in a MAPE (Mean Absolute Percentage Error) ranging from 0.5% to ~9% which is a significant improvement from currently used techniques.
Anjali Chappidi, Un-Crewed Aircraft Analysis & Maintenance Report Analysis, August 2017, (Michael Fry, Jayvanth Ishwaran)
This Internship comprised of two projects: Analysis of some crew data using SAS and analysis of the aircraft maintenance reports using text mining in R. The first project identifies and analyzes how different factors affected the crew ratio on different fleets. The goal of the second project is to study the maintenance logs which consisted of the work order description and work order action related to the aircrafts that were reported to go under maintenance.
Vijay Katta, A Study of Convolutional Neural Networks, August 2017, (Yan Yu, Edward Winkofsky)
The advent of Convolutional Neural Networks has drastically improved the accuracy of image processing. Convolutional Neural Networks in short CNNs, are presently the crux of deep learning applications in computer vision. The purpose of this capstone is to investigate the basic concepts of Convolutional Neural Networks in a stepwise manner and to build a simple CNN model to classify images. The study involves understanding the concepts behind different layers in CNN, studying the different CNN architectures, understanding the training algorithms of CNNs, studying the applications of CNNs, and applying CNN for image classification. A simple image classification model was designed on an ImageNet dataset which contains 70,000 images of digits. The accuracy of the best model was found to be 98.74. From the study, it is concluded that a highly accurate image processing model is achievable in a few minutes given the dataset has less than 0.1 million observations.
Yan Jiang, Selection of Genetic Markers to Predict Survival Time of Glioblastoma Patients, August 2017, (Peng Wang, Liwei Chen)
Glioblastoma multiforme (GBM) is the most aggressive primary brain tumor with survival time less than 3 months in >50% patients. Gene analysis is considered as a feasible approach for the predication of patient’s survival time. The advanced gene sequencing techniques normally produce large amount of genetic data which contain important information for the prognosis of GBM. An efficient method is urgently needed to extract key information from these data for clinical decision making. The purpose of this study is to develop a new statistical approach to select genetic markers for the prediction of GBM patient’s survival time. The new method named Cluster-LASSO linear regression model has been developed by combining nonparametric clustering and LASSO linear regression methods. Compared to the original LASSO model, the new Cluster-LASSO model simplifies the model by 67.8%. The Cluster-LASSO model selected 19 predictor variables after clustering instead of 59 predictor variables in LASSO model. The predictor genes selected for Cluster-LASSO model are ZNF208, GPRASP1, CHI3L1, RPL36A, GAP43, CLCN2, SERPINA3, SNX10, REEP2, GUCA1B, PPCS, HCRTR2, BCL2A1, MAGEC1, SIRT3, GPC1, RNASE2, LSR and ZNF135. In addition, The Cluster-LASSO model surpasses the out of sample performance of LASSO model by 1.89%. Among the 19 genes selected in the Cluster-LASSO model, the positively associated HCRTR2 gene and negatively associated GAP43 are especially interesting and worth of further study. A further study to confirm their relationship to the survival time of GBM and possible mechanism would contribute tremendously to the understanding of GBM.
Jing Gao, Patient Satisfaction Rating Prediction Based on Multiple Models, August 2017, (Peng Wang, Liwei Chen)
As the development of economy and technology, online health consultation provides a convenient platform which enables the patients seeking the suggestion and treatment quickly and efficiently, especially in China. Due to the large population density, physicians may need to take hundreds of patients every day at hospital, which is really time-consuming for patients. So there is no wonder why online health consultation grows so rapidly recently. Since healthcare service always related to issues of mortality and life quality for patients, hence online healthcare services and the patient satisfaction are always important to keep this industry running safely and efficiently. So in this project, we focus on the patient satisfaction. We integrate three levels of data (physician level, hospital level and patient level) into one. And we build multiple predictive models in order to know which independent variables have significant effects on the patient satisfaction rate as well as to check the precision of the models by comparison. This paper verifies that the physicians’ degrees of participation with the online healthcare consultation system as well as the hospital’s support affect the patient satisfaction significantly, especially the interactive activity such as total web visits, thanks letters, etc.
Jasmine Ding, Comparison Study of Common Methods in Credit Data Analysis, August 2017, (Peng Wang, Dungang Liu)
Default risk is an integral part of risk management at financial institutions. Banks allocate a significant amount of resources on developing and maintaining credit risk models. Binning is a method commonly used in banking to analyze consumer data to determine whether a borrower would qualify for a bankcard or a loan. The practice requires that numeric variables are categorized into discrete bins for further analyses based on certain cutoff values. The approach for grouping observations could vary from equal bin size to equal range depending on the situation. Binning is popular because of its ability to identify outliers and handle missing values. This project explores the basic methods that are commonly used for credit risk modelling, including simple logistic regression, logistic regression with binned variable transformation, and generalized additive models. After developing each model, a misclassification rate is calculated to compare model performance. In this study, the credit model based on binned variables did not produce the best results, both generalized additive model and random forest performed superior. In addition, the project also proposes other methods that can be used to improve credit model performance when working with similar datasets.
Sneha Peddireddy, Opportunity Sizing of Final Value Fee Credits, August 2017, (Michael Fry, Varun Vashishtha
The e-commerce company allows customers to “commit to purchase” an item and they charge the seller a fee (commission for sale) when this happens. If the actual purchase does not happen because of any reason, seller has to be refunded the fee amount as a credit. In this process, there are multiple reasons why a transaction could not be completed after “commit to purchase”. Also, there are cases where a transaction is taken off the website because of the mutual agreement between buyer and seller. This will result in loss of revenue for the company. The current project involves identifying the key reasons for an incomplete transaction and sizing the opportunity to minimize the credits payment for off platform transactions.
Krishna Teja Jagarlapudi, Solar Cell Power Prediction, August 2017, (Michael Fry, Augusto Sellhorn)
The rated power output from a solar cell is estimated through experimental measurements and theoretical calculations. However, it is difficult to obtain reliable prediction of the power output for varying weather conditions. With the advent of Internet of Things, it is possible to record exact power output from a solar cell over time. This data along with weather information can be used to build predictive models. In this project, a neural network model and a random forest model are built. The performance of the two models is compared using 10-fold cross validation, based on mean absolute error, and adjusted r-squared. It is seen that Random Forest performs better than neural network.
Mansi Verma, 84.51o Capstone Project, August 2017, (Michael Fry, Mayuresh Gaikwad)
84.51° is an analytics wing of Kroger which aims to make people’s life easier by achieving real customer understanding. It brings together customer data, analytics, business and marketing strategies for more than 15 million loyal Kroger Customers. It also collaborates with 300 CPG (consumer packages goods) Clients by driving awareness, trail, sales uplift, earned media impression and ultimately customer loyalty. Using the latest tools, technology and statistical techniques; 84.51° works towards producing insight on customer behavior with their spend data at the stores for business decisions. All goals of the company are centered towards customers at the center and not the profits only. Targeting the right customers is not an easy job. The objective of the customer targeting is to target right customer base and to know when to target them with what. This right kind of targeting not only drives sales but also saves business resources and maximizes profit. Kroger provides coupons in many channels being tills at the time of billing, emails, website, mobile app and direct mails that it sends to the best customers. This project aims to discuss about the model for best customer targeting for a direct mail campaign for a beauty CPG client for a new product launched.
Shengfang Sun, Human Activity Classification Using Machine Learning Techniques, August 2017, (Yichen Qin, Liwei Chen)
In this work, machine-learning algorithms are developed to classify human activities from wearable sensor data. The sensor data was collected from 10 subjects of diverse profile while performing a predefined set of physical activities. Three activity classifiers using the sensor metrics were trained and tested: random forest, Naïve Bayes and neural network. Performances of these classifiers were scored by leave-one-subject-out cross validation. The results show that neural network performs best with an accuracy rate of 85%. A closer look at the aggregated confusion matrix suggests that most activities of new subject can be predicted well by the pre-trained neural network classifier, despite that some activities appears to be very subject-sensitive and may require subject-specific training.
Sakshi Lohana, Market Basket Analysis of Instacart Buyers, August 2017, (Peng Wang, Uday Rao)
Market Basket Analysis is a modelling technique used to determine the unique buying behavior of customers. It can be used to formulate strategies to increase sales by suggesting customers what to buy next and providing promotions on relevant products of their choice. Through this project, Market Basket Analysis and Association Rules are explored using the dataset available on Kaggle.com. This data set is transactions by various users on an ecommerce website known as Instacart. After careful analysis, it is found that the items of daily use such as fruits, milk, sparkling water are ordered the most. Also, the proportion of reordered products is as high as 46% and hence customers can be encouraged to buy the same product again if they are satisfied with the buying experience the first time. There is high level of associations between different yoghurts, pet foods and organic items. A person buying organic cilantro is most likely to buy organic limes.
Sahil Thapar, Predicting House Sale Price, August 2017, (Dungang Liu, Liwei Chen)
Over the recent years we have seen that house prices can be an important indicator of the state of the country's economy. In this project, we will employ machine learning techniques to predict the final sales price for a house based on a range of features of the house. We know that houses can be the single biggest investment an individual makes in his lifetime. A sound statistical model can help the customer get a fair valuation of the house - both at the time of purchase and sale. The final house prices are a continuous variable and are predicted using linear regression. As a part of this project, regularization was performed to achieve simpler predictive models.
Pradeep Mathiyazagan, Website Duration Model, August 2017, (Yichen Qin, Yan Yu)
This capstone project is a natural extension of the Graduate Case Study that I worked on in the Spring Semester, 2017 as part of the Business Analytics program at University of Cincinnati. This will explore a bag-of-words model with user browsing data on the website of a local TV news station in Las Vegas owned by EW Scripps. The original Graduate Case Study did not afford us the time to explore a bag of words model as it involved a fairly large amount of web-scraping. Another worthwhile information I hope to include in this model is the amount of media elements present on a webpage in form of tweets, pictures and videos to analyze their impact on user engagement. Through this, we hope to identify pertinent information that results in better user engagement which would ultimately result in increased advertising revenue.
Rajul Chhajer, Forecasting Stock Reorder Point for Smart Bins, July 2017, (Michael Fry, John E. Laws)
Forecasting the reorder point plays an important role in efficiently managing the inventories. The reorder point is essentially the right time to order a stock considering the lead time to get the stock from the supplier and the safety stock available. It is difficult to determine the replenishment point if the sales information and lead time are unknown. In this study, historical reorder trends have been observed at product level for the forecasting. Apex ActylusTM smart bins have the ability to reorder stocks automatically based on the inventory level and it stores the information of all the past orders. The past reorders helped in understanding the velocity of a product present in a bin and then a moving average technique was used at product level to predict the next replenishment. The reorder point prediction would reduce the frequency of ordering and would help the floor managers in making better reorder plans.
Wei Yue, Analysis of Students’ Dining Survey, July 2017, (Peng Wang, Yinghao Zhang)
The goal of this project is to explore the factors that influence the customer experience the most under the designed circumstance. To achieve this objective, regression models were built to represent the relationship between customer experience and their basic information. The results of model building showed that the customer experience is not directly related to all the information provided by the survey. The survey results were supplied by randomly selected students from a University in Guangdong Province, China. The purpose of the survey is to help the restaurant management to better understand which dishes are more popular among students, and more importantly, if there are connections between dish ordering patterns and different students.
First, the students’ basic information was collected and categorized, such as, gender, major, frequency of dining out, etc. Then participants were asked to pick 5 dishes from eight categories of dishes on the menu with two in each category (16 in total) as they were dining in and then one out of the five dishes was randomly selected to be out of stock. Under this circumstance, participants need to pick one other dish to replace it. Then the customer experience was surveyed for analysis. The total number of participants is 98.
Gupta, Akash, Customer Segmentation and Post Campaign Analysis, July 2017, (Michael Fry, Naga Ramachandran)
A marketing campaign is a focused, tactical initiative to achieve a specific marketing goal.
Marketing activities require careful planning so that every step of the process is understood before you launch. Because a marketing campaign is tactical and project based, you need to map out the process from the initial promotional intent to the ultimate outcome. Based on that purpose, you need to set specific goals and metrics or key performance indicators (KPIs) that will help you determine how your campaign is performing against that goal and are helpful when creating or refining marketing strategies. It is important to track our marketing activities to results. Results will be determined by what our goals were for the campaign. But in most cases, results are usually in terms of sales or qualified leads and eventually applications.
Palash Siddamsettiwar, Internship at Tredence Analytics, July 2017, (Michael Fry, Sumit Mehra)
During my period of internship at Tredence Analytics, I was working as an analytics consultant to one of the biggest plumbing, HVAC&R and fire protection distributor in United States with more than $13 billion of yearly revenue. I was involved in building the Analytics capabilities in various divisions including supply chain, operations and products. My primary project involved working with warehouse managers and the head of data to understand how to cut down shipping costs to customers by optimizing modes of shipment and timing of delivery and thus, cutting down fixed and variable costs. By providing cost estimations for the options available, sales representatives and dispatchers would be able to make data-driven decisions rather than instinct-based ones.
My secondary project involved working with the products team and the ecommerce team to help them categorize their products using machine learning techniques. Within more than 3 million SKUs involved and more than 2 million of these still unclassified, the current pace and accuracy of classifying these products was not sustainable. Using machine learning would help these two teams to significantly reduce effort, time and money needed to classify the products and check the classifications. Both projects involve creation of a long-term, automated and real-time solution which will be integrated into their IT systems, to help people make quicker and more efficient decisions.
Jordan Adams, Forecasting Process for the U.S. Medical Device Markets, July 2017, (Yan Yu, Chris Dickerson)
The goal of this capstone is to build a forecasting process and model for Company X to forecast the US medical device market sales and share for Company X and all competitors. The forecasting process will be built using two data analytics tools to handle data management, data modeling, data visualization, and statistical analysis. The forecast process for the medical device market will involve conducting a baseline forecast using an array of time series forecasting methodologies, and adjusting the forecasts based on economic trends, competitive intelligence, market insights, and organizational strategies. The forecaster will have the flexibility to choose among many differing forecasts to select the model that they feel has the best predictability power, and the ability to cleanly visualize and explore each forecast in depth.
Aditya Singh, Churn Model, July 2017, (Michael Fry, Evan Cox)
The client is a cosmetics company based in New York City. The company has close to 9000 members globally, both men and women, from over 2250 companies in the beauty related industries. The primary reasons for becoming a member are as follows:
- Networking with other people in the beauty industry
- Find a career in the beauty industry
- Learn more about the latest trends in the beauty industry
- Get your product/company recognized at an awards event hosted by the company
A big percentage of the members churn after just one year of subscription. The goal is to identify patterns among these members who are likely to churn and eventually predict when a member is going to churn. A significant amount of time has been spent setting up the dataset before the modeling process. After, data cleaning and manipulation, I have built a Logistic Regression Model which predicts whether a member is going to churn or not.
Catherine Cronk, A Simulation Study of the City of Cincinnati’s Emergency Call-Center Data: Reducing Emergency Call Wait Times, July 2017, (David Kelton, Jennifer Bohl)
Emergency-response call centers are arguably one of the most important services a city can provide for its constituents. When a person calls 911 there is an expectation that the call will be answered and dispatched to the nearby emergency response department within seconds. In recent years, the total number of calls to 911 have increased, causing wait times to be up to 30 minutes for people contacting emergency services. The purpose of this simulation study is to analyze the current emergency call-center system and data for the City of Cincinnati and simulate alternate systems. The goal is to identify a better system that can achieve the City Administration’s goal of call takers’ answering 90% of 911 phone calls in under 10 seconds.
Michael Ponti-Zins, Inpatient Readmissions Reduction and MicroStrategy Dashboard Implementation, July 2017, (Michael Fry, Denise White)
Inpatient hospital readmission rates have been considered a major indicator of quality of care for several decades and have been shown to have a highly negative correlation with patient satisfaction. In 2017, the Ohio Department of Medicaid announced a 1% reduction in Medicaid reimbursement for all hospitals that are deemed to have excessive readmissions. In order to improve care and avoid potential payment reductions, Cincinnati Children’s Hospital created an internal quality improvement team focusing on readmissions reduction. In order to better understand the millions of data points related to readmissions, a dynamic dashboard was created using MicroStrategy, a business intelligence and data visualization tool. This dashboard was used to track the percentage of patients readmitted within 7 and 30 days of discharge, track why patients were returning, the percentage of readmissions that were potentially preventable, and other related aspects of each inpatient encounter. This information was used to identify targeted interventions to decrease future readmissions. These interventions included improved discharge and home medication instructions, automated email notification of providers, and data exports to assist in ad hoc analysis.
Ajish Cherian, Predicting Income Level using US 1994-95 Census Data, July 2017, (Peng Wang, David F. Rogers)
The objective of the project was to predict whether income exceeds $50,000 per year based on US 1994-1995 census data using different predictive models and comparing their performance. Since, the prediction to be made is a categorical value (income <=50K or >50K), the predictive models built were for classification. Models designed for the dataset were Logistic Regression, Lasso Regression, K-Nearest Neighbor, Support Vector Machine, Naive Bayes, Classification Tree, Random Forest and Gradient Boosting. Performance and effectiveness of all the models were evaluated using Area-Under the Curve (AUC) and Misclassification Rate. AUC and misclassification rate are calculated on the training and test datasets. However, for finalizing a model only metrics from the test dataset were used. Gradient Boosting performed best out of the selected models.
Rui Ding, Analysis of Price Premium for Online Health Consultations by Statistical Modeling, July 2017, (Peng Wang, Liwei Chen)
In this project, we focus on the mechanism of how the descriptive information of physicians and information of interactive reviews from patients will affect the price premium of online health consultation. Section 1 briefly introduces the definition of online health consultation and the techniques to be used in the project. Section 2 concentrates on the exploratory data analysis of the data set to obtain the overview of distribution of price premium and physicians. Section 3 discusses the analysis process of the data set by different modeling methods. The performance of each method is evaluated by in-sample, out-of-sample mean squared error and prediction error. Generalized linear modeling and mixed effect modeling demonstrate similar performance without obvious over fitting. Regression tree shows better prediction performance. However, tree-based bagging and random forest methods provide excellent performance with potential over fitting problem. Section 4 concludes the finding from the modeling and interprets the importance of the variables in the finalized models.
S.V.G. Sriharsha, Analysis of Grocery Orders Data, July 2017, (Yichen Qin, Jeffery Mills)
Objective of this analysis is to study order pattern of users of Instacart, a grocery delivering company and provide key insights about the customer behavior. There are 206209 users in the database and 49687 different products available to order through Instacart which can be characterized to 21 different departments. Current database consists of the details about 3421083 orders placed by the users over a certain amount of time. This analysis starts with exploration of variables then moves on to i) Association rule mining using apriori algorithm, ii) Unsupervised classification of customers based on their buying behavior using K-means clustering algorithm, iii) Product embedding using Word2Vec analysis and concludes with a summary of the results.
Linxi Yang, Analysis of Feedback from Online Healthcare Consultation with Text Mining, July 2017, (Peng Wang, Liwei Chen)
China has experienced rapid economic growth which benefited many industries but not the healthcare system. Because of the uneven economic development in China, not all residents can receive appropriate medical care. With an immature healthcare system and scarce medical resources for 1.3 billion people, the online healthcare consultation community in China now has become as popular as it is in other developed countries. The data was collected from an online healthcare consultation community, Good Doctor Community. Good Doctor Community (www.haod.com), which is the earliest and largest online healthcare consultation community in China, has been growing rapidly in the past 10 years. This research project will focus on how to improve the quality of service in the healthcare industry and provide insightful analysis for Good Doctor Community for future development by using text mining. Results show that the main purpose of visits is for treatment and diagnosis, and the main reason for choosing the physician is the online reviews and recommendation from friends, relatives, etc. There are 11,671 out of 22,625 respondents registered at the counter before they have seen physician, and 9,290 out of 22,625 respondents registered via an online system. The most frequent word appeared in the dataset is patient, and the most frequent word appeared in the dataset with dissatisfied review is impatient. By analyzing the sentiment of text, most patients have very positive sentiment and only 1/48 people have negative sentiment.
S. Zeeshan Ali, Image Classification with Transfer Learning, July 2017, (Peng Wang, Liwei Chen)
To correctly classify an image is a problem which has been there since the breakthrough of the modern computers. Nowadays because of techniques like deep learning there has been breakthroughs in this field. We will explore some techniques like transfer learning to classify the images in this project. We will also touch upon image feature extraction and modelling with image arrays. We will see this with a digit image dataset for simplicity.
Apoorv Joshi, Predicting Realty Prices Using Sberbank Russian Housing Data, July 2017, (Dungang Liu, Liwei Chen)
Sberbank is Russia’s oldest and largest bank. It utilizes historical property sale data to create predictive models for realty price and assists customers in making better decisions while renting or purchasing a building. The Sberbank Housing Dataset describes the property and the sub area to which it belongs in Moscow. The dataset contains 30471 observations and 292 variables. The variables are analyzed using Exploratory Data Analysis to see how they individually affect the price of a house. Further, the available data is cleaned, manipulated, and is used to fit models that can predict the house prices. Linear Regression, LASSO, Random Forest, and Gradient Boosting models were fit on the data, and we could make the predictions with sufficient accuracy.
Aishwarya Nalluri, Multiple Projects with Sevan Multi Site Solutions, July 2017, (Michael J. Fry, Doug Gafney)
Client Company A is a well-known fast food restaurant chain, spread across the world. Their business model in USA is divided into major FETs. In this project, an attempt has been made to map employees (supporting Company A but who are employed by Sevan Multi Site Solutions) working at different levels in a single dashboard. The tool used is Power BI. Main challenge is collecting data and preparing it for use in Power BI. The data had to be valid for representing in a dashboard and how the headshots can be embedded in the dashboard instead of simply specifying external hyperlinks.
QBR is a Quarterly Business Report which is presented to board members of the company. Every quarter a meeting is held and an opportunity is provided for each department to represent where they stand and what are the challenges they are facing. QBR is mainly focused on 4 aspects: people, clients, operations and finance. This methodology was introduced when the company started acquiring more projects from a variety of clients. As quarters passed by, many modifications have been made to the process of collecting required data and presenting it. The main challenge that the company faced is that there is no standard framework to work on QBR release reporting. The Finance team had issues collecting data, cleansing and representing it. As part of the solution to this challenge, a standard approach was formed using excel. The only effort needed by the Finance team now is loading a report from Quick books from excel which would automatically update all the reports. This solution has reduced their time by 50%.
Siva Ramakrishnan, The Insurance Company Benchmark (CoIL 2000), July 2017, (Yan Yu, Edward Winkofsky)
This project focuses on predicting potential customers for the Caravan Insurance Company. The dataset was used in the Computational intelligence and Learning(CoIL) 2000 challenge. It consists of 86 variables and includes product ownership data and socio-demographic data. The aim of the project is to classify customers as either buyers or non-buyers of the insurance policy. Six different models where developed including Logistic Regression, Classification Tree, Naïve Bayes, Support Vector Machine, Random Forest and Gradient Boosted trees. These models were evaluated based on the competition rules where contestants had to select a set of 800 observations from the test set of 4000. The logistic regression model performed better than all the other models.
Nitisha Adhikari, PD and LGD Modelling Methodology for CCAR, July 2017, (Michael J. Fry, Maduka S. Rupasinghe)
With the acquisition of First Niagara Bank in 2016 Key Bank acquired $2.6b Indirect Auto Portfolio. This was a new addition to the list of existing portfolios at Key and a Loss estimation model is being built to generate stressed loss forecasts for the Comprehensive Capital Analysis and Review (CCAR) and Dodd-Frank Act Stress Tests (DFAST). This document talks about the data preparation and modeling methodology for Probability of Default model (PD) and Loss given Default (LGD). The PD and LGD along with the Exposure at Default (EAD) are used to generate stressed loss forecasts for the CCAR and DFAST.
Venkat Kanishka Boppidi, Lending Club – Identification of Profitable Customer Segment, July 2017, (Dungang Liu, Liwei Chen)
Lending club issues unsecured loans to different segments of customers. The interest rate for the loan is dependent on the credit history of the customer and various other factors like income levels, demographics etc. The data of the borrowers is public. The current analysis has several objectives:
- To review the lending club dataset and summarize thoughts on LC risk profiles by loan type, grade, sub grade, loan amount, etc. using loan status of ‘Charged Off’ and ‘Default’, as indicators of a ‘bad loan’.
- To identify fraudulent customers (customers with no payment) in Lending Club data. The key characteristics of these fraudulent applications.
- To Identify best and worst categories by purpose (a category provided by the borrower for the loan request) in terms of risk.
- To build a statistical model using classification techniques and identify the less risky customer segment. These recommendations can be used to cross sell the loans for a customer segment which has low default rate and high profit.
Xiaojun Wang, Co-Clustering Algorithm in Business Data Analysis, July 2017, (Yichen Qin, Michael Fry)
In this project, we investigate a two-way clustering method and apply it to a business data set.
The classical clustering method is one-way. Given a data matrix, it is performed either on the whole row (observation-wise), or on the whole column (variable-wise). For example, in the well-known K-means method, all the variables involving in the distance measure either come from variables, or records, but not both. Co-clustering, also called bi-clustering or block clustering, is a two-way clustering method. It does clustering simultaneous on the rows and columns of a data matrix and turns the data into blocks. Our data set comes from a retail company that has hundreds of stores, each of which contains hundreds of business departments. Co-clustering analysis helps to group the data into blocks based on the similarity in productivity. Each block will consist of a group of departments and the corresponding group of stores they belong to. Our goal is to study these blocks so that business decisions can be made based on the information they bring with. The result we get shows co-clustering serves our purpose well.
Manisha Arora, Marketing Mix (Promotional Spend Optimization) for a Healthcare Drug, July 2017, (Michael Fry, Juhi Parikh)
The Healthcare Industry is one of the world’s largest and fastest-growing industries, consuming over 10% of GDP for most developed nations. Data and analytics are playing a major role in healthcare, allowing organizations the ability to make smart, impactful, data-driven decisions to mitigate risk, improve employee welfare and capitalize on the opportunities. This capstone project focusses on evaluating the effectiveness of its professional tactics for a particular drug, and optimizing its promotional spends, based on the channel effectiveness. This project analyzes each of the channels and would try to answer the following questions:
- What is the impact of each channel on the promotion of the drug?
- What is the average and marginal ROI for each channel?
- What would be the ideal spend levels per tactic and optimized based on a brand budget number?
Jayaram Kishore Tangellamudi, Predicting Housing Prices for ‘Sberbank’, July 2017, (Yan Yu, David F. Rogers)
Sberbank, Russia’s oldest and largest bank, helps their customers by making predictions about realty prices so renters, developers, and lenders are more confident when they sign a lease or purchase a building. Although the housing market is relatively stable in Russia, the country’s volatile economy makes forecasting prices as a function of apartment characteristics a unique challenge. Complex interactions between housing features such as number of bedrooms and location are enough to make pricing predictions complicated. Adding an unstable economy further complicates the predictions. Several regression models such as Linear Regression, General Additive Models (GAM), Decision Trees, Random Forest (RF), Support Vector Regression (SVR), Extreme Gradient Boosting (XGB) were built on the housing features alone to predict the housing prices. Additionally, economic indicator data was merged with Housing features data to check if these indicators can further explain the variance in the housing prices. The predictive model performances were compared using the Mean Square Error (MSE) of the logarithmic value of the housing prices.
Ramya Kollipara, Analysis of Income Influencing Factors in Different Professions, July 2017, (Dungang Liu, Liwei Chen)
Knowing the characteristics of a high/low income individual can be useful in marketing a new service targeted at potential customers within a salary range. There is always a cost involved in attracting the right customers, which the organization would always want to minimize. If a model was designed to accurately identify the right people in an income range, the cost could be significantly decreased with a higher rate of returns. The objective of this project is to explore and analyze the variables associated with an individual that might prove to be useful in understanding whether his/her income exceeds $50K/year, specially focusing on 3 different professions: Sales, Executive Managers, Professional Specialties. Various modelling techniques are explored and the different models are compared to see how some characteristics have a greater influence on certain professions compared to others and the most effective model is selected to accurately predict whether an individual’s income exceeds $50K/year based on the census data.
Shalvi Shrivastava, Black Friday Data Analysis, July 2017, (Yan Yu, Yichen Qin)
Billions of dollars are spent on Black Friday and the holiday shopping season. ‘ABC Private Limited’ has shared data of various customers for high volume products from the Black Friday month and wants to understand the customer purchase behavior (specifically, purchase amount) against various products of different categories. The challenge is to predict purchase amounts of various products purchased by customers based on the given historical purchase patterns. The data contained features like age, gender, marital status, categories of products purchased, city demographics etc. We were to build our models on the train data and validation data. The evaluation metric was RMSE, which also seemed a very appropriate choice for this problem.
Junbo Liu, Predicting Movie Ratings with Collaborative Filtering, July 2017, (Peng Wang, Zhe Shan)
Collaborative filtering, the most popular recommendation system, has been widely applied to virtually every aspect of people’s lives and has generated remarkable success in e-commerce. To make a relevant recommendation to an active user, recommendation system must be able to accurately predict the utility of items for this user because items with the highest utility (ratings in movie case) are recommended. Therefore, prediction accuracy is the key to success of a recommendation system. In this report, we compared three representative types of collaborative filtering approaches derived from three distinct rationales using movie ratings data. The three types are user-based collaborative filtering (UBCF), single value decomposition (SVD) and group-specific recommendation system (GSRS). The minimum root mean square error (RMSE) for UBCF is 0.9432 when the number of neighbor is set to 28. For SVD, the minimum RMSE is 0.9240 when the tuning λ is 0.17 and the number of latent factor is 19. For GSRS, the same number of latent factor of 19 is used and the cluster number of both users and items is set to default value of 10. When the λ value is 65, the RMSE for GSRS is 0.9007. Therefore, our results show that GSRS has the highest prediction accuracy, SVD next and UBCF last. They are consistent with the conclusion from publications.
Aditya Nakate, Talmetrix Inc, Cincinnati, July 2017, (Michael J. Fry, Ayusman Vikramjeet)
I am working as a Production Support Analyst Intern at Talmetrix at downtown Cincinnati. This company helps organizations capture feedback regarding the employee experience and analyses that data to help organizations attract desired talent and improve employee retention, performance and productivity. It helps organizations to make more informed decisions about their employees. Analysis or reports created by the company are mainly consumed by the human resource heads of the client company. During my internship, I am working on various projects including report generation and ad-hoc analysis. While working here, I have used technologies like SQL, R, Tableau etc. and have also used various statistical skills like classification algorithm and regression to name a few. This report contains the summary of the work I have done during my internship at Talmetrix. My first project at the company was about the driver’s analysis. This project was intended to find out the categories which are critical for employee satisfaction and which the clients need to focus on. Later, I also worked on a report generation process for one client. We had the employee feedback data. Employees were asked to take surveys which contained both type of questions: Likert’s and open-ended questions. Reports were created in tableau. Different views were created at levels such as overall, region, age-levels, Tenure levels, department, operating unit and suboperating units. Currently, we are generating more reports based on this one as clients are doing some deep-dives.
Suchith Rajasekharan, Allstate Insurance Claim Severity Analysis, July 2017, (Yichen Qin, Michael Magazine)
In the insurance industry, having the ability to accurately predict the loss amount of a claim is of paramount importance. Companies build predictive models based on different features of a claim and use the predictions from these models to apply proper claims practices, business rules and experienced resources to manage the claims. In this paper, we explore the different steps involved in building a model to predict the loss amount of a claim. A Kaggle dataset provided by Allstate Insurance is used for this study. Various machine learning techniques, viz, Multiple Linear Regression, Generalized Linear Model, Generalized Additive Model, Extreme Gradient Boosting, and Neural Networks are used to build different models. The models are implemented using various packages available in the open source software ‘R’. Models built using different techniques are compared based on their performance on a validation set and the best model is chosen. XGBoost model gave the best performance out of all the models. Therefore, it is chosen as the final model.
Mahesh Balan, Cash as a Product, July 2017, (Michael Fry, Fan Yang)
The project analyzed the potential of adding cash as an additional payment feature to more markets. The analysis quantified the pros and cons of cash. The economics of a cash vs non-cash trip on Uber was analyzed. The cash trip was economically beneficial to Uber compared to that of a non-cash trip. The project also deep dived into aspects such as driver and rider experience in a cash vs a non-cash trip. The experience for a non-cash trip appears to be seamless compared to that cash trip. The project also tried to quantify the risk and safety issues in a cash vs a non-cash trip. The non-cash trip appears to be safer and more trustworthy compared to that a cash trip. The project also looked at various ways to improve the existing economics and current rider/driver experience in a cash trip. The recommendations from the analysis was presented to the Growth and Product Team to improve the overall cash experience for a rider and driver in a cash trip.
Anitha Sreedhar Babu, eCommerce Marketing Analytics, July 2017, (Michael Fry, Maria Topken)
The client, a well-known online food delivery service in Cincinnati, is looking to engage its existing customers and increase the size and frequency of purchases. In order to understand the customer behavior and to drive revenue and engagement, customers were segmented based on frequency of orders and average lag between the orders. Customers were grouped into frequent shoppers, yearly shoppers, and one-time buyers. The data was also used to perform a market basket analysis to understand their purchase patterns. This information was used to drive recommendation engines as well as for effective cross selling of products to existing customers, by designing suitable combos. Targeted marketing strategies were developed based on the insights derived from the analysis.
Dhivya Rajprasad, Prediction of 30-Day Readmission Rate for Congestive Heart Failure Patients, July 2017, (Michael Fry, Scott Brown)
Prediction of readmission rates for patients has gained importance in the present healthcare environment for two major reasons. First, transitional care interventions have a role in reducing the readmissions among chronically ill patients. Second, there is an increased interest in using readmission rates as a quality metric with the Centers for Medicare and Medicaid Services (CMS) using the readmission rate as a publicly reported metric aimed at lower reimbursements for hospitals which have excess readmission rates according to reported risk standards. The objective of this project is to understand the factors which contribute to high readmission rates and predict the probability of a patient being readmitted. With a prediction model in place, hospitals will be able to better understand patient dynamics and provide better care while avoiding penalties for higher readmission rates. In this report, several different data mining, advanced statistical and machine learning techniques are explored and used to predict readmission rates. A comparison of the different techniques is also provided.
Prarthana Rajendra, Cincinnati Children’s Hospital and Medical Center, July 2017, (Jason Tillman, Michael Fry)
The scope of the project is to compute the metric usage of reports of various learning networks. This is done through the Extraction, Transformation, and Load processes. Data is extracted from various tables, transformed according to the requirements, and loaded into a single reporting database. The measures computed are then visualized in the form of graphs. SSIS, SSRS and SQL Server are the main technologies used in accomplishing this task. Packages were built in SSIS to automate these tasks and the resultant data sets can be viewed and analyzed through reports built using SSRS.
Matthew Murphy, Optimization of Bariatric Rooms and Beds within a Hospital, July 2017, (Michael Magazine, Neal Wiggermann)
Currently, hospitals do not have the ability to predict the quantity and type of specialty resources needed to care for specialty patients. This inability is especially problematic given the explicit and implicit cost of under or overestimating the need. Two such specialty resources are bariatric beds and bariatric rooms. According to the Center for Disease Control, the obesity rate within the United States adult population has risen to 36%. The increase in the obese population of the United States along with high costs for bariatric beds and dedicated bariatric rooms have necessitated investigating a better way to determine the proper number of bariatric rooms to construct, bariatric beds to own, and bariatric beds to rent. In this paper, we use simulation and probabilistic techniques along with queueing theory models to investigate the relationship between service level of severely obese patients and the number of bariatric rooms needed to reach a designated service level with such patients. Furthermore, we investigate and build a model that can be used to determine the optimal mix of beds to buy versus rent to minimize the overall cost of bariatric equipment for the entire hospital.
Soumya Gupta, Employee Attrition Prediction, July 2017, (Yan Yu, Peng Wang)
Every company wants to make sure that its employees especially the good ones continue to work for it. Losing valuable employees is very expensive for a company both monetarily and non-monetarily. In this project, we aim to predict whether an employee will leave the company. Three classification techniques —logistic regression, decision trees, and random forest —have been used for building the predictive models. Their results have been compared. Valuable employees have also been identified by making a few assumptions and separate models have been built for this set of employees since the cost of losing a valuable employee is much higher. The prediction accuracy of the random forest is quite high in this case.
Apurva Bhoite, Predicting Success of Students at Medical School, July 2017, (Peng Wang, Liwei Chen)
The University of Cincinnati’s College of Medicine wanted to conduct a study to explore the students’ information enrolled at the College of Medicine by exploring their MCAT scores, MMI scores, Academic Background, Race, and overall background. The College of Medicine also want to identify the most influential predictors in determining the success of the students at the medical school, and finally build a predictive model to do so. The main aim of this project was inference based. Thus, a lot of graphical exploratory analysis which included mainly box plots and bar plots faceted over variables were plotted to get an overall idea. Due to the high dimensionality of the data and less number of observations I used lasso subset selection with cross-validation to reduce the number of predictors. The modeling techniques Logistic Regression, Classification Trees and Random Forests were used to build the predictive model and compare its performance to select the best model. The College of Medicine can employ this model while admitting the students to the college.
Swapnil Sharma, Application of Market Basket Analysis to Instacart Transaction Data, July 2017, (Yichen Qin, Edward P. Winkofsky)
With the rise in online transactions, companies are trying to leverage the humongous data generated by the transaction activities to transform it to meaningful insights. Data Mining techniques can be used to develop a cross selling strategy for the products. Data scientists use predictive analytics to improve the customer experience of shopping online by developing models that predict which products a user will buy again, or try for the first time, or which products are bought together. In this paper, we analyze the trend in customer shopping behavior on the Instacart website for buying groceries. The data set was made public by Instacart—a same day grocery delivery service, for 3 million transactions of over 200,000 users. The data set is explored using the open source statistical learning tool R. Market Basket Analysis is done using the Apriori algorithm for various support levels, confidence and lift to suggest combinations of products to be included in a basket to cross sell the products on the platform. The model is developed to predict which previously purchased products will be in a user’s next order. The F-score measures the model’s performance.
Jasmine Sachdeva, Malware Analysis & Campaign Tracking, July 2017, (Michael J. Fry, Dungang Liu)
Any software that does something that causes harm to a user, computer, or network can be considered malware. Malware analysis is all about examining malware to understand how it may harm the device, what is its source, how it works, and how to destroy it. As the number of malware attacks hitting an organization is increasing every day, it is crucial to analyze and mitigate it to ensure the security of the sensitive data residing in the devices. This project is also about IT Security awareness programs that were conducted and analyzed enabling employees to become more vigilant, ensuring data and security is not breached within the organization.
William Newton, Concrete Compressive Strength Analysis, July 2017, (Yan Yu, Edward Winkofsky)
Concrete is an indispensable material in modern society. From roadways to buildings, humankind is literally surrounded and supported by this chemical bond comprised of relatively basic ingredients. Concrete is so ubiquitous today that it is often taken for granted. Many never question how concrete got here or how it can be trusted. Responsible for the construction of buildings thousands of years old, which in some cases still stand, the contained analyses seek to explain what concrete is and how its strength can be evaluated. Using principal components analysis and linear regression techniques, a dataset comprised of different concrete mixtures was analyzed. The analyses provide bases for reasonable inferences to be made about compressive strength and how different elements behave in the presence of others. But they also indicate that this particular dataset is not comprehensive enough to make reliable predictions regarding compressive strength of concrete.
Eric Nelson, Enhancing Staffing Tactics for Retailer Credit Card Customer Acquisition, July 2017, (Peng Wang, Justin Arnold)
Credit card companies across the country spend millions of dollars promoting their credit cards to consumers. Obtaining the attention and interest of a shopper can be extremely difficult, nonetheless so when a credit card is being promoted by a retailer whose marketing budget does not stand up to those of larger banks. To attract customers, many retailers set up in-store promotional activities to give customers a chance to learn more about the card. One such retailer invested in this strategy but required assistance determining which stores should receive marketing at which times as the required materials and staff are limited and expensive. To answer the question, this project looks at existing credit consumers through the lens of their shopping history. A model is used to determine which potential acquisition customers (“lookalikes”) are most similar to customers who already possess one of the retailer’s credit cards. A final tool shows which stores have the largest number of lookalike households and when those shoppers are in store and likely to notice the credit card promotional materials.
Nitish Puri, A Study of Market Segmentation and Application with Cincinnati Zoo Data, July 2017, (Yan Yu, Yichen Qin)
The process of dividing a market into homogeneous groups of customers is known as market segmentation. Customers can be grouped together based on where they live, other demographic factors or even their behavioral patterns. This project explains these along with other ways of grouping customers in the market, purpose of doing segmentation and the general process followed. Clustering is the main statistical technique used for performing segmentation. The two most commonly used algorithms are K-Means and Hierarchical Clustering, and they are explained in detail in this project. The final part of the project describes the membership data from Cincinnati Zoo, and segments the Cincinnati Zoo customers by performing K-Means clustering on this data.
Tauseef Alam, Internship with JP Morgan Chase Bank, July 2017, (Michael Fry, Yuntao Zhu)
The Chase Consumer & Community Banking (CCB) Fraud Modeling team at JPMorgan Chase & Co. is an analytical center of excellence to all fraud risk managers and operations across the bank. CCB Fraud Modelling team is responsible for building predictive models for managing fraud risk at transaction, account, and customer and application level. As part of CCB fraud modeling team my role is to build machine learning model for predicting Credit Card Bust out Account fraud. "Bust-out" fraud also known as sleeper fraud, is primarily a first-party fraud scheme. It occurs when a consumer applies for and uses credit under his or her own name, or uses a synthetic identity, to make transactions. The fraudster makes on-time payments to maintain a good account standing, with the intent of bouncing a final payment and abandoning the account. ("Bust-out fraud white paper" 2009 Experian Information Solutions, Inc.) I used GBM as my modelling technique for predicting fraud accounts. As part of the process I have created some independent variables and tuned model parameters to build the model. As part of our next steps we will enhance our model performance by including more features. Once the model is finalized it would be implemented and the scores generated from this model will be used in deciding whether the Credit Card account is fraud or not.
Xiaoming Lu, Investigating the Information Loss of Binning Variables for Financial Risk Management, July 2017, (Peng Wang, Dungang Liu)
In financial risk management, binning technique is widely used in the credit scoring field, especially in the scorecard development. Binning is defined as the process of transforming numeric variables into categorical variables and regrouping the categorical variables into new categorical variables. This technique is usually employed at the early stage of model development to coarsely select important variables for further evaluation. One potential problem of binning is the information loss due to transformation. To tackle this question, we performed automatic binning on the German Credit dataset using the “woeBinning” package. Then, we explored the potential information loss of binning in the development of several models using R, including logistic regression, classification tree and random forest. We employed residual mean deviance, OOB estimate of error rate, ROC Curve, symmetric and asymmetric misclassification rates (MR & AMR) to compare model performance. In general, there is little difference in the model performance for the original data and binning data, which means there is little information loss after binning the data.
Sushmita Sen, Digit Recognition with Machine Learning, July 2017, (Yan Yu, Liwei Chen)
Computer vision is a subject that piques everyone’s interest. As humans, we learn to see and identify objects very early and as such don’t give much thought to the process. But in the background an immensely complex architecture of neurons carry on this task. In pursuit of replicating this process, many fields of study have emerged. Machine learning and pattern recognition are among those. In this project, I have attempted to identify images of handwritten digits from the very popular MNIST dataset. I have used a very popular classifying algorithm Support Vector Machine and Neural networks and compared their results in this document.
Abhishek Rao, NBA’s Most Valuable Player of 2017, July 2017, (Yan Yu, Liwei Chen)
Analytics and sports have been around together for a while now, with advancements in sports technology, the application of analytics to sports increases with every passing day. A lot of decision making, scouting, recruiting, and coaching in this age has something to do with how they crunch their numbers. While it’s true that uncertainty in sports is the best thing, an increasing number of people are becoming proponents of analytics applied to Basketball. The nature of the sport makes it very suitable for statistical analysis. The plethora of variables and their interrelationships reveal some of the important facets of the game. Although it’s difficult to evaluate an individual’s ability through analysis of a team game, it reveals things that wouldn’t have been noticed by plain sight.
Gautam Girish, Predicting Wine Quality, July 2017, (Peng Wang, Yan Yu)
Wines have been produced across the world for hundreds of years. However, there are significant differences in the quality of the wine which may be due to several factors. These factors can range from alcohol content, pH of the wine, fixed and volatile acidity etc. In this paper, I am trying to predict the quality of wine based on several of these factors. Different modeling techniques will be used to determine the best model to predict the quality of wine. The prediction techniques used are Linear regression, Generalized Additive Models, Regression trees. Ensemble methods like Boosting and Random Forest have also been used. Principal Component Analysis will also be done to try and improve the model performance. The dataset has been obtained from Kaggle. All the ensuing analysis and model building has been done in R with the necessary packages. The R-squared values obtained from the test dataset is used as the metric for model comparison.
Keerthana Regulagedda, Diabetes Prediction In Pima Indian Women, July 2017, (Yan Yu, Michael Magazine)
The objective of the project is to predict diabetes in Pima Indian women based on different diagnostic measures. As the size of the data set in consideration is small and has missing values in some of the variables, models are built using algorithms that are robust to missing data. In order to achieve this, first data exploration is performed and all the predictor variables are analyzed, correlations and patterns in the data are noted. Based on the preliminary analysis, variable selection is done and initial prediction model is built using logistic regression technique by removing the records with missing data. While removing the missing data, half of the information is lost and as a result the logistic regression model built gave poor results in prediction with AUC as 0.6 and misclassification rate as 0.54. In the model building process, CART and Gradient boosting classification algorithms that handle missing data well are implemented and performance metrics are calculated. Missing data imputation is also done and effects of imputation on variable distributions are studied. Finally, models are built on complete data to see if the accuracy of prediction improves after imputation of missing values.
Rajarajan Subramanian, Predicting Employee Attrition Using Data Mining Techniques, July 2017, (Yan Yu, Edward Winkofsky)
For any organization, human resources form one of the many pillars of foundation to ensure its sustainability in the market. Employee satisfaction and attrition are two critical factors that impact its growth in the near future. They tend to have positive and negative influences respectively not only in an organization’s success but also among the existing workforce in the organization. Using statistical modeling, it would be possible to determine the various causes/factors that lead to employee attrition and predict whether an employee would leave the organization or not. The objective of this study is to compare various predictive models and identify the best among them for predicting an employee attrition. A fictional dataset from Kaggle, created by IBM data scientists is used for the study. The models built include logistic regression, classification tree, generalized additive model, random forest and support vector machines. All the models are evaluated with respect to their out-of-sample prediction performance. Misclassification rate, Cost and Area Under Curve (AUC) are considered as metrics for comparison.
Rohit Khandelwal, Comparison of Movie Recommendation Systems, July 2017, (Yan Yu, Peng Wang)
Recommendation systems have become a key tool in marketing and CRM strategies of companies in all spheres of life. This project aims to build a movie recommendation system that will use users’ ratings of movies to recommend movies they are likely to watch. Various models have been built to predict ratings and recommend movies accordingly and the results from these models are compared. A good model is one that not only predicts the ratings right but also has high precision and recall and makes recommendations in the right order.
Andrew Garner, Coffee Brand Positioning from Amazon Reviews, July 2017, (Yan Yu, Roger Chiang)
Online reviews contain rich information about how customers perceive brands in a product category, but the information can be difficult to extract and summarize from unstructured text data. Text mining and machine learning are applied to Amazon reviews to map the brand positioning of coffee companies. Specifically, a two-dimensional map places companies with similar Amazon reviews close together. This was accomplished by cleaning the text data, training a word2vec model to create a numeric representation of the review text, and applying t-SNE to reduce the high dimensional data to a two-dimensional map. Hierarchical clustering was used to label brands with distinct clusters.
Nitin Abraham Mathew, Lending Club Loan Default Analysis, July 2017, (Yan Yu, Liwei Chen)
Peer to peer lending platforms have become increasingly popular over the past decade. With relaxed rules and less oversight, the possibility of an investor losing money has greatly increased. This calls for the need to build risk profiles for each and every loan disbursed on these peer to peer platforms. The objective of this project is to explore the application of different risk modeling techniques along with techniques to tackle class imbalance on financial lending data in order to maximize expected returns while minimizing expected variance or risk.
Dhanashree Pokale, Image Classification using Convolutional Neural Network and TensorFlow, July 2017, (Yan Yu, Dungang Liu)
The inspiration behind the Image Classification problem considered for this project is employing deep learning techniques and TensorFlow like advanced data processing libraries hosted in Python to classify data. The focus of this project is using Convolutional Neural Network which layer by layer learns features from the images and considers the fact that in images pixels close by are more correlated than those which are apart. Feed forward neural network is a fully connected neural network and hence fails to utilize spatial correlation factor while classifying. With 2 convolutional layers, I could achieve a classification accuracy of 89% on Street View House Numbers dataset. With deeper architecture employment, maximum accuracy of 97% can be achieved.
Matthew Wesselink, Analysis of NBA Draft Selections, July 2017, (Yichen Qin, Edward Winkofsky)
The NBA offseason is a short time from June to October each year when players and teams have the opportunity to regroup and improve their prospects for the coming season. Teams can do this a number of ways either through free agency, the NBA draft or outright trades. The focus of the following analysis will be on the NBA draft and the future performance of those players. By analyzing win shares, one measure of an individual player’s offensive efficiency, we can better project the value each draft selection will provide to their team. Multiple forms of regression were used to predict player win shares value based on draft position. To evaluate the models, we analyzed AIC and BIC values, residuals, Cook’s Distance, and leave-one-out cross validation. Consistently, the logarithmic model performed better than other forms of regression. Logarithmic regression does well at modeling the average, but fails to predict an individual player’s success.
Angie Chen, Textual Analysis of Quora Question Pairs, July 2017, (Peng Wang, Dungang Liu)
Quora is an online platform that allows people to ask questions and connect with those who can share unique insights. The site’s mission is to distribute knowledge in order for people to better understand the world. However, with the platform’s ever-growing popularity, many users submit similar questions. At the same time, there are a limited amount of experts who do not have time to answer multiple variations of the same questions. Quora aims to allow experts to share knowledge in a scalable fashion – writing an answer once and disseminating the information to a wide audience. As a result, Quora wishes to focus on the canonical form of a question – phrases that are the most explicit, least ambiguous construct of a question. To address this dilemma, we used data analysis and modeling techniques to identify duplicate question pairs. Exploratory data analysis and text mining procedures were performed to develop a predictive model that classifies for duplicate question pairs. Two types of ensemble learning procedures, random forest algorithm and gradient-boosted trees, were attempted. Based on the research, an effective model was ultimately developed through sentiment analysis (positive or negative valence), evaluation of key question pair characteristics (number of common words, difference in character length, similarity ratio), and gradient-boosted trees, which yielded an accuracy rate of 70% on the testing data. As such, this solution can be used to efficiently focus on the canonical form of a question. This will facilitate high quality answers, and provide for better user experience on the platform.
Jainendra Upreti, Rossmann Store Sales Forecasting, July 2017, (Peng Wang, Dungang Liu)
For retail stores, the sales are affected by a combination of several factors such as, promotional offers, presence of competitors, assortment levels, store types etc. It is very important for the stores to understand how these parameters and use the analysis to predict the sales in future. Predictive models based on these characteristics are used to forecast the sales efficiently and accurately. These predictions help the store managers to comprehend the store performance against performance indicators and prepare in advance, the measure that should be taken to improve sales for example, introducing promotional offers, understanding competitor market, etc.
In this paper, we cover the processes involved in building a model to forecast store sales over a given period based on certain attributes. A store sales dataset from Kaggle is used to achieve the same. Different modelling techniques are explored - random forest, gradient boosting and Time Series Linear Model. All the models are built using R. The different modeling techniques are trained on the training dataset using the metric Root mean square percentage error (RMPSE) and then based on the prediction power they are used to forecast the store sales. Since our test dataset does not hold any value for sales therefore, the prediction error is tested by submission of the outputs on Kaggle.
Matt Policastro, District Configuration Analysis through Evolutionary Simulation, July 2017, (Peng Wang, Michael Magazine)
This capstone replicates a methodology for identifying biased redistricting plans in a new context. Rather than electoral districts, twenty-four of the City of Cincinnati’s fifty neighbourhoods across three of the five Cincinnati Police Department districts were chosen as the units of analysis. It should be noted that this project did not constitute a rigorous analysis of potentially-biased districting practices; instead, this project identified advantages, trade-offs, and other challenges related to implementation and analysis. While the results of the evolutionary algorithm-driven simulation suggested deficiencies in the current implementation, the underlying methodology is sound and provides a basis for future improvements in evaluation criteria, computational efficiency, and evolutionary operators.
Ritesh Gandhi, Gender Classification by Acoustic Analysis, July 2017, (Dungang Liu, Liwei Chen)
With the advent of machine learning techniques and human-machine interaction, automatic speech recognition is finding practical utilities in today’s world. As a result, gender classification based on acoustic properties of speaker’s voice is applicable in a range of applications in different fields. It starts with extracting voice characteristics from huge databases containing human voice samples before processing and analyzing those features to propose the best model for implementation in respective systems. The purpose of this paper is to perform comparative study of gender classification algorithms applied on voice samples. Extreme Gradient Boosting (XGBoost), Random Forest, Support Vector Machine (SVM) and Neural Network are employed to first train our model and then compare the results to determine the best classifier for gender. It has been shown for all the models that average model performance crosses 95% and misclassification rate stays below 5%. Final results suggest that Random Forest is the best classifier among all the six techniques that are used for gender recognition.
Anurag Maji, Analysis of the Global Terrorism Data, July 2017, (Yichen Qin, Edward Winkofsky)
The problem we are trying to address is to predict whether an attack will result in casualties given the nature and characteristics of the attack. The dataset was obtained from Kaggle . The motivation behind this project was to gain an understanding of how terror attacks have spread over time and over different regions and what have been the key drivers in features involved in such an incident. A detailed analysis has been done on spatial and temporal nature and characteristics observed in majority of the attacks. The target variable has been converted from continuous to a categorical one as it was deemed more important to know whether there will be civilian casualties as opposed to knowing the magnitude of damage to life.
Krishnan Janardhanan, Win Probability Model for Cricket, July 2017, (Peng Wang, Ed Winkofsky)
Cricket is a popular team sport played around the world with batters and bowlers. There are a limited number of resources available to each team in the form of wickets and balls. In order to understand the impact of the resources and the current situation of the game, a win probability model was created which estimates the probability of the team winning. The model was created using Logistic regression, Classification tree using boosting and Local regression. The models were analyzed based on model parameters such as Area under the curve and Misclassification Rate. The most suitable win probability model was chosen, and the model was applied to a game to examine its predictability. Win probability models may be used to evaluate player value and contribution, used in betting sites to calculate odds of a team winning.
Kuldip Dulay, Fraud Detection, July 2017, (Yan Yu, Edward Winkofsky)
Credit cards have become an integral part of our financial system and most of the people use it for their daily transactions. Given the huge volume of transactions that occur, it is very important to ensure that these transactions are valid and have been performed by the credit card owner himself.
With the advancement in machine learning algorithms it has become possible to narrow down our search for fraud transactions to a very limited number of records which can be later manually verified. For this project, I have tried to implement 4 such machine learning techniques to identify fraud transactions. Predictive models have been built using these 4 techniques and based on a few selected performance criteria the best model has been identified.
Nidhi Mavani, How Can We Make Restaurants Successful Using Topic Modeling and Regression Techniques, July 2017, (Dungang Liu, Liwei Chen)
Yelp is an online platform both website and app, where people write about their experiences about a place they visited. Yelp had published the data for competition, wherein the information about different businesses across the countries of US, Canada, Germany and U.K and their check-ins, reviews were made available. The objective of the project is to identify the factors that affects the business of restaurants across the mid-west region of US, states of Ohio, Wisconsin, Illinois and Pennsylvania. About 6.2K restaurants with 50 attributes and having 2M reviews are analyzed. Analysis is spread across two spectrums, first is to analyze the text and identify the topics that customers are more concerned about using Topic Modeling technique called Latent Dirichlet Allocation (LDA) and the other is to find the features of highly appreciated restaurants using Logistic Regression. Thus, an analysis both on qualitative and quantitative data is done to understand the customer’s preferences.
Aditya Bhushan Singh, Price Prediction for Used Cars on eBay, July 2017, (Dungang Liu, Liwei Chen)
With the advent of E-Commerce Industries everything from household items to cars are being made available online. Changing market trends indicate various demographics now prefer to shop for almost everything from the comfort of their own homes. In this analysis we will predict the price of second hand cars whose ads were posted on EBAY Kleinanzeigen based on various attributes of the car made available by the sellers. This will help prospective buyers gather an accurate estimate of the cars while also helping sellers price their cars at an optimum level. In order to accomplish this, different data mining algorithms will be applied and evaluated to identify the best solution for this problem. Once the best solution has been established for this use case, it can be then easily transferred onto other products being sold across the E-commerce spectrum.
Ishant Nayer, Airbnb Open Data for Boston, July 2017, (Yan Yu, Yichen Qin)
The purpose of the analysis was to show people how Airbnb is really being used and how it is affecting their neighbourhood. By analysing reviews from Airbnb’s data, itself we can judge which areas are most popular, or which apartment types are most commonly used and how all the listings are reviewed. Airbnb started their open data initiative in which they disclosed some data location wise. The data was analysed using R and visualizations such as Google Static Maps, Word Clouds, were integrated into the Sentiment analysis to highlight the sentiments - location wise for the Boston area. Such kind of analysis gives a holistic approach where a vibe of the neighbourhood can be picked up using analytics. Recommendations can be made to Airbnb or the people who put up their places on their website. With the use of all the recommendations Airbnb can improve their service which will lead to happy customers and thus, better business. Sentiment Analysis was done using faceted vertical bar graphs, word cloud, and horizontal bar charts, etc. The analysis show that overall there is a Positive vibe from the listings at Boston, MA.
Yash Sharma, Image Recognition, July 2017, (Bradley Boehmke, Liwei Chen)
Computer vision is a concept which deals with automated extraction, analysis and understanding of information from images. This field has enormous use cases and numerous organizations like Google, Tesla, Baidu, Honeywell etc. have invested significant resources into research and development of computer vision technologies. Computer vision can be utilized in autonomous vehicles, language translators, wildlife conservation, medical solutions, forensics, census and many more fields. Character recognition could be taken as the first step into computer vision. This project leverages data from a Kaggle competition where 42000 labeled samples of numbers were given and participants must build models which could accurately recognize numbers from 0 to 9. Machine learning techniques like Principal Component Analysis, Random Forest and Artificial Neural network were used to build models which were trained to identify hand written numbers. Predictions from each of these models were compared and metrics such as precision, recall and F1 scores were used to judge the accuracy of the model predictions.
Aditya Kuckian, Loan Default Prediction, June 2017, (Dungang Liu, Liwei Chen)
Loan defaults is a most common problem that banks face today for all its assets. The problem aggravates at the time of economic downturn. This project is from an online competition on Hackerearth wherein a bank wants to control its Non-performing assets by timely identifying the propensity of loan defaults among applicants. The data provided was related to loan application, customers’ engagement and demographics, and credit information. It had ~532K records and 45 features. The scope of the project was to identify the characteristics of loan defaulters for credit card and house purchase. This information was extracted from ‘purpose’ column in the data. Machine learning classification models such as Logistic Regression and Gradient Boosting Machine were used for this purpose. The predictions from each of the models were compared for concordance and area under the curve (AUC) metrics. Both the techniques identified similar characteristics from factors such as number of inquiries and credit lines, delinquency metric, verification status, grade, etc. that distinguished loan defaulters. Variable ‘total interest received till data’ showed contrasting behaviour. Gradient Boosting machine could significantly improve the predictions for credit card defaults.
Sanket Purbey, Spotify Tracks Clustering and Visualization: Creating playlists by preferentially ordering audio features, April 2017, (Jeff Shaffer, Michael Fry)
Pradyut Pratik, Movie Recommendation Engine: Building A Movie Recommendation Engine to Suggest Movies to Users, April 2017, (Peng Wang, Ed Winkofsky)
A recommendation engine is a tool or algorithm that makes suggestions to the user. A recommendation engine using statistical techniques is quite widespread in e-Commerce nowadays. It solves the problem of connecting the existing users with the right items in the massive inventory of products or content. Amazon has reported 29% of the sales because of cross sell in 2016 and Netflix paid $1 Million to the team which can improve the recommendation accuracy by 10%. The idea behind this project is to study different statistical methods used in the recommendation engine and better understand collaborative filtering and content-based filtering used in the recommendation engine.
Zhengquan Wang, Modeling Bankruptcy Probability: A Study of Seven Major Banks, April 2017, (Dungang Liu, Liwei Chen)
The goal of this study is to model the bankruptcy probabilities of seven major banks under different economic scenarios. The balance sheet, income, and bankruptcy risk are analyzed so that their capital status, operation risks and default probabilities are evaluated. The seven banks investigated here are: 1) J. P. Morgan & Chase (JPM), 2) Bank of American (BAC), 3) CITI Group (CITI), 4) Wells Fargo (WFC), 5) Goldman Saches (GS), 6) Morgan Stanley (MS), and 7) U.S. Bank (USB). The balance sheet and earning data of the banks is collected from the 10K documents published by these banks.
The balance sheet analysis is performed in terms of the ratios of the total bank liabilities versus the gross domestic production (GDP), and total bank assets versus GDP. Results indicate that these ratios have declined after the 2008 financial crisis. The revenue and net income analysis reveal that both revenue and income for most banks declined during the crisis significantly and they have recovered after that and stayed stable since. A logistic regression model is with the goal to predict the bankruptcy risk probabilities of the banks. The approach to achieve this goal is to borrow the information from an existing dataset which consisted of the corporation bankruptcy data of other industries. This approach is based on the assumption that bank failures may follow a similar mechanism as the corporations in other industries. Results of this study show that the liability ratio (liability/asset) of the banks has significant impacts on bankruptcy.
Shreya Ghelani, Product Recommendation for Customers of a Bank, April 2017, (Yan Yu, Edward Winkofsky)
Recommendation systems can enhance customer engagement by not only providing selective offers which can be highly appealing to the customer but also by adopting targeted marketing and advertising efforts towards potential customer segments and thereby achieving cost efficiency. The objective of this analysis is to look at customer purchasing behavior of financial products at a bank and predict the new products that customers are likely to purchase thereby recommending those products to the customers. With a more effective recommendation system in place, the bank can better meet the individual needs of all customers and ensure their satisfaction. Different data mining classification algorithms are tried and compared to identify the best model for such a problem.
Arun Yadav, Application of Predictive Modeling in Loan Portfolio Underwriting, April 2017, (Yan Yu, Edward P. Winkofsky)
In financial institutions, a decision on an online loan application is made within a matter of a few seconds. Predictive models based on a large volume of consumer characteristics are used to make the decision efficiently and accurately. This ensures prevention of loans to people susceptible to bankruptcy; thus, avoiding bad debt. On the other hand, an accurate model will also ensure good credit risk customers are not erroneously rejected a loan; leading to increased profits. In this paper, we cover the processes involved in building a model to predict the probability of customers’ loan default in financial institutions. A financial transaction dataset from Kaggle is used to do the same. Different machine learning techniques are explored - logistic regression, random forest, gradient boosting and neural networks. All the models are built using the ‘sklearn’ package in Python.
The different modeling techniques are validated for out-of-sample data using Area Under ROC (Receiver Operating Characteristic) Curve (AUC) and Kolmogorov–Smirnov (KS) statistics as performance metrics. Logistic regression and neural network have comparable performance; however, logistic regression is chosen as the final model considering model complexity.
Mitchell Garner, Analysis of McDonald’s Made for You Production System: Current Barriers, April 2017, (Amitabh Raturi, Katie Blankenship)
With the introduction of ADB 2.0, the second phase of All Day Breakfast at McDonald’s, interesting challenges have been introduced into the kitchen systems of the iconic QSR. Here the barriers will be analyzed using experimental design, statistical inference, queueing analysis, and simulation modeling. These methods are meant to not only search for and analyze the potential barriers but prescribe fixes both temporary and permanent that Owner/Operators may adopt to improve operations and sales. Data was collected for this study in a manner which will generalize these results to as wide an audience as possible. Prior beliefs in assembly barriers will be tested using these data. Models will be constructed integrating the data with knowledge of the system. Recommendations include planning lower volume kitchen positioning around the product mix of the hour, utilizing positioning standards in higher volume stores to their fullest, and using the acquired statistical evidence to necessitate a group collaboration with McDonald’s for solution development.
Hui Wang, Spline Regression Modeling and Optimization Analysis of Floor Space, April 2017, (Yichen Qin, Michael Fry)
The purpose of this project is to seek the maximum total net sales by modifying the floor space allocated to different product types for a large retail chain of stores. In order to achieve this objective, spline regression models were built to represent the relationship between floor area and the productivity for each of their products. The results of model building showed that the behavior of most of the products sold by the retailer follows the anticipated “U” shape curve, that is, along with the increase of selling area, the productivity of the product initially decreases and then increases; at a certain threshold, the increase of the productivity terminates, the productivity then either becomes a plateau or starts to decrease again. A non-linear optimization model was then conducted based on the spline regression models with an objective of maximizing the total net sales of a store. The optimization procedure was designed at three different levels: optimizing the selling area of products in a whole store; optimizing the floor area of products within each floor of a store; and optimizing the selling space within each product category in a store. The output of the optimization analysis are the optimal floor areas assigned to specific product groups. These analyses and recommendations are significantly important because they serve as valuable reference for the retailer on whether and how to adjust the floor space of their products in order to maximize total net sales.
Todd Eric Hammer, Simulation Study: Waiting Times During Checkout at Non-Profit Tag Sale, March 2017, (W. David Kelton, Edward Winkofsky)
The purpose of this simulation is to study the waiting times of buyers during a public consignment sale held by a non-profit organization. During a substantial portion of the 4 hour public sale, the lines become very long and the non-profit group worries that some people will abandon their purchases and they will lose the commission from those sales. A simulation model was created in Rockwell’s Arena software to simulate the sale and modifications to the sale to determine if different configurations would reduce the amount of waiting time for the buyers. The study primarily focused on adding more helpers and tables during the checkout process. The study found that adding more tables was the most beneficial way of shortening lines and reducing the amount of time the buyer was in the checkout process.
Dan Shah, Developing a Health Score for Network Traffic, March 2017, (Yan Yu, Mukta Phatak)
GE has over 30 locations globally within its IT infrastructure deemed “mission-critical” to ongoing operations. Many of these locations are noting unacceptably high levels of variability in network latency, but limited information makes it difficult for the network engineering team to determine root cause. A “health score” implemented to a visual dashboard will help GE Digital understand current and past network health and prioritize improvements. In this paper, the network traffic across sites is analyzed to generate a framework for assigning a score to a location’s network traffic. The parameters for the framework are generated through simulation and input from the client and, subsequently, a prototype implementation is established. Additionally, based upon further research and simulation, alternative methods for outlier detection including Cumulative-sum (CUSUM) and Exponentially-weighted-moving-average (EWMA) control charts are compared and a path forward for the diagnostic tool is proposed.
Uma Lalitha Chockalingam, Customer Churn Propensity Modelling, August 2016, (Dungang Li, Edward Winkofsky)
Churn is a measure of subscription termination by customers. Churn incurs a loss to the company when investments are made on customers with high propensity to churn. Churn propensity models can help improve the customer retention rate and hence increase revenue. This paper focuses on the churn problem faced by companies and predicting customer churn by building churn propensity models. Data for this project is taken from the IBM Watson Analytics Sample Datasets, which contain around 7043 instances of telecommunication customers’ churn data. In this paper churn propensity models are built using techniques like logistic regression, support vector machines, neural networks, random forests, and decision trees. By comparing the various model performances it is observed that for out-of-sample prediction, neural networks, logistic regression and random forests perform better. While neural networks and random forests are black-box algorithms, logistic regression gives good insight of predictor variables that are effective in modelling churn. In-sample prediction measures of random forests show the ideal misclassification rate indicating over fitting to training data. Hence logistic regression is recommended owing to good out-of-sample prediction performance, along with insights on predictor variables that are significant to model.
Mohan Sun, Customer Analytics for Financial Lending Industry, November 2016, (Zhe Shan, Peng Wang)
This research involves discovering customers’ experience, attributes and performance to help the company make better decisions in increasing profit in all aspects of service, origination and collection. This has been done by examining different datasets such as customers’ attributes and performance data, using different tools such as SAS and GIS and performing various analysis. Upon examination of these datasets, it becomes clear that timely answering customers’ phone calls, targeting customers and locating stores in the area with high demand index as well as tracking stores and customers’ performance in time will help the company understand its operation and then make more profit.
Wenwen Yang, P&G Stock Price Forecasting using the ARIMA Models in R and SAS, December 2016, (Yichen Qin, Dungang Liu)
Time series analysis is commonly used in economic forecasting as well as analyzing climate data over large periods of time. It helps identify patterns in correlated data, understand and model the data as well as predict short-term trends from previous patterns. The aim of this paper was to present a concise demonstration of one of the most common time series forecasting models, ARIMA models in both R and SAS. The daily stock prices of Procter & Gamble from January 1, 2013 to September 30, 2016 with 693 points were used as an example. The autocorrelation function/partial autocorrelation function plots were used to examine the adequacy of the model as well as Akaike Information Criterion (AIC). The daily stock prices from October 1, 2016 to November 4, 2016 with 25 points were used to test the model’s performance by calculating the accuracy of the forecasts. First, the time series modeling was conducted in R, and then it was validated using SAS. The final model was identified as a moving average model with a first order difference. The AIC was 1494 and the average accuracy was 97%, which suggested that for the short-term prediction using the ARIMA model could do a good job. In addition, the log transformation was performed which was preferred in real economic prediction analysis. In this case, the same modeling results were obtained. To conclude, this paper demonstrated a comprehensive time series analysis in R and SAS, which could be a useful documentation for beginners.
Scott Woodham, Time Series Analysis Using Seasonal ARIMAX Methods, December 2016, (Yan Yu, Martin Levy)
The goal of this analysis is to develop a model that forecasts sales using time series methodology. First the ARIMA and SARIMA models are developed in their polynomial forms. Second the process of developing a model from start to finished is performed addressing such issues as stationarity of data, interpreting the ACF and PACF plots to infer the model parameters and then estimate the parameters. After the forecasts are made for the multiplicative seasonal model, the model is adjusted to include an exogenous variable, SARIMAX, to enhance performance. The current heuristic used to predict sales is the value of sales from a week prior or a lag 6 value, the final model selected has both AR and MA seasonal and non-seasonal components as well as a binary indicator variable, which is sometimes referred to as intervention analysis, though that term is not used here as it usually implies a large sudden shift from which the system recovers and in this analysis the data are more sinusoidal as the sales shift from weekdays to weekends.
Jin Sun, Internship at West Chester Protective Gear, December 2016, (Yichen Qin, Yan Yu)
West Chester Protective Gear founded in 1978, is a known leader in the marketplace for providing high performance protective gear for industrial, retail and welding customers. From gloves to rainwear to disposable clothing, WCPG offers a wide range of quality products including core, seasonal and promotional products and is one of the largest glove importers in the United States. This capstone is composed of five projects, most of which are interactive reports made with Microsoft Power BI, a cloud-based business analytics tool. The Order Picking and Fill Rate reports greatly increases the work efficiency of the Warehouse Department, and the reports for Purchasing Department provide people another view of the sales data, which will help the company make a better inventory plan. The last project is to analyze the relationship between Average Sales Price and Sales Units. A linear regression model is built to explain how the change of the price will affect the sales units. Model diagnostics are conducted. Model performance in terms of hold-out sample prediction is evaluated. Throughout this internship, I have practiced and made the best use of my knowledge from MSBA program to real world applications.
Ally Taye, Predicting Hospital Readmissions of Diabetes Patients, December 2016, (Yichen Qin, Yan Yu)
Diabetes is an increasingly common disease among the U.S. population. According to the CDC, the number of people diagnosed with diabetes increased fourfold from 1980 to 2014. In addition, if not well controlled, diabetes can lead to serious complications such as cardiovascular disease, kidney disease, peripheral artery disease, and many others that can result in hospitalization or even death. In light of the seriousness of this condition, it is worth looking into the causes of hospitalization of diabetes patients and what factors influence whether they stay healthy enough to avoid future hospitalizations. This paper looks at a de-identified dataset with information about diabetes patients admitted to hospitals across the U.S. over an extended period of time, and analyzes multiple variables from their hospital records to see if there was a statistically significant relationship between any combination of these factors and whether or not they were readmitted to the hospital any time during the observation period after their initial visit.
Hamed Namavari, Disney Princess: Strong and Happy or Weak and Sad, A Sentiment Analysis of Seven Disney Princess Films, December 2016, (Michael Fry, Jeffrey Shaffer)
In a world that is predominantly run by men, it has been suggested by several researchers that entertainment content is affected more by male influence than female influence (see Friedman et al.). But, what if the general male dominance coming from the study context is eliminated in the research process? In this capstone case, the significance of main female characters in a select list of Disney Princess title movies are explored by only comparing their scripts in those movies to that of the other main character, which is always not female, in each title. The research completed here supports the idea of that Disney princess characters are the most positive and most spoken characters in their movies.
Pramit Singh, Sentiment Analysis of First Presidential Debate of 2016, December 2016, (Amitabh Raturi, Aman Tsegai)
Datazar is a platform which uses open data to generate meaningful insights, hence the sentiment analysis was performed as a part of a scalable plan which would allow analysts to reuse the analysis to calculate sentimental scores based on Twitter feeds. It was performed after the First Presidential debate to capture the mood of people on social media, the tweets were classified as positive, negative and neutral and a sentiment score was calculated for each of the presidential candidates.
Also, logistic regression and Random forest techniques have been used to predict the negative sentiments. While this was implemented for the presidential debate, the functions used are reusable and hence can be used to get the score for any other brand. This was done to ensure the process is scalable and reusable.
Huangyu Ju, Regression Analysis for Exploring Contributing Factors Leading to Decrease of Cincinnati Opera Attendance, December 2016, (Dungang Liu, Tong Yu)
In recent years, there has been a growing concern about the diminishing audience for opera nationwide. Cincinnati Opera is currently facing a flat dropping down in total audience attendance in the past decade. The audience attendance of Cincinnati Opera includes subscribers and single ticket buyers. While the number of both subscribers and single ticket buyers is decreasing year by year, the number of single ticket buyer is not decreasing as rapidly as it is of subscribers. The goal of this project is to explore and identify the possible variables that may influence especially the number of subscribers. In this project, the regression analysis method is adopted for exploring the contributing factors that impacting the audience attendance. The analysis has identified four categories of variables: variables related to the origin of the opera piece, variables regarding the show time, variables related to the popularity, and the theatre capacity. To boost the audience attendance, it is recommended that the opera pieces with European background and good reputation should be included in each season’s performance. More performances should be scheduled during weekends so that they can attract more audience.
Darryl Dcosta, Analysis of Industry Performance for Credit Card Issuing Banks, August 2016, (Dungang Liu, Ryan Flynn)
Argus Information and Advisory Services, LLC, is a financial services company that utilizes the credit card level transaction data collected from different banks and credit bureaus, to offer various analytical services to credit card issuers. Argus possesses transaction, risk, behavioral and bureau sourced data that covers around 85-95% of all the banks in the US and Canada. The dataset contains transaction level data provided by nearly 30 banks, across 24 months, with an approximate size of 3+ million records. This study looks at how Argus can offer an early bird analysis of the variance in performance of the industry, while abiding legal regulations that prevent the company from revealing more than a certain level of data, which poses a threat of price fixing by the client bank. Data is pulled from tables containing different dimensions of data in the SQL server database and aggregated to produce client level reports. The analysis showed that the projections made were fairly consistent with the observed industry trend, after the transaction level bank data was loaded, validated, normalized and queried from the database. It is a good indication on the accuracy of the projections. Client banks use the flash report to tailor their revenue model and customer acquisitions strategy. The spike in Total New Accounts in the industry for March 2016 was not captured by the projection made, which would need to be revisited from a business point of view.
Shivaram Prakash, Predicting online Purchases Navistone®, August 2016, (Efrain Torres, Dungang Liu)
Ecommerce – the newly emerged platform for online retail sales has seen a burgeoning increase in its usage since inception. Although dominated by the giants like Amazon and Ebay, almost all businesses have their own online store or website which contribute to a sizable chunk of the total revenue. Navistone® collects the visitor browsing behavior data and analyses patterns to predict prospective buyers for its clients. The objective of this exercise is to analyze the browsing behavior data of online visitors, in order to predict the success of purchase for each visitor. In order to achieve this goal, visitor browsing data is collected from various client websites, checked for erroneous entries, cleaned and analyzed. Binary response models are then generated on a reduced, choice-based dataset (for enabling better prediction). The first model, a classification tree model, is generated to enable the management to understand the importance of different features of the dataset while the second, logistic regression model, is generated to better predict the response as compared to the classification tree. The logistic regression model produces better prediction in both the training and testing datasets and the classification tree provides evidence that the number of carts opened is the most statistically significant variable, prompting the management to focus the marketing efforts on visitors who put items in the cart and then abandon them later on.
Lavneet Sidhu, Predicting YELP Business Rating, August 2016, (Yan Yu, Glenn Wegryn)
Sentiment analysis or opinion mining is the computational study of people’s opinions, sentiments, attitudes, and emotions expressed in written language. It is one of the most active research areas in natural language processing and text mining in recent years. Its popularity is due to its wide range of applications because opinions are central to almost all human activities and are key influencers of our behaviors. Whenever we need to make a decision, we want to hear others’ opinions. The focus of this study is to quantify people’s opinion on a numerical scale of 1 to 5. Various predictive models were explored and their performance were evaluated to determine the best model. Attempts were made to extract the semantic space from all the reviews using latent semantic indexing (LSI). LSI finds ‘topics’ in reviews, which are words having similar meanings or words occurring in a similar context. Similar reviews were clustered into different categories using semantic space.
Ashok Maganti, Internship with Argus Information and Advisory Services, August 2016, (Harsha Narain, Michael Magazine)
Argus Information and Advisory Services is a leading benchmark, scoring solution, analytics provider for the financial Institutions. Argus helps its clients maximize the value of data and analytics to allocate and align resources to strategic objectives, manage and mitigate risk (default, fraud, funding, and compliance), and optimize financial objectives. One of the core competencies of Argus is being able to link the different accounts of a customer across financial institutions and have a complete view of the customer’s wallet. The deposits, transfers and the spending of the customer can be linked and the complete profile and spending behavior can be studied. Wallet Analysis Team is responsible for the Linkage and validation of the data. As a part of Data and Applications Vertical and Wallet Analysis Team, my primary objective was to study the concepts of Record Linkage, Identity Resolution and to develop an algorithm to identify the unique customers from different data Sources and to populate into a single normalized flat database using deterministic Record Linkage process for the UK market. The Records have the credit card account and customer details from different banks. These records are to be linked and integrated so as to identify the same customer across banks and remove the duplication. Apart from identifying the accounts of customers across banks, the changes in the customer details have to be captured and to be maintained in the integrated flat database with the help of slowly changing dimension of type-2.
Lian Duan, Fair Lending Analysis, August 2016, (Julius Heim, Dungang Liu)
The Consumer Financial Protection Bureau (CFPB) requires lenders to comply with fair lending laws, which prohibit unfair and discriminatory practices when providing customer loans. Applicants’ demographic information is usually prohibited for collection but it is needed to perform fair lending analysis. The objective of this project is to show that the race distribution is similar across the three bins of a predictor from our scorecard. In this project, the customers’ race categories were predicted using last names and residential locations according to the Bayesian Improved Surname Geocoding (BISG) proxy method published by CFPB. Modifications in our analysis including using customers’ Core Based Statistical Area (CBSA) information instead of home address, and using R software instead of STATA for data preparation and analysis. The predictor was evaluated based on race distribution in each bin, and our results suggest that the race distributions across the three bins of this predictor are similar.
Juvin Thomas George, Automation of Customer-Centric Retail Banking Dashboards, August 2016, (Andrew Harrison, David Bolocan)
Retail Banking is a competitive arena focused on customer-centric service. Customers interacting with banks through multiple channels have created an explosion of data, banks use to generate insights into their behaviors. Understanding customer data is crucial to developing better products and services. Performing analytics on transactional data and utilizing benchmarking studies requires creation of standard dashboards on a regular basis. Automating data input processes and updating dashboards are critical to on time services. This capstone project was completed at Argus Information & Advisory Services, part of Verisk Analytics, located at White Plains, NY.
Sudarshan K Satishchandra, Prediction of Credit Defaults by Customers Using Learning Outcomes, August 2016, (Peng Wang, Yichen Qin)
Most financial services have realized the importance of analyzing credit risk. Predicting the credit defaults with higher accuracy can save considerable amount of capital to financial services. Many machine learning algorithms can be leveraged to increase the accuracy of prediction. Popular and effective algorithms such as Logistic regression, Generalized Additive Models, Classification Tree, Support vector machines, Random Forest, Extreme Gradient Boosting, Neural networks and Lasso are apt for predicting the credit defaults. These algorithms have been compared using asymmetric misclassification rate and AUC for the out sample prediction. Data from the UCI Machine Learning Repository which was donated by I-Cheng Yeh from Chung Hua University, Taiwan has been used.
Rishabh Virmani, Kobe Bryant Shot Selection, August 2016, (Michael Magazine, Yichen Qin)
The report consists of insights about Kobe Bryant’s shot selection throughout his career. The data we have are all of his career shots and whether they went in or not which is the response variable. Along with that, we are trying to predict Kobe’s performance in the last two seasons of his career. We are predicting whether he actually sunk the shot or not. For this purpose we are employing the following three algorithms, Random Forest (Bootstrap Aggregating technique), Support Vector Machine (Non Probabilistic technique) and XGBoost(Boosting technique).
Alicia Liermann, The Analytics of Consumer Behavior: Customer Demographics, August 2016, (Jeffrey Shaffer, Uday Rao)
This project focuses on consumer buying behavior in retail grocery stores across the United States. The data was obtained through historic Dunnhumby data that was generated by shopping cards and recorded coupon codes, accompanied by transaction information. The project was approached from a business sales and marketing orientation as a means to target customers and increase sales.
Sarthak Saini, Predicting Caravan Insurance Policy Buyers, August 2016, (Peng Wang, Glenn Wegryn)
The project involves analyzing customer data for an insurance company. The aim is to predict whether a customer will buy caravan insurance based on demographic data and data on ownership of other insurance policies. The data consists of 86 variables and includes product usage data and socio-demographic data derived from zip codes. There are 5822 observations in the training data set and 4000 observations in the testing data set. The project aims to predict if a customer is interested in purchasing a caravan insurance policy. The models used for the project classifies them as potential buyers or no buyers. Predictive models were built to describe the customer behavior and predict potential buyer. Given that this is a classification problem Lasso Logistic regression, Classification Tree, Random Forest, Support Vector Machine (SVM) after dimension reduction by Principal Component Analysis (PCA), Linear discriminant analysis (LDA) (after PCA) and Quadratic discriminate analysis (QDA) (after PCA) were used to predict the potential customers . Dimension reduction was employed to reduce the number of predictor variable as there are many predictor variables. Best results were obtained using LDA and SVM with the misclassification rate as low as 7% for the testing data. Dimension reduction significantly improved the performance of the models. PCA was used for reducing dimensions and the first twenty components were used to build the model.
Abhishek Chaurasiya, Tracking Web Traffic Data Using Adobe Analytics, August 2016, (Dan Klco, Dungang Liu)
A website is the major source of information and interaction between the consumer and producer in any kind of organization or environment. It can be accessed by hundreds to millions of users, which generates huge volumes of data. This data contains important information about customer profile, demographic information, technology used, user patterns, consumer trends etc. Tracking this data and reporting it in the format desired is therefore a huge and important task. This project uses Adobe Analytics, along with Dynamic Tag Manager (DTM) to track and effectively report this data. The reports are then analyzed keeping in mind their business value. The analysis concludes that the author ‘Ryan McCollough’ garners maximum views, around 90% of the total, through his posts. It’s also concluded that Twitter is the most preferred Social media channel, it drives around 80% of the traffic which follows the blog.
Rutuja Gangane, Customer Targeting for Paper Towels – Trial Campaign, August 2016, (Sajjit Thampy, Yichen Qin)
Customer Targeting has been a marketing challenge for many years. The idea behind customer targeting is to optimize targeting so that one targets the right kind of customer at the right time, with the right kind of product in order to maximize sales, save business resources and maximize profit. Quotient Technology Inc.’s Website Coupons.com delivers personalized digital offers in accordance to user’s purchasing behavior data. A customer is displayed with many different combinations of coupons based on their buying patterns/segments created using data-driven techniques. This project is an ad-hoc predictive analysis to determine the target customers for a Paper Towel producing CPG brand, (say YZ) targeting its customers with personalized coupon offers for various retailers. The main idea behind the campaign is to generate trial. It is easy to determine user’s behavior based on previous trial campaigns, but as this is a first campaign of its sort for the brand, we will make use of multiple machine learning techniques, heuristics and business knowledge to make the best predictions about which customers are likely to try the product.
This project will make use of excessive SQL queries (Hadoop Impala) and R software to perform data analyses, market basket analysis, logistic regression, random forest, SVM models and similar Machine Learning techniques to find which customers are more likely to buy YZ Brand’s Paper Towels and should be targeted for this trial campaign.
Hardik Vyas, Analysis of Kobe Bryant Shot Selection, August 2016, (Michael Magazine, Peng Wang)
The key objective of this project is to explore the data pertaining to all of the 30,697 shots taken by Kobe Bryant during his entire NBA career. We also look to develop various models to predict which of these shots would make the basket had the outcome been unknown. The problem is based on a competition now closed on Kaggle. The competition was introduced post Kobe Bryant’s retirement from professional Basketball on April 12, 2016. Kobe played out his entire 20 year NBA career with the Los Angeles Lakers. He had an illustrious career to say the least, holds numerous records and is regarded as one of the most celebrated players to ever grace the game.
Nikita Mokhariwale, Reporting Analyst Internship at BlackbookHR, Cincinnati, August 2016, (Marc Aiello, Peng Wang)
The importance of data interpretability is often overlooked during Executive reporting. The customer experience can be increased manifold if Executive reports are made user-friendly and in such a manner that the executives are encouraged to see patterns and trends in data, and to even question the data. I transformed the traditional reports which BlackbookHR used to create for all its clients in the Talent Analytics space. The traditional reports comprised of numbers and tables which were tedious to read and provided little insight apart from just the results of surveys taken by the employees of the client. I introduced innovative visualizations and charts in the reports and minimized the use of numbers in depicting the data. The visualizations helped the Executives to view their organization in one snapshot without having to perform any mental calculations, as there were no numbers involved. This received very positive feedback from clients because such charts helped them find patterns also in areas where they weren’t expecting them. For example, one of the clients was able to identify a possible negative correlation between size of teams and levels of Employee Engagement. My work was primarily based on Excel and Tableau. I later created Excel and Tableau templates which could be used for all future reporting purposes. I did scalability tests so that the reporting templates could be used for larger clients and stay robust when varied data is introduced in them.
Joshua Roche, Market Analysis Framework for Mobile Technology Startups, August 2016, (Amit Raturi, Michael Magazine)
The current technological revolution has created a veritable modern day “gold rush” due to an ever-growing market and much lower barriers to entry than in traditional industries. Many startups do not pursue an analytical study of the market in which they seek to enter before development begins. This potentially leads to a tremendous undertaking that is in effect useless, due to a lack of implementing a market analysis before work begins. This paper seeks to establish an initial analytical framework to begin testing market potential assumptions before work begins so that entities with a limited amount of resources including, a lack of analytical prowess and information asymmetry, to make more informed decisions.
Shashank Pawar, Hybrid Movie Recommender System Using Probabilistic Inference over a Bayesian Network, August 2016, (Peng Wang, Edward Winkofsky)
Recommender systems are used widely, in order to help users accessing the Internet, by suggesting the products or services, they would be interested in based on their historical behavior, as well as the behavior of other users similar to them. Two different types of approaches are usually adopted while developing a recommender system: Content based and Collaborative filtering. This project studies the application of a hybrid approach, combining the content based and collaborative filtering techniques, in developing a recommender system for movies. The data set used is the MovieLens 100K data set, consisting of 100,000 ratings by 943 users of 1682 movies, where a movie is described using one or more of 19 features or genres. The objective is to predict how a given user would rate a movie, which has not yet been rated by him. A Bayesian network, is used to represent the interaction and dependencies among the movies, users and movie features, which in turn are represented as nodes in the graph. In order to find the users similar to the given user, for the collaborative filtering part, first, the ratings by the two users on common items, are considered, and the Pearson Correlation Coefficient between the two sets of ratings, is used as the measure of similarity and second, considering the same sets of ratings by the two users, the count of instances where the two users have both rated a movie lowly or highly, is used as a measure of similarity.
Raunak Bose, Machine Learning - Comparison Matrix, August 2016, (Uday Rao, Michael Magazine)
With the availability of several options, the decision of selecting machine learning tools for machine learning algorithms has become cumbersome. Each algorithm brings its own pros and cons to the machine learning community and many have similar uses. The emergence of phenomenon of collection of huge data is already here and current tools for machine learning need real-time processing abilities to meet the requirements of its users. Through this paper, I wish to provide researchers the ability to utilize machine learning with Python. In order to evaluate tools, one should have a thorough understanding of what to look for. This paper will take into account the platform of Python to evaluate machine learning algorithms on confusion and hardware matrix. We will look at libraries such as Python SCIKIT and study their usage in performing processing on data meant for supervised learning algorithms.
Ryan Stadtmiller, Predicting Season Football Ticket Renewals for the University of Cincinnati Using Logistic Regression and Classification Trees, July 2016, (Michael Magazine, Brandon Sosna)
Season ticket holders (STH) are important for both collegiate and professional sports teams. It allows fans to take ownership in the team and also provides a significant amount of overall revenue for the team’s ownership. For these reasons, maintaining a high renewal rate of STH’s is important to the teams on and off the field performance. I will focus on analyzing STH renewals for the University of Cincinnati’s Football team. I will use statistics and data mining techniques to predict whether a STH is likely to renew their seats based on many predictor variables such as quantity of tickets, section, percentage of tickets used throughout the year, and percentage of games attended among many others. If a customer is not likely to renew their tickets, the athletic department can take preemptive measures to retain the customer.
Nicholas Imholte, Optimizing a baseball lineup: Getting the most bang for your buck, July 2016, (Michael Magazine, Yichen Qin)
Given a fixed payroll, and focusing purely on the offensive side of the ball, how should a baseball team assign its funds to give itself the highest average number of runs possible? In this essay, I will attempt to answer this question using regression, clustering, optimization, and simulation. First, I will use regression to model baseball scores, with the goal being to determine how each event in a baseball game impacts how many runs a team scores. Second, I will use clustering to determine what kinds of hitters there are, and how much each type of hitter costs. Third, I will use optimization to determine the optimal arrangement of hitter clusters for a variety of payrolls. Finally, I will complement this analysis with a simulation, and see how the results from the two approaches compare.
Nidhi Shah, Revenue Optimization through Merchant-Centric Pricing, July 2016, (Jay Shan, Madan Dharmana)
A payment processor, that processes credit and debit card transactions, wanted to come up with a strategy to maximize the revenue they make from merchant transactions, by re-pricing the processing rates of their merchants periodically. The biggest challenge with increasing a merchant’s rate, as is with any customer of a business, is that, there is a very fine line between driving the customer away due to their price sensitivity and being able to determine an optimum price point so as to get the most revenue out of them and retain them as a customer.
To address this challenge, we implemented a dynamic, merchant-centric pricing strategy where each merchant is treated individually - based on their profile - while determining the pricing action to be taken. In order to achieve this, we designed an automated solution in SAS that came up with a unique pricing recommendation for each merchant based on certain decision rules. The strategy to maximize revenue was implemented by increasing processing rates up to the merchant’s segment (industry and volume tier) benchmark along with certain other constraints. This automated solution allowed re-pricing to be done more frequently (monthly) which resulted in an annual incremental revenue of ~$500,000 for the payment processor.
Kristofer R. Still, Forecasting Commercial Loan Charge-Offs Using Shumway’s Hazard Model for Predicting Bankruptcy, July 2016, (Yan Yu, Jeffrey Shaffer)
In the course of lending money, a certain percentage of a bank’s outstanding loans will be deemed uncollectible and charged-off. Because charge-offs can lead to significant losses commercial banks try to minimize these losses by closely monitoring borrowers for signs of default or worse. Commercial banks maintain detailed financial records for their customers which include numerous accounting ratios. This analysis seeks to leverage this accounting data to predict corporate charge-offs using a sample of firms from January 1, 2000 through the present. A simple hazard model is used and compared to older discriminant analysis methods based on out-of-sample classification accuracy.
Sahithi Reddy Pottim, Building a Probability of Default Model for Personal Loans, July 2016, (Dungang Liu, Yichen Qin)
Consumer lending industry is growing rapidly with a wide spread of loan types and lending personal loans over internet is gaining huge importance. The main goal of the project is to determine which customers should be offered a loan in order to maximize the profit of a small finance company which issues loans to customers over internet. The data set has information on the past loan performance and contains about 26,194 loans with 70 variables. The variables can be categorized as those on application data, credit data, loan information and loan performance. The main crux of the project is the selection of variables using weight of evidence and information value concepts which are measures of predictive power of the response variable. It has been noticed that weight of evidence is high for those variables where the percentage of the good and bad loans change significantly as the bins change. Variables with information value (predictive power) between 0.26 and 0.02 which can be classified as strong, average and weak predictors are considered for building logistic regression model and it resulted in an AUC of 0.67. However Information value did not take into account correlation or multicollinearity among the variables. Further check on correlation and multicollinearity using variance inflation factor (VIF) resulted in the reduction of variables. Step-wise logistic regression model is built on the selected variables using information values and it resulted in the reduction of variables and an AUC of 0.69 and a reduction in misclassification rate of good and bad risk loans. The results proved that information value is one of the best variable selection procedures and step-wise logistic regression model suited best in the prediction of probability of default of loans on the dataset.
Joseph Chris Adrian Regis, Human Activity Recognition using Machine Learning, July 2016, (Yichen Qin, Dungang Liu)
The Weight Lifting dataset is investigated in terms of "how (well)" an activity is performed. This can have real life applications in the sports and healthcare space. In this particular capstone, machine learning algorithms are applied with the intension of checking the feasibility of its application in terms of accuracy. This data is collected from the use of wearable accelerometers consisting of 39,242 observations with 159 variables. Features were calculated on the Euler angles (roll, pitch and yaw), as well as the raw accelerometer, gyroscope and magnetometer readings from the wearable devices. We have chosen to go with algorithms in the order of increasing complexity in order to probe accuracy w.r.t. the algorithm used. Decision Trees, Random Forests, Stochastic Gradient Boosting and Adaptive Boosting were applied. We saw that there is not much difference between the latter 3 (less than 0.25% apart in terms of % accuracy), but they were much better than decision trees, as expected. But as we have to choose between the three, we choose adaptive boosting as the final algorithm. We get an accuracy of 99.95% with the algorithm (adaptive boosting) on the scoring dataset and this is the expected accuracy in a general application using the same setup.
Jigisha Mohanty, Analyzing the Relationship between Customers for the Commercial Business of the Bank to Identify the Nature of Dependency and to Predict the Direction of Risk in Cases of Possible Adverse Effects, July 2016, (Kristofer Still, Michael Magazine)
The Commercial banking business deals with many customers that buy various products from the bank. There are scenarios where a company and its parent company are both customers of the bank. Further, a bigger company can guarantee the loan requested by another company. Each loan or credit service established carries a certain amount of risk for the bank. Each relationship is rated based on such factors of risk. The direct risk to the bank is established by the direct exposure amount assigned to the company. When a different company owns or guarantees for a company, the latter’s direct exposure also shows up as indirect exposure for the former. This implies that if the smaller company defaults in paying back its loan, the company owning or guaranteeing for it is responsible for the entire loan taken by the smaller company.
The objective of this study is to create a network map to identify such connections. The network map will provide a visual description of the relationship between two customers and show the dependencies between customers. The second objective of the study will be to identify the direction of risk in terms of direct and indirect exposure for primary, secondary companies and so on. This will help the bank establish a line of action and to quantify the exposure amount attributed to each customer. The direction of risk will open up the analysis of the effect of an adverse effect.
Minaz Josan, Sentiment Analysis for the Verbatim Response Provided by Clients for Satisfaction Survey for Fifth-Third Bank, July 2016, (Kristofer Still, Yan Yu)
The Financial-Services industry is still struggling with high churn rates as customers have numerous options where they can bank. This leads to the need for understanding the hidden customer sentiments. The industry has realized the need for strengthening the relationship with their customers. One measure taken is to monitor the performance of the representatives and the satisfaction of the clientele with the institution as well as the representative. An overall satisfaction score is provided to every representative based on the survey completed by the clients on the performance of the bank and the representative. This survey also includes the verbatim responses. In this project, an attempt will be made to identify the sentiment behind these verbatim responses and the correlation to the overall satisfaction score. The responses will be analyzed in three-category scale of positive, negative or neutral using the supervised learning model of SVM (support vector machines) and Logistic regression algorithm.
Adam Sullivan, Predicting the Rookie Season of 2016 NFL Wide Receivers, July 2016, (Yichen Qin, Mike Magazine)
The NFL has never been more popular than it is today, part of why the sport has become so popular is the expansion and exponential growth of fantasy football. According to American Express nearly 75 million people will play fantasy football and spend nearly $5 billion to play in the course of the 2015 season. The leagues people play in range from daily fantasy football, where different players can be selected each week, to dynasty fantasy football, where players can be kept for their whole career. This analysis will be focused through the lens of dynasty fantasy football, which is seeing its own explosion of participants. In dynasty fantasy football the wide receiver is king with 16 of the top 20 ranked players being wide receivers. The purpose of this analysis is to give insight into which 2016 rookie wide receivers are in the best position to have success in their rookie season and would validate being selected early in dynasty football drafts.
Eulji Lim, Cincinnati Crime Classification, July 2016, (Yan Yu, Dungang Liu)
Every citizen expects prompt service from police, and the police department wants to draw satisfaction from citizens with resource management and other tools. This study aims to build “Cincinnati crime category prediction models” in order to find an insight of the crime data through appropriate data visualization. The Cincinnati Police Crime Incident dataset is provided by the City of Cincinnati Open Data Portal. It contains time and location of crime in the six districts of Cincinnati from 1991 to present and has been continuously updated daily. Specifically, there are over one hundred eighty thousand incidents from Jan 2011 to May 2016, which is the sub-dataset chosen for the analysis. The crime classification idea and the model evaluation method are inspired by one of the Kaggle competitions: “San Francisco Crime Classification”. In the data exploration, it is found that month and season affect the number of crimes rather than the types of crime. Logistic Regression models are built using R with different time and geographical attributes. The hour, year and neighborhood factors are found to be more effective than other factors such as latitude and longitude, in order to build the model with the lowest log-loss (2.133). In addition, Random Forest and Tree models are built in SAS Enterprise Miner and the random forest model with hour and neighborhood factors shows the best performance with the lowest misclassification rate (0.67).
Joshua Horn, Analysis and Identification of Training Impulses on Long-Distance Running Performance, July 2016, (David Rogers, Brian Alessandro)
Long-distance running is one of the most popular participatory sports in the United States; in 2015 there were 17.1 million road race finishers and over 500,000 marathon finishers, each collecting a trove of untapped data. The subject of this analysis has been a semi-competitive runner since 2000 and began collecting personal running data in 2004, with an increase in detail in 2007 while competing collegiately and again with the inclusion of GPS data in 2014. Using these data and background knowledge of training theory and exercise physiology, a variety of new variables were defined for exploration in their ability to explain changes in athlete fitness, defined by VDOT, a pseudo form of VÖ2max, maximal oxygen consumption rate. The primary objective was to identify the primary drivers of VDOT to inform future training decisions. Based on a combination of heuristic, ensemble, and complete search methods across linear, additive, and tree regressions, the variable 48-week measures of training impulse, measured in intensity points, was identified as the primary driver of changes in VDOT. From these results, future training for the athlete should focus on maintaining long-term consistency, with the 48-week training impulse between 3,500 and 4,500 points, a zone that produces VDOT outcomes in the 66th to 86th percentiles without inducing the substantial physiological (muscular degradation due to insufficient recovery) and psychological (mental strain accompanying 10 to 16 hours required for weekly training) stress associated with higher training loads.
Linlu Sun, Analysis and Forecast of Istanbul Stock Data, July 2016, (Yichen Qin, Peng Wang)
The Istanbul Stock Exchange data set is collected from imkb.gov.tr and finance.yahoo.com. Data is organized with regard to working days of the Istanbul Stock Exchange. The objective of this exercise is to forecast the response variable ISE. First, we will use the mean forecast ISE, if it does not work we need to build a linear model. When using mean forecast ISE fails, we will build the best model using a linear regression on an 80% sample of the actual data. The initial approach involves performing exploratory data analysis to understand the variables and designing the best model with the most appropriate variables using linear regression. Based on the best model, we will forecast each predictor variables, then use the best model formula to forecast the next 10 days value of the ISE.
Subhashish Sarkar, Sentiment Analysis of Windows 10 – Through Tweets, July 2016, (Dungang Liu, Peng Wang)
With the advent of the mobile operating systems (OS), Microsoft revamped its value offering and launched the Windows 10 OS in July 2015 that works across devices (Laptops, Desktops, Tablets and Mobile phones). To gain market share and attract existing users to install the new OS, Microsoft offered a free upgrade that is expected to end on July 29th, 2016. However, Microsoft has not been able to generate the targeted traction for its new OS amongst its user base. The purpose of this project is to explore the sentiments of the user base and thereby explore the reasons why Windows 10 is not getting the traction targeted by Microsoft. Sentiment analysis is helpful for brands to determine the wider public perception about a product on social media. Results from the analysis can be used as a direct feedback that can result in altering product strategy or pruning and adding features to the product. In this case, lexicon-based sentiment analysis of tweets on Windows 10 revealed that only 24% of the users had a positive opinion. The analysis using ordinal regression further highlighted some specific issues that contributed to the negative opinions. E.g., the negative emotions were due to bugs, crashes, installation errors and the aggressive promotion adopted by Microsoft. The positive opinions about Windows 10 were centered on the host of features available in the OS. The report also goes on to identify frequently used contextual words that can be added to the lexicon to improve the parsing of emotions.
Zhiyao Zhang, Methodology on Term Frequency to Define Relationship between Public Media Articles and British Premier League Game Results, July 2016, (Yichen Qin, Michael Magazine)
This project is intended to define whether the relationship between the public media and the game results of Premier League games exists. Premier League is a soccer league filled with rumors, sources, news, and critics. Players in the league are suffering from the pressure and anxiety of the critics, which may potentially affect their game performance; however, we do not know whether the relationship actually exists or not, and even if does, we do not know how the public media affect the games. In this research, I will investigate these two questions with Term Frequency.
Emily Meyer, Demonstration of Interactive Data Visualization Capability for Enhancement of Air Force Science and Technology Management, July 2016, (Yan Yu, Jeff Haines)
Data must often pass through certain people and channels before it becomes information and reaches someone who makes a decision. In order to make the data-to-decision-maker pipeline somewhat more expedient, an effort is being undertaken to setup Tableau and Tableau Server within an organization. This effort is a long-term project whose current initial stage is focusing on both setting up Tableau Server and also demonstrating the capabilities of blending data and use of interactive dashboard visualizations for personnel within the company to create early adopters within the organization. The following report is an intern’s contribution toward the demonstration of the capabilities of Tableau through creating PowerPoint presentations, Tableau Story Points, Tableau Dashboards, identifying principles for structuring data, cleaning up datasets, and refining already created dashboards.
Zhaoyan Li, Identifying Outliers for the TAT Analysis, July 2016, (Michael Magazine, Ron Moore)
The goal of our company is to provide the best healthcare services (Cleaning, Equipment Delivery, etc.) to our client – Cincinnati Children’s Hospital Medical Center, or CCHMC, and of course, to patients who visit the Hospital. Sure enough, data analysis plays the role of improving the quality and efficiency of our services. My data analysis work can be put into four categories: Turnaround Time Analysis, Supervisor Inspection Analysis, Patient Survey Analysis, and Full-time Employee Analysis. Since October, 2016, profits our company receive from CCHMC are all based on metrics. For example, the hospital requires that 90% of all cleaning requests should be completed within 60 minutes. If we can hit this goal, we will receive 100% money. If hit 90% within 65 minutes, we get 75% of profit, and so on. The goal from the perspective of a data analyst is to generate graphs that tell how we perform in the past as we