MS Business Analytics Capstone Projects
Rishi Ambani, Appropriate Rent Prices in Cincinnati, Spring 2022 (Michael Morelli, Dylan Accorti)
The coronavirus pandemic brought a lot of changes to society including the change in where people decided to live. With changing populations also came changing rent prices which introduced a lose-lose situation for property managers. In the volatile rent market, if a property was priced too high, it could stay on the market longer earning no money for the property managers. On the contrary, if priced too low, the property would miss out on potential revenue. To address this problem, a calculator was created which would allow property managers to understand what a property may be worth in the 45202 zip code based on property information. To calculate the appropriate rent price in Cincinnati (45202 zip code) at any given time it is important to pull current market data every month and run an analytical model. This model can extrapolate coefficients to calculate projected prices for a property not included in our data (property without a designated price yet).
Nina Brillhart, Descriptive & Predictive Analytics: Crossroads Church Attendance Analysis, Spring 2022 (Brent Suer, Leonardo Lozano)
The COVID-19 pandemic is unprecedented, causing major financial, social, employment, academic and religious disruptions. Churches and religious communities have been greatly impacted, and likely permanently altered by the pandemic. The new way of life demands adaptation, including new ways for churches to engage with communities. The closure of churches forced Christian communities to pivot to online church, creating a “new normal” for church. This project aims to (1) explore the impact of COVID-19 on Crossroads Church — one of the largest megachurches in Cincinnati — and (2) identify factors that have the greatest influence on post-COVID church attendance.
Crossroads Church physical attendance had dropped 25-35% between 2019 and 2021. Each of Crossroads’ 12 campuses are now fighting to grow physical attendance back to what it was prior to the outbreak in 2020. The mission behind increased attendance is to prevent isolation, encourage community building and make church feel more like a family than a podcast. Research shows that people who regularly attend church report stronger social support networks and less depression.
This project reviews three years of Crossroads Church physical attendance from 2019 to 2021 to visualize the change in church attendance YoY and identify which factors have the greatest propensity to predict physical church attendance. This data was used to build a PowerBI dashboard and fit a multiple linear regression to decompose the correlation between the 8 church engagement variables and their contribution to physical attendance.
The findings are two-fold and show that: 1) The larger churches were the least effective in retaining attendees likely due to a lack of connection even in a physical building and 2) People are still engaging with Crossroads’ social platforms even though attendance has dropped, meaning there exists a “new normal” for church.
Max Chenoweth, Ames Housing Market Analysis and Modeling, Spring 2022, (Brandon Greenwell, Edward Winkofsky)
This project attempts to explore data on the housing market in Ames, Iowa from 2006-2010. This project applies various machine learning and regression algorithms to predict housing prices based on typical house listing information. While focusing on the interpretation and implications of these models, their accuracy is also evaluated and compared to find the optimal approach to explaining this dataset. By analyzing variable importance and partial dependence plots, we gain a better understanding of what contributes to the value of a house and the nature of the relationship to the sale price. The main focus of this project was to apply various modeling methods and extract insights into the value of a house.
Colton Dolezal, Pizza Place Recommendation, Spring 2022, (Yan Yu, Michael Morelli)
Pizza happens to always be one of the top foods eaten in America. And with there being such a large quantity of places available to choose from, sometimes finding the right place can be difficult and time-consuming. In this project, the goal is to utilize three pizza datasets available and create a system in which it would allow consumers to make better informed decisions about which pizza restaurants to visit. This gives users the ability to leverage their preferences to the system in order to output results seen as favorable. Generally, most people enjoy eating pizza and since the United States houses so many places, it is important to find a restaurant based on individual preferences. In the analysis, it is illustrated how New York is a pizza hotspot in America, so the recommendation system tailors to this area. With that, this system gives the user a chance to find a pizza restaurant, perhaps by rating, location, or the price level. From here, consumers can determine which restaurant to visit.
Alex Goellner, Using Machine Learning to Predict Winners of Mixed Martial Arts Fights, Spring 2022, (Yichen Qin, Michael Fry)
In recent years the popularity of Mixed Martial Arts (MMA) has grown to where ESPN struck an agreement with the Ultimate Fighting Championship (UFC) for the rights to broadcast the promotion fights for 5 years $150 million per year (ESPN, 2018). Predicting a winner of a fight is difficult due to the many variables contributing to each fighter’s fitness and capabilities at the time of the fight. This project explores whether using machine learning algorithms can provide insights and predictions of a fight’s winner. The outcome of this thesis is that more and better data are necessary to improve the algorithm accuracy in order to predict a winner of a Mixed Martial Arts fight.
Andrew Greiner, Deserved? Or Snubbed? Predicting NBA All-Stars through Logistic Regression, Spring 2022 (Michael Morelli, Spencer Niehaus)
Every year when All-Star rosters are announced, social media erupts with debates about who was does or does not deserve to be there. These rosters are selected by voting from fans, media, and players, as well as head coaches around the league. With subjectivity a usual problem with voting, especially in sports, ideas for a better selection process are thrown around year after year. To attempt to help with this annual debate, this project focuses on different logistic regression models that attempt to accurately predict the likelihood that different plays make the NBA All-Star Game. The results allow us to investigate who has the highest chances to make the rosters, as well as who may or may not deserve a selection. Finally, the model output gives a better idea of the important statistics and characteristics of NBA All-Stars.
Anthony Hale, Dynamic Aggregation and Selection of Data with Power Bi, Spring 2022 (Denise L. White, David Curtin)
For my Capstone project, I have built upon a project that I recently proposed to the management team at my current place of work. I currently work as a Business Intelligence Specialist for a service management organization that runs a group of Ophthalmology and Optometry practices in the United States. After taking some time to learn about the company’s current analytics structure, I was excited to learn that the data management team had recently begun development of a data warehouse through the use of snowflake, a web-based DBMS that is known for its speed and scalability. One of the goals of the Data warehouse was to create an environment for modeling data in a long-tabular format, ideal for working with in Power Bi. After learning about the data setup, I realized that the reports currently developed in Power Bi would benefit from some added functionalities. During my capstone project, I have focused on making the report more robust using a tool known as Power Bi Tabular Editor. Some of the additional functionalities that I have added include: an ability to toggle between metrics, the addition of time-intelligence comparisons, and a way to see the averages of each of their metrics based on store, district manager, hub, regional manager, and the Company as a whole.
Matthew Karnes, Supervised Classification of Exoplanets into Confirmed or False Positive, Spring 2022 (Dungang Liu, Liwei Chen)
We propose a method of determining whether an object of interest identified by telescope observation can be confirmed as an exoplanet or is a false positive. An open source dataset from the NASA Kepler Mission, which over its lifetime discovered 2,706 exoplanets, is used. Three models were developed in R using this dataset, and each were evaluated on the same criteria of accuracy and error. Of the models, logistic regression and random forest achieved promising results, while the k-nearest neighbors model did not. The logistic regression and random forest models were then used to predict the status of yet unclassified objects of interest. These models are useful for supporting scientific inquiries into exoplanet discovery and can help reduce cost and man-hours necessary to validate exoplanet confirmation.
Daniel Keating, Dissecting the PGA Tour’s Player Impact Program, Spring 2022 (Paul Bessire, Michael Morelli)
The topic for my capstone will surround the Player Impact Program (PIP) set up by the Professional Golf Association (PGA). In keeping with the times and due to outside pressure from other competing leagues, the PGA announced last year that they would roll out the PIP to reward players for their contributions to growing the game of golf. The top 10 players ranked in the PIP scoring list will split a $44 million pot, with the top golfer getting eight million of that. The PIP program is scored via five factors (via Golf Digest): Google searches, Meltwater mentions, MVP Index, Nielsen score, and Q-Score.
The PGA, however, does not give insight or any breakdown into the actual data and how it is weighted and collected. This has been a controversial point in the golf community. My capstone reverse engineered and recreated the PIP program to gain insights into how they come to the final list. The bulk of the work was researching the best way to collect the data and how to best set up the data in order to use a regression-based approach. The R Squared for the model was 0.87, which indicates that the PGA tour is letting the data speak for itself and letting the players that are bringing the most attention to the game benefit from their work.
Robert Koenig, Leveraging Data: Combining data sources to improve organizational performance, Spring 2022 (Leonardo Lozano, Dungang Liu)
In this project, a local business was consulted to uncover an interesting and valuable research question. The goal of the project was to establish relevant costs associated with each delivery made by the company’s vehicles, with the ultimate goal of providing a profit estimate for each delivery. The vehicle locations and timing data were provided by GPS tracking sensors located in each vehicle. Delivery activities then needed to be associated with specific orders and this was done with the help of the Mapbox API and various techniques to link the location data to the accounting system data. Using the cost figures, a profit was calculated and validated through additional analysis of existing data. The output of this analysis provided the business with order-by-order detailed vehicle delivery information as well as the associated delivery cost and profit estimates.
Charlie MacKenzie, National Hockey League Exploratory Analysis – Modeling Game Statistics to Best Predict Wins, Spring 2022 (Michael Morelli, Jeffrey Bogenschutz)
Hockey is a very statistically driven sport. It has been suggested that the best models can only predict the winner 62% of the time due to variances in talent and “puck luck,” which refers to an idea that hockey can boil down to how lucky the puck bounces for a team.
After finding a data set of over 30,000 observations across 17 variables capturing statistics on the past six NHL seasons, I wanted to answer this fundamental question: what game statistics best predict a win in the NHL? Additionally, I wanted to do a general exploratory analysis with the purpose of gaining a better understanding of the league’s landscape and how these variables are related.
I wanted to compare different logistic regressions, so I created different regressions including all of the variables and taking out certain variables that did not seem significant. Outside of modeling, this project also includes insightful correlation information, summary statistics, and visualizations.
Samuel Martin, Strat-O-Matic Baseball — The Godfather of Statistics in Sports Video Games, Spring 2022 (Alex Wolfe, Denise White, Calvin Catania)
Strat-O-Matic Baseball is widely known as the godfather of statistics in sports board/video games. Although it was created before the video games of today, it set the standard for what statistics and probability could do in providing a realistic simulation of sports games. After drafting eight teams and playing two full 162-game baseball seasons with every one of these teams, I believe I have compiled enough data to evaluate the statistical accuracy of the board game through analysis of my own compiled data and the data provided from the game. I will be comparing the stats of two individual players: Babe Ruth, whose stats from my leagues were fairly accurate to the game stats, and Barry Larkin, whose stats from my leagues seemed to be inaccurate compared to the game stats. After determining the accuracy of the game through the analysis of these players, I will then be discussing reasons for any potential inaccuracies between my data and the games provided data.
Delores Mincarelli, Can we predict heart disease? Spring 2022 (Dungang Liu, Xiaourui Zhu)
According to the Center for Disease Control, heart disease is the leading cause of death in the United States — one person dies every 36 seconds. Accurately diagnosing heart disease based on clinical tests is critical to make appropriate treatment interventions. This type of analysis could be used in an emergency room (ER) setting to triage a patient to the appropriate next step in their care. For instance, upon entry to the ER, clinical tests would be performed for symptomatic patients. The test results could trigger a classification model to run to assess the patient’s risk of heart disease.
I'll be using a heart disease dataset from the University of California-Irvine. I'll do progressive logistic modeling which factors the cost of the diagnostic test so at each stage of modeling, we can quantify the model improvement. This will reveal the cost & benefit of these tests and potentially help streamline procedures to save money and time while offering strong predictive value.
Vanessa Murillo, Emergency Department Simulation, Spring 2022 (Denise L. White, William Bresler)
The following capstone project reviews the completion of a discrete event simulation case study for the University of Cincinnati Health System Emergency Department. The intent of this study was to translate the current processes, protocols, patterns, distributions & layouts within the ED and design a system model based on data analysis and iterative discussions with the business side. The process included aligning with the business on what where these major focus processes, understanding how these processes were reflected as data attributes within their data repository, explore the relationships and distributions uncovered from these same attributes, design a discrete event simulation model & validate the model outputs based on business side feedback and the data. Although the milestone process was iterated one time in its entirety, overall conclusions showed that further data is needed to utilize this model for business applications.
Lucas Schirr, Best States to Live In: What the Data Tells Us, Spring 2022 (Asawari Deshmukh, Michael Morelli)
This project was undertaken to create a data-based ranking of all 50 U.S. states across different categories. The states were ranked in the categories of “Health and Safety,” “Financial Stability,” “Environment,” “Family,” and “Social.” Datasets for 15 different variables that relate to these categories were collected from a variety of online sources. R Studio was then used to conduct an analysis of the variables and the impact they had on each state’s ranking. Visualizations were then created for each state’s ranking in each category.
All variables in this study were weighted equally against one another. Scaling and reverse scoring methods were used to plot each variable against each other. The results revealed that some categories were heavily geographically dependent, while others were not. The Northern Midwest, Northwest, and Northeast were the regions that fared the best across all categories. With each variable weighted equally, a significant outlier for one variable was shown to have a large impact on a state’s score in the category related to the variable. The results of this study can be used to inform decisions on which states are best to live in based on an individual’s preferences and priorities.
Trisha Shekhawat, Data Science Project with P&G, Spring 2022, (Denise L. White, Rakesh Gummalla)
P&G has presence across various sectors: i.e. fabric and home care, feminine care, hair care, oral care etc. This capstone project was with the FemCare P&E Digital Innovation Team. Data from various manufacturing locations and production sites is available but needs to be cleaned, pre-processed and normalized before it can be used for any kind of analytical reporting. The focus of the capstone project was to understand the overall structure and vastness of the data, filter the necessary features & then clean & transform the filtered data to make it fit for further use.
This data science project had multiple phases and the first step was to clean the data residing in the MS Azure ML Studio. An elaborate Graphical User Interface was created using Python to allow the user to make selections to filter the tags, select specific features and then check for missing data and the corresponding sampling frequency. After filtering the relevant tags in a dataframe, the next step was to perform an extensive Exploratory Data Analysis focusing on Numerical and Graphical data visualization, Outlier detection and Correlation study across different features. Then, the data set was finalized by handling missing data (deleting or imputing the missing features), dealing with outliers and applying any necessary data transformations. The final step was to export the functionality of the final data set from the MS Azure ML Studio to the local hard disk for further analysis in JMP.
James Stratton, Agglomerative Clustering Business Customer Segmentation Study, Spring 2022 (Edward Winkofsky, William Bresler)
Business customer segmentation performed using hierarchical clustering. Feature engineering was performed to prepare client data. Hierarchical clustering was carried out on a subset of the data using the R cluster package implementations of AGNES and DIANA algorithms for agglomerative and divisive clustering, respectively. Cluster numbers ranging from 2 to 10 were considered. AGNES clustering outperformed DIANA in connectivity, Dunn Index, and Silhouette Coefficient at client’s cluster granularity.
AGNES clustering of the nearly 600,000 observations was not possible due to memory requirements. Multinomial, random forest, and boosting tree classifiers were trained to extend the 6,000-observation training dataset clustered by AGNES to the entire dataset. All three achieved test dataset error rates of 1 out of 1,200. Random forest was chosen as final classifier.
Cluster stability over time was evaluated. 3.80% to 5.38% of business customers changed clusters each month. The median and 75th percentile were 0 and 2 cluster changes, respectively. Most clusters experienced linear growth which was expected. Client found these stability results acceptable.
Challa, Shravya; Leonardo Lozano; Neil D’Souza, Domino and its Role in Day-to-Day Machine Learning
The agriculture industry is ever-growing and has a great potential for technological advancements. By 2050, farmers should be able to feed 9.8 billion people, while facing numerous challenges like increased calorie intake, weather, reduced farm land, soil erosion, pollution and many more. The Climate Corporation provides a leading digital platform to help farmers overcome these challenges with machine learning and deep learning models, increase their productivity, turn insights to action, optimize and flawlessly execute decisions on the farm. Data is an integral part of this mission and is used and stored in various forms suitable for modelling and analysis. As different models are developed to predict the best course of action for each stage of a crop lifecycle, there is a need to automate and scale these models to suit real world cases. Big data tools like Spark and Domino come handy for this purpose. Domino is a data science platform that enables fast, reproducible, and collaborative work on data products like models, dashboards, and data pipelines. Users can run regular jobs, launch interactive notebook sessions, view vital metrics, share work with collaborators, and communicate with their colleagues in the Domino web application. This project falls in line primarily with Big Data Integration with Spark and Databricks course learnt during MS BANA, while allowing me to take a shot at transformation and modeling of crop data applying the knowledge gained from Statistical Methods and Statistical Models courses.
Bindal, Shivam; Denise White; Daniel Martin Argyasmai; Assessing the effectiveness of Electronic Health Records (EHR) to improve underwriting process in Life Insurance
The aim of this project is to assess the effectiveness of a specific data source (electronic health records) which can expedite the underwriting process of a life insurance company. The effectiveness will be evaluated by extracting the features from the data source using data extraction, manipulation, and natural language processing (for text mining) techniques. The extracted features will be fed into a mortality risk prediction model and model lift will be calculated to assess whether the extracted features were important or not. Based on this assessment, a recommendation will be provided to the client whether they should incorporate this data source in their current workflow or not. One the data is parsed, and data pipelining is completed, extracted features will be used to provide insights via Tableau Dashboards. These dashboards will reside on client’s system hence, data visualization integration is also required as future steps in this project.
Gupta, Shaoni; Yichen Qin; Dave Gerson; Linear Pacing Tool
NBCUniversal (NBCU) is one of the biggest media houses, broadcasting numerous shows across different dayparts. The Corporate Decision Sciences team in NBCU deals with the ratings and impressions of the several shows broadcasted across the network. To track the performance of the shows across different levels timely, and forecast future performance of the shows, a Linear Pacing Tool has been developed. The tool provides a comprehensive understanding of the performances of different NBCU shows in comparison to the estimated targets obtained from the research teams. Models in the back end have delivered the forecasts of the financial rating of NBCU program impressions in the next quarters. The analysis when performed for new and upcoming shows, can be used to determine optimum show times, across different dayparts and how it’s going to impact the viewership. The forecasts are also used to check if the target ratings are being met in the future quarters.
Krishnan, Ravi; Yichen Qin; Tathagat Kumar; Multi-Text Classification of Consumer Complaints
Resolution of consumer complaints in real time is getting more important day by day with the increasing popularity of social media, high computing power and increased/flexible storage. Customers today expect their issues to be resolved on ASAP. In this context, automatic classification of consumer complaints into their correct products based on the complaints text becomes important to enable early resolution of complaints. This paper aims at classifying consumer complaints using NLP enabled supervised learning methods into predefined products for ease of handling. This classifier assumes that each complaint is meant only for one product.
Mishra, Durgesh; Yichen Qin; Tanu George; Design & Development of data pipelines and database for Campaign Attribution Models in Account Based Marketing
A whopping average of $3.6 billion was spent on sales and marketing by companies like Adobe, Apple, Netflix and other big tech giants in 2020 alone. This frenzy spend is justified by only 40% of businesses who identify marketing ROI as their top challenge.
The return on investment from marketing initiatives is referred to as marketing ROI. It is used as a gauge for how well a marketing technique succeeds by calculating the money collected less the initial outlay. Using marketing ROI, the performance of the marketing initiatives in relation to the overall goals, whether they be sales, lead generation, engagement, or any other growth-oriented marketing approach is assessed.
This project marks the phase 1 design and development towards determining marketing ROI. Like any analytical solution, the stress is given to the velocity, variety, volume and veracity of the big data in question. Developing reliable data pipelines from sources to fuel campaign attribution models prove to be the most effective strategy towards realizing ground-breaking insights therefore the focus of this project is on the design and development of this essential infrastructure, set in the Account Based Marketing tone of business (B2B). Global Marketing Sources of Records refer to the campaign metadata that is used in the project to link to anonymous visitor’s clickstream data. Data was fetched through two data warehouses to create a data lake on Hadoop that would fuel content dashboard and calculate ROI metrics. Further, the phase 2 of the project will be targeted towards stitching spend data with ads, targeting, web and campaign data.
Mohapatra, Sameer Kumar; Yichen Qin; Carrie Johnson; Food Demand and Lead Prediction
Food Demand Dashboard: Demand prediction and visualizations are important part of a business analysis. In this project, I have analyzed business of a meal provider which has multiple centers and server variety of meals across the food centers. I have created a dashboard using Power BI to understand current business scenario and predict the demand of food across various centers. The business objective is to understand which are the best performing centers, meals with highest order and how promotions are impacting the sales. Apart from this, a visualization of forecast is also prepared to understand demand scenarios.
Lead Prediction: In this project the objective is to identify current credit card customers who can be targeted for cross-selling of a financial product. The dataset has historical details on customers’ demographics and whether they were successful in cross-selling or not. Using this data, I have identified features which can result in marking a customer as a potential lead. I have used feature engineering and machine learning models such as Logistic Regression, Decision Trees and Random Forest to obtain the best fit model and measured the accuracy using F1- score and ROC AUC.
Saxena, Swasti; Denise White; Rahul Sattar; Demand Forecasting for Dropship Vendors
Client information: One of our major CPG clients is looking to work with one of the American global courier delivery services to ship the items from drop ship vendors to the customer's address. Dropship vendors are the vendors that do not keep physical inventory(warehouse) with the CPG's. So, drop shipping is a business model where retailers use suppliers (our CPG client) to ship products directly to the end customer rather than managing stocking and shipping themselves. So, our CPG client is using courier services to ship the products from the dropship vendor's physical inventory to the customer's address who purchased the product.
Problem Statement: For our CPG client to work with the courier services, the latter has asked the demand forecasting with respect to every dropship vendor per location per day for the next fiscal year. This demand forecasting will help the courier services to plan and prepare for the capacities they will need in the future, especially during the festive/peak seasons.
Approach: We will be utilizing CPG's retail data to build time-series forecasting models to predict demand forecasting
Sajimon, Haritha Thaloorayyathu; Yichen Qin, Carrie Johnson; Credit Card Lead Prediction
As a Decision Scientist at a multinational artificial intelligence company, I aim to power every human decision by applying AI and analytics to the decision-making process and hence directly provide analytical solutions to global clients. In this project, I provided analytical services to a mid-sized private bank to help them cross sell their credit cards to existing customers. The bank had identified a set of customers that are eligible for taking these credit cards and wanted help to identify customers that could show higher intent towards a recommended credit card. The main task was to analyze the given data of customers which included a set of customer details as well as details regarding the customer’s relationship with the bank. The ultimate goal was to make predictions/models to identify whether the customer would be a potential lead or not.
Vooturi, Santosh Vaibhav; Leonardo Lozano; B K Vinayak; Territory Planning for New and Existing Leads
As, a Sales Analytics team at PayPal, our team helps the Sales leadership at PayPal to take informed and data driven decisions, that enables them to effectively manage the Sales teams.
One of our Sales Directors would like to understand the distribution of the new FY2022 leads by various cuts so that they can better map these leads to our sales reps. They also would like to understand if there is a smarter way to identify any existing opportunities that are not being given enough attention, so that they can be moved to another sales rep. As a solution to this, we shared multiple Tableau Dashboards with Director that captures the list of existing and new opportunities by various cuts. We have also provided the flexibility to drill down to a particular region/size of the accounts so that they can make better Account-Rep mapping.
Akbar, Aalim; Leonardo Lozano; Nitin Jain; Pre-Post Analysis of Member feedback themes for Digital Renewal of Membership
Any business wouldn’t last if they don’t bring out constant change in their system. My role, working as a consultant for one the largest retail companies in the world, is to bring about changes in the system by analyzing member feedback data. In this project, I am doing a pre/post analysis after a platform upgrade on the feedbacks given by the members. I will be performing data cleaning on the feedbacks data for our analysis, followed by clustering of the feedbacks into different pre-defined themes buckets. Finally, I will analyze % volume of themes, sentiment scores and average ratings and figure out the improvement areas, and deep dive into areas which haven’t shown a significant improvement and try to improve the member experience during digital renew, and thereby increasing the Net-Promoter Score.
Amaravadhi, Sharanya; Leonardo Lozano; Aarti Vaswani; IIS Logs Analysis
Cognizant’s TriZetto Healthcare Products are software solutions that help organizations enhance revenue growth, drive administrative efficiency, improve cost and quality of care, and improve the member and patient experience.
IIS Logs (Server-Side logging enabled on a URL group) tracked on one of the products URLS were to be analyzed to understand the product usage data and suggest further improvements to the product/ product URL based on the previous user's experience.
IIS Logs were received in the Log file format and were analyzed after converting them using an external software called Log Parser. The team at Cognizant was also looking for a solution to convert the files without the help of the external software.
The above problems were solved using various concepts learnt throughout the BANA coursework. The major concepts used in the Capstone project were Exploratory Data analysis, Data Visualization, Python concepts etc.
Anurag, Kumar; Yichen Qin; Tyrone Smith; Refunds and Returns Analysis on Zulily website
Zulily is an e-commerce platform that sells a variety of products ranging from apparell to everyday usage items. Many of these orders are either returned or refunded or both, so it becomes vital to understand the impact it has on the net profits of the company. We can leverage analytics to find the detailed reasons for these returns, identify channels that need to be focused upon to make better business decisions goinig ahead. With this analysis, we were able to find out the various factors related to customer and products, and create a hierarchy for the stakeholders to identify how these factors changed over time.
The findings of this exercise also aided in business development by focusing on vendors and accounts with lesser return rates. The future scope of this project willl be focused toward building a predictive model to identify the customers or orders that will be returned or refunded.
Bahuguna, Avijit; Leonardo Lozano, Alex Wolfe; Customer Segmentation for an ecommerce portal
In this project, I performed data cleaning and exploratory data analysis and then finally segmented the customers for of an ecommerce company to help in targeted marketing. The data was downloaded and contains information about customer profiles, their purchasing habits with the company and
Kala, Vedangini; Charles Sox; Awanindra Singh; CLTV (Customer Lifetime Value) for Retail Industry
In this project we worked on solutioning CLTV for retail industry. This metric represents an estimate of their interactions with discount and other promotional offers. The essential information about the customers such as their gender, income, offers they used was extracted and analyzed. I also created custom metrics such as view rate and conversion rate for each customer to help in making business decisions and eventually I applied K-Means clustering algorithm to cluster the customers whom the offers were send.
Belyazid, Youssef; Leonardo Lozano, Hsiang-Li Roger Chaing; Prediction of short-term deposit subscription
The goal of this project is to predict whether a customer will subscribe to a short-term bank deposit or not based on several features: age, job category and bank balance. To that end, the following steps were carried out: exploratory data analysis, feature engineering and modeling, then model evaluation using recall. Note that the data instances were collected by a Portuguese bank in a short-term deposits marketing campaign. In terms of implementation, we took advantage of the library scikit-learn to train several classification algorithms in addition to pandas for data manipulation and seaborn for data visualization.
Boldrick, Allison; Yichen Qin; Uday Rao; What Makes a Champion? Analyzing Postseason NCAA Basketball
Every year millions of brackets are filled out by people trying to guess the next NCAA men’s basketball champion. This project is focused on using multiple metrics to try and predict which teams will be successful in the end-of-year tournament. Success is measured by making it past the first weekend of games. This means a successful team will play in the Sweet 16, Elite 8, Final 4, and/or the championship game. The analytic methods used are Generalized Linear Models with both logit and probit links, Stepwise Variable Selection, Random Forest, and Boosting Tree. While the conclusion may not lead to picking a guaranteed winner, it will hopefully provide answers as to what the good teams do best.
Das, Pratyutpanna; Leonardo Lozano; Headly Appollis; Forecasting Stock Prices using ML Techniques
Forecasting is a technique that uses historical data as inputs to make informed estimates that are predictive in determining the direction of future trends. Forecasting future stock prices is an important and very difficult task. Primarily because there are many internal and external factors which affect the stock prices. In this project the future stock prices of Microsoft will be predicted using various ML techniques like moving average, Linear Regression and Gradient boosting. The models and codes can be reused for predicting any other company. For improving model performance, hyperparameter tuning will be done to choose a set of optimal parameters for a learning model. The dataset is spilt into Train, validation and test. Then, the model performance will be evaluated using RMSE and MAPE.
Gradient Boosting is the best model as it was successful to capture historical trends and learn from the weak learners to give a more robust model to predict stock prices accurately. It had the lowest MAPE of 1.25%. In the future, we can use ARIMA, LSTM, Sentiment analysis to improve the accuracy of stock prices and customer's lifetime revenue. CLTR is an estimate or projection of all revenues during the customer's lifetime, not the actual historic reportable revenue. CLTV is the prediction of metrics like average order value, number of orders and Churn propensity to predict Customer Behavior, which is intern used to estimate customer value. This number can be used in engaging customer and capturing similar customer from the market.
With this project we are working on designed and developed a suite of models to estimate Customer Lifetime Value for one of the largest retailers in the world. We used individual value estimation models that can be used for short- and long-term cohorts. The impact of this project is that the client could use the estimates for Channel, Acquisition and Engagement strategies. It can also be used for developing their monthly target lists.
Khan, Saad Ali; Leonardo Lozano; Tim Schroeder; Claims Staffing Plan
A staffing plan is a strategic planning process by which a company (typically led by the HR team) assesses and identifies the personnel needs of the organization. In other words, a good staffing plan helps you understand the number and types of employees your organization needs to accomplish its goals.
A staffing plan answers the questions:
- What work needs to be done?
- How many people do we need to employ?
- What skills and experience are necessary to do this work?
- What skills gaps need to be filled (and are there any areas of redundancies)?
In this project I built a staffing Plan for the Claims Department that covered a lot of different aspects and technologies that I learned during my Masters. I started off by building views and tables through SQL and consolidated all the different metrics from various schemas and tables into one single source of truth and used techniques such as Recursive CTEs, Case When statements and Window Functions. After consolidating all of this, I moved on to R software, where I started off by Data Cleaning and Data Manipulation to get desired features and then moved on to do forecasting for different features which are very important for Staffing Model.
Instead of using any Time Series Model, I worked with Linear Regression as the dependent variables were more than one and the Linear Regression Model gave out solid results with Adjusted r-Square of around 0.85. The next step was to apply an arithmetic formula which accounted for features flowing in and flowing out and their average handling times. After getting the required Staff for each team within the Claims Department, I made an R-Shiny app out of it for interactivity and ease of use.
More, Prafull; Yichen Qin; Michael Platt; Bank Marketing for Term Deposit Enrollment
We propose a data mining (DM) approach to predict the success of telemarketing calls for selling bank long-term deposits. A Portuguese retail bank was addressed, with data collected from 2008 to 2013, thus including the effects of the 2008 financial crisis. We analyzed a large set of 20 features related with bank client, product and social-economic attributes. A semi-automatic feature selection was explored in the modeling phase, performed with the data prior to July 2012 and that allowed to select a reduced set of 10 features. We also compared six DM models: logistic regression (LR), decision trees (DTs), Random Forest (RF), Gradient Boosting Model (GBM), neural networks (NNs) and support vector machines (SVMs). Using two metrics, area of the receiver operating characteristic curve (AUC) and mis-classification error, the six models were tested on an evaluation set, using the most recent data (after July 2012). The NN model did not converge for hidden layers >1 and resulted best fit with AUC = 0.92 and mis-classification error = 0.09 using 0 hidden layers. But the Boosting Tree model presented the best results (AUC = 0.9445 and Mis-classification error = 0.08), using the test data. Thus giving credible and valuable prediction model for telemarketing campaign managers.
Nookala, Ratan Vaibhav; Leonardo Lozano; Andrew Stripling; Direct Mail Campaign – Selection Strategy
Credit cards industry is a lucrative business. Most banks that issue credit cards run acquisition campaigns to acquire customers. Direct mail campaign is one of the acquisition strategies employed by the Fifth Third Bank. This involves a physical mail document being sent to a prospective customer with an offer such as balance transfer offer or spend to get cashback offer or zero percent APR offer for a certain period.
The project seeks to improve the population selection strategy of the direct mail campaigns for the Fifth Third Bank. The population for a direct mail campaign is selected by considering a variety of factors including marketing costs, mail offer, response score deciles, present value of the prospective customer, FICO score, response rate of the customer. Based on these factors, a metric, Return on Marketing Investment (ROMI) is calculated on FICO group level and response score decile level. Only the population that meets ROMI cut-off is selected for the direct mail campaigns.
Patil, Pradnya; Yichen Qin; Sylvester Ashok; Document Translation & Orchestration with Axure
Global non-profit organizations interact with residents from all around the world. Most of these interactions happen in local languages and are recorded in multiple formats like word, excel, csv, etc. This data can’t be used in data processing/ text analytics as it requires a lot of manual effort. During this project, we worked with the non-profit organization to bring this multi-lingual multi-format data into mainstream analytics using cloud and orchestration tools.
Problem Description:
- Research text translation tools available in the market
- Create a multi-purpose translation module
- Integrate translation module in existing Azure Data Factory pipeline
Challenges:
- Azure Cognitive service does not allow any other method for folder identification, while automatic SaS token generation is not well documented by Microsoft.
- Created SaS tokens with longer expiry dates.
- There was a lag between REST API callback status of success and actual availability of translated file in the output folder
- Introduced while loop which waited till file was detected in the output folder.
Output:
- Identified translation tool which can integrate well with the existing solution and can be reused with future solutions
- Orchestrated ADF pipeline to translate files with any format (.pdf, .csv, .excl)
- Integrated the translation pipeline with the existing solution
Rajput, Dwarkesh; Leonardo Lozano; Pava Kunchala; Analytics-driven Customer Journey Enhancement
As part of an on-going analytics engagement between Tiger Analytics and their client, the team worked on developing analytics-driven solutions to measure improvement in customer experience. The key focus area was on understanding the impact created by relaunch of new features to boost new customer sign-ups. This was achieved by comparing the sign-up rate captured within the first month of feature launch with the relaunch sign-up rate. As more customers sign-up, it becomes important to develop a better understanding of the growing user base. This is accomplished using demographic, behavioral and business-metric based data generated over user interactions with the system.
Shruthi Saxena, Automotive Service Analytics, December 2021 (Yichen Qin, Anand Natarjan)
Lucid’s Service Operations team wanted to leverage advance analytics to gain insights into their data to support better business decisions. Having hired as an Analytics Intern, my work involved defining measures, tracking KPI's, conducting in-depth research on automobile service data and delivering data insights to Service network by developing data pipeline and BI reports using SQL, Tableau and Salesforce Lightning. I also used tools and techniques that I learnt during my master’s courses – Data Visualization, Data Wrangling, Statistical Methods, Statistical Computing and Finance to help understand the data and deliver my tasks with utmost quality. Thus, my contribution helped Lucid’s Service Operations to get to know what works well for the business and what is not.
Apporv Shrivastava, Volume and Model Performance Analysis, December 2021 (Michael Fry, Anuj Khandelwal)
For the internship capstone period, I have been employed with Fractal Analytics (analytics consulting firm) as a full-time Data Scientist. The project in which I work is called Routing Program - with one of the biggest technology clients of Fractal. The project was going on for about three years or more.
Large number of businesses use client’s product. To assist these businesses in fully utilizing the products, the client organization has a team of representatives. Because the number of representatives is small compared to the number of businesses, it’s vital to make the best use of resources. So, the client uses a model to find the best assignment (Routing) of businesses to representatives that maximizes their revenue i.e., where the representative intervention leads to revenue growth. The model score along with external logic, is used to categories these customers into 3 segments: Large, Medium, and Small.
The client has multiple different versions for model pipelines that have evolved over the years. Because of these different versions of the model, the client wanted to understand how well the current version of the model is performing (i.e., categorizing the customers in the right segment) as compared to the previous one. So, my task is to perform a volume and model performance analysis between current production pipelines and the old production pipelines. Subsequently, once the volume analysis is done, we checked how good is our model in predicting the customer N Day spend on the historic data.
Deepesh Singh, Analysis for credit card lead prediction, December 2021 (Leonardo Lozano, Carrie Johnson)
Leads are the dynamic forces to many businesses. It represents a starting point to reach out to potential customers. In more simple terms we can say that a “lead” represents an observation about a potential customer that typically includes some information like age, gender, occupation, region, vintage, average account balance, and possibly additional attributes about the customer (e.g., product preferences and other demographic data). A significant amount of time, money and effort is spent by marketing and sales departments on lead management, a concept that we will take to encompass the three key phases of lead generation, qualification, and monetization. This project aims at classifying bank customers using supervised learning methods into potential lead for credit cards. This classifier assumes that each customer is either a potential lead or not.
Shivangi Tyagi, COVID 19 Data Visualization, December 2021 (Yichen Qin, Headley Appollis)
It’s been more than a year since the first case of coronavirus was reported. It caused huge disruption across the countries but also gave way to multiple changes. As the time progressed, there were multiple claims on what factors contributed to the spread of this virus in developed as well as developing nations. . Thus, it became necessary more than ever to track the progress and the effect of this virus worldwide.
This Capstone project for the Business Analytics Course consists of building an interactive dashboard in Tableau to visualize COVID-19 dataset based on below factors:
- Impact: COVID-19 impact on the world in terms of total cases and deaths across the countries and continents
- Key factors: Factors such as GDP per capita, median age etc impact on COVID-19 death rate
- Measures taken: Vaccination and testing rate in different countries and continents
Tavisha Mehta, Decision Making and Experimentation, December 2021 (Yichen Qin, Navin Natarajan)
Working for one of the leading social media platforms, a key question Company A, wanted to answer was — How can they be confident that their decisions are delivering a better product experience for current customers and helping grow the business with new customers? We as a partner suggested them to use A/B Testing and experimentation to make systemic and effective decisions for their product. An A/B test is a simple controlled experiment that starts with an idea — some change that can be made to the UI, or any other part of the customer experience that we believe will produce a positive result for their engagement. We shortlisted a couple of key metrics across multiple metric suites that will help us track the performance of the product and built a tool to track the performance of each of these metrics and experiments. We created a one stop tool to help track and evaluate all the live experiments. Reusable modules were used which enabled rapid scaling and quick insights, thus helping the client make their product a better experience for their customers.
The aim of the project is to capture the impressions on different pages of the online platform of a retail store. Advanced time series models such as ARIMA (Auto-Regressive Integrated Moving Average), Exponential Smoothening, MA (Moving Averages) will be used to forecast impressions and backend calculations will be done to optimally price ad units on website using forecasted impressions. Along with model-building, my role also encompassed communicating with clients and be involved in requirement gatherings, decision analytics, and business development.
Adithya Venugopal, Deciling and Targeting in Pharmaceutical Industry, December 2021 (Yichen Qin, Wanxin Lyu)
Pharmaceutical companies currently are adopting new targeting strategies, to increase profitability and return on investment. Due to limited resources, companies are looking to optimize their marketing call plan, looking for cost-effective ways to reach out to high value customers. The important customers in the industry are physicians, physician assistants and nurse practitioners. Physicians have the authority to prescribe prescription drugs. Companies send sales representatives to visit their offices to promote their business. Marketing researchers need to decide on the appropriate size of a sales force needed to sell a particular brand or portfolio of drugs to the target market. Few of the factors influencing this decision are optimal reach and frequency of each individual physician, how many patients suffer from that disease state, how many sales representatives to devote to office and group practice, and how many devote to hospital accounts if needed. To aid this decision, physicians are broken into different deciles, according to their prescription behavior, patient population, and of course, their business potential based on various marketing strategies.
Deciling is a very common methodology to rank and group certain members together based on a certain metric. Specifically in the pharmaceutical industry, deciling is often used to rank and segment physicians to indicate their value in terms of prescription volume. Based on the deciles, a group of physicians will be given priority levels. Top tier deciles will be given priority by the company’s sales representatives, whereas remaining deciles may be targeted using less expensive methods like telemarketing or mailing campaigns. With successful targeting strategy and implementation, sales forces are utilized most effectively and hence profitability can be maximized. This project provides business insights on segmentation that can lead to effective targeting strategies for KMK’s pharmaceutical clients.
Niteen Kalyan, Loan Strategy Optimization, December 2021 (Yichen Qin, Michael Turrie)
Elective healthcare is one of the categories where credit has not been accessible historically. Lending in such categories comes with unique challenges and opportunities. The size of such loans is between a credit card loan and mortgage loan. Greensky patient solutions offers credit in this space. As a part of the credit strategy team, the objective here is to choose the most profitable customers and serve them. Using past loans data, we picked the variables which are responsible for defaults in these loans and try to incorporate them into a model. These factors are usually related to the creditworthiness of an individual. The model does a good job of predicting the default after a specific period of time, as it performs well on out of sample data. We then select the right cutoff point for the model response in order to maximize the future value of the portfolio. This way, we have the optimal model for current economic conditions. The model can be used until there is a significant change in macro factors as they would affect things key factors such as default rates.
Prateek Kumar, Housing Market Prediction, December 2021 (Yichen Qin, Hsiang-Li Roger Chang)
The data we will be analyzing including the problem statement is inspired by a Kaggle competition for Data Science students. As a home buyer, consider a situation where we are asked to describe our dream house, and that probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 81 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, we look forward to taking up the challenge to predict the final price of each home.
Exploration of the Housing Data is to investigate factors that influence price negotiations. As for the more traditional approach, i.e., based on newspapers and local agents we usually end up deciding the House Value not comparing more than a handful of variables, like the number of bedrooms or neighborhood. Based on the extensive data set with ~1600 total records, capturing almost every aspect of the house, our exploration based on the response variable “SalePrice” and major factors that influence the response variable that may not be obvious at a first glance.
Our proposed approach covers the major nuances and insights of the dataset. Since this is a massive dataset with 81 variables, we aim to highlight the major contrasting variables. Also, the best/selected model for this analysis may not be the most accurate model, as the goal here is to validate the model supporting the insights from our analysis.
Rahul Narendra Goswami, Customer Segmentation of an online retail company, December 2021 (Leonardo Lozano, Alex Wolfe)
The project aims to establish performance metrics, segment customers using the RFM (Recency, Frequency, Monetary Value) method, and lifetime value prediction for an online retail company in UK. The dataset consists of 541,909 observations and 8 variables. For a business to grow, there must be clearly defined metrics which it would need to maximize. In our case, the most important metric or the North Star Metric will be monthly revenue.
Customer segmentation can be called as the art and practice of dividing customers into clusters that show how similar the customers are in each group. With this, we can implement strategies to maximize our performance metrics by addressing the clusters. Here, we will divide customers in segments according to the value they bring to the company viz., low value, mid value, high value, etc. using unsupervised machine learning clustering algorithms. There is a considerable amount of investment to acquire and retain customers for them to generate revenue and be profitable. This leads to calculating their lifetime value by identifying behavior patterns and customer segments and act accordingly. Lifetime value can be calculated as subtracting total cost from total gross revenue for that customer. The impact can be moderated using various machine learning algorithms.
Anshul Jain, Competitive Analysis using Incorta (ETL/Analytics/Visualization), December 2021 (Charles Sox, Pradeep Macharla)
An important aspect of deploying sales strategies is to understand competitors’ market. This work utilizes techniques of data manipulation using SQL and Pyspark, statistics and visualization to draw business insights by analyzing competitors’ market real-time. “Incorta” is a unified data platform that allows to perform big data computing along with analytics and visualization and has been used in this work. Insights such as number of weekly digital coupons offerings at different retailer, manufacturers and brands offering digital coupons at these retailers, number of retailers having the same digital coupons etc. have been drawn to help sales team benchmark own performance metrics and analyze different players in the market.
Sai Mounica Gudimella, Direct Auto Load Forecast Model, December 2021 (Michael Fry, Michael Kraeutle)
Direct Auto Loan is a popular lending product. Auto loan is credit financing for consumer purpose for the purchase or refinance of an automobile with the loan amount ranging from $2,000-$80,000. The objective of the project is to create a forecasting model for direct auto loan that the regional bank offers based on historical data spanning from January 2019. The analysis will be based on the data from 2019 till date, which will be used further to forecast the trends for Direct Auto till December 2022.
The forecasting curve is determined based on the period between the application, approval and booked dates. The forecast would be segmented further on a credit score and the type of loan (New or Refinance).
The forecast generated helps the bank take strategic decisions:
- To deliver accurate forecasts on auto loans
- Approximate what the business is currently doing and make an impact while deciding on strategies and tactics to improve the auto loan financing.
- Forecasting helps finance and the leadership to understand direct auto loan’s contribution to the overall revenue and profitability of the bank
Yamini Tawde, Energy Service Analytics, December 2021 (Leonardo Lozano, Ted Stephens)
Energy Service Analytics team at Tesla focuses on wrangling data and building visualizations to draw meaningful insights from service data related to Tesla Energy products. This helps the Technical Support team take data-driven decisions that can reduce the resolution time for the service issues. The primary responsibilities as a Technical Support Analyst Intern at Tesla were to build SQL queries, scripting in Python for adding feature to ETL and building dashboards in Tableau. The work covered advanced data wrangling, ETL building, descriptive statistics and data visualization along with through understanding of how the team at Tesla will use the results to make impactful decisions for the business.
Amish Kapoor, Blueprint Takeoff Analysis, December 2021 (Charles Sox, Vaibhav Bhat)
The Home Depot (THD) is looking to understand performance-related benefits of blueprint takeoff usage at both store and customer-level. The project involved identification of right product and sale-level datasets for analysis. Upon identification, the next step was to query data from Google Cloud Platform and creating a weekly refreshable Tableau dashboard, as well as a strategic PowerPoint presentation for higher management at THD. We had to make sure that the refresh is not taking time and had to spend a good amount time for optimizing the query for the same.
In short following were the key steps in the project:
- Understanding and Extracting Relevant Datasets from GCP: There were product, sku description, calendar, and sales tables which were identified, and then data was extracted from BigQuery by writing relevant SQL queries (used advanced commands including JOINS and SUBQUERIES)
- Data Wrangling in MS Excel: After creating master dataset from GCP, used PIVOT tables to create relevant views and understand sales and order-related information
- Creating PowerPoint and Tableau Dashboard: Conducted deep-dive analysis to understand the relation between blueprint takeoffs and sales/ orders, and presented the results to higher management at The Home Depot
Lohit Borah, Analyzing & Optimizing Marketing Strategies and Spending, December 2021 (Denise White, Rav Estrada)
The projects which I have highlighted as a part of my capstone project submission are:
- Cohort Analysis of customers who bought Toy category products in the Q4-2020 vs. other customers: In this project, my role was to join various data sources in the Big Query SQL to get the required data and perform quantitative analysis to identify if the cohort of customers who bought Toy category products in the Q4-2020 are higher-engaging and higher-value customers compared to others. As a part of the final deliverable along with an analysis summary, an Excel Dashboard was provided to the Merc Team to optimize marketing and pricing strategy for Toy category products in the 2021 Holiday season period.
- Create Google Studio & SIMBA Big Query Dashboards: In this project, my initial task was to create an automated flow for both the Google Studio & SIMBA Big Query Dashboards by joining multiple sources/tables in the Big Query and setting up the automated refresh process. The next step was to create the dashboards with required views/graphs and filters so that the Paid Ads Team can analyze and optimize spending across different programs and campaigns on the fly.
Ankush Morey, Advertisement Prediction and Recommendation Engine, December 2021 (Michael Fry, Dave Gearson)
Olympics and Super Bowl have long been an ideal platform for companies looking to promote themselves. Even with the challenge of a global pandemic, NBC averaged 15.1 million TV-only viewers and Super Bowl got on an average 100 million viewers. As a part of the Corporate Decision Sciences team, our objective is to evaluate the commercials on different KPIs (attention, search and tweets) and provide recommendations for improvement.
The pipeline starts with PCA for dimensionality reduction and hierarchical clustering for campaign segmentation. Then we have a 2-stage architecture to provide prediction using operational features and creative features by training two Random Forest algorithms. The final prediction is obtained by combining the two predictions. Finally, there is a script that toggles each feature one by one and calculates the predicted score for each KPI of each commercial. The feature that returns the highest increase in the prediction is the top-recommended feature.
This data is used by Advertisers, the internal marketing team, and the Ad sales team to optimize the performance of commercials and in turn, increase the ad revenue.
Lavanya Tharanipathy, Transition Modeling, December 2021 (Denise White, Shantanu Seth)
Pharmaceutical companies in the today’s day and age are constantly innovating to cure diseases in the most efficient manner possible. One such pharma company manufactures a drug that cures arthritis. While being effective, it causes considerable discomfort and pain to the patient during administration. Thanks to its resource pool of innovation and expertise, the pharma client has released a variant of the same drug which eliminates the discomfort but is effective in curing the disease. This project aims to estimate the likelihood of patients to transition to the new variant of the drug by computing the propensity of a patient to transition to the new drug variant based on patient & engagement attributes.
Arun Pallath, Quarterly Campaign Performance Through Click Stream Data, December 2021 (Leonardo Lozano, Prasun Velayudhan)
Understanding the performance of campaigns plays a vital role in designing better and focused campaigns. My role, as a Marketing Analyst for a Fortune 500 company, was to analyze the quarterly performance of all the campaigns launched in the third quarter. The goal was to identify the difference in impact of campaigns in different demographics through Click-Through-Probability and Conversion Rate. Gen-Z, defined as users in the age bucket of 18-24, is considered as a target audience for most of the campaigns. Another goal was to identify the impact of these campaigns among the Gen Z population. All the campaign performances were compared against the benchmarks and most of them were found to be performing better than the set benchmarks. Finally, a dashboard was created with all the significant metrics which can be easily understood by technical as well as non-technical audience.
Reema Manda, Opportunity Store Analysis, December 2021 (Yichen Qin, Shaun Mckim)
The objective of the project is to find the opportunity/ outlier stores that are at a higher risk so that the sales leaders can target those particular stores prior and prevent the loss. The methodology of finding the outlier stores involved key metrics to assess the risk factor of a store. We considered Stores sales script count as an indicator to flag the importance of the store, as it would be more profitable if we stop the loss from the stores with high script volumes. To calculate the store risk factor, we took into account the key metrics like Balance on Hand Adjustments and recent shrink rate of the stores. Analyzed the relationship among these metrics to calculate the risk factor and ranked the pharmacy stores based on the risk factor assigned. We also have explored other clustering techniques in the process of finding the outlier stores. The sales leaders would target these stores based on their risk rankings and would prevent the loss in advance.
Hriday Anand Nissankara, Recovery Settlement Model, December 2021 (Leonardo Lozano, Siddharth Krishnamurthi)
The Fifth Third Bank collection team is seeking to develop Machine Learning models to predict and optimize collection methods. The project consists of performing Exploratory Data Analysis to understand what factors determining the price elasticity. In the scope of Capstone project, I built a prediction model related to Collections and Recovery of Auto loans and Credit cards and see which attributes are most important in the model and understand their relationship with collection effort, EDA and Classification of are covered. It includes building a Logistic regression model to predict accounts that accept a direct mail settlement.
Vedant Sunil Deshpande, Customer Acquisition and Retention Analysis, December 2021 (Yichen Qin, Raunak Gulshan)
Working for a retail giant, which has presence in the form of online and physical stores, my project involved providing customer shopping behavior insights for one of the brands selling their products through the platform. The main focus was on acquisition and retention of customers. Customer acquisition refers to the process of onboarding new customers onto the brand. Here we look at where the customers coming from, segmenting them into the following categories – repeat customers, acquired from competitor, acquired from different category and customers new to the retail platform. Retention analysis, also known as churn or survival analysis, looks at behavior of past customers in current timeframe. The customers are categorized as follows – repeat customers, customers who churned to competitor brands and customers who churned away from category. The analysis helps brands in devising their media strategy at different marketing funnel stages.
Aditya Nitin Ketkar, Product Category Analysis, December 2021 (Yichen Qin, Raunak Gulshan)
This project involves driving business decisions for the marketing division of a top company in the retail space. The marketplace has various product divisions and categories and each of them perform and respond differently to the marketing campaigns. The category analytics that we do helps answer questions related to the category performance through the tracking of KPI metrics falling in four major types - Website Traffic (Product visits, Number of visitors, Page views, Product page views), Conversion (Revenue, Number of Orders, Orders per visit, Average Order Value), Customer (Number of new customers, repeat customers, reactivated customers) and Marketing funnel (Bounce rate, Add to Cart ratio, Cart abandonment ratio). These KPIs are used to monitor the health of the business as well as to measure the effectiveness of various marketing campaigns. It also helps detect patterns and anomalies that could entail action from various departments like supply chain, inventory, marketing, product management, and so on.
One such project that we worked on was to understand if certain cohorts of customers are more valuable than others. These cohorts of customers were defined using their buying patterns and more specifically by analyzing if they buy a certain set of products. We then analyzed their characteristics through KPIs mentioned above along with their repeat purchase patterns over segments of time to establish the long-term value of these cohorts of customers. The different cohorts under consideration were then compared on basis of these KPIs and their purchase patterns and the cohort with the highest long-term value was identified. The analysis not only helped identify the most valuable cohort but also provided guidance on curating the marketing strategies for each of those cohorts to extract the maximum value from them over medium to long term.
Sai Satyajit Suravarapu, Home Equity Line of Credit Customer Portfolio Analysis, December 2021 (Yichen Qin, Jacob George)
The capstone project is about analyzing Home Equity Line of Credit (HELOC) balances, payments and draws for Fifth Third Bank over 2019 to 2021. We observe that balances and balance / balance active have consistently reduced since COVID hit in Mar 2020. Two products - HELOC with a fixed term balloon and term product (different time periods for draw and amortization) have similar balances as of Oct 21. Balances per customer risk levels and combined loan-to-value (CLTV) have also been analyzed and risk levels 2-3 constitute majority of the balances. The number of customers reaching time to maturity have doubled over the period 2019-2021. For the term product, the draws and draw per draw active have seen a spike due to re-introduction of a promo offer.
Naga Lohit Kilaru, NPS for Delivery Experience at an E-Commerce Firm, December 2021 (Yichen Qin, Saurabh Malhotra)
Post Delivery NPS used in measuring the customer experience after delivery of retail goods is very critical to gauging the customer needs for the many companies. NPS is usually calculated by various companies to check how their customers feel about the various services provided as a part of the business. NPS is also one of the key metrics which can be used to predict the growth of a business. NPS is a metric which can be easily understood by various customers. The project details on the influence of
operational metrics like cycle time and other operational metrics. The dependency of NPS on the vendor delays in cycle times depending on the vendor are also analyzed. The various trends in the consumer patterns and the NPS trends along weeks and month also help understand the consumer behavior.
Nikhil Agarwal, Flagging Incorrect Shipments, December 2021 (Leonardo Lozano, Nag Reddy)
Tariffs are usually declared based on the country where goods come from in the global trade. Some people avoid paying tariffs on ship goods that are restricted by falsifying the origin country. If the data science team can predict the country of origin with high accuracy, it is then possible to flag shipments with unlikely origins.
The steps for predicting the country of origin include:
- Performing exploratory data analysis on the dataset and prepare data for modeling. This step will include perform ETL operations to make a structured dataset, removing / replacing unavailable values, and identifying gaps and trends in the data
- Develop ML models to predict the ‘Country of Origin’ class variable using other covariates. The ML models include KNN and Naïve Bayes
The KNN (K – Nearest Neighbor) works on the principle that the class of the response variable is decided by the majority of neighbor observation’s class. Another popular predictive classification algorithm is Naïve Bayes which Works on Bayes algorithm i.e. the observations are independent of each other and Bayes theorem for conditional probability. Since the data in the industry does not follow a particular distribution, these classification algorithms result better in terms of prediction accuracy.
Divya Bhadauria, Market Segmentation Analysis, December 2021 (Yicen Qin, Jim Kilgore)
Knowing the size of your market is crucial for estimating income, whether you're entering a new market, planning an expansion, or getting ready to introduce a new product. Total sales revenue, possible number of clients, and sales volume are all examples of market sizing analyses. In this project, I helped a major pharma client answer a crucial question of whether they could achieve company’s set financial goals with the products they currently sell in the market. I prepared a market size analysis report using python and PowerBI. This report helped the client in concluding that they need to invest in enhancement of their products and expand their target market for meeting their financial goals.
Chris Cardone, Predicting the Outcome for NFL Games Using a Team Rating System, December 2021 (Leonardo Lozano, Yan Yu)
The goal of this project was to predict the outcome of NFL games. In order to do this I reviewed and compared various rating methods before deciding to use an Elo rating method. The dataset consisted of box scores for the 2000 to 2020 seasons in the NFL and were scraped online using Python and Beautiful Soup. To begin, the most basic form of the Elo model was implemented before adding further complexities such as partial mean reversions for team ratings season to season, home field advantage adjustments, and a point spread multiplier accounting for the quality of victories; each with different levels of improvement to the predictive accuracy of the base model. The final Elo model was used in conjunction with a betting strategy based on consensus odds from Vegas bookmakers in a simulation where wagers were placed for the 2015 through 2017 seasons and was able to yield a 14% return on all money wagered. Overall, the final model was able to predict the outcome of games 65.6% of the time over 20 seasons.
Yash Modi, Covid 19 Daily Cases Prediction using Machine Learning, December 2021 (Leonardo Lozano, Vivek Soundarapandian)
The 2nd wave of Corona Virus Pandemic in India caused unprecedent loss, killing thousands of people. Experts believed that a lot of lives could have been saved had the government anticipated the rise of covid-19 cases and subsequently the need for hospital beds and oxygen cylinders to treat patients. We developed an algorithm to predict covid-19 cases using historical data till May of 2021. The model emphasized on periodicity to deal with time-series component present in covid data and vaccination variables. We tried three models to predict daily covid cases — a simple linear regression, generalized linear model and XG Boost. All three models were trained on training data and further validated using test data. These In the end, XG boost performed better amongst all models on test data and was chosen with objective set as Poisson. We found that our algorithm proved to be reliable and our predictions weren’t far off from the actual numbers that were reported for the subsequent months that followed the 2nd wave.
Akshay Nitturkar, Multicultural Test Campaigns, December 2021 (Yichen Qin, Sandy Latushkina)
Demographically, LatinX and African American community has substantial population in the USA and that makes them one of the important communities to focus on from the marketing standpoint. Recently, Ancestry.com has acquired several records specific to these ethnic groups and thus marketing team decided to test targeted marketing campaigns whose creatives are specifically designed to attract LatinX and Black Americans. Purpose of this pilot test was to assess if such tailored campaigns help increasing brand awareness and adoption among specific ethnic groups. Social Media campaigns were launched on Facebook in 3 different phases of which 1st phase was designed to attract while next phases were focused on acquiring more customers. Several KPIs such as Traffic, Click Rate, Conversion Rate, Lifetime Revenue etc., were designed to assess the performance of the test in each of these phases. PostgreSQL on AWS Redshift, Tableau Dashboards and Adobe Analytics used to process the response data for these campaigns, website traffic and attributed sells records. Comparison with the relevant historical campaign benchmarks suggest that phase 1 was successful in attracting the targeted customers while next two phases were strong in the conversion rates but brought in less lucrative customers with lower lifetime revenues. This test gives strong insight on need on improvements in the conversion strategy for any such campaigns in the future.
Ashish Saxena, Patient Risk Scoring, December 2021 (Denise White, Shantanu Seth)
The goal of this project was to help patients with drug consumption & access, drug manufacturers launch patient support programs helping patients at different steps of acquisition of drug, and to ensure a successful support program, it is imperative for manufacturers to identify patients at risk of going off therapy. Assuming that adherence to therapy and engagement on the program are correlated to the highest degree, patient’s preferred response patterns are identified.
This is done by obtaining in-depth understanding of behavioral segments and analyzing historic response patterns of patients enrolled in the patient support program. The risk scoring model built as solution identifies patients’ propensity of disengaging from the support program and thus falling off therapy.
The following outcomes were achieved after successful completion and impact analysis:
- Improved patient engagement on the platform
- Increased length of activity on the program with improved adherence to therapy
Kushal Gupta, Investment Framework Dashboard, December 2021 (Denise White, Siya Gupte)
This project is based on a marketing performance tracking dashboard. The purpose of the dashboard is to measure the impact of different marketing channels (by channel, by region), and understand the key attribution/drivers of performance. For example: understanding the global change in Lifetime value and understanding how much of it is driven because of margin changes in EMEA (Europe, Middle East, and Africa) region.
The Investment Framework dashboard is a growth and revenue management dashboard for a leading music and video service provider. The video and music service provider wants a one-stop solution for monitoring the impact of different marketing channels on overall revenue. This happened due to the primary focus shift of the company towards paid members and their retention.
For solution, I have used in-house technological platforms such as Plx scripts and dashboards for building the dashboard. This project requires:
- Strong implementation of data pipelines, automation, and scale of delivery
- Building a driver analysis to understand the key drivers of performance
- Integrating results from a marketing attribution model
- Communicating insights to cross-functional stakeholders
Lakshmi Dedeepya Boddu, Retail Banking Price Optimization, December 2021 (Yichen Qin, Andrew Cheesman)
Citibank Retail banking division is seeking to develop Machine Learning models to predict prices of various banking products in near future. Model coefficients are given as input to Optimizer tool. The project consists of performing Exploratory Data Analysis to understand what factors determining the price elasticity. In the scope of Capstone project, EDA and Segmentation of Deposit Pricing Optimization tools are covered. It includes building a LASSO regression model to predict fund flows between various banking products. For Deposits module, it models 38 products covering 40 flows per product. A total of 4 variants are built to model each flow to select the best model to input to optimizer to determine the product’s price elasticity.
Neelabh Sengar, Supply Chain Visibility at Sam’s Club, December 2021 (Yichen Qin, Praveen Dodda)
Sam’s Club digital product suite for associates has had a heavy focus on solutions based on inventory processes within clubs. They have little to no visibility on when and where a product is within the supply chain. Being on the front lines of customer engagement, it is pivotal for them to be equipped with this information to enhance the member experience.
In addition to the member experience, Sam’s Club will leverage this new product for cost optimization and to drive operational behavior. Better labor planning in the clubs and distribution/fulfilment centers, identification of transportation lanes with higher accessorial cost, higher trailer ratios, out-of-alignment shipping, supplier delivery compliance etc., are some of the other use cases that the new product will address. Sam’s Club is expected to save up to $3.4 million just by providing the visibility to the general merchandise within the outbound network.
One of the features currently in development is the Trailer Unload Prioritization, that aims to identify the trailers with high values items for restocking in the clubs and Fulfilment centers. The prioritization looks at various parameters like the value of items, in stock status, special offers etc. to determine which items need to be unloaded on priority to minimize lost sales and improve member experience.
Mohit Sharma, Transformation of Client’s E-commerce Portal into a Unified Service Portal, December 2021 (YIchen Qin, Ankit Pathak)
This project is about fulfilling the product and process related requirements for the transformation of Client’s E-commerce portal into a unified service platform. There were three main sub projects that I worked to make sure smooth migration to unified portal. Firstly, Data Discrepancy AA Test was done to make sure there is no data discrepancy in between different data thereby reducing the delta across operational and business metrics to <1%. To make sure reliance on the results of the AB Experiments, it became important for client to remove any data differences in between the different data sources i.e., the data source used to feed data for AB experiments and the cloud data. Secondly, Basket Analysis was done to measure the impact of persistent filters on apps by drawing comparisons across different seasons to minimize the null searches. Observing the drastic shift in customer buying behavior across different seasons, it became important to make certain product level changes to not compromise on customer experience. Lastly, IP Module AB Test was done to test the hypothesis of improving overall Item Page performance by presenting alternative layout to improve visibility of product related content while lowering the placement of item suggestions.
Kumar Anurag, Refunds and Returns Analysis on Zulily Website, December 2021 (Yichen Qin, Denise White)
Zulily is an e-commerce platform that sells a variety of products ranging from apparel to everyday usage items. Many of these orders are either returned or refunded or both, so it becomes vital to understand the impact it has on the net profits of the company. We can leverage analytics to find out the detailed reasons for these returns, identify channels that need to be focused upon to make better business decisions going ahead. With this analysis, we were able to find out the various factors related to customer and product and create a hierarchy for the stakeholders to identify how these factors changed over time. The findings of this exercise also aided in business development by focusing on vendors and accounts with lesser return rates. The future scope of this project will be focused toward building a predictive model to identify the customers or orders that will be returned/refunded.
Nan Yie, Breast Cancer Risk Factor Analysis, December 2021 (Yichen Qin, Edward P. Winkofsky)
In this capstone project, a dataset recording the possible risk factors leading to breast cancer was explored. These risk factor include blood glucose level, insulin, MCP.1, et cetra. The response variable was either 1 (healthy individuals) or 2 (breast cancer patients). Multiple modelling methods like logistic regression and classification tree were employed to investigate the relationships among the explanatory variables and the response variable.
Nitin Mittapally, Truck Capacity Optimization for FMGC client, December 2021 (Yichen Qin, Jordan Durham)
Solve a truck capacity optimization problem using current and future orders data to reduce the variance in the truck capacity, maximize the capacity utilization and maintain delivery times. We have past orders and deliveries data along with market reports on shelf availability of products giving info on critical products which need restocking. Build python scripts to build prototype model to solve the above problem and verify them with past data. Put the model on pilot run for a single location and verify results in real-time. Scale the model to all locations on Azure.
My scope includes migrating the existing the excel models to python, verifying them and then build them on Azure. This includes connecting to various databases and file systems and aggregate the data in a single place and then perform the required manipulations and build models.
Christopher A. Calhoun, First-Year Engineering Student Performance in Calculus and Physics, December 2021 (PK Imbrie, Yan Yu)
Calculus and Physics are foundational courses for university engineering students. These courses are taken during or before freshman year and serve as a common gateway to success within an engineering education. There are many factors that influence a student’s success in the Physics and Calculus at the University. In this work, the Calculus and Physics performance (grades) for a collection of engineering freshman at the University of Cincinnati are analyzed. Data was collected about a student’s high school experience, including classes they took, subject specific grade point averages and standardized test scores, which in turn were used to predict their grades in Calculus and Physics at the university. A variety of models were considered from linear regression, classifiers (multinomial, classification tree, random forest, and neural network), and ordinal model (proportional odds). It was seen that the models all worked similarly well and provided insights into the odds of success. This report discusses the different factors predicting success.
Nishitha Alapati, Grocery Data Analysis, August 2021 (Dungang Liu, Yan Yu)
The purpose of this study is to analyze what factors play a significant role in increasing customer engagement at grocery stores. Strong customer engagement is important in the retail industry as it will foster customer loyalty and growth in sales. First, a demographic analysis was done in order to examine if there is any association between demographic factors and customer spending. Next, Market Basket Analysis was performed and association rules were created using the Apriori algorithm in order to uncover relationships between products that are often purchased together. The demographic analysis revealed that there is a positive association between customer income and spending as well as household size and spending and negative association between customer age and spending. Helpful recommendations were discovered in regards to product cross selling, promotional offers and store layout from the Market Basket Analysis. Understanding how demographic factors affect customer spending and implementing recommendations based on association rules will lead to an increase in customer engagement at grocery stores.
Matt Anthony, Predicting the Winner of the World Series, August 2021 (Dungang Liu, Liwei Chen)
Baseball is a sport that is known to be very random. Any team can win any game on any given day. This causes there to be an incredible amount of analysis within the sport as each team tries to find an advantage. This is especially true in the playoffs where every team is good, and any team could win each series. I decided to model this randomness to predict the team that is most likely to go all the way in the playoffs and win the world series. To do this, I built 3 different models to predict the number of playoff wins a team will get using data that was based on the Lahman database. The first model was the standard Linear Regression Model, the next a Random Forest model, and the last an XGBoost model. According to the Linear Model, the Houston Astros are most likely to win the World Series with a predicted playoff win total of 6.840. The Random Forest Model, the Chicago White Sox are most likely with a playoff win total of 7.820. Then the XGBoost Model predicted that the New York Mets are most likely to win with a playoff win total of 4.077. Overall, the Chicago White Sox have the highest likelihood across all 3 models, finishing 2nd, 1st and 2nd, in each model, respectively.
Amanda Cabezas, Identifying Key E-commerce Clickstream Performance Metrics that Impact Revenue Generation, August 2021 (Dungang Liu, Jeffrey Shaffer)
Throughout the Covid-19 global pandemic, many businesses have suffered financially or closed due to reduced business and lost revenue. In response to mandatory quarantine enforcement, some businesses have been able to adapt through increased investments in e-commerce offerings and the online user experience. These investments are often expensive, and it can be difficult to pinpoint which changes would lead to increased revenue. By performing logistic regression on e-commerce clickstream data and thorough variable selection, key metrics can be identified that significantly impact the potential revenue generation of each online shopping session. The analysis identifies six clickstream metrics that have a statistically significant impact on whether revenue is generated from an online shopping session. The resulting analysis provides a good basis for further exploration into key performance indicator tracking for businesses hoping to becoming more competitive with their e-commerce offerings.
Cynthia Corby, Calculating the Career Reentry Population and Simplifying the Process to Update, August 2021 (Leonardo Lozano, Denise L. White)
This project assists iRelaunch, a pioneering company in the career reentry space, in updating its data to better create and improve pathways for people to return to work after a career break. Using the Bureau of Labor Statistics (BLS) microdata, the calculation of the relauncher population was updated for the last three years as well as the last 12 months. Relaunchers are defined as women ages 25-54 with children under 18 and a college degree or higher that are not in the workforce, with the broad assumption that they are not in the work force because of childcare. This project additionally included finding the population of men who have the same profile. There are currently about 2.9 million total relaunchers in America as of May 2021. Once the relauncher population calculations were updated, a modeling tool was built in R that iRelaunch can continue to utilize in future years by plugging in future Bureau of Labor Statistics data, extracted from the IPUMS-CPS database, to update the relauncher population calculation.
Justin L. Ditty, Utilizing Obituaries to Predict Casket Sales, August 2021 (Michael J. Fry, William M. Bresler)
Third party vendors in the funeral product industry rely on annually published government mortality statistics to analyze market share, predict sales and manufacturing quantities, and inform other business decisions. These mortality statistics are finalized up to six months after the end of the calendar year. As a result, business decisions are often made with outdated information which may lead to suboptimal outcomes. Obituary information from local newspapers and funeral home websites can be used as an alternative source of mortality statistics. Obituaries can be gathered monthly, reducing lag times for sales forecasting and churn analysis from six months to one month. Random Forests, Decision Trees, Zero-Inflated Poisson Regression, and Multiple Linear Regression were used to predict the number of funeral products sold using available obituary data and government death statistics. The machine learning models trained using the monthly published obituary information predicted the number of funeral products sold significantly outperformed models trained using the government mortality statistics. The result of this analysis shows that third party vendors can rely on obituary information instead of waiting for government death statistics for forecasting and churn analysis where the obituary information is available.
Joseph Froehle, Mercy Health: An HR Survey Study, August 2021 (Bryson Purcell, Michael J. Fry)
Once a year, almost all employees at Mercy Health are eligible to fill out a survey, which contains different questions about their opinion of the job, the actions that could be taken to make their job better, and what factors play a role in this. Different data analysis techniques are used to paint a picture of the different groups that work within the health organization, in which the goal is to try and specifically target these groups to increase employee retention rate and morale. The Workforce HR team is specifically designed to look at large sheets of data, and to employ different techniques, (such k-means and hierarchical clustering) to group different types of employees together, or to find out what are the biggest influences for their overall work experience.
Anthony Goebel, Receiving Dock Sizing for Retail Fulfillment Centers Using Stochastic Simulation, August 2021 (Michael Fry, Eric Worley)
This paper details a model built for optimizing peak-volume sizing of inbound processing space for a retail fulfillment center using discrete-event simulation with stochastic production inputs. The model is run for a 72-hour peak period using historically fitted arrivals, production rates, operational inputs, and resource parameters—such as staff size, travel time and travel distance. A full 5-door case-unload receiving operation is simulated. Three simulations are run using slight adjustments to labor allocation, and it is found that backlog appears to be highly sensitive to resource changes at the end the process as product is moving out of the dock space. Pallet backlog volumes could soar close to 1,000 pallets in dock space, but a 2-unit change in staff allocation can decrease this amount to below 100. The max scenario would require a footprint of roughly 70,000 sq ft. which would take up almost 6% of the average 1.2M sq. ft. facility, but the second would only require only 6,000 sq. ft. Next steps are to begin consulting engineering and operations teams on the model and its results. As of right now, pallet sizing requirements are inconclusive given small resource changes resulting in a 65,000 sq ft. incremental need.
Megan Hueker, Interpreting Gradient Boost Models for Property Casualty Insurance Models, August 2021 (Jared T. McKinney, Michael Fry)
Over the years, A key challenge that prevents more widespread adoption and understanding of Gradient Boosting Models in the Insurance Industry is its complexity, and lack of interpretability. In this paper, we propose that Gradient Boosting Models have a grand potential for positively impacting the Insurance Industry and how the industry builds it prediction models. We explore techniques for improving the interpretability of Gradient Boosting Models and show that interpretability is possible for these models by both Actuaries and Analysts, and Policyholders.
Charles Kishman, Forecasting Spot Market Truckload Rates, August 2021 (Michael Fry, Chris Painter)
Spot market truckload pricing is a major area of concern for all parties in the supply chain. The current transportation market has been thrown into chaos by DOT regulations and the ongoing effects of the Covid-19 pandemic. The high demand for consumer goods and tightening truck capacity due to driver shortages make it imperative to have a strategy for setting benchmarks for truckload rates. As a Brokerage and Managed Transportation Solutions provider, Stridas is susceptible to these issues.
This paper will seek to develop a forecast for dry truckload rates based on historical values. In addition, it will explore the impact of market factors such as carrier capacity and tender rejection data on linehaul rates. Autoregressive time series methods and the Facebook Prophet model will be evaluated.
Seth Kursel, Time-Series Prediction of Stablecoin Volatility, August 2021 (Dungang Liu, Michael Morelli)
Cryptocurrencies experience extreme volatility, often growing 100x or going to 0. Stablecoins are intended to function as a medium of exchange while maintaining the decentralized benefits of crypto by maintaining a set value. Sometimes the mechanisms fail and their value ‘de-pegs’ from its intended value. In order to capture possibilities for arbitrage, we look for a model to explain and predict these stablecoin-wide increases in volatility using five popular $1-pegged coins. Optimal timeframes for employment of market indices are investigated. Using time-series analysis and GAM with price range and volatility, we conclude that more nuanced time-series techniques are necessary to extract any signal from the noise of crypto volatility.
Jonathan LaForge, Seoul Bike Rental Analysis, August 2021 (Dungang Liu, Yinghao Zhang)
With the increased emphasis on the reduction of fossil fuels for transportation, reliable and readily available alternatives must be available for a transition to happen. Year over a year there has been a sharp increase in usage of forms of micromobility, like bicycles and scooters. As this usage continues to grow it becomes essential for those supplying these lightweight vehicles to be able to fill the demand of these new consumers.
In this project an analysis was done over bike rental data collected from one of the largest cities in the world, Seoul, South Korea. Several models were fitted with to the data, with the random forest model being chosen as the optimal model to be used.
Mijia Li, Garments Worker Productivity Analysis, August 2021 (Michael Fry, Rishika Kondaveeti)
This Garments Worker Productivity Analysis Project applies data mining to the Productivity Prediction of Garment Employees dataset. The objective is to identify critical factors of actual productivity and predict the range of productivity to help decision-makers understand their factories better and improve decision-making. Exploratory data analysis, linear regression models (full model, stepwise selection, and LASSO regression), advanced trees models (regression tree and random forest), and neural network models are used in this study. By comparing the out-of-sample performance MSPE (mean squared prediction error) and in sample performance MSE (mean squared error) and integrating interpretability, the random forest model was selected as the best model for this Productivity Prediction of Garment Employees dataset. In addition, based on the random forest model, "targeted_procuctivity," "incentive," and "no_of_workers" are identified as three critical factors that affecting actual productivity.
Wei Li, Movie Popularity and Ratings Analysis, August 2021 (Leonardo Lozano, Peng Wang)
The movie data set is from Kaggle and is organized by movie title with related information listed. This data set includes 15,480 rows with 29 variables, which are from different sources including Netflix, IMBD, Rotten Tomatoes, YouTube, and box office information. The data set includes movie ratings from several critics, and includes the movie's genre, release time and release markets, gross proceeds among other data. The objective of this research is to find the most popular movies from the last 100 years as judged by ratings, to determine the top 10 most popular movie genres and to assess correlation between ratings and financial performance of the movie.
Yao Lin, Default of Credit Card Clients, August 2021 (Dungang Liu, Mike Morelli)
Objective: (1) To develop a financial model to predict a customer’s default probability before he/she is actually default in credit card payment; 2) to assess the performance on real world banking credit card customers from different models, compare their AUC ; and 3) identify directions for future real world studies in predicting customer’s default probability.
Data and methods: I used a dataset from UCI, which contains information about credit data, historical payment, bill payment and default payment of customers in a bank at Taiwan from April 2005 to September 2005. It has 25 variables and 30,000 rows. Leveraging statistical methodology, I built logistic regression to predict response for individual customers and identified their default probability. The performance was validated on the actual banking customer default outcome, where ROC curve (AUC) was assessed.
Results: Built logistic regression that simulated the customers’ performance as a benchmark, using cluster analysis, information value and variable selection analysis to pick the variables that are significant in the logistic regression.
Discussion and Conclusion: By exploiting predictive variables from the data source, I demonstrated that statistical model improving the accuracy of predicting potential default customers. With the study, the AUC of logistic regression and decision tree are 0.7215 and 0.754.
Andrew McCloskey, Conjoint Analysis: Exploring a Sweet Technique, August 2021 (Leonardo Lozano, David Curry)
In this project, the marketing research technique of conjoint analysis is explored and implemented. A brief explanation of what utility the analysis provides and how it can be applied is discussed first. Following, survey data collected in 2000 by W. Nowak is introduced and analyzed. This data involves 87 respondents to a survey that rates 16 different chocolate bar profiles which have varying attributes and levels. The preferences from these survey respondents are analyzed to determine which chocolate bar would be most advantageous for a manufacturer to produce if they are trying to introduce a new product to the market.
The results of this analysis show that the aspects of a chocolate bar that consumers care about the most are the kind, or flavor of the chocolate, the price of the bar, and how many calories the candy has. Milk chocolate bars with a low price and low caloric content are the most desirable. A light weight is slightly favorable as well, though less important, and the packaging of the bar is fairly unimportant to consumers.
Paul Messerly, Natural Language Processing for Bill of Material Compliance Prediction, August 2021 (Leonardo Lozano, Marcel Oliveria)
In recent years compliance regulations set forth by the European Union, the State of California, and many other governing entities have become increasingly important and more seriously enforced. This has left organizations scrambling to search for bill of material compliance documents for assembly parts from their suppliers. Further, over the past year, I have been involved in the development of an application named Rumzer which helps these organizations manage this tedious communication process. Through collecting data from application users, Rumzer has obtained a database of customer assemblies. Moreover, in efforts to automate the compliance document search for a newly entered part, this project evaluates the feasibility of using previous part descriptions and judgments to predict the compliance judgment classification of the new part. This project describes the process of training a natural language processing algorithm to test the feasibility of using previous part descriptions to predict the compliance of new parts entered into the Rumzer application.
Shiv Patel, Crime in America, August 2021 (Dungang Liu, Peng Wang)
American crime is something that everyone talks about and something that happens every day. Seeing which crimes affect states the most and how would be beneficial knowledge to society to have. Crime rates for three main crimes and which states are ranked most in those crimes would be beneficial for businesses expanding to different states and beneficial to society in general to be wary of those areas. Analytical models will be used to evaluate what makes a state dangerous and at what rate. With our binary response variable, we are able to assess states and their response values and signify how much of a bad state they are. Our results have a rate of 70.403 prediction rate of guessing a dangerous based on our variables. These results can serve as a base and a road sign for future models with more in-depth data and bigger time range. These results will be generalizable to be custom to a specific need or area of interest to a client.
Massey Pierce, Predicting Steam Video Game Ratings, August 2021 (Leonardo Lozano, Michael Fry)
This project looks at over 27,000 video games that have been sold through the video game digital distribution service Steam. A linear regression model, along with multiple tree models, is used to try and predict which variables could make a video game popular and increase sales. Steam is currently the most popular platform for buying computer games, and allows easy availability for indie developers, as well as popular developers, to sell their games there. As of 2019 there are over 95 million monthly active users on Steam, and in 2013 Steam held around 75% of the PC gaming market share.
Colleen Sterner, Outlier Detection for Client Shipping Costs, August 2021 (Michael Fry, Abe Adams)
In the transportation industry, shipping costs can quickly become out of control and identifying the underlying issues can save a company a good deal of money. The first step in identifying the underlying issues is to identify the outliers. This paper uses the freight data from a client of a transportation company to identify shipped loads with higher-than-expected shipping costs.
The data was extracted from a Transportation Management system used by the client encompassing the prior three years. The analysis attempts to use linear regression to identify outliers based on the total shipping cost. The dataset is split by mode and then a linear regression is developed for each mode. For full truckloads, the linear regression results in a model that will identify outliers. The outliers will then be reviewed by the business team to determine the reason for the higher costs. A linear regression model does not provide sufficient fit for the smaller shipments that are sent less-than-truckload (LTL).
Lepa Juju Stajanovic, Determining Characteristics Influential in Mathematics Performance of Portuguese Students, August 2021 (Dungang Liu, Peng Wang)
There are many factors that can influence a student’s performance in school beyond what happens in the classroom. Figuring out what these external factors are that make the most impact can help educators pinpoint students who may benefit from extra attention and assistance. Mathematics, in particular, is a subject that many students can be intimidated by in school and struggle to keep up with if not provided with the proper environment. In the following project, we will examine this idea through data collected on demographics of Portuguese students and their performance in math class.
We accomplish this through linear regression with variable selection methods with the help of R’s software. These methods help us determine which factors may be the most influential in a student’s performance in math class. The sample size for this study is small, so for more conclusive results, a larger sample size would be preferable. Nevertheless, we find that students tend to score consistently throughout the school year and their grades in earlier terms are a strong indication of their performance at the end of the year. Additionally, some of the characteristics that have the strongest influence on the final grade are a student’s sex, their mother’s occupation, where they live, study time, previously failed classes, going out with friends, romantic relationship status, family support, and absences.
Emily Thie, Only You Can Prevent Forest Fies (with Data Mining Techniques), August 2021 (Dungang Liu, Peng Wang)
Quick detection is a key factor in fighting forest fires, and data mining techniques can aid in more accurate prediction of the size of potential fires. This data looks more closely at Montesinho Natural Park in Portugal and uses meteorological data and fire prediction indices to predict the area a forest fire will cover. Due to some large outliers in the square area of forest fires in the data, a log transformation was applied to the response variable. We then used statistical models including Linear Regression and Random Forest and applied multiple different variable selection methods to the regression. Models with just the meteorological and just the fire index data were compared, but a combination of the two performed better than other linear models. However, the performance of the linear models overall did not bode well for being useful in prediction. Due to the distributions of the testing and training data being different, the more complex Random Forest model did not perform better than the simpler Linear model. These results are specific to a northeast region of Portugal and the sample size was relatively small in addition, and even so the results did not suggest an advantage of these data techniques over traditional means of detecting forest fires.
Jing Wang, Sentiment Analysis of Tweets About COVID-19 Vaccine Incentives, August 2021 (Dungang Liu, Liwei Chen)
Data from the Centers for Disease Control and Prevention shows the pace of COVID-19 vaccinations has slowed nationwide since April 2021. To overcome Covid vaccine hesitancy, states have introduced lottery-based incentives to increase vaccine uptake. Are those incentives going to accelerate vaccination process immediately or hurt people’s willingness to get one in a long term? Public sentiment analysis contributed valuable information toward making appropriate responses. This project aims to develop a model which predicts sentiment trends among public opinions on COVID-19 vaccine incentives. In this study, Sentiment140 data set was used to build machine learning models, text data preprocessing and vectorization methods were conducted, after utilizing cross validation tuning hyperparameters the logistic regression model achieved 80% accuracy. The results of predicting vaccine incentive tweets sentiment showed There were 2.4% more negative tweets than positive tweets from May 23 to June 24. The results indicate that public opinions did not support lottery-based incentives.
Kaiwen Xu, Contributing Factors for the Outcome of League of Legends Ranked Games, August 2021 (Dungang Liu, Leonardo Lozano)
As the game of League of Legends becomes more popular, strategy analyst of professional teams around the world has been researching the deterministic factors to the victory of the game. Using the first 10 minutes of real high rank games data from the League of Legends API, we fit models to generate the insights about which early game features, such as getting first kill and taking down dragon, contribute the most to the victory of blue side. We approach the problem first by using the correlation matrix to find the features that correlate the most with the response variable, blue side wins. Then we fit the training data with logistic regression model and random forest model and compare the predictive power of the two. After comparing the out-of-sample AUC, we find that logistic regression model has more predictive power. The conclusions from the logistic regression model follow. Taking first blood would increase 0.188 log-odds of winning. Dragon is a larger factor in the victory of blue side: winning the dragon would increase 0.4118 log-odds, while losing it to red side would decrease 0.1651 log-odds. Having 1000 experience advantage would also increase 0.4368 log-odds. Using these results, teams could form early game strategies that gain competitive edge from the beginning.
Brandon Baghdadchi, Music Genre Popularity: 1970-2020, April 2021 (Dungang Liu, Jaime Windeler)
Music is one of the oldest engagements and achievements of humanity. Almost everyone is passionate about a special music genre and respects music in terms of art, therapy or entertainment. Music is an interesting topic to explore. In this project I had two objectives: (1) to explore public musical trends during the last 5 decades; and (2) to build a model to identify what musical features might drive a track trend. The data set used for this project was provided by Spotify, which includes information about music tracks from 1957 to 2020, but according to the objective I did all the analysis on a sample size of 32,636 observations based on tracks released in 1970 or later.
Exploratory data analysis indicated that during the last 50 years, rock has been the most consistent genre in respect to popularity and most count of hit songs over each decade. Electronic dance music (EDM) and rap genres had the most variance in popularity, but both got trendier in public in 2020 comparing to 1970 but still their golden time was around 1980. Pop and Latin both dropped in popularity over the 1970-to-2020-time range, but not constantly. To satisfy the second objective, various linear and tree modeling approaches were implemented. Across all modeling approaches, the Random Forest model yielded the most accurate result because it had the lowest mean square error of prediction (MSPE). Based on this model, the top five predictors to make a track popular were identified as energy, loudness, instrumentalness, duration and danceability. The analysis was done based off a time range of 1970 and later, but the largest proportion of the data was from 2010 to 2020, especially 2019. This may cause an important time-range bias which must be considered when referring to the results.
Megan Barton, Customer Loyalty through Variable Creation and Clustering Applications, April 2021 (Dungang Liu, Luke Fuller)
Loyalty rewards cards have become a staple in retail, enabling customers to receive rewards and special offers based on their shopping behavior while allowing retailers to retain purchase-specific details for each individual customer for downstream analysis. By investigating logistic regression, we can predict loyalty and implement applications of K-means clustering to better understand how to successfully predict customer behavior. Typically, the more loyal the customer, the richer the customer rewards, for a two-fold purpose: incentivize increased customer engagement which in turn gives richer customer data and sales to the retailer itself. In the world of machine learning, given unlimited data there’s a seemingly unlimited number of ways in which data scientists could create customer groups, explore data, and predict a multitude of customer behaviors, loyalty included. By looking at a sample of Kroger transactional data, one of the nation’s largest retailers, logistic regression in combination with differing applications of K-means clustering showcases how we can fine-tune a model for predictive accuracy while condensing overall variables to be used within a model, improving misclassification rates in predicted results to be under 20%. This suggests that variable reduction and exploration can be a creative process from manipulating existing data to avoid delving into more complex concepts in business cases, like PCA.
Matthew Baryluk, Neural Networks: Model Description, Creation, Tuning, and Evaluation in Python, April 2021 (Yichen Qin, Michael Platt)
In this paper, I explore the subject of neural networks and their implementation in Python. Neural networks are a powerful black-box classification algorithms that are currently at the cutting edge of the machine learning field, and generally comprise the field of deep learning. They are given their name due to their resemblance to the structure of the human brain, and internal nodes of the network are called “neurons”.
I identify five component processes in a deep neural network: forward propagation, the activation function, the cost function, gradient descent, and backward propagation. I then construct such a network in Python, training it on a dataset with classification outcomes wherein banknote forgeries are identified by certain numeric features. I then evaluate it on this dataset, tune its parameters, and compare its performance to other classification algorithms: logistic regression, k nearest neighbors, decision tree with bootstrap aggregation, naïve Bayes, and support vector machine. I find that its performance is equal to or better than every other model, at the expense of a much longer execution time.
Peter Bekins, Modeling Career Trajectories in Baseball - A Bayesian Multilevel Approach, April 2021 (Michael Fry, Michael L. Thompson)
This project demonstrates the benefits of a multilevel Bayesian linear model applied to the problem of estimating age effects on batting performance for career Major League Baseball players. A multilevel model was chosen to compensate for a high degree of noise when estimating career trajectories for individual players due to the relatively small sample sizes. Sharing data between levels allows for shrinkage toward the population-level estimates, while still allowing the intercept and slope terms to vary by player. The Bayesian approach was chosen because Bayesian inference allows for the quantification of uncertainty in the performance estimates. Applications of the model include identifying and ranking unexpectedly high performances relative to a player’s career trajectory and out-of-sample prediction of future performance for an individual player based on the current trajectory.
Alexandria Bianco, Hospital Patient Flow Simulation, April 2021 (Michael Fry, Denise L. White)
Decisions taken without recognizing system-level patient traffic cause or worsen hospital congestion and the many associated healthcare delivery costs and quality problems. Without patient flow models, decisions that impact the entire system are taken with only a small subset of the system in mind.
A Cincinnati-based hospital would like to better understand their existing patient flow process to determine where they are seeing the largest bottlenecks and system limitations among their hospital. The deliverable for this project was to build a model that accurately reflected the hospital’s current flow process. Through the usage of discrete event simulation, a model was built to accurately reflect the hospital’s flow through a simulation software called Arena.
The model described herein reflects historical data obtained from the hospital ranging from the years 2017-2021. A multi-step approach was taken to deliver on the expected requirements from the hospital personnel. The first step in the process was to explore, analyze, and manipulate the data in preparation to be inserted into our Arena model. The second step in the process was to build out the model framework in Arena and insert the corresponding data from step 1.
Jeffrey Bogenschutz, Predicting March Madness with Regression and Tree Methodologies, April 2021 (Michael Fry, Paul Bessire)
The NCAA March Madness basketball tournament is notoriously difficult to predict. However, this does not stop millions of people from filling out brackets predicting the winners of each game of the tournament. This analysis uses quantitative approaches in the forms of linear regression, bagging trees, random forest, and boosted trees, as well as qualitative information, to make predictions for each game of the March Madness tournament. Regarding out-of-sample MSPE, the models based on regression and boosted trees perform best. Furthermore, the predicted brackets using regression and boosted trees were the top two performing brackets for the 2021 NCAA March Madness tournament.
Adam Deuber, A Statistical Analysis of the Game of Soccer, April 2021 (Alex Wolfe, Edward Winkofsky)
This capstone project takes a deeper look into the sport of soccer on an analytical basis. Utilizing data collected by myself with many variables related to soccer statistics, the goal of this analysis is to see if certain trends or understandings can be deduced from the game. Specifically, my study looks into the importance of the formations that teams run in these games. Using regression analysis and visualizations, a better comprehension is given of what should factor into a coach’s decision of what formation to run. Additionally, the variables in the dataset are examined as to their respective effects on formational success and winning in general. My research also covers the best and worst teams from the data collected, and it views the tendencies that define their successes or failures. Ultimately, this project seeks to give the readers a better interpretation of the analytics movement in soccer and how soccer formations along with other variables being collected affect the potential outcomes of games.
Seth Draper, Predicting MLB Offensive Player Salaries Based on Demographic and Previous Year Performance, April 2021 (Dungang Liu, Michael Platt)
The purpose of this research is to develop a model that can predict the salary of a Major League Baseball player for their following season. This research paper takes a deep dive into players’ hitting statistics and performance, as well as their tenure and age. Taking data from starting position players from 2000 to 2015, a final model was generated that can accurately predict future salaries based on a set of variables that will be described in detail later in the paper. This model is only for position players, not pitchers. In addition, it is comprised of purely hitting statistics, with WAR being the only variable that includes defensive ability. The variables included in the final model include service, RBI, yearID, OBP, R, IBB, G, GIDP, AVG, SH, 1B, War, 3B, age, SO, and SB. The final model is a strong linear regression model with an R2 value of 0.55. This paper will describe in detail the process of obtaining the data, cleaning, examining, modeling, and testing different models to ultimately produce the final model.
Katie Fasola, More than Touchdowns: An NFL Analysis, April 2021 (Alex Wolfe, Ed Winkofsky)
The National Football League, better known as the NFL, is a multi-billion-dollar industry. Millions of fans across the world cheer for these 32 teams every year, and people are continuously looking for ways to understand the game better. With that said, my capstone project is an extension of my final project from Data Wrangling in R (BANA 7025) with Professor Tianhai Zu. This project submission focuses on the various data points publicly available from the NFL and works to answer the question, “What all goes into winning an NFL game, and what teams are historically successful in the final standings?” Using the past 20 years’ worth of data, I seek to investigate this problem. My original project looked at various situations, including the importance of fan attendance, standings over the years, offense vs. defense, and individual game observations. In my extended research, I add other points of reference including the impact of weather, player statistics, coach information, passing yards, rushing yards, penalty yards per game, and rivalries. I hope to demonstrate proficiency in R through displaying my findings using R Markdown as well as flexdashboard with Shiny components. The main goal of my analysis is to inform my readers on what all goes into winning an NFL game. My hope is that the audience will better understand historic trends and performance from teams, players, and coaches alike.
Daniel Morgan, Analysis of Local Aircraft Using ADS-B, April 2021 (Dungang Liu, Jeff Shaffer)
The aim of this project is to explore local aircraft via the Automatic Dependent Surveillance Broadcast (ADS-B). This broadcast is a system by which aircraft can broadcast their details about their location and movement to ground stations and other aircraft. As of 2020, all commercial flights in the US are required to use this technology. Information is sent out several times a second on the 1090MHz frequency and is available to anyone who wishes to decode it.
To gather the data, I used a Raspberry Pi computer with a Nooelec software defined radio antenna. To decode and report the information, I made us of the opensource software, dump1090. This application automatically decodes incoming transmissions and allows you to share information with FlightAware, a live flight tracking website. Flight stats are aggregated across users which allows for a higher degree of accuracy, as well as a more robust infrastructure.
For the purposes of this report, I will be taking a deeper look into data I logged during 1 week in March. The reason for this is that the sheer volume of data for any substantial length of time can become difficult and time consuming to work with. One 24-hour period can produce over 3 million observations and a week worth of records resulted in 2Gb of data. To keep things simple, I logged information directly to a csv file saved on a flash drive.
Christian A. Steege, Spotify Data Analysis, April 2021 (Yichen Qin, Michael Fry)
In this publication, multiple visualizations have been used to discover insights on the surface of the Spotify data set. These visualizations gave insights on variables like, genre, popularity and mode. Also, visualization was used to teach us that track popularity is skewed and that some of our variables like energy, loudness, and acousticness have strong correlations. Multiple machine learning techniques were applied to the data set to predict track popularity. Using our EDA to inform our decisions with undersampling and variable selection, it was discovered that boosted trees and SVR were our most effective algorithms at generating profit according to our profit function - drawing about 2.4 million in profit. Undersampling did not help out-of-sample MSE for our predictions, but it did at times allow us to acquire a higher profit margin on some algorithms by skewing the predictions upward and taking on more songs to 'produce' and 'distribute'. The clustering analysis gave us the insight that genre is not the most reduced form of song prototypes. We learned that there are three song prototypes characterized by clusters of attributes like popularity, instrumentalness, tempo, energy, loudness, liveness, and acousticness.
Kelsey Sucher, Integrating Business Intelligence and Simulation in a Production Environment, April 2021 (Leonardo Lozano, Eric Webb)
My capstone project is an exercise in both simulation modeling and data visualization. I created a simulation model in Excel to simulate the production of a small, fictional company, including their demand, production, staffing levels, and inventory control strategies. I pulled the resulting data into Microsoft Power BI and created visualizations – one dashboard to be used by a production manager and another to be used by a member of the executive team. These dashboards displayed metrics critical to each of their roles and showed the importance of data visualization and business intelligence to truly understand what is happening in a business.
The value-add of this project to the simulated company is the ability to alter inventory, demand, production, and staffing, and explore different assumptions about the operation to observe their effect on the performance of the business. This enables the company to ensure that their policies are robust enough to handle uncertainty in their assumptions, and it can also be used for scenario planning to be ready for situations such as heightened demand. With the Excel and Power BI models natively linked, users need only edit the assumption input page in Excel and refresh Power BI to see the results. This is a user-friendly way to drive increasingly data-driven decisions in any business.
Waqas Tanveer, Patient Flow Data Restructure and Simulation Analysis, April 2021 (Michael Fry, Denise L. White)
A Cincinnati-based hospital wants to build a simulation model to understand the gaps and problems with the current hospital operations. The simulation model will be used to predict which departments need more space and analyze how growth will affect current capacity. Arena simulation software will be used to build the simulation model for the hospital. Before building the simulation model, historical data needs to be transformed into an analysis-friendly format, and a simulation analysis needs to be built to examine the patient flow in the historical data, summarizing the analysis into an input for the Arena simulation model. The following analysis outlines the steps taken from raw data to finalized analysis model with cumulative probability distributions for patient flow and data inputs for the Arena simulation model.
Linh Vu, Predicting Human Taste Preferences with Wine, April 2021 (Dungang Liu, Edward Winkofsky)
A large dataset of white wine from Portugal was collected. R tools, Python and other analysis techniques are used in this paper to understand the distribution of the dataset and predict the human taste preferences based on variables which are derived from analytical test at the certification step. Multiple regressions have been trialed on this dataset for model selection, and the logistic proved to outperform other methods. And it is proved that among the variables given, there are some that might have positive impact on the quality of wine, e.g., pH, residual sugar, and some have negative impact like volatile acidity and chlorides. Among the influenced variables, some could be controlled during the prep process of wine making, e.g., pH could be adjusted during grape concentration. This finding could help wine makers a lot since they now can utilize the result from data analysis to improve the wine quality during the production process.
Abhijith Antony, Kroger Supply Chain Data Analytics, August 2020, (Denise L. White, Stobart Wesley)
Kroger generates huge amounts of data in its supply chain (data related to inventory, ordering, operations and logistic information), sales (retail transaction data through any type of sales channel), items (data pertaining to Kroger items), customers, and promotions areas. Solutions to most of the problem that Kroger faces like better store fulfilment, optimized ordering resulting in less wastage and better customer satisfaction can be solved by leveraging these vast amounts of data. I was part of an analytics team at Kroger which is responsible for doing just the same. We worked closely with store associates and used data to help them run their stores efficiently. I was able to build customized data solutions which facilitates decision making for higher management and store and division leaders directly resulting in better store fulfilment and thus better customer experience. I am part of two big transformations happening at Kroger, one in the infrastructure space and the other in the process optimization working towards a larger organizational goal of centralized automated ordering. I used advanced analytical tools like PowerBI, R programming language to access advanced enterprise Datawarehouse at Kroger to realize these transformations.
Ashley Colbert, ‘Untappd’ Potential: Predicting Beer Preferences, August 2020 (Xiaorui Zhu, Yan Yu)
The craft beer market is worth $29.3 billion and continues to grow every year (National, 2020). Using data from the beer rating app Untapp’d, this project aims to find models that can predict beer ratings on both a consumer and brewery level. The goal for the consumer model was to predict a consumer’s beer rating, given their past beer rating history so that they can gauge beer preference before having to make a purchase. This would result in more satisfied consumers because after seeing their predicted rating for various beers on a tap list, they could order the one with the highest rating that they are most likely to enjoy. A linear regression model resulting from stepwise variable selection had the lowest 10-fold cross validation error out of the three tested models and was the best model for this problem. The goal for the brewery model was to classify popular beers based on their average global rating from past consumers. This would enable breweries to test new beer concepts to predict if they would be popular with their consumer base before having to go into production. A classification tree with an asymmetric cost function that aimed to reduce false positives had the highest precision and lowest false positive rate on the test data set, making it the most effective model out of the five tested for solving this problem. Being able to classify popular beers would enable breweries to reduce costs by understanding how much of each beer they should produce, therefore cutting inventory and waste.
Robert Doering, Opioid Use Disorder in the Age of COVID-19, August 2020 (Caroline Freiermuth, Denise L. White)
Background: As communities across the world continue to struggle with the COVID-19 pandemic, patients with opioid use disorder are facing increasing barriers to care. This study examined Emergency Department (ED) visits at a single center urban academic medical center and Emergency Medical Services (EMS) calls in a major metropolitan area for opioid use disorder patients.
Methods: In a retrospective design, electronic health record (EHR) data was utilized for OUD patient encounters from 11/01/2019 – 09/30-2020.
Results: While EMS OUD related calls were decreasing in the pre-pandemic period and increasing in the intra-pandemic period, ED visits were trending upward in both time periods. ED OUD patient demographics were not significantly different in the pre/intra-pandemic period and the OUD population were not admitted are higher rates during the pandemic (p >0.05). Forecasting models show seasonal variation in ED OUD visits peaking in the summer and logistic regression and classification tree modeling were used to predict ED OUD admissions.
Conclusions: The OUD patient population frequently utilizes ED and EMS resources. While early data has shown that the pandemic has not significantly changed the OUD patient population, better understanding of the needs and trends of these patient encounters can hopefully lead to better resource allocation especially during a global pandemic.
Vamsi Chand Emani, Claims Loss Run Dashboards, August 2020 (Siaw Teckmon, Denise L. White)
Claims are the most important aspect of the insurance business. An insurance claim is a formal request by a policyholder to an insurance company for coverage or compensation for a covered loss. The insurance company validates the claim (or denies the claim). If it is approved, the insurance company will issue payment to the insured. Handling claims and its financials has become more so important during the current COVID era going into the future. Over last several months, Starr is capturing the claims data daily. However, this data is still not available for use to the end users like claims managers, claim adjusters and underwriters. The purpose of Claims Loss Run Dashboards is to bridge this gap by helping business with data driven decision making. Loss Run Report integrated within Starr360⁰ platform will cater all business units across the enterprise. As a part of the analysis, both policy details and claims details are studied to integrate them to give a 360⁰ view of the business process starting from the policy to claim. Data models are built after understanding the data capture process and merging the data sources. Loss calculation metrics are developed by following the industry and enterprise standards. High level summary across LOB’s and locations is developed. Policy level summary and details tabs are developed to give a complete breakdown of the financials. Future scope is to integrate the dashboard in Starr360⁰, the enterprise document management system.
Sidharth Gaur, A Report on Credit Model Monitoring, August 2020 (Yan Yu, Mark Peters)
Axcess Financial has recently deployed a new credit risk model (the model has been in use since July-2020) to screen loan applications. This model (also referred to as the Gen2 model) is used to calculate applicant score and Risk Grade based on which an applicant is either approved or denied for a loan.
The Gen2 model is used for screening both Retail and Online submitted applications but only for Direct Mail Pre-Approved (DMPA) applicants. DMPA applicants are those applicants who are selected, based on “pre-screening” by another risk model, and pre-approved to receive a direct mail by Axcess Financial, as part of their marketing initiatives.
Objective of this project is to develop a reporting mechanism for ongoing model monitoring. Additional tasks to be included as part of the reporting include validation of certain system routing logic such as applicant attribution to direct mail v/s organic, underwriting model used for screening, etc.
Pratik Gundawade, Using Big Data for Freight Management, August 2020, (Vishnu Kodukulla, Peng Wang)
Freight shipping companies transport a large volume of cargo every day. Multiple operations during the shipping process generate a large amount of data. The data includes details of the freight, shipping time, shipping cost, RFID data, IoT data, etc. We gain valuable insights from these data for effective route and cost management there-by increasing the profit. These organizations, therefore, tend to focus on an effective data management strategy by using big data, commercial analytics, third party data. Integrating the data possess a challenge in data quality, trust in the accuracy, and integration of the data into data lake. The project aims to define a design an analytics platform for Big Data Management capability for a freight shipping company. This platform will support the integration of disparate data sources into a Data Lake by building data pipelines. One of the important aspects of the project is to report on the data objects migrated and made available in the big data lake. My project aims in building data visualization reports that would communicate the data migration progress to the data solutions team week over week and month over month.
Ethan Hoyds, NLP for Code Generation, August 2020 (Michael Fry, Dungang Liu)
One the greatest ironies in the modern field of software development is the dependency on vast numbers of software developers that are still needed to implement software projects. While a key paradigm in software engineering has been “reusable code”, there has been few attempts to abstract business logic into universally reusable code modules. There are a large number of initiatives that have led to the creation of frameworks to make certain patterns easier to implement. Yet, the total number of software development jobs has not decreased and to this day there still exists a vast number of developers each repeating implementation of similar logic specific to one’s own organization.
A second important component to every software implementation is the translation of customer needs into actionable functional requirements. Up until the late 1990s, the dominant approach to software implementation was referred to as the “Waterfall Method.” This required spending a large effort collecting all the requirements from the customer before any code could be written. While this method worked well enough in the 1970s and 1980s, this became increasing costly as competition in the technology and software industries increased drastically throughout the 1990s. This led to the development of the “agile” method software development which emphasized small iterative cycles as well as customers being able to express their needs in same form as they would write an email rather than needing to be trained in understanding how to express their needs in form of functional requirements.
With the advent of the “Era of Machine Learning” the functional capabilities exist for an operational, enterprise-level platform for automated software development.
Ali Ismail Jumani, Data Analytics Project at Corporation for Findlay Market, August 2020 (Dungang Liu, Sue Wernke)
This report focuses on the Data Analytics project (MS CAPSTONE) completed at the Corporation for Findlay Market. The Corporation for Findlay Market is a private non-profit organization and have recently decided to delve into analytics, mainly for the purpose of monitoring the daily affairs of the marketplace through the creation of internal portals. Tableau was used as the analytical tool to convert the data into visualizations that can be used to keep track of the organizational affairs as well as making informed decisions. Five dashboards were created in total to present the data for Parking, Traffic, Complaints & Solutions, Findlay Shopping App, Market Sales and Biergarten Sales.
Mohammad Zain Khaishagi, Unsupervised Segmentation of Users, August 2020 (Yan Yu, Christian L’Heureux)
Unsupervised learning has been the go-to choice for segmenting data when labels are not readily available or are expensive to get. Unsupervised learning uses machine learning algorithms to divide the data into separate clusters. There are a number of machine learning methods that can accomplish this, such as Hierarchical Clustering, K Means, DBSCAN, Expectation Maximisation Clustering etc. While there are a large number of methods available, not all of them are applicable to categorical data. Further, some methods require an input for the number of clusters, while other methods automatically find the optimal number of clusters. In the field of marketing, finding the right target audience is a crucial step because of cost constraints and efficacy of marketing campaigns. If the right message is sent to the right group, it can help increase customer engagement and help generate higher profits at a lower cost.
In this project, the goal is to find a segmentation of customers such that they have distinct qualities. Various unsupervised clustering algorithms are applied to the categorical data. A comparison is made using visual techniques such as Silhouette plots, Elbow plots, Scree plots and PCA Plots.
Abhiteja Achanta, Multi-Class Text Sentiment Analysis, August 2020 ( Yichen Qin, Liwei Chen)
The goal of sentiment analysis is to extract human emotions from text. This project applies various machine learning algorithms to predict sentiment of reviewer from his textual review on Amazon food products. Metrics such as accuracy of prediction and precision/recall are presented to gauge the success of these different algorithms. Main purpose of this project is to introduce and apply different feature engineering techniques to convert text to numeric data and see how different Machine learning and Deep Learning algorithms perform with this data.
Jeevisha Anandani, Recommender System, August 2020 (Peng Wang, Michael Thompson)
This project compares two approaches to build a movie recommender system. First one implements Bayesian Network by learning the Conditional Probability Distributions from the data. Bayesian Networks belong to a class of algorithms known as probabilistic graphical models. This Bayesian network is then used to predict ratings. It uses Maximum Likelihood Estimation to estimate the conditional probability distributions. The other approach of Collaborative Filtering (Matrix Factorization) is applied to view the recommendations and the results as compared with the prior approach. Alternating Least Squares algorithm was implemented to predict user ratings using PySpark.
Vidhi Bansal, Bike Rental Prediction Analysis, August 2020 (Yan Yu, Dungang Liu)
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Currently, there are over 500 bike-sharing programs around the world. Such systems usually aim to reduce congestion, noise, and air pollution by providing free/affordable access to bicycles for short-distance trips in an urban area as opposed to motorized vehicles. The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. The ability to predict the number of hourly users can allow the entities (businesses/governments) that oversee these systems to manage them in a more efficient and cost-effective manner. Our goal is to use and optimize Machine Learning models that effectively predict the number of ride-sharing bikes using available information about that time/day.
Rahul Bhasin, An Analysis of UC Center for Business Analytics Project Feedbacks using NLP, August 2020, (Mike Fry, Andrew Harrison)
The University of Cincinnati Center for Business Analytics helps its corporate clients solve business problems by developing analytical solutions utilizing data mining techniques and popular analytics tools. It supports its member firms through a variety of options like contracted projects, capstone projects, and case studies. For case studies, the client firms outline the business problem, provide datasets for graduate students to analyze, and upon completion of case study projects both clients and students submit feedback forms. These feedback forms contain opinions about the BANA 7095 course (Graduate Case Studies in Business Analytics) on aspects that they liked about the project and aspects that can be improved. The scope of our analysis is to delve into the provided feedback to analyze the satisfaction level of students and gauge emotions using a lexicon approach. Additionally, we aim to identify key topics and critical issues using Latent Dirichlet Association (LDA) and Non-Negative Matrix Factorization (NMF) algorithms, capturing context using Word2Vec Word Embeddings and understanding the association between review words using n-grams. This analysis reveals important insights about case study projects which can be used to improve students’ and member firms’ future project experiences.
Puneet Bhatia, Refreshing Student Transfer and DFW Dashboards, August 2020 (Michael Fry, Brad Miller)
In my summer internship at the Office of Institutional Research at the University of Cincinnati my project is to refresh the Student Transfer Dashboard with 2018 and 2019 student data and DFW Dashboard with the Spring 2020 cohort data. The process involves fetching data from the Catalyst Reporting Tool (CaRT) using queries and then creating the required input file for Tableau dashboard through data manipulation using SAS. The final step is to make the required reporting changes in Tableau to make it consistent with the new data.
Nikhila Nayana Bobba, News classification into fake or real news, August 2020 (Yichen Qin, Liwei Chen)
Text Classification a supervised machine learning task used to automatically classify the text documents into one or more defined categories. The classifier basically uses a labelled dataset to train a classifier. Fake news is not only being used to influence politics and promote advertising but also has become a method to stir up and intensify social conflict. Stories that are untrue intentionally mislead readers and cause growing mistrust among people. I have used a text classification to classify news into two bins: true and fake.
Vallabh Reddy Burla, APPAREL IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORKS, August 2020, (Dungang Liu, Yan Yu)
Motivation: The internet has only been getting more expansive. And this expanse brings unstructured data, like images. These are difficult to organize because the subjects in an image are not readily interpretable by machines. But deep learning innovation has picked up momentum and it is much easier to build models which are able to make sense of image data and classify their contents.
Problem Statement: Classify images from 10 apparel categories. The dataset used is the Fashion MNIST dataset by Zalando Research which consists of 28x28 grayscale images of apparel. The train set contains 60,000 labeled images and the test set contains 10,000.
Approach: I used Convolutional Neural Networks (CNN). They are effective because the Convolutions work with images directly and learn to look for patterns in the input. Deeper convolutions look for more complex patterns. My final model has 3 convolutional, 3 dropout, 1 max pooling, 1 flatten and 1 output layer. There are 100 neurons in each of the convolutional layers and 10 neurons in the output layer, one for each category of output. I trained the network for 20 epochs with a learning rate of 0.0001.
Results: The final training accuracy is 94.53% and testing accuracy is 92.52%.
Conclusions: CNN can make highly accurate predictions for this dataset with a small network. The deeper networks I built with a much higher number of neurons or layers fared worse than the smaller network.
Xingchen Chen, Direct Mail Campaign Geography Optimization, August 2020, (Peng Wang, Farhad Rahbardar)
This capstone is part of the work project I participate in during my internship in Axcess Financial.
The main goals of this project are: 1, to find insights and relationship between current Direct Mail geographic distribution vs Trade area and DM driven concentration, to check how current direct mail distribution works, and 2, to make recommendation to optimize mail distribution.
Major steps here include, 1, identified 3 concentration areas: trade area (where 75% of new customers come from), direct mail driven customer concentration area (where 75% of new direct mail driven customers come from), and mail concentration area (where we drop 75% of direct mails), 2, checked 3 how those 3 concentration areas related, compared the campaign performance of overlap vs. non-overlap areas and identified opportunity areas to improve direct mail efficiency, and 3, made recommendations based on the comparison of performance.
Customer data and direct mail data pulled from internal snowflake database using structure query language. Summary statistics and interactive dashboards were created in Tableau to get insights on characteristics of 3 concentration areas and see how overlap and mismatch areas distributed. Result of this project shows that majority of those 3 concentration areas overlaps and current direct mail distribution works well, but room for improvement exists. Recommendations are made based on what we found as well.
Zhuo Chen, Insurance Wholesalers Activities Analysis, August 2020, (Michael Fry, Harlan Wahrman)
The objective of this project is to enhance future sales of a life insurance product which is sold through banks to individual customers. The models use the past records of wholesalers’ activities and bank representatives’ sales performance related to this product. The distribution and wholesaling team will use the development of this project to optimize the wholesaling strategy and guide wholesalers’ daily activities. Moreover, due to the COVID-19 pandemic, bank representatives’ meeting preference with wholesalers has changed from in-person to online media. This change challenges the data scientist to find the most effective activities during pandemic and to provide recommendations to wholesalers. Market-mix modeling (MMM) and unsupervised learning are used to evaluate the different activities’ impact on sales performance, especially in 2Q 2020 due to challenges caused by the pandemic. Business recommendations are provided to optimize the wholesalers’ activities and the life insurance company’s business strategy.
Himanshu Chhabra, Uber & Lyft Price Prediction, August 2020, (Yan Yu, Dungang Liu)
Ridesharing services are companies that match drivers of private vehicles to those seeking local taxicab-like transportation. Ridesharing services are available mostly in large cities in many countries. Some of the biggest names in the industry are Uber, which exists in 58 countries and whose name is almost synonymous with ridesharing services, and Lyft, which covers many American cities. Uber and Lyft both are American multinational ride-hailing companies offering services that include peer-to-peer ridesharing, ride service hailing, and a micro-mobility system with electric bikes and scooters. Their platforms can be accessed via websites and mobile apps. In California, Uber is so dominant that it is a public utility, and operates under the jurisdiction of the California Public Utilities Commission. Ridesharing systems generate a lot of data that can be used for studying mobility in a city. The ability to predict the peak time of the day and the day of the week can allow the businesses to manage them in a more efficient and cost-effective manner. Our goal is to use and optimize Machine Learning models that effectively predict the price using the available information about that time/day and the weather conditions.
Vincent Chiang, Heart Disease Prediction using Various Factors and Determination of Key Factors Leading to Heart Disease, August 2020 (Leonardo Lozano, Yan Yu)
Heart Disease is a major issue within the United States with millions of people affected as well as being the number one cause of death in the United States. As such, there is a demand for predictive models that take health data and determine if there is a high probability of a person having heart disease. In this paper, attempts were made to formulate high accuracy predictive models using classification trees, random forest and boosting as well as to improve upon the models and their default parameters.
Kevin Dalton, A modern Bayesian workflow approach to actuarial non-life rate-making in R with the brms package, August 2020, (Michael Thompson, Peng Wang)
Bayesian Generalized Linear Models are currently not used extensively in the insurance industry despite their discussion in the actuarial literature. We model a typical automobile claims dataset using a modern Bayesian data analysis workflow to illustrate and develop this workflow in an actuarial setting. Using the brms package we generate the parameters of a theoretical policy premium based on policyholder characteristics and explore the use of zero-inflated count distributions in a generalized linear multilevel model context.
Manoj Kumar Eega, Identifying Credit Card Fraud using Machine Learning, August 2020, (Yan Yu, Dungang Liu)
It is important that credit card companies can recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. Even though the percent of a fraudulent transaction is very low in proportion to the total transactions, fraud transactions can hamper the consumer sentiment. This will be a damaging thing to the company, as the customer might not use the card later, and for the country, because a huge negative sentiment will put the consumption based economies at risk. So, we decided to use our skills to build a model that can accurately classify fraud transactions. I found this problem particularly interesting because the problem at hand has a huge class imbalance. It is a challenge to address this issue and simultaneously build models that have good recall and accuracy. So, we want to use all the techniques available to address the class imbalance and build the best model that has the highest predictive capability. I am addressing a real-world problem with real-world data. All transaction monitoring companies such as credit card companies, banks, insurance companies, etc. can make use of this project.
Emily Fischer, COVID-19 Impact on Strength Training, August 2020 (Leonardo Lozano, Alex Wolfe)
The purpose of this study is to determine the impacts of the SARS-CoV-2 pandemic on the strength of a weightlifter. This was an observational study. The data was reviewed for the time period of 1/1/2020 to 6/22/2020 and included 60 workouts prior to the gyms closing, 50 at-home exercises, and 3 workouts back at the gym after the gyms reopened (this data was excluded as there were not enough data points to form a trend). Without the appropriate equipment available for at home workouts, the user saw decreases in strength in bench press (-4.35%), bent over row (-5.71%), deadlift (-11.21%), overhead press (-2.78%), and squat(-10%) which appears to be due to a reduction in overall sets per week. The user saw increases in strength in bicep curl (+3.17%), chest dips (+5.4%), and incline bench press (+4.36%). This is likely due to a low discrepancy between pre-COVID and during the gym closure. With SARS-CoV-2 virus continuing to spread in the United States, the likelihood of gyms to close and/or the risk profile to attend gyms will continue to be high. The solution to this ongoing challenge was to create an advanced simulation to quickly generate effective workouts. The final simulation generated allows the user to select their preferred exercises, goal (power/strength/endurance), reps, and sets and then validate if they would likely complete the workout within the selected time constraint. This will be an easier method to create workouts and to continue to motivate the user throughout these uncertain times.
Himaja Gaddam, Analysis of Household Electric Power Consumption, August 2020, (Yichen Qin, Dungang Liu)
The energy sector has been an important driver of industrial growth over the past century, providing fuel to power the rest of the economy. Many things in our life starting from lighting our rooms to operating a heavy machinery runs on electricity. Nowadays all the countries are concerned about providing sufficient energy to the consumers as well as optimizing the total demand of energy consumed. Since a significant part of that energy is consumed by household sector, the optimal consumption of the energy at home is of great importance. To better regulate the production of energy it is important to understand the energy needs of this sector which can be done by analyzing the past energy consumption data. This report presents the analysis of electric power consumption data of a household collected every minute for 4 years. The given data is analyzed by aggregating the data over hour. Different models are trained on the aggregated training datasets using the approaches VAR and LSTM.
Rasesh Garg, Movie Recommender System, August 2020 (Michael Fry, Peng Wang)
Recommender systems are a crucial aspect of digital businesses today. Whether one is shopping on Amazon, watching a movie on Netflix, or listening to songs on Spotify, one wants to have personalized recommendations that can save them the hassle of searching through an overwhelming number of choices. A sound recommendation system helps companies enhance the user experience and engage with their customers, resulting in higher revenues for the company. This project explores building a Movie Recommender system using the historical data of the user movie ratings. It uses two popular approaches for this – content-based filtering and collaborative filtering. While content-based filtering is agnostic to users and based only on the content of the movie, collaborative filtering is more personalized and considers the historic user ratings. This project report describes these approaches, the mathematics behind them, and the results obtained using a real-life public data set.
Saket Kumar Garodia, Sentiment Trend Analysis on Twitter data to analyze Insurance Risk, August 2020, (Michael Fry, Naziya Rehman)
Usually, any insurance company has a few questions to answer before insuring any product. What are the products it should insure? What should the premium be of the products it wants to insure? What is the risk that it is taking in insuring the products? These questions are important for any insurance company since companies in the insurance industry faces many uncertainties and therefore a detailed analysis is useful and can save considerable money.
In the era of artificial intelligence, social media can be leveraged to answer a lot of questions for insurance companies. People show their appreciation as well as frustration towards a product through social media. If there is a way to understand the sentiments expressed about a product, it can be very useful.
This project analyzes the tweets to understand their sentiment over time. This analysis will help Great American Insurance in their underwriting decisions.
Anjali Gautam, Prioritization of Calls, August 2020, (Michael Fry, Siddharth Krishnamurthi)
For a Collections Team in a bank, contacting customers via calls is one of the key methods to collect a debt. The team is also required to consider various components associated with this method such as the amount of debt, the number of missed payments, available calling agents, and more. Since all customers are not equally risky in terms of loss of debt associated with them, calling all of them with equal priority is not effective, and therefore prioritizing customers is important. To utilize calling customers efficiently it is of significance to assign a priority to customers leveraging information of the customer with the bank. In this project, I have explored account balances, information on missed payments, and probability score (based on the probability of customer missing more payments in future) of customer accounts to define priorities from different approaches. These priorities will be used by the bank to focus collection efforts via calls efficiently. Additionally, I have built a framework for KPI tracking which will be used to assess the working of defined priorities in the future.
Harshal Goswami, Movie Recommender System, August 2020, (Yan Yu, Liwei Chen)
As the data in every business is increasing, need to extract meaningful information from gamut of data is both a time saver and requirement for better decision making. Same is true for online entertainment platform like Netflix, Amazon Prime, Spotify and many others. Recommender systems are in forefront to solve this problem. These systems collect information from the users to improve the future suggestions. This paper aims to describe the implementation of a movie recommender system via Content based, Collaborative filtering and Hybrid algorithms using python.
Prakhar Goyal, Digit Classification Model, August 2020, (Dungang Liu, Yan Yu)
The main objective of this capstone project is to classify a given image of a handwritten digit into one of 10 classes representing integer values from 0 to 9, inclusively.
This model is part of a collaborative project, the other part being to create an object detection model. The aim of the two parts is to design a system that can detect and recognize Vehicle Number plates. Once the license plate is detected, it would undergo processing, and the text data can easily be edited, searched, indexed and retrieved.
I have tested 2 different machine learning models(SVM and CNN) to find the best model for the classification task. The best model for handwritten digit classification is SVM with an accuracy score of 99.35% on the test dataset and taking one-third time to train the hyperparameters in comparison to CNN.
Srujana Guduru, Netflix Movie Recommendation System using Collaborative Filtering, August 2020 (Yichen Qin, Dungang Liu)
We love Netflix for the movie recommendations it does. Movie or content recommendation is very important for Netflix as engaging users more and more brings them more revenue. But dealing with human preferences or interests is extremely challenging. As they say, in many cases a subscriber may visit Netflix without knowing what exactly to watch. If he did not find any interesting movie in his recommendations, there is a high chance of the subscriber leaving the site. To avoid this, recommendation is highly used to increase the customer engagement on Netflix. Each subscriber is nuanced in what brings them joy and how that varies based on the context they are set in. Moreover, tastes and preferences of customers might change over time which further complicates the recommendation process. In this project, we focused on collaborative filtering where the behavior of a group of users is used to make recommendations to other users. Recommendation is based on the preference of other similar users. We used Surprise library, a Python scikit library for analyzing recommender systems that deal with explicit rating data. Recommendation models are built on the Netflix data using the ready-to-use prediction algorithms of Surprise library- Surprise BaselineOnly, Surprise KNN Baseline with user-user similarity, Surprise KNN Baseline with Movie-Movie similarity, Singular Value Decomposition (SVD) and Singular Value Decomposition ++ (SVD++). The best recommendation model obtained is Singular value decomposition (SVD) using SGD and 5 factors with Test RMSE of 1.131933.
Praveen Guntaka, Sentiment Analysis, August 2020, (Dungang Liu, Liwei Chen)
In this digital era understanding, polarities within a text statement had emerged to be an important factor for the businesses, as they can extract the extent of customer satisfaction or even a suggestion on their product. But, humanly it is impossible to perform this. Sentiment Analysis is the classification of human emotions like positive, negative and neutral on a text sentence. It is a text analysis method to determine the polarity within the text, a whole document, a sentence, or a paragraph.
Our end goal is to build a model to predict the polarity or sentiment of a review, for that we will start with text cleaning and perform exploratory data analysis to understand the data better and then proceed to topic modeling where we try to cluster these reviews into their potential topics, once we get the topics clustered we then move onto the important segment, that is sentiment analysis, where we try to fetch polarity of each sentence (review). Our final model will be built with the help of machine learning algorithm and then we will evaluate our model with evaluation measures like accuracy, precisions, recall, etc.
At the end of this project we must be in position to predict the polarity of a review for a company, as we are going build a process, the same application can be used for different datasets(with some minute changes) to get sentiments out of it.
Hardik Gupta, Music Genre Classification (Spotify Dataset), August 2020, (Dungang Liu, Liwei Chen)
A dataset of 32,833 songs with 12 audio features for each of the songs provided by Spotify was analyzed to determine whether these audio features could be used to classify songs into 6 different genres. Genre classification is an important task for any online music streaming service. Amongst the 3 data mining techniques i.e. Linear Discriminant Analysis, Decision Trees, and Random Forest used Random Forest gave us the best classification rate of 49.34% improving random chance by more than 3 folds.
Results indicate that genres like Rap and EDM are the easiest to classify as Rap songs are high on speechiness and EDM tracks are high on tempo and energy.
Shubham Gupta, Text Analytics: Predicting product recommendation by customers based on the reviews, August 2020, (Leonardo Lozano, Peng Wang)
People usually purchase online products after looking at how much star rating it has and after shortlisting the product, they usually read several text reviews written by other customers who have purchased this product. E-commerce companies build recommendation engines to market specific products to a customer which have been purchased (and liked) by similar customers. While deciding whether a customer liked a product or not (to decide whether to recommend it to similar customers), star rating given by customer can be easily utilized. But the star rating may not capture the entire sentiment of a customer about a product. Also, studies have shown that customers trust the content in written review over a ‘5-star-fake’ review (which might be written by the product seller).
In this paper, we will use text mining techniques to use the reviews written by customers to predict whether a customer will recommend a product or not.
Soumya Halder, IBM HR Analytics Attrition Analysis, August 2020, (Yan Yu, Liwei Chen)
IBM is a multinational technology company which provides products ranging from hardware and software to consulting services along with innovations through research. Being a key player in the analytics industry, they have developed multiple game changing products to drive down costs and build up accuracy. These products have been significant for the organization as well as various clients across domains. However, for continuity in growth, its essential to retain their workforce and make the employees feel valued. As per the data, the attrition rate stands at 17% which may not sound alarming by industry standards but when unwanted employee attrition takes place, it can harm the company in long run.
The problem is approached through performing extensive data analysis and predictive modeling to understand the key factors behind employees feeling burnout, fatigue and eventually leave. On analyzing the results, some of the changes that should be deployed by management are providing a proper career path for younger employees, monitor working hours, incentivize overtime, etc. which will improve current employee satisfaction levels.
Xiaojing He, Carbo-loading Sales Analysis, August 2020, (Yichen Qin, David Curry)
Global sales of pasta, pasta sauce, pancake mix, and syrup have been growing fast in recent years and are forecasted to grow even faster. The U.S. is one of the most important markets for these items, especially pancake mix and syrup. This thesis uses advanced statistical procedures – Time Series Analysis, K-means Clustering, and Association Rules Analysis – to detect sales trends, forecast future sales, and improve promotional strategies for these items. Data are from the open-source Carbo-Loading database available from 84.51°, which includes records from more than 5 million transactions for these items from two large U.S. geographical regions in which a large retailer operates. ARIMA models and VAR models were compared in the Time Series Analysis section. ARIMA models are adopted as the final models for sales forecasting. Results suggest that zip-code clusters that complement those in 84.51°’s original geo-based management system will provide a more precise solution to regional segmentation. Finally, results from Association Rules Analysis suggest important cross-selling opportunities for Private Label Fettuccini and Ragu Cheese Creations Alfredo Sauce.
Sanjay Jayakumar, Using ensemble methods to estimate the unit sales of Walmart retail goods, August 2020, (Peng Wang, Yichen Qin)
Business forecasts help organizations prepare and align their objectives by providing a big picture of the future. Improving the accuracy of forecasts, hence, is one of the integral factors in the business planning of organizations. This project explores different ways to forecast unit sales of products with the objective of zeroing in on the model with the least error. A special focus is given on ensemble models during the study. Ensembles are recognized as one of the most successful approaches to prediction tasks. Previous theoretical studies of ensembles have shown that one of the key reasons for this performance is diversity among ensemble models. This project aims to compare the performance of two key ensembles, XGBoost & Light GBM with a baseline Seasonal ARIMA model.
Krithika Jayaraman, Image Caption Generation using Deep Neural Networks, August 2020 (Yan Yu, Peng Wang)
Automatic Image-Captioning is an interesting application that is a combination of Computer Vision and Natural Language Processing. This involves recognizing the contents of an input image and a language model to turn the understanding into a meaningful sentence or words describing the image. There are various applications of Image captioning such as image indexing for Content-based Image Retrieval (CBIR) which in-turn has applications in e-commerce, education, ads, social media. Deep-learning methods have demonstrated state-of-the-art results in this type of application. In this project, I present a comprehensive model that utilizes a pre-trained model such as ResNet for image inference and a Long Short-Term Memory (LSTM) based Recurrent Neural Network (RNN) to generate sequence of text. I use the Bilingual Evaluation Understudy (BLEU) score to quantify the accuracy of the captions and performance of the model. This model has achieved a BLEU score between 0.5 and 0.6 on the captions generated for the validation set.
Pushpa Jha, Natural Language Processing: Topic Modeling, August 2020, (Peng Wang, Yan Yu)
Natural Language processing is one of the most prominent techniques that helps us deal with unstructured data in a more effective and quick manner. The domain has various applications in tasks like machine translation, Speech to text and vice versa translation, sentiment analysis, chatbots and text classification etc. For this project we will be exploring one its very useful application called Topic Modeling. Topic modeling is an application of NLP that helps to identify the main content of the document which could be further used to filter out the important sections quickly and effectively. It is an unsupervised algorithm that used the document corpus matrix to identify the most relevant topic related to the document. Extracting these document topics could be very helpful in automatic labelling and clustering of the documents into major categories on which further analysis can be performed later to generate more insights about the contents of required documents. It is quite different from topic classification technique that is based on supervised learning algorithm.
Jagruti Joshi, Google QUEST Q&A Labeling, August 2020, (Yan Yu, Peng Wang)
Computers are good at answering questions with single, verifiable answers. But humans are often still better at answering questions about opinions, recommendations, or personal experiences. Humans are better at addressing subjective questions that require a deeper, multidimensional understanding of context - something computers are not trained to do well yet. Questions can take many forms - some have multi-sentence elaborations while others may be simple curiosity or a fully developed problem. They can have multiple intents or seek advice and opinions. Some may be helpful and others interesting. Some are simple right or wrong.
Unfortunately, it is hard to build better subjective question-answering algorithms because of a lack of data and predictive models. That is why the CrowdSource team at Google Research, a group dedicated to advancing Natural Language Processing (NLP) and other types of Machine Learning (ML) science via crowdsourcing, has collected data on a number of these quality scoring aspects.
This project aims to use the new dataset to build predictive algorithms for different subjective aspects of question-answering and improve automated understanding of complex question-answer content. We will be focusing on Bidirectional Encoder Representations from Transformers (BERT) and DistilBERT, a distilled version of BERT that is smaller, faster, cheaper, and lighter.
Sankirna Joshi, Multilingual Toxic Comment Classification, August 2020, (Yan Yu, Peng Wang)
Conversational toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion. Even a single toxic comment can derail an entire conversation and the fear of such comments often hinders people from sharing their opinions, thereby reducing the quality of an online discourse. The Conversation AI team[1], a research group founded by Jigsaw[2] and Google builds the technology to protect such voices. The goal of this project will be to study and apply machine learning techniques to identify whether a comment is toxic or not. Given, the multilingual nature of the internet, we will expand and check our model’s performance on text from a different language. By identifying the toxicity in conversations, we can deter users from posting such messages, encourage healthier conversations and have a safer, more collaborative internet across the globe.
Sahil Kala, Office of Quality and Patient Safety Division of Information and Statistics Original Source Data Submitter Project (OSDS), August 2020, (Liwei Chen, Kenneth Goolsby)
New York State enacted legislation in 2011 that allowed for the creation of an All Payer Database (APD). The complexities of the health care system and the lack of comparative information about how services are accessed, provided, and paid for were the driving force behind this legislation. The goal of the APD is to serve as a key data and analytical resource for supporting policy makers and researchers.
The current APD- OSDS project I am working is on is directed towards building a new age system which is planned to replace the current legacy system which was based on IBM data integration tool. The project is focused on using latest technologies like Informatica PowerCenter for Data Wrangling, SQL and Tableau for data analysis. Project is planned to go in phase 2 in mid-2021 to be considered to undergo statistical analysis by using Machine Learning for predictive analysis.
The APD is creating new capability within the Department, including more advanced and comprehensive analytics to support decision making, policy development, and research, while enhancing data security by protecting patient privacy through encryption and de-identification of potentially identifying information.
With the APD, the Department will have a comprehensive picture of the health care being provided to New Yorkers by supporting consumer transparency needs on quality, safety, and costs of care. The systematic integration of data technology and weaving of the previously fragmented sources of data will create a key resource to support data analyses that address health care trends, needs, improvements, and opportunities.
Surabhi Srinivasa Kamath, HR Analytics – Employee Attrition Prediction, August 2020, (Peng Wang, Dungang Liu)
Predicting attrition, whether an employee will leave the job or not, has become an important concern for the institutions in recent days, owing to several reasons. In this project, we will work on a dataset from Kaggle in R to explore the factors that are related to employee attrition through Exploratory Data Analysis and build statistical models that could be used to predict whether an employee would leave the company or not. Finally, we will explore different sampling techniques and dimension reduction technique to find the important factors.
Sandeep Kavadi, Analytical Approach to designing Financial Hardship Programs for Consumer Loan Products, August 2020, (Michael Fry, Siddharth Krishnamurthi)
In this paper, we look at design of financial hardship offers for various consumer loan products. Designing a financial hardship offer involves changing certain terms of an existing loan contract to make debt payments more affordable to borrowers in financial distress. There is a delicate balance of risk and reward involved while changing the terms of a consumer loan. Hence, we use an analytical approach to balance these two quantities. We first quantify the risk using the expected loss. The reward is quantified using the net present value of the expected income cashflows. The probability of default is modelled as a logistic curve whose parameters are determined based on historical data. The resulting objective function is a non-linear function of the decision variables. The ‘near optimal’ solutions are arrived at using the Solver function in MS Excel which uses evolutionary solutions method to solve non-smooth optimization problems.
Digvijay A Kawale, Detection of Disease Severity in Breast Cancer Cells, August 2020 (Yichen Qin, Dungang Liu)
Goal and Background: The Breast Cancer Wisconsin (Diagnostic) data set contains the information about the features that are computed from a digitized image of a fine needle aspirate of a breast mass. Feature variables describe characteristics of the cell nuclei present in the image. The data set contains 31 variables, one of them is Diagnosis type of the breast mass classifying them into Benign and Malignant type.
The aim of the study is to build a best model using machine learning techniques like Logistic Regression, Decision Trees and Random Forest that uses the feature variables and predicts the Diagnosis type which would help the Breast Cancer Patients to identify malignancy in the early stages of tumor.
Approach: We will be randomly selecting 80% of the data points from our data set as in-sample data for the modeling purpose and the remaining 20% will be used as out of sample for model evaluation and performance. The seed set for the random sampling of data is 13437586.
We will start with fitting the best logistic regression model using the Exploratory Data Analysis and variable selection techniques like Stepwise variable selection method. After fitting the logistic regression model, we will move to the tress approach for the model building. Starting from classification tress and to more complex techniques like random forests. The model performances will be evaluated based on the Out-of-sample predictions and a final best model will be selected.
Major findings: It was found that as we move from simple models like Logistic Regression and Decision Trees to the more complex models like Random Forests the prediction goes on improving reducing but we lose the interpretability. Depending upon our goal of study i.e. interpretation or prediction we should choose the right model among these. As in our case prediction is our major concern, the random forest model was chosen as it had the best prediction for the Diagnosis of breast cancer cells.
Mohammad Zain Khaishagi, Unsupervised Segmentation of Video Gamers, August 2020 (Yan Yu, Edward Winkofsky)
Unsupervised learning has been the go to choice for segmenting data when labels are not readily available or are expensive to get. Unsupervised learning uses machine learning algorithms to divide the data in separate clusters. There are a number of machine learning methods that can accomplish this, such as Hierarchical Clustering, K Means, DBSCAN, Expectation Maximisation Clustering etc. While there are a large number of methods available, not all of them are applicable to categorical data. Further, some methods require an input for the number of clusters, while other methods automatically find the optimal number of clusters. In the field of marketing, finding the right target audience is a crucial step because of cost constraints and efficacy of marketing campaigns. If the right message is sent to the right group, it can help increase customer engagement and help generate higher profits at a lower cost.
In this project, the goal is to find a segmentation of video gamers such that they have distinct qualities. Various unsupervised clustering algorithms are applied to the categorical data. A comparison is made using visual techniques such as Silhouette plots, Elbow plots, Scree plots and PCA Plots.
Sahit Koganti, A Study of Phase Change Materials Using Statistical Inference and Machine Learning, August 2020, (Dungang Liu, Gayatri Perlin)
This project is aimed at applying data science and machine learning methods to study the effects of elemental composition on the performance of phase change materials (PCM). This special class of materials is actively being pursued in electronics and optoelectronics research for the realization of cutting-edge data storage and information processing technologies. We have tried to build a statistical inference method to identify highly desired and an active area of combinations of primary elements and dopants. In our analysis, we chose the primary elements to be Antimony (Sb), Germanium (Ge), Tellurium (Te) and the dopants for these combinations of primary elements are Ti, Bi, BiN, Mo, N, Sc, Al, AlSc, SiC, In, C, Si, SiN, O, W, Se, Er, Gd, Sn. We have extracted the independent and the dependent properties of these alloys from various published papers and have built a SQL database. Using these parameters, data pre-processing analysis (such as outlier analysis, multi-correlation analysis) and EDA analysis are done to understand the data distribution better. To conclude, we have performed a clustering analysis based on the dependent parameters to understand the influence of primary elements and dopants. Implementation of both K-means and DBSCAN methods has been demonstrated and well-defined clusters explained. These clusters explain to us how the primary elements and dopants influence the activation energy and other dependent parameters of the crystalline, amorphous and the transition phases of these PCM alloys.
Ankit Kumar, Understanding Non Personal Promotions activity during COVID-19 and evaluating ThoughtSpot vs Qlik Sense, August 2020 (Michael Fry, Inder Rishi Kochar)
Pharmaceutical industries market their products to physicians through detailing, wherein a sales representative goes to the physicians to talk about the drug and provide free samples for trial purposes. During COVID-19 pandemic, the sales representatives cannot physically go to the physician’s office for detailing. Thus, Non Personal Promotions (NPP) becomes even more important.
In this project, we evaluate the trend of Non Personal Promotion activities (clicks and impressions) during COVID-19. We examine which brands, vendors, vendor products, and DMAs were the key drivers for increases and decreases of NPP during COVID-19 pandemic by building four dashboards in ThoughtSpot.
We have also done an in-depth comparison of ThoughtSpot (an AI-driven search based visualization platform) and Qlik Sense (Novartis Oncology’s current software for reporting). Based on this comparison, we have provided a suggestion about whether or not to invest in ThoughtSpot for the future.
An important insight from our dashboards was that Paid Social channel contributed the most to the increase in clicks and impressions from Feb-Mar’20. This may be because more ads were shown in Paid Social channel or physicians preferred Social channel over Search and Display.
The team decided not to invest in ThoughtSpot this year because Qlik is more suited to its needs based on the existing dashboards and anticipated future needs.
Priya Kumari, Consumer Complaints Classification using Traditional Machine Learning and Deep Learning Models, August 2020, (Peng Wang, Yan Yu)
Unstructured text data is everywhere on internet in the form of emails, chats, social media posts, complaint logs, and survey. Extracting texts and classifying them can generate a lot of useful insights, which can be used by businesses to enhance decision-making. Text classification is the process of categorizing text into different predefined classes. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of predefined tags or categories based on its content. Lately, deep learning approaches are achieving better results compared to previous machine learning algorithms on tasks like image classification, natural language processing, face recognition, etc. The success of these deep learning algorithms relies on their capacity to model complex and non-linear relationships within the data. This study would cover supervised learning models and deep learning models for multi-class text classification and would investigate which methods are best suited to solve it. The classifier assumes that each new complaint is assigned to one and only one category.
Heng Li, Forecasting Stock Returns Using Machine Learning Methods, August 2020 (Yan Yu, Denise White)
Forecasting stock return is an important topic in the finance industry. However, the stock market has high volatility which makes the price movements hard to be predicted. Eugene Fama and Kenneth French introduced the Fama-French three-factor model in their research paper Common Risk Factors in the Returns on Stocks and Bonds (1993). The traditional Fama-French three-factor model applied the conventional multiple linear regression model, which is still powerful in evaluating stocks and comparing investment results when stocks are held for different periods. However, in recent years, machine learning methods are taking advantage of calculating speed and forecast accuracy. Therefore, in this project, we will evaluate the model performance for both traditional linear models and machine learning models.
In this project, we applied multiple linear regression, univariate linear regression, random forest, XGBoost, and Artificial Neural Network, models. All models selected Market Excess Return (Mkt.RF) as the most important factor, followed by SMB then HML. However, machine learning methods are not able to outperform linear models in terms of output accuracy. We will in the end, briefly discuss the possible reasons and project limitations.
Komal Mahajan, Airline Portfolio Analysis: Selecting a profitable airline amidst the COVID-19 pandemic, August 2020, (Yan Yu, Andrew Harrison)
The coronavirus COVID-19 pandemic situation is unprecedented and catastrophic and since its inception in China last year, the anticipation regarding its impact and cure is ineffective. This virus has put a lot of things on hold, leading to lockdown and complete shutdowns in most of the countries for at least a month. The ramifications of this virus other than human casualties majorly also involve slowdown of the world economy affecting every industry’s operation and their stock prices. Time-series forecasting is one of the most encountered applications in the Data world. The company’s financial data ( stock prices, revenues, etc. ) is collected at regular intervals and different scales such as daily, weekly, seasonally, and yearly, along with an overall trend.
Modeling a time series and predicting future values is an important skill. One simple but powerful method for analyzing and predicting a time series is the additive model. For our study, we have considered Airline industry performance amidst the pandemic and have built different additive models for the time-series data using the Prophet package developed by Facebook for time series forecasting. As our study captures the impact of Covid-19 on the airline industry, so we have selected the model containing data points from February to May 2020, without splitting further in training and test records. It gave the lowest MAPE and MAE among the other additive models we built.
Tanya Malaiya, Improving Affirmative Action, August 2020, (Peng Wang, Liwei Chen)
The data for this project has been provided by the Office of Equal Opportunity and Access (OEOA) at the University of Cincinnati. The OEOA is responsible for monitoring and auditing the workforce activities taking place across the university to ensure that they are compliant with the Affirmative Action Plan.
This project aims to analyze 5 years of Applicant Flow Logs describing the candidates for every position that the University hired in this time. This data includes information about the candidates regarding their status of 2 protected classes :
- Protected Veterans
- People with Disabilities
Currently, the proportion of employees belonging to these protected classes in the University of Cincinnati are significantly lower across most job groups and business units, compared to the proportions in relevant pools provided by the U.S. Department of Labor. By analyzing the given data, this project aims to develop a strategic action plan for improving the representation of these protected groups by understanding where in the hiring process (recruitment, interviews, selection) are barriers present for such groups.
Vipul Mayank, Instacart Market Basket Analysis, August 2020, (Yan Yu, Liwei Chen)
Whether you shop from meticulously planned grocery lists or let whimsy guide your grazing, our unique food rituals define who we are. Instacart, a grocery ordering and delivery app, aims to make it easy to fill your refrigerator and pantry with your personal favourites and staples when you need them. After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you. Instacart’s data science team plays a big part in providing this delightful shopping experience. Currently they use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to their cart next during a session. Recently, Instacart open sourced this data - see their blog post on 3 Million Instacart Orders, Open Sourced.
I shall be using association rule mining techniques to figure out and obtain the following results for the business:
- Cross Selling: Offer the associated item when the customer buys any item from the store.
- Product Placement: Items that are associated (Bread and Butter, Tissue and Cold Medicine, Potato Chips and Beer) can be placed next to each other. If the customers see them, it has higher probability that they will purchase them together.
- Affinity Promotion: Design the promotional events based on associated products to enhance the business.
Hridhay Mehta, Improving User Experience for IT Services, August 2020, (Michael Fry, Vikas Babbar)
A major responsibility of IT Teams is to not only provide users with faster service but also provide a positive experience during the whole engagement. The objective of this project is to improve user experience while availing IT services for the employees across all locations of the company. We aim to reduce the time to resolution of tickets and increase customer satisfaction.
The first part of the project focuses on predicting ticket types for users that engage IT support using emails. The email text data is analyzed using natural language processing techniques and then using machine learning algorithms, segregated into the correct ticket type. This automatic classification eliminates the manual effort of examining each ticket and then routing it correctly, thus avoiding more than 75% conversions between ticket types as well as reducing the resolution time for each ticket.
The second part is targeted at creating a personalized experience by using personas for all the IT service users to better understand their behaviors and addressing their needs based on the insights discovered. The users are clustered together based on dimensions such as channel of approach, ticket categories and the ticket impact using K-Prototype algorithm. This segregation will provide insights into what to target first and what kind of recommendations can be given to help improve the experience for certain users.
Plash Monga, An Analysis of the Customer Churn in Telecom Industry, August 2020, (Dungang Liu, Charles Sox)
The biggest problem a company faces is of Churned Customers. Churning is a term used in this industry to describe whether the consumer or the user is going to continue the services with the company any further or not. By being aware of and monitoring churn rate, companies are equipped to determine their customer retention success rates and identify strategies for improvement. In this project, we will work on customer churn dataset of a telecom industry and will use different machine learning models to understand the precise customer behaviors and attributes which signal the risk and timing of customer churn. After collecting 7000 data points related to telecom industry, a rule-based quality-control method is designed to decrease human error in predicting customer churn. After examining the results from different machine learning models, we conclude that the results using XGBoost model are promising: we achieve an accuracy rate of 81.5%.
Anudeep Mukka, Prediction of Adult Income Class, August 2020 (Peng Wang, Dungang Liu)
There is a growing economic inequality among individuals across the world which has been a concern for various governments. Many people consider their income information private and would be hesitant to share the information. Predicting an individual income based on their demographic attributes would be helpful in planning and allocating the resources for upliftment of the poor and mitigating the economic gap. This would also help validating the income declared by the individuals to the governments and identifying individuals evading taxes.
The income prediction has also been an area of interest for many companies as this information would enable them to achieve greater understanding of consumer and market behavior. The companies can come up with target programs and apps to specific income groups. This will also help the companies price the product accurately and drive the sales.
This report “Prediction of Adult Census Income Class” contains analysis of Census Income which contains attributes of various individuals to predict if an individual earns more than 50k dollars per year or not. Using these attributes, a comprehensive exploratory data analysis is done to understand what features drives the income level of an individual. Feature selection of the variables is done using AIC, BIC, and LASSO variable It also highlights various feature engineering and data cleaning tasks performed to improve model performance. It finally compares performance of various machine learning models in classifying the individual income level and impact of each predictor on the response variable. Various metrics such as Accuracy, AUC, Recall, Precision, and f-1 score is used to evaluate model performance.
Pravallika Mulukuri, Plant Pathology: Identification of category of foliar disease in Apple trees, August 2020 (Yan Yu, Peng Wang)
Agriculture is the main economic resource in most of the developing and under-developed countries. It is crucial in these countries to have disease resistant crops to progress the economy. Traditional disease identification methods like human vision is time consuming and requires lot of human resource. Hence computer vision would provide more time – efficient methods to identify diseased crops.
In this project, the power of deep learning is used to build a predictive model to identify foliar disease in apple trees. The dataset used in analysis consists of images of apple tree leaves in various sizes, shapes, and color. A very deep model is developed to classify if a leaf is healthy, rusted, scabbed, or has multiple diseases. The developed model had an in-sample prediction accuracy as 100% and out-of-sample prediction accuracy as 91.5%.
Meenal Narsignhani, Customer Retention Analysis, August 2020, (Peng Wang, Rossana Bandyopadhyay)
About at least 40% of U.S. adults do not have the financial resources to cover a $400 emergency. That eye-opening statistic helps explain why so many American households count on Axcess Financial to get financial solutions. Headquartered in Cincinnati, Axcess has, in 20 years, grown to nearly 1,000 retail stores across 24 states and has serviced over 50 million loans. It provides customers with financial solutions in form of variety of loan products that can be acquired via two primary channels – Retail Stores and Online.
In these extraordinary situations of crisis, Axcess has witnessed a substantial drop in the overall number of loan transactions. Acquiring new customers as well as retaining the existing customers is quite challenging in the current scenario. As retaining existing customers is comparatively cheaper than acquiring new ones, the team wants to utilize this opportunity to engage with the existing customers in order to better the retention rates. Since Retail Installment loan products are the major revenue generating source, the team wants to particularly focus on customers who have opted for retail installment loan once during their tenure at Axcess.
In this project, an extended list of factors influencing a customer to re-loan or re-finance were identified from a comprehensive data mart, and a feature engineering process resulted in a consolidated analytical dataset to be used by the extended analytical team. Longitudinal analysis of loan purchase history, demographics, delinquency rate and credit scores helped create insights and actionable recommendations. Retention models will be implemented that will help in identifying customers that are most likely to re-loan or re-finance with Axcess and based on which customized targeting strategies would be devised.
Leila Ouellaj, Google Store Customer Revenue Prediction, August 2020, (Yichen Qin, Edward Winkofsky)
This capstone will try to answer the way companies can allocate better their marketing budgets. There’s an 80/20 rule that states that only a small percentage of customers produce the largest part of a business’s revenue, this pushes marketing team to think carefully before allocating budgets. This project is using a dataset from Kaggle from a Google Merchandise Store to predict revenue per customer in order to help companies make better use of their marketing budgets. The final product will be a prediction of the natural log of the sum of all transactions per user. Exploratory Data Analysis will be performed first before using different models to predict the revenue: GLMNET, ARIMA, Linear Mixed Model, XGB.
Priyanka Pavithran, Blindness Detection in Diabetic patients – using Deep Learning, August 2020, (Yan Yu, Liwei Chen)
Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working aged adults. It is a diabetes complication that affects the eyes. It is caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina).
Many organizations in India hope to detect and prevent this disease among people living in rural areas where medical screening is difficult to conduct. Diabetic retinopathy affects up to 80 percent of those who have had diabetes for 20 years or more. Diabetic retinopathy often has no early warning signs. Retinal (fundus) photography with manual interpretation is a widely accepted screening tool for diabetic retinopathy, with performance that can exceed that of in-person dilated eye examinations.
Currently, technicians travel to these rural areas to capture images and then rely on highly trained doctors to review the images and provide diagnosis. Their goal is to scale their efforts through technology; to gain the ability to automatically screen images for disease and provide information on how severe the condition may be.
Shrinidhi Purohit, Forecasting Sales with Machine Learning, August 2020 (Peng Wang, Dungang Liu)
Forecasting sales has become one of the major applications of Machine Learning, which helps the business take an informed decision. The project which I am undertaking will forecast the sales for Walmart using various Machine Learning Algorithms.
The project aims for forecasting the sales of products sold by Walmart in their stores in three states – California (CA), Texas (TX), Wisconsin (WI). The dataset provided by Walmart, involves the unit sales of various products sold organized in the form of grouped time series. The products are classified in three categories- Hobbies, Food and Household. The products are sold across ten stores located in CA, TX, WI. The data ranges from 2011-01-29 to 2016-06-19.
Sumukh Purohit, Predict housing sales prices using Advance Regression Technique, August 2020, (Peng Wang, Dungang Liu)
The project aims to predict the final price of houses in residential areas in Ames, Iowa. The datasets consist of 2,930 observations and 79 explanatory variables (23 nominal, 23 ordinal, 14 discrete and 20 continuous). Dataset can be considered as a modernized and expanded version of Boston Housing dataset which was used to build several regression models during Data Mining class. The main goals of the project are to study various methods of feature engineering to determine the factors which affect the house prices and to implement and study several advance regression algorithms to predict the housing prices. The project will also serve as a part of a Kaggle competition wherein the feature engineering and model building will be done on training data consisting of 1465 observations and model performance will be evaluated on testing data consisting of remaining 1465 observations. The Root-Mean-Squared-Error between the logarithm of the predicted value and the logarithm of the observed value of housing price will be used to evaluate the models. The logarithm of housing price will be used to normalize the error in predicting expensive houses and predicting cheap houses.
Aninthan Ramaswamy, Trade Architecture Transformation, August 2020 (Dungang Liu, Chetan Dolzake)
Tiger Analytics is a consulting firm that provides analytics services to other major businesses. During the internship, the task was to assist a major global manufacturer of consumer goods with re-designing their trade investment architecture. The manufacturer partners with multiple retailers, negotiating a pricing and promotional plan for a product with the result being a business plan, that lists agreed upon decision consensus on pricing, promotion, and distribution of the SKUs (Joint Business Plan or JBP). To facilitate these promotional activities, the manufacturer invests millions of dollars, which is driven by strategic negotiations with the retailers. Lately, this trade investment allocation has been decided more towards a retailers’ market power and lesser on their decisions that were agreed in the JBP. The manufacturer wanted to create a process that tracks KPIs of performance, modelled the outcomes and achieve an optimized trade rate for each retailer based on their decisions. The KPIs contribution towards the trade rate was calculated using a sigmoid transformation model, working on some pre-determined business constraints. The optimized trade payout generated in the process ensures that the investment encourages the decisions taken by the retailers that favors the manufactures, while not penalizing poor outcomes beyond what is necessary. The framework also generates a potential $Y million in investment savings over the next 5 years, also freeing up capital for other marketing activities. The solution includes tracking data from multiple external sources, creating and modeling KPIs using a data harmonization framework, designing an excel tool that generates an optimized investment scenario and designing a Power BI dashboard to provide quick insights on historical and future looking trade performance. This framework has been implemented on the client’s environment, that the team can use for improving their trade investment architecture. This would ensure that the trade investment planning is shifted from a strategic process to a more data-backed and performance-based process.
Vidhi Rathod, Sentiment Analysis & Recommender system for Amazon’s Utility Product line, August 2020 (Yan Yu, Peng Wang)
How important are customer reviews to shoppers? Very important, as it turns out. The fact is, 90% of consumers read online reviews before visiting a business. And 88% of consumers trust online reviews as much as personal recommendations.
This is even more prominent in the e-commerce industry. Online stores have millions of products available in their catalogs. Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. Sentiment analysis tools allow businesses to identify customer sentiment toward products, brands or services in online feedback.
On contrary, Finding the right product becomes difficult for customers because of ‘Information overload’. Users get confused and this puts a cognitive overload on the user in choosing a product. Recommender systems help customers by suggesting probable list of products from which they can easily select the right one. They make customers aware of new and/or similar products available for purchase by providing comparable costs, features, delivery times etc.
Recommender systems have become an integral part of e-commerce sites and other businesses like social networking, movie/music rendering sites. They have a huge impact on the revenue earned by these businesses & also benefit users by reducing the cognitive load of searching and sifting through an overload of data. Recommender systems personalize customer experience by understanding their usage of the system and recommending items they would find useful.
You might have come across examples like below for amazon recommendation system. This project aims at exploring the reviews written by customers on Utility products line for an e-commerce titan- Amazon.
Here, Natural Language Processing is used to analyze the polarity of reviews and a Recommender system for products is built using the past reviews.
Kamaleshwar Ravichandran, DC-Taxi Driver Schedule Optimization, August 2020, (Leonardo Lozano, Eric Webb)
Revenue earned by the taxi drivers is highly dependent on the route, day, and time they are driving. The traffic in a route varies during different times of the day, also varies depending on the day of the week, as few routes may have high traffic during weekdays, few may have high traffic during weekends. Thus, if a driver chooses an optimal route, day, and time to drive, they can earn more revenue. To address this problem, in this paper we aim to formulate an optimization model using mixed integer programming. The driver can feed the region, day, and time they are available in a week to drive. This model utilizes the past taxi trip records to arrive at an optimal driving schedule for the week. Further, we perform Monte Carlo simulation on the revenue calculation by varying the probability of getting a new trip, which gives the optimal total revenue a driver can make in a week.
Eliza Redwine, Modeling voter turnout and party preference in Ohio’s 1st congressional district, August 2020, (Yan Yu, Dungang Liu)
Identifying who will vote and the party preference of these voters is key to winning political races, particularly in close races. In this project models were developed to determine both a voter’s likelihood of voting in a general election and that voter’s likelihood of voting for the Democratic party for registered voters in Hamilton and Warren counties in Ohio. Models were built using voter registration and history and demographic information from the U.S. Census and Social Security Administration. The performance of CART, Elastic Net, and Random Forest models were assessed. Random Forest models proved to be the most accurate models. Accuracy of voter turnout models was greatly increased by incorporating voter turnout history into the Random Forest model by using an iterative approach, successively fitting models to subsets of voters who had different lengths of voting history.
Pooja Sahare, COVID-19 Twitter Sentiment Analysis, August 2020, (Yan Yu, Dungang Liu)
World Health Organization declared COVID-19 outbreak as a pandemic on 11th March 2020. The world saw 2020 in a different light shrouded by Covid-19. It managed to not only grab the headlines everyday but also dictate and disrupt our daily lives until this very day. Owing to its deep-rooted and widespread impact, the response around the world has been diverse. Twitter is a real time social media platform where people can voice their opinions freely and is one of the most popular media worldwide. The aim of twitter analysis is to capture pulse of the people during this quarantine majorly positive or negative. Help, angry, fearful, sadness, anxiety, grateful, hopeful are some of the expected reactions. The study also aims to detect the above emotions from the gathered tweets using machine learning techniques.
Sagar Sahoo, Natural Language Processing in Free-Text Notes, August 2020, (Michael Fry, Denise White)
50,000 employees of Boehringer Ingelheim create value through innovation with a clear goal: to provide better health outcomes and improve the lives of both humans and animals.
Currently, the next-best action planning is driven by analyzing the free text across multiple departments such as Oncology & Respiratory. With thousands of insights in Oncology alone, it becomes tough for business teams to analyze every feedback of physicians regarding the drugs. Use cases such as auto-tagging of the compound and tumor types to free text, summarization of insights, identifying hot/trending/emerging topics would assist business teams to derive hidden insights.
This project uses natural language processing to design a “Multi Labeled Classification” predictive model using deep learning (Keras/Tensorflow) and using this model to predict tumor type from the free text. The results showed that using recurrent neural networks with pre-trained word embeddings (gloVe) can effectively learn better compared to the traditional bag of words approach given enough data. Furthermore, this project required aggregating monthly free text and generating insights (based on the similarity of text) using Bidirectional Encoder Representations from Transformers (BERT) summarizer and developed a platform using Tableau to present a summary of the insights, significant bi-grams/trigrams (based on count), unsupervised sentiment analysis using Text-BLOB to business teams. This should allow for faster decision-making abilities leading to more dynamic and customer-specific processes.
Vivek Sahoo, Dashboards as a Product Offering & for tracking Product Usage, August 2020, (Michael Fry, Christy Foxbower)
VNDLY is a leading provider of cloud-based contingent workforce-management systems. Launched in 2017, it has grown quickly with multiple Fortune 500 clients and is backed by investments of over $57 million. In its efforts to disrupt the vendor-management space, VNDLY plans to provide dashboards as a product offering to its clients to aid their decision making and better manage their non-employee workforce needs. Also, being a product-oriented organization, VNDLY strives to leverage analytical dashboards to make key product related decisions. This capstone involves dashboard development to refine requirements, build prototypes using Tableau and ensure data visualization best practices are followed in the product offering for our clients. Analytical dashboards are created using Google Data Studio by sourcing product usage data from Google Analytics. These dashboards follow design principles for effective data visualizations and hence aid the decision making around product priorities for internal stakeholders at VNDLY.
Mohammed Nifaullah Sailappai, Sentiment Classification & NLP Analysis on Amazon Fine Food Reviews Dataset, August 2020, (Yichen Qin, Gowtham Atluri)
This project is a Natural Language Processing Analysis (NLP) on the Amazon Fine Food Reviews dataset. The dataset contains 568454 reviews of fine foods from Amazon spanning a period of more than 10 years. The dataset includes product and user information, ratings, and a plain text review in a tabular format, in which each row is a singular and separate review in itself. Initially we preprocessed data to convert the ratings into positive and negative sentiments. Then we structured the data into Keras’s Neural Network feedable format, further we generated the embedding layer using the Gensim library in python. After a process of iterative improvement, we then finalized a sentiment classification model. Finally we analyzed the performance of model on incorrectly classified reviews to understand the nature and pattern involved in such reviews. The final model selected was a Recurrent Neural Network with LSTM cells as the main building block. The final model gave us a training and test accuracy of 90%, upon deeper analysis it was found that the actual accuracy should be greater than 90%.
Matteo Salerno, Analysis of different approaches to treat time series instability, August 2020, (Jeffrey Mills, Yan Yu)
This work compares the frequentist and Bayesian approaches to the treatment of time series instabilities such as structural breaks and unit roots using AR(1) modeling (autoregressive of order 1). The two approaches were evaluated using ad hoc generated time series that had breaks and regime switching between I(1) and I(0) states. The breaks in these time series were treated as having unknown positions. Once the breaks positions were identified, the time series were modelled, and the results compared. The structural breaks have been searched with a practitioner approach based on the time series modeling minimal regression RSS (Residual sum of squares) which is described in this paper (hereinafter referred to as “Minimum RSS search”). The detected breaks positions were validated with the QLR test (Quandt Likelihood Ratio). The BIC (Bayesian Information Criterion) has been used to compare the fit among models. The conclusion of this work is that the Bayesian approach provided a better fit than the frequentist one in the cases analyzed.
Onkar Samant, Duplicate Questions Identification on Quora, August 2020, (Peng Wang, Denise White)
The project explores automated prediction of duplicate questions given two questions from the Quora platform through predictive models. Different Natural Language Processing techniques are extensively explored to extract features from the raw text and build models for predictions. The modeling work majorly involves the following two types of approaches:
- Classical Text Mining Approach: Character counts and TF-IDF vectors along with cosine similarity are used as features extracted from the raw text. Furthermore, Logistic Regression, Random Forest, and XGBoost are tried for prediction using these features.
- Advanced Techniques: Various Neural Network Architectures are tried with word embeddings created from questions’ text
The dataset contains 404,290 records and 95% of it is used for training and 5% for testing. Results from the approaches defined above are tabulated and compared based on log loss and accuracy. The results from the word embedding approaches are found to be very promising and can provide a scalable solution for this task.
Nikita Sanke, Sentiment Analysis on Women’s Customer Reviews in Ecommerce Industry, August 2020, (Peng Wang, Liwei Chen)
61% of customers read online reviews before making a purchase decision, and they are now essential for e-commerce sites. Understanding the sentiments of customers is of utmost importance today. By analysing users reviews, a company can be aware of how its users feel. For brands who are trying to actively engage with their users, it is important to detect any negative sentiment. Hence, Natural Language Processing applications like sentiment analysis help companies to improve the online ecommerce experience for their users and also to extract insights from insights from unstructured information such as customer reviews.
Sentiment analysis, which also includes text polarity detection apart from feature extraction provides valuable insights to firms to understand how buyers actually feel about the products they bought, and do they recommend it to others. It is particularly useful in datasets as this, as it can be used to extract.
Kaustubh Saraf, Churn Analysis: Identifying the quotes that are likely to convert into sales, August 2020 (Michael Fry, Andrew Harrison)
With the growth of analytics, it has become more important for businesses to use analytical tools to gain an upper hand over the competitors. Analytics can be used across different verticals of a business such as manufacturing, financial planning, and marketing. Marketing is an important aspect of any business with many ways of marketing a product. Therefore, it is important to select the right way of marketing a product. One of the important factors of any marketing campaign is the selection of the target group. Using analytics to select the right set of customers within a target group helps businesses optimize costs and focus on the right set of customers. Another important business area to explore is analyzing the shifts in the buying patterns of the customers.
In this study, we will be analyzing the transactional data of a steel manufacturing company to predict the conversion of a quote that is sent to its clients. Some important aspects of the study include the frequency of conversations, the geographical area of the clients, the functional status of the clients, the response time for the first conversation and the average response time for conversations. We also try to bucket customers based on the transactional data and predict the number of quotes that get converted into sales based on factors such as the customer group, month of the conversation and geography.
Somil Saxena, Telecom Company Customer Churn, August 2020, (Yan Yu, Dungang Liu)
With the rapid development of telecommunication industry, the service providers are inclined more towards expansion of the subscriber base. To meet the need of surviving in the competitive environment, the retention of existing customers has become a huge challenge. Firms are directing more effort into retaining existing customers than to attracting new ones. To achieve this, customers likely to defect need to be identified so that they can be approached with tailored incentives or other bespoke retention offers. Such strategies call for predictive models capable of identifying customers with higher probabilities of defecting in the relatively near future.
In this study, an in-depth customer attrition analysis was conducted which was followed by building a data model that predicted whether a customer would churn or not. Different supervised learning techniques such Decision Tree, Support Vector Machines, Logistic Regression, Random Forest were utilized to predict the categorical target variable ‘churn label’. Since the distribution of churning and non-churning customers was not balanced, accuracy was not used as a performance metric. Instead, model comparison was done on ROC AUC score. Random Forest gave the best performance with an AUC score of 86%.
Tanu Seth, American Sign Language Hand Gesture Recognition: Application of different Machine Learning Algorithms for Image Classification, August 2020, (Yan Yu, Liwei Chen)
American Sign Language (ASL) is a visual language that serves as a principal sign language for the deaf communities in the United States and in many parts of Canada. Sign Language Gesture Recognition is an open problem in the field of machine vision and has many applications with the scope of improving human-computer interaction. The project will aim at creating a translator that utilizes and compares various Machine Learning algorithms based on model accuracy to predict alphabets corresponding to static hand gestures of American Sign Language. The data in this project is collected from Kaggle. It consists of 44200 images (28 x 28-pixel data) of American Sign Language hand gestures corresponding to alphabets ranging from A to Z. Each row consists of 785 dimensions, the first dimension is the label which corresponds to the index of the alphabet and the remaining 784 values are the grayscale pixel values (a number from 0 to 255). Among the traditional classifiers, Random Forest Classifier performed better and was able to recognize the hand gestures with 93.37% accuracy. The Convolutional Neural Network developed using the Keras library in python recognized alphabets corresponding to hand gesture with 98.3% accuracy. The accuracy achieved could change significantly if new objects are included in the background. With higher computational power and hyper tuning of parameters, the prediction performance of the algorithms could be improved further.
Sagar Shah, Plastic Cutlery Product Development, August 2020, (Michael Fry, Rakesh Rathore)
TrueChoicePack (TCP) is a private label company and an expert in supplying disposable products across the United States. Our team at TCP is always striving to develop new products for the private labelling customers as well our own e-commerce businesses. I will be presenting my “Plastic Cutlery Product Development” project in this report where I have analyzed the markets for the plastic cutlery, analyzed the supplier bids to determine the best supplier for the product development, developed the stock keeping units for the company retail and the e-commerce markets, determined the appropriate pricing for the retail and e-commerce markets by performing competitive analysis and implementing price-based costing in line with the company’s objectives, developed the content on the product dielines for marketing purposes and designed the product packaging for the retail and e-commerce markets.
This project is divided into various subparts each demanding exploratory data analysis, statistical computing, data munging, analytical dataset creation, and data visualization on multiple datasets to come up with recommendations and communicating results to the senior management. We are expecting to launch this product by August on different e-commerce platforms.
Anjali Shalimar, Movie Recommendation Engine, August 2020, (Peng Wang, Liwei Chen)
Recommendation engines are integral to the ever-increasing need for businesses to further personalize the customer experience. From recommending clothing size to product substitutes, content recommendation plays a vital role in customer engagement. The objective of the analysis is to learn and build a basic movie recommendation engine.
Initially, the analysis would investigate a content-based filtering algorithm. The key idea is to recommend movies that best suit a customer’s prior movie selections. For instance, if you have watched ‘The Avengers’, you would receive movie recommendations with similar plots or characters. In this analysis, a content-based recommendation has been built based on the plot description of a movie. The technique of term frequency – inverse document frequency is used to create keywords across each description and later create a TF-IDF matrix. A cosine similarity score within the available pool of movie keywords is generated to recommend the most similar movies.
A collaborative filtering algorithm is an improvement to the content-based approach as it produces recommendations based on the similarity between users and the movies they have rated. A collaborative filtering algorithm could pose the problem of data sparsity. The challenge of sparsity occurs when we do not have sufficient data points to identify similar users or items. Hence the analysis investigates a singular value decomposition technique to recommend movies. A singular value decomposition technique decomposes a matrix into a product of three component vectors. The key idea is to predict the user-rating for a movie aiming to minimize the RMSE of the rating predictions.
Anupam Shukla, US Airline Tweet Topic Modeling and Sentiment Analysis, August 2020, (Peng Wang, Liwei Chen)
Twitter is one of the popular social networking sites where people express their sentiments about different companies and their products and services. According to Brandwatch stats, 65.8% of US companies with 100+ employees use Twitter for marketing, 80% of Twitter users have mentioned a brand in a tweet and the last two years have seen a 2.5x increase in customer service conversations. Companies can analyze these tweets to understand where they need to improve. But analyzing these large number of tweets manually can be a time taking process. This project attempts to address this issue by employing Natural Language Processing tools like topic modeling and sentiment analysis. A dataset consisting of customer tweets about each major US airline is used for the study. It contains almost 14000 tweets and comes with pre labeled sentiment (positive, negative or neutral) for each tweet. The topic model will help airlines identify frequent topics flyers tweet about and address those areas where the service is not satisfactory. With the classification model, airlines can predict sentiment of future tweets and analyze if the service improvements are actually working or not. Topic modeling will be performed using Gensim library in Python. To build the classification model, machine learning classification algorithms like Logistic Regression with Lasso regularization, Random Forest, Boosting, naïve Bayes classifiers and deep learning algorithms like Recurrent neural network using pretrained GloVe word embedding is used.
Harsh Singal, Python Notebooks for Data Mining Course, August 2020, (Yan Yu, Peng Wang)
In today’s data driven environment, the study of data through big data analytics is very powerful, especially for the purpose of decision making and using data statistically in this data rich environment. Any person who is new to data science or any organization who has started using data science in their day to day operations, need to pick a language at first in which they can analyse the data and a thoughtful way to make that decision. Although there are many languages present, there are mainly two languages commonly used for data science - Python and R. Both have been used successfully in teaching as well as in the professional world and being currently used to make decisions involving big data. Both comprise a large collection of packages for specific tasks and have a growing community that offers support and tutorials online. Since the Data Mining course at University of Cincinnati is heavily R oriented, through this project, I have tried to convert the labs and homework from R to python notebooks for the students interested in doing the coursework in Python as well.
Aditi Singh, Application of Convolution Neural Networks (CNN) in detecting Malaria, August 2020, (Peng Wang, Liwei Chen)
The applications of deep learning have achieved great success in the healthcare industry in recent years. Deep convolution neural networks (CNN) are now widely used in medical imaging diagnosis for various diseases such as pneumonia, Alzheimer’s, cancer, diabetic blindness, etc. One such important application of image-based classification is the diagnosis of malaria.
The project exhaustively explores the CNN architecture that sheds light on the unique feature processing of the images within convolutions that result in a significant improvement in predictive power especially, when compared to traditional classification algorithms like random forest. Using Giemsa stained colored cell images, the analysis attempts to fit the best CNN architecture to classify the red blood cells as infected or not. Different models were developed using both VGG blocks and residual modules. Further, overfitting was addressed using regularization techniques like dropout, l1 & l2 regularization terms, early stopping and data augmentation. The choice of optimization algorithm was also evaluated. Based on the analysis, it was concluded that the model using 3 layers based on VGG block, along with dropout rates, early stopping, l2 regularization term and RMSprop optimization algorithm yielded the best results in terms of model performance on test data.
Ashutosh Singh, Predicting Click-Through in Online Hotel Ranking, August 2020, (Yan Yu, Dungang Liu)
Click-through rate prediction is an essential task in industrial applications, such as online advertising and online bookings. Since the past decade, many traditional machine learning models have been used for this purpose ranging from Logistic and Decision Trees to more advanced ensemble models like Boosting and Bagging. Recently Deep-learning based models have been proposed, which have been pretty impressive in the classification problems involving large data. In this study, an online hotel rankings data is used to test the efficiency of these traditional, ensemble, and deep learning methods in predicting the click-through. The performance of all these models is evaluated both on in-sample and out-of-sample data. The methods used were broadly classified into four major categories – Traditional, Ensemble, Naïve Bayes, and Deep-Learning, and the experiments aimed to find out the best performing model in each of the categories. In the end, Boosting algorithms specifically Gradient Boosting performed best in out-of-sample data. Overall, the majority of the above methods gave good results in terms of accuracy with some exceptions like Bernoulli NB and Sequential Model. Two features namely, Position of Hotel ranking in the search results and the Property location score were identified as the most important based on the feature importance plot of Random Forest and Gradient Boosting Model.
Utkarsh K. Singh, Customer future value modeling for non-contractual businesses – Bayesian approach, August 2020, (Yichen Qin, Dungang Liu)
Estimatingfuturevalueofacustomerisoneofthecorepillarsinmarketingstrategy. Value canbeperceivedeitherbythenetprofitorrevenueearnedfromacustomerduringaperiodof defined length in the future. A number of Econometric, Probabilistic and Machine Learning models utilize customer level transactions for this purpose.
The modeling approach explored in this project is one amongst a suite of Bayesian probabilistic models popularly known as the Buy till you die models for estimating customer value. The report discusses the theory and application of a combination of Beta Geometric Negative Binomial Distribution (BG/NBD) model and the Gamma-Gamma submodel for estimating the expected future value of customers for an e-commerce retail business. The BG/NBD model was first introduced by Fader, Hardie and Lee in 2004 for predicting expected future transactions and survival probability for customers in a non-contractual setup.
The model is trained over a calibration period of 9 months and the predictions are tested for a holdout period of 4 months. The model performance is evaluated against a simplistic baseline model based on the observed average behavior of an individual. Finally, the model is used to predict “High Future Value” customers and the lift obtained in capturing target customers is reported.
Apoorva Milind Sonavani, Image Caption Generation Using Computer Vision & NLP, August 2020, (Yan Yu, Peng Wang)
Visuals and imagery continue to dominate social and professional interactions globally. With a growing scale, manual efforts are falling short on tracking, identifying, and annotating the prodigious amounts of visual data. With the advent of artificial intelligence, multimedia businesses are able to accelerate the process of image captioning. AI-powered image caption generator employs various artificial intelligence services and technologies like deep neural networks to automate image captioning processes. The image captioning model is an automated tool that generates concise and meaningful captions for prodigious volumes of images efficiently. The dataset used is the COCO Dataset 2014 (Common Objects in Context). COCO is a large-scale object detection, segmentation, and captioning dataset. This version contains images, bounding boxes and labels for the 2014 version. The model employs techniques from computer vision and Natural Language Processing (NLP) to extract comprehensive textual information about the given images. The image caption generator consists of Neural Networks (CNN), Recurrent Neural Networks (RNN), wherein-
- CNNs are deployed for extracting spatial information from the images
- RNNs are harnessed for generating sequential data of words
Bahdanau Attention is used within the encoder-decoder structure of the model, to preserve sequence-to-sequence efficiency.
Ayshwarya Srinivasan, Movie recommendation Engine, August 2020, (Peng Wang, Yichen Qin)
Streaming services such as Netflix, Prime, Hulu are garnering more and more audience each day. With more streaming services than one can keep a count of coming up, families must decide as to which service to subscribe to. Increasingly, the quality of recommendations a service has become pivotal to this decision making. The objective of this project is to recommend movies for a subscriber to watch.
While there are several factors that one could consider while creating a recommendation engine, in this project we are focusing on a few aspects of the user and the movie to provide recommendation. We would be focusing on the movie genre, the tags associated with it, the director and cast of the movie. We will also be focusing on recommending movies to a user based on how close his taste matches with another user. The goal of this project is to provide holistic experience to a user by providing recommendations based on various criteria such as popularity, user-user collaborative filtering and content filtering.
Rajat Srivastava, Feature Based Music Recommendation System, August 2020, (Peng Wang, Liwei Chen)
Music service providers, for personalization of content, rely mostly on the preference of users that are set at the time of account creation. The ones who have user data make use of collaborative filtering and listening patterns. However, a user rarely listens to the same kind of music all the time. The song preferences are largely dependent on the mood of the listener. The mood can be derived, abstractly, from the audio features of the song that user is currently listening to. Over the years, many supervised metric learnings have been applied to this problem in conjunction with collaborative filtering and user preferences. However, if we look at the problem independently, the problem is that of unsupervised distance metric learning. Moreover, number of features range, generally from 10 to 20 and hence a linear distance metric, often, do not give good results. In this paper, we demonstrate a non-linear distance metric derived from the idea of Locally Linear Embeddings (LLE) method of dimensionality reduction. The results are evaluated based on the user feedback collected using web application.
Venkat Sureddi, Movie Recommendation System, August 2020, (Yichen Qin, Liwei Chen)
The one key reason why the recommendation systems have become ubiquitous in the modern world is the enormous options people have on the internet. From fashion to items for daily use to movies, there are a plethora of options for a user surfing the web. It is impossible for any user to choose from that exhaustive list of options. This is where the recommendation systems come into play. A recommendation system employs a statistical algorithm to predict the users’ preference and make suggestions based on those preferences.
Recommendation systems are playing a key role particularly in the entertainment space where there are thousands of movies and series to choose from. For providing the best viewing experience and retaining the users, it is important for the OTT platforms to seamlessly suggest movies to both existing and new users. We built different recommendation systems using the TMDB Movies dataset from MovieLens website.
The three recommendation systems we have built in this project are:
- Simple Recommendation System
- Content Based Recommendation System
- Collaborative Filtering
Ambita Surlekar, Book Recommender System, August 2020, (Peng Wang, Yan Yu)
Recommender systems have become an integral part of many e-commerce companies. From Amazon to Netflix, recommender systems help users to explore items/ songs/ movies which are similar to their tastes. They also have significantly impacted businesses by increasing purchases resulting in increased revenue. This project builds a recommender system from books whose details are stored in the good reads database. Many advances have been made in identifying best algorithms to make these recommendations. This project explores 3 popular ones. Content-based filtering approach uses the book titles and ratings to suggest books to users. In the collaborative filtering (CF) approach, user-based as well as item-based approach have been explored. Finally, I have applied the Singular Value Decomposition (SVD)/ Matrix Completion, which is one of the most popular approaches. I have listed the advantages and disadvantages of each of the recommenders.
Hasnat Shad Tahir, Movie Recommendation Systems, August 2020, (Peng Wang, Dungang Liu)
Today, while surfing/purchasing on the internet we are provided with a lot of choices to choose from, this can be time-consuming and frustrating sometimes. There is a need to filter, prioritize, and efficiently deliver relevant information to alleviate the problem of information overload. With the rapid growth of data collection, we can create/modify more efficient systems by effectively using the collected data. Recommendation Systems are information filtering systems that improve the quality of a search result by increasing its relevancy to the user search history or preferences. At present, almost every big company uses these systems: Amazon uses it to suggest products to customers based on their and other similar customers purchasing habits, Youtube uses it to decide what video to play next. Some music application companies like Spotify depend solely on the effectiveness of its recommender system for its success.
Juan Tan, DrugBank Data Mining – Wrangling & Network Analysis, August 2020, (Leonardo Lozano, Denise White)
In this project, I parse and explore the database downloaded from drugbank website (https://www.drugbank.ca/), which contains plenty of information, including targets, manufacturer, price, monoisotopic mass, metabolism, toxicity, etc of over 13,000 drugs. Drug type, state, price, target action, manufacturer, group, average mass from this database are mined. Then, drugs for Alzheimer’s Disease (AD) are identified. Furthermore, by using the platform of Cytoscape, the drug-target, drug-enzyme, and drug-transporter networks are built and analyzed. Insights into the current drug market, new drug discovery, and drug repurposing for AD are obtained.
Varun Varma, Detecting Fake News and Real News Articles, August 2020, (Yan Yu, Dungang Liu)
Fake news and lies have been there since before the coming of the Internet. The generally acknowledged meaning of Internet fake news is: imaginary articles intentionally manufactured to trick readers. Online networking and media sources distribute counterfeit news to build readership or as a feature of mental fighting. In general, the objective is benefitting through misleading content sources.
Misleading content sources draw clients and allure interest with showy features or plan to click connects to expand promotions incomes. This article examines the pervasiveness of phony news considering the advances in correspondence made conceivable by the rise of person to person communication locales.
The motivation behind the work is to look for an answer that can be used by clients to recognize and sift through locales containing bogus and deluding data. The project is concerned with identifying a solution that could be used to detect and filter out sites containing fake news for purposes of helping users to avoid being lured by clickbait. It is imperative that such solutions are identified as they will prove to be useful to both readers and tech companies involved in the issue. I have tried to build models with different machine learning algorithms to recognize fake posts. The result outcomes show a 99.8% accuracy utilizing a strategic classifier.
Lekshmi Venugopal, Generation of Music using Deep Learning and Recurrent Neural Networks, August 2020, (Peng Wang, Edward Winkofsky)
Traditionally, music was generated by talented artists and was deemed to be a skill owned by a select few who had the sense of creativity and the skill in the area. Nowadays, with the help of advanced technologies, the same can be achieved without human interference. The applications of machine learning and artificial intelligence are diversifying into fields which were considered difficult to be conquered by the mere technicality of machines. Creative fields which require a certain skill set and sense of innovation such as art, literature etc. were thought to be the last which would be dominated by machines. In this project, we attempt to train generative models and use these models to create music. Recurrent Neural Networks (RNN) models based in Long Short Term Memory architecture(LSTM) are used. Data, in the form of ABC format is fed to the model, after which is preprocessed and vectorized into integer format before training the model. Description about the LSTM model architecture and its connections to develop a neural network is also presented in this work. When the model was trained with enough samples, it produced impressive results in generating new samples.
Mudit Verma, Customer Churn Modeling for Term Policies at Ameritas, August 2020 ( Dungang Liu, Jennifer Kelly)
Customer churn is the focal concern of most companies which are active in industries with low switching cost. Among all industries which suffer from this issue, insurance is also significantly hit with approximate annual churn rates of 16%. Tackling this problem, there exist different approaches via developing predictive models for customer churn. In this project, we have tried the application of different classification models on the policy data at Ameritas to accurately predict the customer’s propensity to churn. Multiple datasets including information related to policies, customer and transactions have been analysed and combined for the study. In model building phase, logistic regression, decision tree classifier and random forest were utilized to get the predictions. All the models have been evaluated basis the out of sample accuracy and they performed exceptionally well giving accuracy between 80%-90% on the test dataset. The random forest model gave the best results with ~89% accuracy. Feature importance is also provided to get actionable insights for marketing.
Vishnu Vijayakumar, News Category Classification using Deep Learning, August 2020, (Yan Yu, Dungang Liu)
Text Classification is a classical problem in Natural Language Processing (NLP) where certain sentences, paragraphs or documents need to be assigned to one or more predefined categories.
Deep learning models based on recurrent structures have been able to surpass issues faced by conventional machine learning models and achieve satisfactory results in classifying text data by utilizing semantic information. While Deep Learning (DL) models have achieved state-of-the-art on many NLP tasks, these models are trained from scratch, requiring large datasets, and days to converge. These major drawbacks of DL models have been addressed through Inductive Transfer learning. Transfer Learning (TL) has thus changed the face of DL in NLP in the recent years by allowing us to take pretrained state of the art models and fine tuning them to suit the task at hand , thus obviating the need for training language models from scratch. This study, apart from exploring some DL models, would be focusing on two of the most popular TL models namely, Universal Language Model Fine Tuning (ULMFiT) and Bidirectional Encoder Representations from Transformers (BERT) that employ transfer learning to classify news articles into predefined categories.
Jayanth Sekhar Viswambhara, Amazon Fine Food Reviews Sentiment Analysis, August 2020, (Dungang Liu, Peng Wang)
Emotions are essential for effective communication between humans, so if we want machines to handle texts in the same way, we need to teach them how to detect semantic orientation of a text as positive, neutral, or negative. That is where Sentiment Analysis is used and is widely applied to voice of the customer materials such as reviews and survey responses, online and social media for applications that range from marketing to customer service that helps the organizations to make data-driven decisions. The focus of this project is application of sentiment analysis on the reviews data collected from amazon.com on fine foods. In this study, we used different feature generation techniques such as vectorization and word embeddingbased and machine learning classifiers such as Logistic Regression, Multinomial Naïve Bayes, Random Forest, XGBoost, and Support Vector Machine for text classification. Lastly, when we dig further to compare each model’s performance in classifying the sentiments, Support Vector Machine with linear kernel and bag of words – Bigram approach has the best F-1 score of 0.904 and Accuracy of ~95%. This provides us a bird’s eye view of the user perceptions about the fine foods and also acts as a powerful feedback mechanism for amazon.com and retailers, which they could use to make immediate corrections and improve their products and services.
Saurabh Wani, Movie Score Prediction, August 2020, (Yichen Qin, Liwei Chen)
IMDB score for a movie on the scale of 0-10 is a popular metric conveying the success/failure of a movie. The dataset for the project has been sourced from Kaggle, which originally was scraped from IMDB website. Project involves exploring the factors impacting the final IMDB score have viz. popularity of actors/directors, budgets, gross earnings etc. via data visualization. Post which, application of machine learning algorithms to predict the final IMDB score for a movie. It was inferred that the total number of users who voted for the movie, duration of the movie, budget and gross earnings for the movie are key factors that impact the final IMDB score. Four models viz. Multinomial Logistic Regression, Decision Trees, Random Forest and Gradient Boosting have been used to assess the prediction accuracy for the final IMDB score. The best accuracy rate of 78% was obtained for Random Forests followed by Gradient Boosting, Decision Trees and Multinomial Logistic Regression.
Yashwanth Kumar Yelamanchili, Identifying COVID Hotspots, August 2020, (Yan Yu, Denise White)
In this project, COVID hotspots (counties) were identified to understand the risk of the spread of disease while taking important business decisions involving the geographical location. Multiple factors summarizing the population estimates, demographics, income, and healthcare system were considered for this project. The final feature set including both base variables and derived variables was used to shortlist the factors affecting the rise in cases and the associated weightage for each of the factors. These factors are then used to calculate the score at a county level to classify the counties as hotspots. The number of cases was also forecasted for the 10th day from July 3rd, 2020 and the score was recalculated. This was done to check if the hotspots identified were stable or if the COVID hotspots are highly volatile and tend to shift in 10 days’ timeframe. The reason for changing only the ‘cases’ lever is because case density was a major factor that goes into the calculation of the score for a county. The COVID hotspots and the cases by county were then visualized on a Tableau dashboard to make the consumption easier and intuitive for the business. A total of 315 hotspots have been identified with ‘Los Angeles’ county in California having the highest score and thus the most susceptible to the spread of novel COVID–19 virus.
Laith Barakat, Candidate of Change? Did Unemployment Trends across America Contribute to the Rise of Donald Trump?, May 2020, (Dungang Liu, Edward Winkofsky)
In the aftermath of a particularly polarizing United States presidential election cycle of 2016, political pundits and commentators have been resolutely torn on the factors leading to the outcome. While many theories have been tested, one compelling potential reason for Donald Trump’s win in 2016 piqued interest for the research of this paper: that household economic performance and work force indicators significantly swung previously Democratic-leaning counties to vote Republican. This paper attempts to build simplistic logistic regression models around some such economic indicators at a county level. Once a complete model is achieved, the paper displays techniques that could be used as a foundation for more extensive and complex models, as well as an interesting application of basic statistical modeling within the political sphere.
Kevin Gilmore, Deep “Dish” Data Dive, May 2020 (Peng Wang, Liwei Chen)
This project contains three datasets in which I am addressing average rating, location, and pricing information of various pizza establishments around the US. The goal is to assist the average American consumer in making informed decisions about which pizza restaurants they would like to patronize. The data is first introduced, a data dictionary is created for the reader to further understand the data, the data is then cleaned, and then some initial exploratory analysis is done. After the exploratory analysis, some statistical modeling techniques were used to see the relationship and significance of different factors in relationship to price level and the owner of the barstool data (Dave). More mathematical numbers are looked at to break down the model summaries. A view of buying pizza at pizza establishments versus buying pizza at a place like a sandwich shop or bar/pub are also taken into consideration.
Leigha Kraemer, The Evolution of the Wage Gap in America, May 2020, (Leonardo Lozano, Ruth Seiple)
My capstone project is an extension of a project I completed in the Business Analytics Data Wrangling with R course in the fall semester. The final project I delivered for the course was about the evolution of the wage gap in America. I utilized several datasets to provide an in-depth analysis on the presence of the wage gap in different industries and within different age groups. There is incontrovertible evidence that a wage gap exists today and has existed for many years, which I support throughout my capstone analysis. A wage gap does not exist for all females across all industries, so in my extended research, I worked to narrow the focus to the industries, age groups, and locations that have the most prevalent wage gaps and what the reasons for those wage gaps may be. From displaying the datasets and pulling in outside research, I provide the consumer of the data with enough information to determine if they believe that the gender wage gap is improving and if it is not, the factors that may be causing it and the populations it is affecting the most. The main goal of my capstone project is to inform the reader about all aspects of the wage gap and encourage future generations to make a difference and reduce this plaguing issue.
Matt Lekowski, Beer Sales Based on Geography and Population Demographics, May 2020,(Michael Fry, Steve McGlone)
MadTree Brewing is one of the largest craft beer breweries in the state of Ohio. Based in the Oakley neighborhood of Cincinnati, MadTree uses several distributors to sell its beer in hundreds of zip codes across Ohio, Kentucky, and Tennessee. The populations of these zip codes are composed of a variety of different demographic types and geographic factors. Population age, ethnicity, median income, and gender breakdown, as well as geographic factors such as distance from a city center, all are potential factors that may influence craft-beer sales volumes. MadTree is seeking to increase its market penetration in the areas it currently sells its craft beer, so I have developed linear regression models to illustrate the relative significance of different predictor variables. I have also developed dynamic graphs and dashboards using Tableau’s geographic mapping feature as a means to visually compare different zip codes in various sales metrics including year-over-year increases in sales volume, sales volume per capita, and sales velocity. MadTree’s sales team will be able to use these dashboards to filter for certain package types, retailers, flavors of beer, and territories of their different sales representatives to further drill down their analysis and determine where improvements can be made with targeted sales and marketing efforts.
Jiaoyao Liu, Exploration of Map/Vision Control of Different Roles in League of Legends, May 2020, (Dungang Liu, Liwei Chen)
In the multiplayer online video game League of Legends, map/vision control is one of the key factors to win games strategically. Although other important elements could lead to final results – win or lose, for example, player items, skills, champion selections, etc., this project will be focusing on analyzing the quantified wards related variables of all five player roles, such as Top, Mid, Jungle, Duo Carry and Duo Support. The goal is to understand the warding behaviors of players and provide insights to help them increase win rate along climbing the ranking ladder or just become better players in general by managing the number of vision wards on the game map.
Spencer Niehaus, Application of Simulation and Optimization to Women's College Basketball Scheduling, May 2020 (Michael Fry, David Rapien)
The NCAA Tournament has been around since the 1980’s and has been a staple of College Athletics since its inception. In those years, the tournament has become a large source of revenue for the NCAA and the teams that are selected to participate. The Women’s Tournament takes a field of 64 teams that are composed of 32 conference tournament champions and 32 at-large bids selected by the NCAA Selection Committee. One of the biggest drivers for the selection of the at-large bids is the Ratings Percentage Index (RPI) that is used to rate the teams and their performances. The RPI is calculated using the formula 0.25*(the given team’s winning percentage)+0.50*(the given team’s opponent’s winning percentage)+0.25*(the given team’s opponent’s opponents winning percentage). This number also affects the other important factors for NCAA selection criteria such as strength of schedule and quality wins. The objective of this paper is to build a model that can predict the win probabilities of the teams for the next season. This will allow for non-conference schedules to be optimized to increase a team’s RPI and likelihood of making it to the NCAA Tournament
Cassidy Peebles, Exploratory Analysis and Reporting Tool UseCase on Credit Card Approval Rate, May 2020 (Yan Yu, Kyle Swingle)
Utilizing transaction level data, the following analysis serves to first, provide a summary of the variables that impact credit card approval rate, and secondarily, use that analysis as a foundation to design a reporting tool that will allow a user to track the approval rate over time and address the specific risks that arise. This work was done for Worldpay, a financial technology company in the payments sector, on a Revenue Assurance team.
Samantha Riser Rickett, Home City Ice: Supply Chain Management, May 2020, (Charles Sox, Uday Rao)
Supply Chain Management has been an area of interest among businesses that seek efficiencies and cost-saving opportunities in this area. Research shows that businesses that provide functional products value the physical costs of production, transportation, and inventory storage over market mediation costs. Therefore, proper production planning and inventory management are key components of a successful functional business. Optimal production levels minimize labor and raw material costs, and proper inventory management minimizes material handling and stockout costs. In combination, these lead to decreased costs, increased customer satisfaction and increased profits. These principles of supply chain management were applied to a Cincinnati ice manufacturing company called Home City Ice. Our team was deployed to analyze the company’s current supply chain and build an automated production and distribution schedule that will balance production levels with stockout costs. Utilizing 10 years of historical demand, as well as production capacity, storage capacity, and truck fleet across all 49 manufacturing facilities and 55 storage facilities, we delivered a model that yields an optimal profit-maximizing production and transportation schedule as well as provides other important operational measures at any given point in time. Additionally, a more accurate forecasting model was developed to further support our optimization model. Both have outperformed the company’s current processes through time and accuracy.
Chi Zhang, Telco Customer Churn Prediction, May 2020, (Peng Wang, Yan Yu)
Customer churn, also known as customer attrition, is the loss of clients or customers. The dataset is from Telco company who wants to observe their customers' churn behavior. Based on the predictive model, they can know their customers' propensity of risk to churn. The cost of retaining a customer is much lower than acquiring a new customer, so that the company can focus on those potential defectors with their customer retention programs. In this project, I use serval methods to determine the significant variables that affect customer churn and use those variables to build a logistic regression model to predict the probabilities of customer churn with the future new dataset. To determine the significant factor that effecting churn, I use AIC stepwise and LASSO to do the variable selection. In addition, I also build the model based on the decision tree, random forest, and xgboost evaluate the performance with confusion matrix to compare with logistic regression. I was able to find out the best model to predict customers’ behavior, help the company reduce the loss.
Tyler J. Creel, Supply Chain Optimization: The Final Mile, May 2020 ( Yinghao Zhang, Leonardo Lozano)
Kroger is headquartered in Cincinnati, OH and is the United States largest retailer with nearly 2,700 stores nationwide. Kroger’s supply chain network is very complex and consists of 38 manufacturing facilities and 42 distribution centers, all strategically placed at different local, regional, and national levels. As e-commerce continues to grow and customer shopping trends evolve, Kroger is adapting and growing their supply chain to meet these needs. In 2018, Kroger formed a partnership with Ocado, a European online grocery retailer so that they could continue to serve their customers and bring grocery shopping to their homes. This partnership will allow customers to order online and have their groceries delivered to their homes. The task I set out to solve was to find the best location for the first first Ocado/Kroger distribution center. I did this by creating an in-depth optimization model. The biggest difficulty I faced when building this model was accurately representing an unknown customer demand for different urban areas in the Midwest. I was able to accomplish my task through the use of multiple simulations and the optimization model itself.
Anirudh Bhanu Teja Addala, Demand Forecasting for Dymatize, December 2019, (Yichen Qin, Michelle Xu)
Dymatize is a provider of premier nutritional and body building supplements which has its operation in Europe, Asia and United States. Dymatize uses third party vendors to produce and sell its products across the globe. The major clients for Dymatize are Amazon, BB.com, Walmart etc. As the production of its products is handled by a vendor, Dymatize needed upfront prediction on the demand from different customers for different products for a proper inventory planning and expansion of its business. Dymatize requires a prediction of 12 months for its products at different levels i.e. at a size level (1 lb., 2 lb., 5 lb. etc.) and at a flavor level (i.e. Vanilla, Chocolate etc.). The challenges in the project include the inconsistent demand for less desired projects and shorter life span of the products due to the increasing need of innovative products in the market. Demand forecasting is done using different time series models like Exponential Smoothing, Croston’s Method of Intermittent Demand Forecasting, ARIMA and error measure used is WMAPE and RMSE to compare the models.
Ramana Kumar Varma Nadimpalli, Data Analytics on Project Durations, December 2019, (Yichen Qin, Yatin Bhatia)
Incedo is a Bay Area headquartered digital and analytics company that enables sustainable business advantage for its clients by bringing together capabilities across Consulting, Data Science and Engineering to solve high impact problems. Verizon is one of its Clients. The completion and service level agreement (SLA) compliance rate for IEN (Intelligence Engineering Network) projects at Verizon is lower than desired. The root cause of the low compliance and low completion rate is not known. Verizon would want to leverage data insights and analytics to address these issues.
We analyzed data using statistics-based approaches to get insights on different project types (projects containing At your service tasks (AYS) and projects containing non AYS Tasks). We extracted themes from the AYS (At your service) tickets (text corpus) using topic modeling and analyzed the resolution times and task durations of these themes. Also we built a scoring model to score each assignee on a 100-point scale based on task completion times.
Amir Babar, Analysis of Physician Sentiments on the Allocation of Work Hours, December 2019, (Michael Fry, Ed Winkofsky)
Physicians at a university hospital can work at up to three different locations due to requirements to staff two community hospitals in addition to the university hospital. In response to reported dissatisfaction with the process for allocating physicians’ hours at the three locations, physicians were surveyed regarding their sentiments toward various aspects of the department’s process. Additionally, respondents ranked the importance of various potential “contribution metrics” to be used to weight physician preferences in a new process for assigning hours at different locations. Significant differences in sentiments were found based on respondent demographics, with younger, less senior faculty members, reporting greater dissatisfaction in the overall process. These differences were further reinforced by clustering respondents based on their sentiments. Two clusters were identified – one generally very satisfied with the process and one dissatisfied. The former cluster consisted of older, more senior, faculty with larger proportion of males. Examination of rankings for potential contribution metrics among this cluster revealed high ratings for factors such as years practicing medicine and years as a faculty member, as well as research involvement and team fit. The latter cluster, while ranking certain metrics similarly to the former cluster, ranked seniority and team fit as far less important, while putting a much greater emphasis on a physician’s total clinical hours when prioritizing preferences for hours assigned to each facility.
Brandon Lester, Predicting NFL Point Spread, December 2019, (Charles Sox, Paul Bessire)
Predicting the outcome of sporting events has been around for years. Whether it’s ESPN analysts discussing matchups or an individual betting on the Vegas spread, improved methods to predict game outcome holds a lot of interest. This project utilized the random forest algorithm to create a model to predict future NFL games. Using R as the primary tool, game level and play-by-play data was collected from nfl.com using the nflscrapR package, transformed into the final data set, and fit into a random forest model using the ranger package. The best root mean squared error achieved was 12.99.
Chunyan Su, Are Lego sets priced too high?, November 2019, (Peng Wang, Yichen Qin)
How is a Lego set priced? You can actually find the answer in Lego official site, which states: “Finding the right price for a set isn’t easy and depends on a lot of factors. To name a few, the number of new and unique molds required, and the cost of licensing characters from other companies and brands.” In this project I will use a Lego data set from Kaggle.com and explore the relationship between price and other factors. Firstly, I will conduct a data exploratory analysis to get a general idea of the data set, as there are a variety of data types such as categorical, ordinal and numeric. Secondly, I will study the relations between pricing and variables for variable-selection purposes. In the process, I will try best subset approach, random forest, and gradient boosting approach. Later, I will split data into training sample and testing data and use MSE to measure the performance of each methods. Thirdly, a regression model will be built with R-square and total variance as model metrics evaluating the prediction accuracy of the model. Last but not least, I will share some insights from my analysis, hopefully I will be a Lego expert shopper by the end of study.
Kaijun Sheng, RDC Mobile Model, November 2019, (Yichen Qin, Joseph Burke)
This paper explains the process of building a RDC Mobile Model for account age less than 90 days to prevent fraud related to mobile deposit. Exploratory data analysis including statistical summary and histograms of numeric variables, collinearity check of variables to omit over-fitting, variable selection using stepwise selection and best subset selection, model selection based on performance on training dataset and model validation on testing dataset will be performed in order to select the final model. The final model will help us to understand what variables are significant to prevent frauds related to mobile deposits while being used daily.
Shashank Kumar, Improving Post Launch New Product Performance Tracking, November 2019, (Amitabh Raturi, Sara Palavicini)
The four business segments of the Biosciences Division (Cell Biology, Molecular Biology, Protein and Cell Analysis, and Sample Preparation) thrive on their ability to introduce new products in the market with the following objectives: to keep the portfolio afresh; and to drive market penetration and stay ahead of the competition. Tracking the progress of the new products after their launch is a process which requires inputs from multiple teams and consumes a lot of time. There is a need to prepare automated business intelligence reports about the performance of the various SKUs across geographical markets by extracting data from the dynamically updated database. As a part of that project the focus is to improve the post launch New Product Introduction (NPI) tracking process. This process tracks the performance of a new product launched by a business segment for six months against several metrics. These metrics evaluate the product from different perspectives of teams involving performance data from marketing, operations and finance. As of now the tracking and reporting process involves data input from the three teams into a flat file and the combined report at times has issues during reporting and understanding/interpretation. Therefore, the project is aimed at improving the metrics and automating the process for visualization of the data on Power BI. This will help in increased understanding of the report and the better usage of the product details for further analysis.
Trishul Gowda Ashok, Market Basket Analysis & Inventory Management Model, November 2019, (Yan Yu, Eric Walters)
Milacron, a global leader in the manufacturing, distribution and service of highly engineered and customized systems in plastic technology and processing industry is moving towards data-driven strategies. The Aftermarket Analytics and Business Intelligence team are building multiple in-house data products. One of the products is the Shopping Cart Tool, which provides recommendation to the customers based on what they have quoted for. I was assigned the task of using the transaction data to generate association rules that can be used to provide recommendations to the customers. Spare part recommendation helps in cross-selling and thereby increases the revenue for Milacron and helps in reducing the shipping and servicing costs for the customer. As part of the Graduate Case Study course, we developed a time series forecasting model and an inventory optimization model for Milacron. I also worked on testing the models that was built during the Graduate Case Study course, identification of challenges to implement the model and making the necessary changes in the model to suit the replenishment process at Milacron.
Anusha Chintakunta Manjunatha, B2B Customer Profiling with External Data, September 2019, (Yiwei Chen, Rajat Swaroop)
CDW, a leading provider of technology product and services in business-to-business (B2B) space, is moving towards data-driven marketing strategy. The Enterprise Data Science and Advanced Marketing Analytics team is building multiple in-house data products. One of the products is Customer 360, which provides a unified customer profile including internal and external data sources. I was assigned the task of exploring external data sources and provide CDW with sample useful data. Signals gleaned from external data including news and social media, job listings, turnover and more can indicate when a lead might be ready for the sales team to approach. Events like growth in R&D, company restructuring, hiring acceleration, and more can all be indicators of a firm’s readiness to buy from CDW. I have used R and Python for my activities of building Application User Interface (API) clients, web scrapers for data collection and transformation activities.
Tess Newkold, Trump: Ten Years of Tweeting, November 2019, (Dungang Liu, Liwei Chen)
The President of the United States of America is one of the most important positions in the world. Everything said, written, or tweeted is of great importance. The Department of Justice has said they are treating Donald Trump’s tweets as official presidential statements. In order to better understand what Trump has tweeted in the last ten years I will use natural language processing techniques on all 39,000+ tweets from the “Trump Twitter Archive” online. This analysis looks at what Trump tweeted the most, when he tweeted, and his sentiments overall. I will also analyze how each of these have changed over the last ten years. The results show that the general sentiment of Trump’s tweets is overwhelmingly negative and that the tweets increased in negativity over the span of the years analyzed. Lastly, the more frequently Trump uses emotions like anger, fear, disgust, and sadness the more retweets the President receives from his followers.
Anthony Selva Jessobalan, Handwritten Digit Recognition, August 2019, (Dungang Liu, Liwei Chen)
Image recognition has been an important part of technological research in the past few decades. Image processing is the main reason why computer vision has been made possible. Once the image is captured, the computer stores the image in 3D arrays, with the dimension referring to height width and the color channel. It is then compressed and stored in popular formats such as jpeg, png, etc. For the computer to understand these numbers it is important to train the machine by tagging the image and enable learning. The main idea of this project is to use Image recognition techniques to identify handwritten digits in the image. It is prudent to use deep Neural Networks for complex problems such as image processing. Neural network breaks down complex problems into simpler understandable form. In order to achieve this, Tensorflow library has a host of pre-built methods, which can be used directly. The data for this project is from the online competition hosted by analyticsvidhya.com. The data is divided into test and train ‘csv’ files. ‘train.csv’ has two columns namely ‘filename’ and ‘label’. Filename refers to the 70000 png 28 X 28 size, totaling 31MB, files of these handwritten digits. Label is the tagging associated with each images. Using the tensorflow framework, the images were predicted at an accuracy of 95.28. Tensorflow is a framework that relies a lot on computational power and hence higher accuracy could be obtained by tweaking the hyper parameters on state-of –the-art systems. With limited computational capability tensorflow performed better for image recognition.
Pallavi Singh, Anomaly Detection in Revenue Stream, August 2019, (Dungang Liu, Brittany Gearhart)
The client owns and operates parking facilities at multiple airports across the US. The revenue is calculated and collected through cars parked at these locations based on the price, duration and type of parking along with any discount coupons that may have been used during the transaction. The revenue is collected by Cashiers and Mangers managing the booth. The client has observed that at certain times there have been discrepancies in the revenue collected and the number of cars that exit the parking facility. In most such cases the revenue has been observed to be lower than expected based on the number of cars that were parked in the facility. This observation led to ad hoc investigations, and it was observed that some employees managing the booth were not being completely transparent and honest in their management, and frauds were taking place.
The client wants to identify these frauds in a timely manner as the current process is tedious and ad hoc, and there is a very high possibility of some frauds being missed and overlooked by the very nature of the investigation. The client wants an automated process where such anomalies in the revenue stream can be identified automatically, and timely investigation can be carried out.
It was decided to build the required model for one parking facility, tune the model and validate the results before the same could be scaled to multiple locations.
Vinaya Rao Hejmady, NYC Taxi Rides - Predicting Trip Duration, August 2019, (Peng Wang, Yichen Qin)
To improve the efficiency of electronic taxi dispatching systems, it is important to be able to predict how long a driver will have his taxi occupied. If a dispatcher knew approximately when a taxi driver would be ending their current ride, they would be better able to identify which driver to assign to each pickup request. In this project, I will build a predictive framework that is able to infer the trip time of taxi rides in New York City. The output of such a framework must be the travel time of a particular taxi trip. I will first study and visualize the data, engineer new features, and examine potential outliers. I will then analyze the impact of the features on the target trip_duration values.
Temporal Features Analysis: I will look for time-based trends in the target variable and see if there are patterns that it is following. Finally, I will build a model to make a prediction of the trip duration. I plan to try Regression, Decision Trees, Gradient Boosting and fit the best model to the data.
Nan Li, Bon Secours Mercy Health Reimbursement Analytics, August 2019, (Michael Fry, Jeremy Phifer)
Bon Secours Mercy Health home office recognizes the need to project gross revenue and net revenue on a monthly basis. This process allows management to make critical business decisions and plan accordingly. The goal of the tableau dashboard is providing an integrated visualization which helps the Chief Financial Officer for each group and each market easily understand the projected gross revenue and net revenue performance in the current month. Conducting a payment forecasting for next month helps the leaders to make assumption about the future operation. Management of Bon Secours Mercy Health is interested in quantifying cash collections to understand financial performance and evaluate revenue cycle operations. The ARIMA (1,5) model in SAS predict the monthly payment is closer than the average historical payment data by using five-year historical payment data and is being considered as an alternative approach for payment activity forecasting in the future.
Kratika Gupta, Talkingdata Adtracking Fraud Detection Challenge, August 2019, (Peng Wang, Liwei Chen)
We are challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. To support the modeling, we are provided a generous dataset covering approximately 200 million clicks over 4 days! Evaluation is done on area under the ROC curve between the predicted probability and the observed target. I first started with basic Exploratory Data Analysis to understand the features. I plotted graphs for all features when the app was downloaded, and the app wasn’t downloaded. After getting a basic idea of the distribution of the variables, I then did Feature Engineering as required, to fit the model better by additional features based on the frequency count. The Modelling started with the basic Logistic Regression to understand the simple classification model (linear). I then built a Random Forest to consider the non-linearity in the data. For a better performance and predictive aspect, I tried other advanced models such as Gradient Boosting, SVM and Neural Net.
Anupreet Gupta, Strategy and Portfolio Analytics, August 2019, (Mike Fry, Siddharth Krishnamurthi)
Credit cards have become an important source of revenue for the bank, as it charges a higher Annual Percentage Rate (APR) in-comparison to any other consumer lending products that the bank has to offer to a customer. While the sources of revenue include the interchange fee, transaction charges, and finance charges, the industry is being very competitive and introduces promotional offers to lure the customers to get onboard. Knowing and understanding from a portfolio point of view, it is of great significance to be able to forecast how would the book look like in the future. Having sight of this helps different teams and departments to prepare a plan of action for the coming years. Also, credit card business possesses a risk to the bank whether or not the customer is able to fully repay the amount borrowed using the credit card and there comes the role of collection strategies as in to recover the amount or avoid the customer from being charged-off. Using the model built on existing hardship enrolled customers, predicted the most likely customers to enroll for the program and proactively try to pitch them before they are charged-off from the books. Enhancing customer experience is amongst the core values of Fifth Third Bank. One such attempt is to extend the expiration date of the Rewards Points earned by the premier customer by a year without impacting the financials of the bank. Delinquent customers were identified one such area of opportunity wherein to cover the loss due to the extension.
Anirudh Chekuri, Churn Prediction of Telecommunication Customers, August 2019, (Yichen Qin, Peng Wang)
There is customer churn when the consumer stops doing business with a company. The cost of retaining a customer is low compared to acquiring a new customer. So, for any business churn prediction would prove an important investment in terms customer lifetime value and marketing. In this project we have data from a telecommunication company, and we try to determine the reasons for the customer churn and build a predictive model to give the probability of customer churn with the given data. We have used Random Forest to check variable importance using mean decreasing GINI and mean decreasing accuracy and logistic regression with logit link to determine the probabilities of customer churn with the given data. We can use the variables which turned out to be important factors effecting churn and use them to design actionable strategies to reduce the churn.
Ashish Gyanchandani, Fraud Analytics, August 2019, (Michael Fry, Andy M)
XXX is currently building two types of software products - Integrity Gateway Pre-Approval and Integrity Gateway Monitoring. My work revolved around improving analytical aspects of the Integrity Gateway Monitoring, which helps customers continuously monitor their employee spend data with XXX risk engine algorithm. The data that I worked on was Concur expense data. It contained the expense details entered by the employees in the Concur expense tool. Some of the expenses captured in these reports are Transportation, Supplies, Meals, etc. The dataset contained close to 1.5 million records. The topics that I worked on can be classified into 3 categories. The first category can be classified as exploratory analysis, the second category can be called as criteria set, and the third category can be called modeling. The exploratory analysis required me to look at the round dollar transactions, the creation of expense categories using expense types, etc. Criteria setting needed me to come up with benchmarks that will help XXX to compute the risk score. The benchmarking involved computing cash to non-cash transaction ratio for all the countries, etc. Finally, modeling involved the creation of employee clusters, anomaly, and trend detection.
The programming language/ tools used for this analysis are R, Tableau, and Amazon AWS Sagemaker.
Harpreet Singh Azrot, An Analytical Study of West Nile Virus (WNV) Mosquitos in Chicago, August 2019, (Peng Wang, Yichen Qin)
The objective of the study is to understand and analyse how WNV found in certain mosquitos is affecting the city of Chicago over the last few years. This study can play a crucial role in identifying the important factors and conditions that results in finding these mosquitos. It will also play a crucial part in predicting given specific parameters and conditions thus by a community’s point of view, appropriate actions can be taken to mitigate the risks. These predictions will be achieved by implementing multiple Machine Learning Algorithms and then comparing them to find the best model for prediction purpose.
Palash Arora, Predicting Revenue and Popularity of Products, August 2019, (Yichen Qin, Nanhua Zhang)
Insurance and healthcare companies can benefit by analyzing customer demographics in order to promote the right type of product. In this project our goal is to understand the relationship between various customer demographic factors and product preference in different regions. We also predict the expected revenue for each zip code across United States. We analyzed approximately 6,000 zip codes with 25 predictor variables such as average age, salary and population in order to predict 2 dependent variables: preferred product and expected revenue in each zip code. We have used 2 statistical methods in this project: Linear regression to predict the expected revenue generated in each zip code; Multiple Logistic Regression to predict the preferred product type in each zip code.
Apoorva Rautela, Event Extraction from News Articles using NLP, August 2019, (Charles R. Sox, Amit Kumar)
Huge amounts of text data is generated every day. Some of the information contained in these texts needs to be handled and analysed carefully. Natural language processing can help organizations build custom tools to process this information to gather valuable insights that drive businesses. One of the common applications of NLP is called Event Extraction, which is the process of gathering knowledge about periodical incidents found in texts, automatically identifying information about what happened and when it happened. This ability to contextualize information allows us to connect time distributed events and assimilate their effects, and how a set of episodes unfolds through time. These valuable insights drive organizations, which provide the technology to different market sectors. Steel tariffs have a direct impact on the Oil Country Tubular Goods (OCTG) market. This project aims to extract events from the past 18 months news articles related to ‘steel tariffs’.
In this work, news articles related to ‘steel tariffs’ are collected from newsapi.org and then the text information is processed using NLP techniques. This work focuses mainly on extracting events using ‘extractive text summarization’.
Pravallika Kalidindi, Analysis of Balance Transfers and Credit Line Increase Programs for Credit Cards, August 2019, (Michael Fry, Jacob George)
Balance transfer (BT) and credit line increase (CLI) programs are two main profit generating programs for credit card companies. Throughout this project we tested different marketing channels, customer behavior, profit and risk drivers for balance transfers and credit line increase programs. Balance transfer program includes identifying the right customers and giving them lucrative offers to transfer their credit card debt from another bank to Fifth Third. Credit line increase programs is where the credit card company increases the credit limit for selected credit-worthy customers enabling them to increase their purchases thereby capitalizing on the incremental interchange revenue and finance charges. The key findings of our analysis are as follows. Customers doing a digital BT tend to do a greater number of BT’s but with each BT being of lesser value compared to non-digital BT’s. We observed that % accounts going delinquent and charge-off are higher when they use convenience checks proving convenience checks are riskier. On a portfolio level, customers are taking on more debt after BT, but this behavior is highly dependent on the type of customer. To reduce our losses, we analyzed a potential solution – to cancel the promo APR for a customer when he goes delinquent. We calculated the estimated finance charge collectible at different stages of delinquency cycle. We observed that risk metrics for CLI are close to each in test and control groups.
Guru Chetan Nagabandla Chandrashekar, Improving Delivery Services Using Visualizations, August 2019, (Charles R. Sox, Pratibha Sharan)
The Commercial Effectiveness team at Symphony Health provides Consulting and Analytics services to healthcare companies of all sizes around the country. They provide standard solutions to the brand teams of drugs from pre-approval, pre-launch, launch until patent expiry phase. A lot of these solutions can be standardized and automated to eliminate repetitive work, save time, reduce errors and get to insights faster. One of the ways to achieve this is by developing standardized visualizations and dashboards that capture the must-haves in key solutions or key aspects of a project. This report will go through a few visuals I developed during my internship at Symphony Health. This report will cover the need for developing each visual, understanding the data required, design, outputs and the impact.
Mohit Anand, Predicting Customer Churn in a Telecom Industry, August 2019, (Peng Wang, Liwei Chen)
Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers. Telephone service companies, Internet service providers, pay TV companies, insurance firms, and alarm monitoring services, often use customer attrition analysis and customer attrition rates as one of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new one. Companies from these sectors often have customer service branches which attempt to win back defecting clients, because recovered long-term customers can be worth much more to a company than newly recruited clients. Companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn occurs due to a decision by the customer to switch to another company or service provider, involuntary churn occurs due to circumstances such as a customer's relocation to a long-term care facility, death, or the relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the analytical models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer relationship which companies control, such as how billing interactions are handled or how after-sales help is provided. Predictive analytics use churn prediction models that predict customer churn by assessing their propensity of risk to churn. Since these models generate a small prioritized list of potential defectors, they are effective at focusing customer retention marketing programs on the subset of the customer base who are most vulnerable to churn.
Kunal Priyadarshi, Microsoft Malware Prediction Challenge, August 2019, (Peng Wang, Liwei Chen)
A malware is a software designed to cause damage. We want to help protect more than one billion windows machines from damage before it happens. The problem is to develop techniques to predict if a machine will soon be hit with malware. It is a classification problem and the models were built using decision trees (CART), Random Forest and Gradient Boosting Machines. These are the current state of the art algorithms. They don't require any assumptions between independent and dependent variables and work in non-linear environment. The algorithms used handles missing values on their own as they all are based on decision trees. Also, the entire code is reproducible. While Random Forest and Gradient Boosting Machines were giving comparable area under the curve (AUC) on the test data, the training AUC was significantly larger for random forest. It is recommended to use Gradient Boosting Machines as the final model as bias was similar while variance was lower for GBM.
Aniket Sunil Mahapure, Quora Question Pairs Data Challenge, August 2019, (Peng Wang, Liwei Chen)
This project is based on a Kaggle competition (https://www.kaggle.com/c/quora-question-pairs/overview). Quora is a platform to ask questions and connect with people who contribute unique insights and quality answers. Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. So, Quora is keen to group multiple questions based on their meaning to reduce redundancy and improve overall convenience for users. In this competition, objective is to tackle this natural language processing problem by applying advanced techniques to classify whether question pairs are duplicates or not.
Supriya Sawant, Prediction of Fraudulent Click for Mobile App Ads, August 2019, (Peng Wang, Liwei Chen)
This project is based on Kaggle competition (https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/overview/description). The project is based on fraudulent click traffic for mobile app ads. For companies that advertise online, click fraud can happen at an immense volume, resulting in misleading click data and wastage of money. TalkingData is China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. China is the largest mobile market in the world and therefore suffers from huge volumes of fraudulent traffic. In this project we are required to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad.
Lixin Wang, Geolocation Optimization for Direct Mail Marketing Campaign, August 2019, (Michael Fry, Shu Chen)
This capstone project is part of the intern project for the marketing analytics in Axcess Financial. The goal of the project is to analyze the geo-spatial relationship between stores and customers, identify trade area, and optimize store assignment in the direct mail marketing campaigns. Historical customer information was extracted from database using Structured Query Language (SQL) and PROC SQL procedure in SAS. Spatial analysis was done using pivot table in Excel and simple descriptive analysis was done in SAS. Interactive dashboard was created in Tableau to visualize the geo-spatial distribution of stores and customers. Major market for each store was identified, and trade areas were divided. Store analysis shows that majority of customers in each zip code went to the top 3 stores when there are multiple destination stores. The top 3 destination stores are consistent with the stores assigned to each zip code, suggesting that the current store assignment strategy works well on the zip code level.
Menali Bagga, Analysis Defects and Enhancements Tickets, August 2019, (Michael Fry, Lisa DeFrank)
This capstone project contains two major applications of what I learnt during my master’s in business Analytics– Data Visualization (the tool used here is Power BI) and Data Mining using Clustering Analysis (using K-Means). The first half of the project was done with the motive of visualizing the number of open critical, high priority production support tickets, mainly defects and enhancement type prior to the August 2019 release. To achieve this objective, I used Power BI where I pulled the relevant data by developing suitable queries in Azure DevOps service, then connected it to Power BI and filtered the data directly in it to suit my needs. While, the second half of the project was done with the objective of performing a clustering analysis on the tickets to facilitate capacity planning by putting the tickets in the suitable brackets.
Syed Imad Husain, Blood Supply Prediction, August 2019, (Chuck Sox, Shawn Gregory)
Hoxworth Blood Center is the primary blood donation centers in the Greater Cincinnati area. Uncertainty in blood supply patterns and donor behaviors is one of the greatest challenges dealt by the Donor Services and Blood Center operations. This project deals with developing analytical methods to sustain data driven decision making for them by employing descriptive, predictive and prescriptive analytics. The main areas of focus of this project are understanding donors’ participation (classification) and predicting donor turnouts (regression) for a given drive. Different supervised and unsupervised learning techniques have been employed to uncover trends.
Lakshmi Mounika Cherukuri, Brightree Advanced Analytics Projects, August 2019, (Michael Fry, Fadi Haddad)
Brightree Advanced Analytics team focuses on providing tailored analytical solutions to internal and external customers, through dynamic dashboards that are easy to navigate. The team contributes to the growth of customers by providing clean, consolidated and consumable data insights. This capstone report outlines two of such projects – AllCall KPI Survey and a Customer Profitability Analysis. The AllCall KPI Survey project involves one of our internal customers – AllCall, a subsidiary of Brightree that works on resupply orders. The project requires us to embed a feedback survey questionnaire into Sisense, enabling the surveyor to input feedback through a dynamic dashboard and then store the responses into the Database. Once the survey results are in, we need to analyze the data and then visualize it to a KPI review dashboard, which will identify most efficient callers, number of sales orders taken by each caller and observe whether necessary etiquettes are practiced by callers while communicating with the patients. The Customer Profitability Analysis is a project that involves our external Customers. For this project, the team is tasked to identify manufacturers and items that yield most profits factoring in the Costs, Revenues, Bill Quantities and Number of Sales Orders for each Item Group. In addition to this, Customers also want to detect the main source of new patients (e.g., from Referrals, Ordering Doctors, Marketing Representatives). The final product is expected to be a dynamic dashboard, which shows the aforementioned KPIs and profitability measures over time.
Maryam Torabi, Improving Patient Flow in an Emergency Department: a Computer Simulation Analysis, August 2019, (Yiwei Chen, Yan Yu)
In this study I have used a yearlong operational time-stamp data from a regional Level II Trauma Center emergency department in Virginia to understand the nature of patient flow in this ED and to build computer simulation models. The emergency department routes patients to 4 different treatment teams based on the severity of their condition. I have simulated the current system, and an alternative that pools two of the treatment teams, and delegates some tasks to the triage nurse. Comparing the average Length of Stay (LOS) of patients in the pooled team (ESI2 and ESI3), and the weighted average of patient wait time after triage to get into a bed shows that pooling resources improves both of these performance metrics at 0.05 level.
Apoorva Bagwe, Loss Forecast Model, August 2019, (Charles R. Sox, Adam Phillips)
Axcess Financial Inc. offers different types of loan products to its customers. A Retail Choice Loan Product (CLP) loss forecast model is currently being used by the company to forecast the amount of money the company will lose due to its Retail CLP customers charging off. These models have been developed in SAS and are refreshed every month. Since the processing time of the process is high, the modeling process needs to be replicated on Snowflake. My task was to convert the loss forecast models from SAS to Snowflake. This resulted in reducing the time taken for the execution from 7-8 hours to less than an hour. The project was divided into four phases – creating the base data, forecasting charge offs using Markov Chain modeling, forecasting charge offs using loss curves and improving the overall efficiency of both the process and the model. As two different data integration processes were responsible for the company’s data being loaded in SAS and Snowflake, a lot of checks at each stage were needed to ensure the accuracy of the results at each step. Due to the functional and coding differences in SAS and Snowflake, different data structuring approaches were needed for the replication of the analysis on Snowflake. Another challenge I faced was that the SAS databases were updated daily while the Snowflake databases were updated every few hours. The business models in Ohio, Illinois and Texas were different from the rest. Therefore, loans from these states were analyzed and modelled separately.
Vishnu Guddanti, Response Model Analysis, August 2019, (Michael Fry, Kaixi Song)
Credit cards became an important part of everyone’s life. They have emerged as one of the most convenient and easiest ways to transact. The credit card industry is a lucrative business. The major revenues include interest revenue from revolving balances, missed payments, late fees, annual fees, merchant fees etc. Most banks that issue credit cards run acquisition campaigns to acquire customers. Direct mail campaign is one of the acquisition strategies employed by the Fifth Third Bank. This involves a mail being sent to a prospective customer with an offer such as balance transfer offer or spend to get cashback offer or zero percent APR offer for a certain period. The project seeks to improve the population selection strategy of the direct mail campaigns for the Fifth Third Bank. The population for a direct mail campaign is selected by considering a variety of factors including marketing costs, mail offer, response score deciles, present value of the prospective customer, FICO score, response rate and approval rate of the customer. Based on these factors, return on marketing investment (ROMI) is calculated on FICO group level and response score decile level. Only the population that meets ROMI cut-off is selected for the direct mail campaigns. The project seeks to improve campaign efficiency by calculating ROMI on a granular response score level and FICO group level employing exponential regression models. Through response model analysis, 10k more customers could be targeted resulting in 26 more credit card accounts booked and a Net Present Value (NPV) increase of 9,824 USD for the bank.
Douglas Kinney, Emerging Risk: Visualizing, Mining, and Quantifying Wildfire Exposure, August 2019, (Michael Fry, Dan Madsen)
One of the primary challenges insurance and reinsurance companies face today is understanding catastrophe risk in a changing landscape. Population movement, city development, climate change, and recent major events in California are the key factors driving an increased focus on wildfire risk at natural disaster level. Model vendors have not yet caught up with a succinct, transparent method of quantifying concentrations of risk, aggregating the level of pure hazard, or estimating damageability of a given location. This paper’s focus will be leveraging location data to solve the problem of comparing concentrations of wildfire risk in California, across varying portfolios of business. The aim is to create a customized view of risk for each data set, using proprietary wildfire hazard grading. The end-result is a framework of analysis that produces digestible information for underwriting, executive review, and decision-making purposes.
Nitesh Agarwal, Predicting the Occurrence of Diabetes in PIMA Women, August 2019, (Yan Yu, Akash Jain)
The diabetes data containing information about PIMA Indian females are used for the analysis. Data contains information about 768 females, of which 268 females were diagnosed with Diabetes. The information available includes 8 variables, such as Age, Number of Pregnancies, Glucose, Insulin, etc. Missing values in the dataset constituted to about 30% of the observations. MICE (Multivariate Imputation via Chained Equations) was performed to impute the missing values in the data set. Performing correlation analysis showed that Insulin and Glucose, BMI and Skin Thickness had a moderately high linear correlation. Logistic regression, Classification tree, Random Forest and Support Vector Machine models are deployed, and Support Vector Machine is chosen as the best model based on out of sample AUC. It also has minimum misclassification rate.
Pooja Purohit, Marketing Analytics for JoAnn Stores, August 2019, (Charles R. Sox, Prithvik Kankappa)
In today’s scenario, retail industry is one of the most volatile industries due to uncertain economy, digital competition, increasing number of product launches, shift in customer interests, tariff pressures, supply-chain constraints etc. Jo-Ann has a brick-and-mortar model with a little online presence (~4%) which is facing the same challenges on a day-to-day basis. It is considering tapping the available amount of data in order to optimize its sales and increase margin. Advanced analytics can deliver insights that inform smart decisions from deciding what promotions should be run for a product to what price it should be set to maximize margin. This report is primarily based on designing a simulation and planning tool by Impact team for Joann stores to get far more from their marketing spending, helping plan, anticipate and course correct their promotional strategies in a very dynamic market. As of now, Joann is focusing primarily on two levers i.e. promotion and pricing for improving their financial performance. This report is primarily focused on pricing aspect where using advanced models, Impact team is helping Joann to develop its pricing strategy. Using historical data, granular level demand models are created to anticipate price elasticity. Based on these models, simulators are designed to evaluate the right price for a given product in order to facilitate minimum margin leakage.
Akash Dash, Using Data Analysis to Capture Business Value from the Internet of Things (Iot) Data for a Leading Manufacturer, August 2019, (Michael Fry, Sagar Balan)
Anybody would seldom pay more attention than required to the smart dispenser machines (dispensing tissues, bath towels, hand soap, etc.) in restrooms. Whether on vacation, staying in the best hotels or on a business trip traveling through airports, these ‘smart restrooms’ are a part of our experience. Our client collects a huge amount of instreaming data from their smart restroom solutions and wants to capture business value out of the data. In this paper we describe how, through the use of data analysis, we helped the client with recommendations on two key business problems they have in mind. The first problem we attack is how to help increase sales. And, the second problem revolves around creating more time savings for the maintenance staff (who are the end customers). Our methodology includes a consulting approach to first understand the problem from client stakeholders, and then apply data cleaning, wrangling, exploration, and visualizations to uncover trends and insights. The tools primarily used through the project have been PostGre SQL and R-studio.
Ishali Tiwari, Prediction of Wine Quality by Mining Physiochemical Properties, August 2019, (Yan Yu, Ishan Gupta)
Certification of product quality is expensive and time consuming at times, particularly if an assessment by human experts is required. This project examines the involvement of data mining techniques to facilitate that process. A dataset consisting of physicochemical properties of red wine samples is used to build data mining models to predict quality of wine. The use of machine learning techniques; specifically, binary logistic regression, classification trees, neural networks and support vector machines were explored, and the features that perform well on this classification were engineered. The performance of models is evaluated and compared by the metrics prediction accuracy and AUC (area under receiver operator characteristics curve).
Hareeshbabu Potheypalli, Labeling School Budget Data Using Machine Learning and NLP, August 2019, (Yan Yu, Yichen Qin)
The objective of the current analysis is to use the machine learning methods and NLP techniques to analyze text data. The data is collected from a competition hosted by ‘DrivenData.org’. The data set chosen is having the expense information for a school where each observation is labelled according to the department /object-bought / functionality / Class / user etc. Therefore, this is clearly a ‘Multiclass-Multilabel’ classification problem. Various models used by the contestants are analyzed and reviewed. Models such as simple Logistic Regression, OneVsRestClassifier, RandomForest, CountVectorizer etc. are used in classifying an observation into its corresponding class of each categorical variable. The models are then tuned further to improve the accuracy of the model and the log-loss cost. Also, the future scope and developments of the project are discussed further.
Bolun Zhou, Identify Heart Disease Using Supervised Learning, August 2019, (Yichen Qin, Charles Sox)
In machine learning, logistic regression is used for predicting the probability of occurrence of an event and the probability can be turned into a classification. Logistic regression extensively used in the medical and social sciences as well as marketing applications. It is used to perform on a binary response (dependent) variable. Moreover, CART can be used for classification or regression predictive modeling problems and provides a foundation for important algorithms like bagged decision trees, random forest and boosted decision tree. Especially, Random forest is an extension of Bagging, but it makes significant improvement in terms of prediction. In addition, artificial neural network (ANN) or connectionist systems are computing system that are inspired by the biological neural network that are similar to animal brains. The neural network is basically a framework for many different machine learning algorithms to work together and process the complex data inputs. In this project, we tried to use these techniques to improve the accuracy of diagnosis of heart disease. This study could be useful to predict the presence of heart disease in the patient or find any clear indications of heart health.
Sukanto Roy, FIFA 18 - Playing Position Analysis, July 2019, (Dungang Liu, Peng Wang)
FIFA 18 offers detailed quantitative information on individual players. In modern day football, specific positions represent a player's primary area of operation on the field. It is extremely important to characterize a player according to their position on the field. Each position requires a different combination of skills and physical attributes. With the rapid increase in the volume of soccer data, data science abilities have attracted the attention of coaches and data scientists alike. As a FIFA video game enthusiast and a soccer player, I took this opportunity to work on this problem using the FIFA18 data which is originally from sofifa.com but a structured version of the data was posted on tableau public website. The data is unique at player level, and each player has attribute (e.g. dribbling, aggression, vision) personal (e.g. club, wage, value) and playing position data (rating on various positions). To solve this problem, I have taken a machine learning approach. After data preparation and dimension reduction, 4 supervised learning statistical models were built: KNN, Random Forest, SVM and Neural Network. We classified the 15 playing positions into 4 positions and trained the models with the positions as our response and attributes as the predictors. KNN, SVM and Neural network models had accuracies of 81.81%, 82.27% and 82.26% on the test data. Only the random forest model had an accuracy lower than 80 – 71.32%. Any of the former 3 models can be used by coaches to support their methods and ideas for a player's playing position.
Joe Ratterman, Predicting Future NCAA Basketball Team Success for Schedule Optimization, July 2019, (Mike Fry, Paul Bessire)
Every year, 353 NCAA Division 1 basketball teams compete for 68 bids to the NCAA Men’s College Basketball Tournament. Of those 68 tournament bids, 32 are reserved for conference tournament champions – leaving 36 at-large bids. These bids are given out to the 36 teams that the selection committee deems the best of the rest. While the selection process is not set-in-stone, at-large teams historically have high Ratings Percentage Index (RPI) rankings. RPI was one of the primary tools used by the selection committee up until the 2018 season. Though the NCAA Evaluation Tool (NET) has replaced RPI as the primary evaluation tool, RPI still provides a quick comparison of teams that played different schedules. The calculation for RPI is as follow: RPI = (Win Percentage*0.25) + (Opponents’ Winning Percentage*0.50) + (Opponents’ Opponents’ Winning Percentage*0.25). This paper aims to develop a method to predict a win probability for each NCAA Division 1 program a year in advance. These probabilities will allow a team to simulate the outcome of all games in a given season and optimize their non-conference schedule.
Mahitha Sree Tammineedi, Analysis and Design of Balance Transfer Campaigns, July 2019, (Charles R. Sox, Jacob George)
Every year, banks make billions of dollars on credit cards, so they are always looking to get more debt. A balance transfer is a way of transferring credit card debt from one credit card to another credit card belonging to a different bank. To put simply, it’s a way to gain debt from the competition. This project seeks to analyze the performance of past Balance Transfer (BT) campaigns at Fifth Third Bank and improve the future campaigns by building a present value (PV) model to provide insights on which offers are the most profitable for each segment of customers considering factors such as balance during the promotional period and post promotional period, revenue from fees, closed accounts, charged off accounts, finance charges, and other expenses incurred. The insights uncovered in this study will be used to design future BT campaigns.
Jeffrey Griffiths, Dashboard for a Monthly Operating Report, July 2019, (Michael Fry, Chris Vogt)
Archer Daniels Midland (ADM) is an agricultural giant headquartered in Decatur, IL. In 2014 the company began a digital transformation of its business called 1ADM and moved its I.T. headquarters to Erlanger, KY. Within the I.T. office the Data and Analytics (D&A) team works on data management and data projects for the business. Each month, Sr. Director of Data and Analytics reports to the CIO about the progress her team has made. Currently, the visualizations used to show that progress require a lot of work from the Sr. Director and do not utilize best practices when it comes to data visualization. This summer I was part of an intern team that redesigned the Sr. Director’s Monthly Operating Report (MoR) with good data visualizations that could communicate the progress D&A has made month-over-month (MoM), while also reducing the amount of work the Sr. Director would need to do each month. The intern team met extensively with the Sr. Director and D&A team leadership to understand the story they were trying to tell with their progress metrics.
Avinash Vashishtha, Identification of Ships: Image Classification using Xception Model Architecture, July 2019, (Dungang Liu, Yiwei Chen)
Computer vision has been a booming field with numerous applications across various sectors. But the application which motivated me the most to take up a project in computer vision was Autopilot feature in Tesla. In this problem statement, A Governmental Maritime and Coastguard Agency is planning to deploy a computer vision based automated system to identify ship type only from the images taken by the survey boats. We will be creating a model to classify images into 5 categories- Cargo, Carrier, Cruise, Military, and Tanker. Data has been picked from a Computer vision competition hosted on ‘Analytics Vidhya’ website. Link of the problem statement is given below:
https://datahack.analyticsvidhya.com/contest/game-of-deep-learning/#problem_statement
To classify images, we have used Xception model architecture and through transfer learning re-purposed it to solve our problem statement. The final trained model showed an accuracy of 96.2% with most of the error happening in cargo and carrier. Our Model would help classify ships or vessels into respective categories and would save Maritime and Coastguard agency crucial time to respond to any emergencies.
Aabhaas Sethi, Predicting Attrition of Employees, July 2019, (Yan Yu, Yiwei Chen)
Employee attrition can be detrimental to a company's performance in the long term. I have personally observed a negative impact on one of my past employer’s performance because of employee attrition. The objective of this project is to explore the factors that are related to employee attrition through data wrangling and building a model that could be used to predict whether an employee would leave the company or not. I have used different statistical techniques to predict employee attrition and compared the performance for those models. Further, I have explored different sampling techniques such as Over Sampling, Under Sampling, SMOTE, etc. as an attempt to manage the imbalance in the data set. There are only 16% positive response values. Finally, I have compared the predicting performance of the model built through the different sampling techniques with the original model with random sampling.
Anjali Gunjegai, Cost Analysis of Steel Coil Production, July 2019, (Charles Sox, Sabyasachi Bandyopadhyay) Since profitability of a product is the backbone of any product, BRS has decided to take a step towards estimating and optimizing the orders received by analyzing the cost going into producing a coil. This project analyses the various production units in the mill and estimates a cost that is associated with each step and with the help of dashboards, gives the company a look into the major factors contributing to the costs and the potential to optimize the processes. Apart from deriving the cost for a coil, the project also analyses the grades of steel produced and predicts the scrap mix consumed and the foreseen costs for the heats. This project gives a good roadmap to achieve a faster accounting for the costs incurred in the month and an automatic costing tool which calculates an estimate of the cost near real time.
Ashwita Saxena, Can Order Win Rate be Predicted Based on Timeliness of Response to Customer Emails, July 2019, (Peng Wang, Michael Fry)
Ryerson is a metal manufacturing company based in Chicago. Their main products include aluminum, stainless steel, carbon and alloys. Most of their customer interactions and transactions happen through emails. Customers request quotes via email and order products via email as well. Ryerson is currently trying to identify ways of increasing their revenue. They believe that an increase in the number of orders they obtain through email interactions could stimulate revenue growth. One critical variable that impacts their orders is the time in which their representatives reply to customer emails. This project aims at identifying the impact of email response time on order win rate, while also identifying other important factors that impact winning orders. The final models presented have been used for interpretation as well as strong predictions while maintaining model accuracy.
Asher Serota, Application of Business Analytics to Quantifying Reporting and Agent Data at American Modern Insurance Group, July 2019, (Michael Fry, Christopher Rice)
This capstone describes two data analytics projects – SharePoint Analytics Open Rate Report and SCRUB – that I performed for the Marketing/Sales Insight & Analytics Team at American Modern Insurance Group. The main goal of the former was automation and visualization of reporting process. The main goal of the latter project was automation and visualization of the agent data. Both required transition from Excel-based and often manual manipulation and entry of data. To automate the processes, I developed R code utilizing several packages, such as Tidyr and Dpylr, and I also used data cleaning and aggregation techniques. Additionally, I developed methods to visually represent the SharePoint Report, including in PDF and URL formats, and to streamline the SCRUB process.
Megan Eckstein, Texas Workers Compensation Analysis, July 2019, (Michael Fry, John Elder)
Great American Insurance Group (GAIG) writes a significant portion of its business in workers compensation. Because of its magnitude within the industry, looking into particular markets to target or avoid is important to help minimize losses paid on workers compensation claims. One of the subsidiaries of GAIG writes the majority of its business in Texas. To help this subsidiary reduce medical loss from claims, I analyze Texas workers compensation industry data to examine medical losses that have occurred within different markets. This data encompasses eleven years of claims. Based on this analysis, I recommend different market segments to target and avoid within the state of Texas.
Akshay Kher, Optimizing Baby Diaper Manufacturing Process, July 2019, (Mike Fry, Jean Seguro)
Currently, the defect rate for diapers manufactured by P&G is larger than desired. Due to this, a large amount of diapers have to be disposed of leading to substantial monetary loss. Any solution which can even marginally decrease this defect rate would be extremely useful for P&G. Hence, very recently P&G has started capturing data related to the diaper manufacturing process using plant sensors. Through the use of this data we aim to do the following: 1. Build a model that can predict whether a batch of baby diapers would be defective or not. 2. Understand and quantify the impact of input variables on the output variable i.e. defect flag.
Santosh Kumar Biswal, Telco Customer Churn Prediction, July 2019, (Dungang Liu, Yichen Qin)
Customer churn is the loss of customers. The goal of the project is to predict the churn rate of the customers for one the telecommunication client. Knowing how churn rate varies by time of the week or month, product line can be modified according to customer response. A very methodological approach has been followed. We start with data cleaning and exploratory data analysis following which various machine learning algorithms like logistic regression, Decision trees, Random forest used to formulate an appropriate model giving out the best results, i.e lowest misclassification rates. Random Forest found to be best model to predict churn rate and important factor contributing to churn rate is “Monthly Charges”.
Husain Yusuf Radiowala, Commercial Data Warehousing and MDM for an Emerging Pharmaceutical Organization, July 2019, (Michael Fry, Peter Park)
Pharmaceutical companies invest time in research and billions of dollars in launching a promising new drug only to see unsatisfactory sales numbers. Competing products, generics arrive quickly after launch, reducing the time in which a drug remains on the market. Therefore, a successful drug launch is critical in the organization’s success. Effective marketing enables this success. Health Care Providers (HCP’s) and other decision makers need to be communicated about the key clinical and non-clinical benefits of a product. The situation is even more vital for Emerging Pharma companies that lack the pecuniary resources of “Big Pharma” to absorb an unsuccessful launch. These organizations therefore tend to focus on an effective drug – launch strategy and use big data, commercial analytics, 3rd party healthcare data, patient information – available externally and build a blueprint before launch. Using external data poses a challenge, data quality, trust in the accuracy of external sources and integration of the data into a single system are key issues that organizations face. The project aims to define and design an analytics platform with Master Data Management (MDM) [6] capabilities for an Emerging Pharma company (Company A) which aims to launch its product (Product B) in Q1 2020, across its health care practice area (Practice C). This platform will generate a “Golden Profile”, which is a single source of truth across HCPs in Practice C, spanning from various commercially available data sources. It provides the basis for strategic business decisions pre and post launch of Product B.
Prakash Dittakavi, Eye Exams Prediction, July 2019, (Michael Fry, Josh Tracy)
Visionworks business drives mostly on comprehensive eye exams and exam conversion percentage. Due to inconsistencies in the number of customers who visit the store regularly, there are days when the stores couldn’t perform exams for all customers due to insufficient staff to match high traffic or perform fewer exams due to less traffic and being overstaffed. So, predicting the number of exams will help in optimizing the staff at the stores, thereby leading to an increase in revenues or decrease in costs. The objective is to predict the exams for every market separately as each market is different from other markets and the business is different for different markets. Random Forest model and Linear Regression are used to predict the number of exams for a week and to find the important features. Adjusted R-square is used to compare models.
Aniket Mandavkar, Energy Star Score Prediction, July 2019, (Yichen Qin, Edward Winkofsky)
The NYC Benchmarking Law requires owners of large buildings to annually measure their energy and water consumption in a process called benchmarking. The law standardizes this process by requiring building owners to enter their annual energy and water use in the U.S. Environmental Protection Agency's (EPA) online tool, ENERGY STAR Portfolio Manager and use the tool to submit data to the City. This data informs building owners about a building's energy and water consumption compared to similar buildings, and tracks progress year over year to help in energy efficiency planning. Energy Star Score is a percentage measure of a building's energy performance calculated from energy use. The objective of this study is to use the energy data to build a model that can predict the Energy Star Score of a building and interpret the results to find the factors which influence the score. We will use NYC Benchmarking data set that measures 60 energy-related variables for more than 11,000 buildings in New York City.
Sanjana Bhosekar, Sales Prediction for BigMart, July 2019, (Dungang Liu, Edward Winkofsky)
BigMart(name changed) is a supermarket that could benefit from being able to predict what are the properties of products and stores which play a key role in increasing their sales. The dataset provided has 8523 records and 11 predictor variables. The target variable is the sales of a particular product at an outlet. This is a typical task of performing supervised machine learning. For this project, a linear model, regression tree, random forest, generalized additive model, and neural network were tried and tested to predict the revenues in dollars. Item_MRP turned out to be the most important variable, followed by Outlet Type. This challenge was hosted on a website “Analytics Vidhya” and the metric chosen by them to assess the best model was RMSE. Going by it, Random Forest Model worked the best.
Shashank Bekshe Ravindranath, Exploratory and Sentiment Analysis on the User Review Data, July 2019, (Yan Yu, Edward Winkofsky)
Yelp.com is a crowd-sourced local business review and social networking site. Yelp users can submit a review of their products or services using a one to the five-star rating system and write their experience as a review in their words which acts as a guide for other users who want to use the specific product or service. Traditionally product feedback from users has been heavily dependent on getting customers’ ratings on a set of the standardized questionnaire but with the introduction of text-based data, there is an opportunity to extract much more specific information which can be leveraged to make better business decisions. This paper is interested in using the star rating to quantify the whole user experience and user-written text reviews to understand it qualitatively. Sentiment analysis (also known as opinion mining or emotion AI) using text analysis and data mining techniques will be performed. The data is systematically identified, extracted, quantified, and studied to understand subjective information. It enables understanding of the emotional or subjective mindset of the people which is quite hard to quantify.
Buddha Maharjan, Surgical Discharge Predictive Model, July 2019, (Dungang Liu, Liwei Chen)
A predictive model of surgical discharge helps how effectively hospitals coordinate care of their sickest patients who were leaving the hospital after a stay to treat chronic illness. This is measured through the discharge rate. Without high quality of care coordination, patients can bounce back from home to the hospital and the emergency room, sometimes repeatedly. This will increase hospital readmission. Therefore, better care coordination promises to reduce readmission rate which minimizes cost and improve patients’ lives. It also helps to figure out how many discharged patients are readmitted for re-surgery. The main purpose of this study is to develop a predictive model for surgical discharge. This dataset is taken from The Dartmouth Institute for Health Policy and Clinical Practice which contains year 2013, state labels data for the patient older than 65, both male and female patients who uses Medicare from 50 states, including DC and the United States. The dataset has 52 observations with 19 variables including 1 categorical and 18 numerical variables. The Exploratory Data Analysis and statistical modelling was used to analyze and develop a predictive model using StepAIC method respectively during the variable selection and model building process. The important variables such as X1 (Abdominal Aortic Aneurysm Repair), X2 (Back Surgery), X345 (Coronary Angiography, Coronary Artery Bypass, Percutaneous Coronary), X7 (Cholecystectomy), X10 (Knee Replacement), X13 (Lower Extremity Revascularization), X14 (Transurethral Prostatectomy) and X16 (Aortic Valve Replacement) are included in the final model. The predictive model in the form of multiple regression simply tells us the number of patients discharged after surgery. It helps a hospital to figure out how many of the surgically discharged patients are readmitted within 30-day periods or longer.
Don Rolfes, Stochastic Optimization for Catastrophe Reinsurance Contracts, July 2019, (Yan Yu, Drew Remington)
Reinsurance is essentially insurance, for insurance companies. It serves to reduce variation in a Primary Insurance company’s financial statements, to transfer risk to a Reinsurance company and can maintain financial ratios that are either required by law or desired by shareholders. As with any asset that a company purchases, several questions must be answered. What type should be bought? How much is enough? What is the best deal for what I’m willing to pay? Stochastic Optimization is a useful tool to answer these questions. This project uses a Random Search algorithm paired with a Monte Carlo simulation study in order to find “Optimal” Catastrophe Reinsurance structures. The results suggest a few simple calculations based on Benford’s law that can identify a Reinsurance structure that will perform well on an average basis.
Rahul Agrawal, Bringing Customers Home: Customer Purchase Prediction for an Ecommerce - Propensity to Buy (Predictions), July 2019, (Yichen Qin, Liwei Chen)
An e-commerce retailer Marketing team wants to improve revenue by performing customized customer marketing. For targeting and segmenting customers, we find customers’ propensity of buying a product in the next month. By prioritizing customers based on their respective purchase score, they can reduce the expense of marketing and get higher conversion rate and therefore better ROI. We can leverage customers’ past lifetime characteristics data since enrollment like customer type, engagement, website behavior, purchases and customer satisfaction with the company and predict their future purchase probability and revenue generation using Predictive Analytics and interpret the factors which influence Customer Purchase. We take a supervised learning approach using 2 Target variables, first, does the customer purchase in the next 30 days, second, the total revenue generated in a month from all purchases. First, we predict whether the customer will purchase in the next 30 days using Supervised Binary Classification, secondly, we predict the total revenue generated using Supervised Regression models. Gradient Boosting model performed best in terms of AUC of 0.82 and accuracy of 90%. Customers who visited recently on the website, had more recent orders, had items Added to Cart and higher overall purchase per month are more likely to purchase a product in the next month. Customers who answered that they will purchase 6 or more products in a year have more likelihood of purchasing in the coming month. A Marketing team can leverage this model for accurate personalized marketing, effective email campaigns, clarity of type of customers with their separation parameters and better customer experience.
Jeevan Sai Reddy Beedareddy, Identification of Features that Drive Customer Ratings in eCommerce Industry, July 2019, (Charles R. Sox, Vinay Mony)
In ecommerce websites, ratings given to a product are one of the most important factors which could drive sales. A higher rating given to a product might increase the trust in the same and can motivate other customers to make a purchase. There could be multiple factors which influence ratings given to a product i.e. delivery times of previous purchases, product description, product photos etc. Ugam, a leading next generation data and analytics company which works with multiple retailers wants to design an analytical solution that helps in identifying drivers of customers ratings. Since Ugam works with multiple retailers, the solution must be designed such that it is reproducible across multiple retailers with little manual intervention. Through this project we designed an analytical framework which takes ratings/reviews datasets as input, performs modeling techniques like regression, decision trees, random forest and gradient boosting machines, identifies the best performing model and outputs features which are important in driving the ratings. Variable selection is performed in linear regression and hyper parameter tuning is done in tree-based models to extract the best performing features. The entire process is automated and would require only datasets as input from the user.
Dharahas Kandikattu, Genre Prediction (Multi-Label Classification) Based on Movie Plots, July 2019, (Yichen Qin, Edward Winkofsky)
A genre is an informal set of conventions and settings that help in categorizing a movie. Nowadays filmmakers started making movies by blending the traditional genres like horror or comedy giving birth to produce new kinds of genres like horror comedies. As time is going by it is becoming harder to classify a movie into a single genre. Most of the movies these days fall under at least 2 genres. To help solve this problem of finding all the genres associated with the movies based on the plot of the movie, our traditional multi-class classification wouldn’t be very helpful. To solve this problem, I have used a concept called multi-label classification. In my project, I have discussed how we can predict all the genres associated with a movie just by looking at the plot of the movie with the help of NLP and multi-label classification using algorithms like Naïve Bayes and Support Vector Machines.
Nagarjun Sathyanarayana, Portuguese Bank Telemarketing Campaign Analysis, July 2019, (Yan Yu, Peng Wang)
The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y). The project will enable the bank to determine the factors which determines the customers' response to the campaign and establish a target customer profile for future marketing plans. The data was split in to Training (80%) and Test (20%) groups. The following algorithms have been employed: Logistic Regression, Classification Tree, Random Forest, Neural Network, Naïve Bayes.
Traditionally, statistical analysis is performed using SAS or R. However, in recent years, Python has developed into a preferred statistical analysis tool. But using Python as a standalone software does not provide operational efficiency. Improving the operational efficiency is crucial, especially while handling huge datasets especially in a Bank, which could have billions of datasets. To achieve this, I have made use of Apache Spark. Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark also provides the handy PySpark integration that lets us run Python codes on top of the Spark terminal. This parallel processing enables faster handling of large datasets and implementation of more complex machine learning algorithms for more accurate predictions.
Ann Mohan Kunnath, Predicting the Designer from Fashion Runway Images with Computer Vision / Deep Learning Techniques, July 2019, (Yan Yu, Peng Wang)
From Coco Chanel to Alexander McQueen, every fashion designer has his/her unique signature to their fashion outfits. The ability to identify a designer based on the fashion accessory was a skill reserved only to the best of the best fashion connoisseur. However, the power to identify designer based on the fashion accessory is now becoming commonplace with the advent of computer vision. This capability is being piloted by many retailers and fashion magazines to increase their online sales and brand recognition. In this project, I’m building a computer vision model capable of predicting the fashion designer based on runway images. There are 50 classes of designers and the evaluation parameter that has been used is categorical accuracy. Network architecture and optimization algorithms are key to the performance of any neural network and hence I have focused on finding the optimal combination of these two parameters for this problem. For network architectures, DenseNet and ResNet have been leveraged as they help in overcoming the issue of vanishing gradient that occurs in deep neural networks. For optimization algorithms, Adam, stochastic gradient descent with momentum, and RMSprop have been leveraged. The results for each model on the training, validation and test sets were compared. It was the ResNet architecture with 18 layers combined with the Adam optimizer that worked best for this dataset.
Ashutosh Sharma, Term Deposit Subscription Prediction, July 2019, (Dungang Liu, Yichen Qin)
Promotion of services or products is done by either using mass campaign or direct marketing. Usually mass campaigns, focusing on large number of people, are inefficient and have low response rates. On the contrary, direct marketing focusses on a small set of people who are believed to be interested in the product. Hence attracting a higher response rate & bringing efficiency in the marketing campaigns. In this report, we are using Portuguese bank’s telemarketing data. The main idea of the project is to work on different techniques which could accurately predict the outcome of direct marketing and then compare the results. For this exploratory data analysis was done to understand the data and figure out if any relationships exist within the data. We then compared the various machine learning algorithms like logistic regression, Decision trees and Random forest to find out which algorithm can most accurately predict the outcome. It was found that random forest gave the most accurate results for predicting if the customer has subscribed for the term deposit.
Varsha Agarwalla, Measuring Adherence and Persistency of Patients towards a Drug Based on their Journey and Performing Survival Analysis, July 2019, (Michael Fry, Rohan Amin)
Client ABC is a large pharmaceutical company and is a client of KMK Consulting Inc. ABC has a diverse number of drugs in various disease areas. XYZ drug is a lifetime medicine prescribed in cases of chronic heart failure. It is priced around $4,000 annually. There are multiple reasons why patients do not take medication on a timely basis. Hence, non-adherence to prescription medications has received increased attention as a public health problem. The development of adherence-related quality measures is intended to enable quality improvement programs that align patient, provider, and payer incentives toward optimal use of specific prescribed therapies. The project shared in this report is calculation of these measures and based on that survival analysis has been performed. The client uses these metrics to track how their drug is performing in the market, potential patients who are consistent and later drop off, and based on that can plan the next steps.
Keerthi Gopalakrishnan, Sentiment Analysis of Twitter Data: Food Delivery Service Comparison, July 2019, (Yan Yu, Peng Wang)
Natural Language Processing (NLP) is a hotbed of research in data science these days and one of the most common applications of NLP is sentiment analysis. Thousands of text documents can be processed for sentiment (and other features including named entities, topics, themes, etc.) in seconds, compared to the hours it would take a team of people to manually complete the same task. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses. In this project the sentiment of tweets is identified with respect to food delivery services like Grubhub, Doordash, Zomato. The food delivery service is one of the largest customer review dependent industries. A collection of good reviews can change the company’s future in the league. Three approaches have been chosen for this project. One is the calculation of the sum of positives and negatives scores when compared with predefined positive and negative words. Second is Naïve Bayes’ using the sentiment package in R. Third is the syuzhet’s lexicons approach using the ‘syuzhet’ package in R. During the course of this project, we will be analyzing the most frequent words in each Food Delivery dataset, The 3 level emotion comparison of positive neutral negative, The 6 level emotion comparison of Joy, sadness, trust, fear, disgust, surprise, and a Word Cloud analysis. All three algorithms point out to one major result. That is, Grubhub has received more positive responses on twitter in comparison to Doordash and Zomato.
Tingyu Zhao, House Price Analysis: Ames Housing Data, July 2019, (Dungang Liu, Chuck Sox)
Real estate industry is growing rapidly in recent years. It is fascinating to find out which factor impacts the price of a house most and if there is a model to accurately predict house sales price. Consequently, I will solve the following questions in this project: 1. find out the most important variables in predicting house price. 2. Build statistical models to predict house price and try to decrease model MSE. To solve the problems above, I chose the Ames Housing Data, which contains data of the sales records of individual residential property in Ames, Iowa from 2006 to 2010. There are 80 variables and 2919 observations in the data set. I cleaned the data set, input missing values, conducted exploratory analysis and built two nonlinear regression models: Random Forest and Gradient Boosting to predict housing price. I found out that variable “GrLivArea (ground living area square feet)” is the most important variable to predict house price by two models. Random Forest model presented the lowest out of sample model MSE: 19.22 and the least difference between in sample MSE and out of sample MSE. Models presented very good results in predicting house price from the perspective of model MSE. Consequently, I found out the trend of sales price in real estate industry is organized and predictable.
Chase Williams, Examining the Relationship between Internet Speed Tests, Helpdesk Calls and Technician Dispatches, July 2019, (Charles Sox, Joe Fahey)
Customer experience is critical to the success of any company, but especially those that provide an intangible product or a service to their customers. Providers of high-speed internet face the challenge of providing the internet speeds purchased by customers regardless of the hardware and wireless speed limitations in place by the customers’ devices. Understanding the highest and lowest performing operating systems and browsers can help providers to maximize the customers’ experience. In addition, by examining the internet speed test and helpdesk call data, providers can gain the ability to predict a technician dispatch and possibly solve the issue prior to the customer request. Improving the customer experience by solving technical issues prior to the customer request could reduce churn and improve profitability.
Zarak Shah, Bank Loan Predictions, July 2019, (Yichen Qin, Edward Winkofsky)
This Data set was posted on Kaggle as a competition. The dataset on Kaggle had two data sets: one for training the model, this dataset had 100,514 observations and the testing dataset had 10353 observations. There were 16 variables in the training dataset and 15 variables in the testing dataset. We have to predict the Loan Status column in the training dataset, we will only be using the training dataset here since the dependent variable is not included in the dataset used for testing. For this Capstone we used Logistic Regression and Classification trees. We analyzed our results using AUC curve, ROC curve, Cost function and Logistic regression measures.
Pruthvi Ranjan Reddy Pati, Time Series Forecasting of Sales with Multiple Seasonal Periods, July 2019, (Dungang Liu, Liwei Chen)
Companies need to understand the fluctuations of demand to keep the right amount of inventory on hand. Underestimating demand can lead to loss of sales due to the lack of supply of goods. On the other hand, overestimating demand results in surplus of inventory incurring high carrying costs. Realizing demand makes a company competitive and resilient to market conditions. Appropriate forecasting models enable us to predict future demand aptly. This paper models the times series data of everyday store sales of an item across 5 years of sales history from 2013 to 2018. The data is split with the first 4 years of the data as Train and the last 1 year as Test to evaluate the performances of Time Series Forecasting techniques like ARIMA, SARIMA and TBATS. The data exhibits multiple seasonality with weekly and annual periods. This complexity of the data clearly shows the limitations of the ARIMA and SARIMA. TBATS performed the best providing a Train Absolute Mean Error of 0.124721 and a Test Absolute Mean Error of 7.236229. It was able to model both the weekly and annual seasonality along with the trend.
Vaidiyanathan Lalgudi Venkatesan, Marketing Campaign for Financial Institution, July 2019, (Dungang Liu, Liwei Chen)
Marketing campaigns are characterized by focusing on the customer needs and their overall satisfaction. There are different variables we need to take into consideration when making a marketing campaign, that determine whether the campaign will be successful or not: Product, Price, Promotion and Place. The data is related to direct marketing campaigns of a Portuguese banking institution. The Marketing Campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to confirm if the product (bank term deposit) would be ('yes') or not ('no') subscribed. The goal is to predict if the client will subscribe to a term deposit or not; to identify strategies in order to improve the effectiveness of future marketing campaigns for the bank. In order to answer this, we have to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions in order to develop future strategies. The dataset consists of 11162 rows and 17 variables, including the dependent variable: ‘Deposit’. Statistical techniques such as Logistic Regression, Classification tree, Random Forest were used to classify customers. Random Forest performs best and returned the lowest misclassification rate and highest AUC. The most important variables from these methods in the order of importance are: Month, Balance, Age. From this we understand that the propensity of marketing conversion depends on which month of contact, balance of the individual and age of the customer.
Shriya Sunil Kabade, Customer Churn Analysis, July 2019, (Dungang Liu, Liwei Chen)
Customer loyalty is important for every business. Loyal customers help a company grow by engaging more and improving brand image. Due to intense competition in the telecommunication industry, retaining customers is of utmost importance. Churn occurs when a customer ceases to use the products or services offered by a company. Insights into customer behavior can help a company understand early indicators of churn and avoid churn of customers in the future. The goal of this project is to identify key factors that make a customer churn and predict whether a customer will churn or not. The data for this project is taken from IBM sample datasets. The data is that of a telecom company ‘Telco’ with 7 thousand records and 21 features. Customers who have churned within the last month have been flagged. The features include information about the customer account, demographic information and customer behavior information in the form of services that the customer has signed up for. Various binary classification models like logistic regression, random forest, XGBoost have been built and compared based on classifier performance and ability to correctly classify churned customers. The final XGBoost model classifies 88.6% of the churned customers correctly and is not able to capture only 58 instances of churned customers. This model can be used by the telecom company to target customers with a potential to churn and retain them.
Gopalakrishnan Kalarikovilagam Subramanian, Analysis of Over-Sampling and Under-Sampling Techniques for an Unbalanced Data Set, July 2019, (Dungang Liu, Yan Yu)
Fraudulent transactions after card hacking is becoming a major concern for credit card industries. It is becoming a major deterrent in customer usage of credit cards. The goal of this project is to build a model for detecting fraudulent transaction using previous documented data of fraudulent transactions. The data set contains 30 variables. Two of the variables are ‘Amount’ of the transaction and ‘Time’ of the transaction which is the time elapsed after the first transaction. The other 28 variables are named V1-V28 and are a result of Principal Component Analysis (PCA) transformation. The output variable - ‘Class’ is binary and hence is a classification problem. Therefore, the following methods are used: Logistic Regression, Decision trees, Random Forest, Gradient Boosting Machine (GBM) and Support Vector Machine (SVM). Over-Sampling and Under-
Sampling schemes are used to overcome the unbalance in the data set. Precision and Recall values are used as indicators of efficacy of a scheme. Over-sampling is leading to a good mixture of precision and recall values. Under-sampling is resulting in a very good recall value but very poor precision value. Among the modeling algorithms, Random Forest is performing best and giving very good precision and recall values. V14, V10, V12, V17 and V4 are the top five important variables which helps in determining whether a transaction is fraudulent or not.
Varsha Shiva Kumar, Recommendation System for Instacart, July 2019, (Dungang Liu, Liwei Chen)
Instacart is an online application through which customers can order groceries and get them delivered on the same day from nearby grocery stores. In an era of big data technologies, recommendation engines play a crucial role in increasing the number of purchases per customer for companies. With the objective of increasing purchases per customer, Instacart wants to build a robust and accurate recommendation system that will recommend customers to reorder products.
The goal of this project is to build a recommendation system which will identify products most likely to be reordered by customers in a given order. The data available is transactional data of orders from customers over time. It contains over 3 million orders from more than 200,000 Instacart users. The information in the set of relational datasets are order related details like order number, days since last order was placed by the user, the products bought in the order etc. Through feature engineering, new variables like total orders, total reorders, average basket size of each user, etc. were created to better understand the relationship between users and products that they order. Using Logistic Regression, Classification Tree, Adaptive and XG Boosting, the reorder flag was predicted for a product in an order. XG Boosting was chosen as the final model because of its high recall value. Order rate of a product by a user and number of orders between purchases of product were identified as the two most significant variable that impact the likelihood of a product purchase.
Shruti Arvind Jibhakate, Analysis and Prediction of Customer Behavior in Direct Marketing Campaigns, July 2019, (Dungang Liu, Yan Yu)
Marketing is aimed at selling products to existing customers and acquiring new customers. Out of various types of marketing, direct marketing is most effective. It allows targeting individual customers but is more expensive than other methods. This analysis will help predict customer’s propensity to subscribe to a term deposit with the bank based on the data collected through the telephonic direct marketing campaign. The goal is to predict whether a customer would subscribe to a term deposit or not, based on data collected from telephonic direct marketing campaigns conducted by the Portuguese banking institution. It will thus help enhance customer acquisition, cross-selling of products, improve targeting and hence increasing the return on investments for the marketing campaigns. The variables record the customer’s profile information, marketing campaign information and social and economic context attributes between May 2008 and November 2010. Exploratory data analysis and various statistical modeling approaches are undertaken to better understand the data and develop a robust prediction algorithm. This analysis uses Logistic Regression, Decision Tree, and Generalized Additive Model. Since this information is used for decision making, methods which differentiate based on controllable parameters are preferred. Based on this criterion, the Logistic Regression model is most interpretable and has the highest predictive power. The model classifies subscribers from non-subscribers based on client information – default status, contact type; campaign information – month, previous campaign outcome, number of contacts performed in current campaign and number of days since the customer was last contacted; and, socio-economic factors – number of employees and consumer confidence index.
Ruby Yadav, Bike Rental Demand Prediction Using Linear Regression & Advanced Tree Models, July 2019, ( Dungang Liu, Liwei Chen)
Bike Rental service is a popular business in USA nowadays full of students or tourists or due to traffic and health promotions. With the automated process of renting a bike on an as needed basis makes it very convenient for the consumers to rent a bike. In this project we need to predict the bike rentals demand for a Bike Rental Program in Washington DC. We are examining the impact of different factors like season, weather, weekend/weekday etc. in order to predict the demand. Since the response variable is continuous, we are using linear regression model to predict the demand. Advanced machine learning techniques like regression tree, random forest, bagging is also used to predict the demand and from these we have chosen the best model for rental demand prediction and determined the significant factors that actually make an impact on the number of bikes rented. Here we have chosen Random forest as the best method to predict the demand for the bikes as the mean squared error is least for Random Forest.
Charles Brendan Cooper, What Matters Most? Analyzing NCAA Men’s Basketball Tournament Games, July 2019, (Dungang Lui, Yan Yu)
Every year since 1939, the National Collegiate Athletics Association (NCAA) hosts its end-of-the-season single elimination tournament for Division 1 Men’s Basketball, commonly known as March Madness, to determine who will be the national champion. In its current form, the tournament consists of 64 teams from conferences across the nation divided into 4 sections. In recent history, spectators of the sport are becoming more and more interested in trying to predict how the tournament will play out. Who will be in the final four teams, what kinds of upsets will there be (lower seeded team beating a higher one), who will be the eventual national champion? All of these questions and more are debated upon by journalists, sports newscasters, and fans so much so that there is a name for it all, Bracketology. Using the data I have available, my goal is to provide what variables prove to be the most impactful for predicting the outcome of games within the tournament. This will offer insight into how teams might approach an upcoming game versus an opponent based on their attributes. Once the data set has been built from a group of tables provided by a Kaggle competition sponsored by Google (resulting in 99 variables and 1,962 observations), a stepwise variable selection process will be applied, and a final model built from those critical variables. This results in 17 core variables out of the total 99 available and an out-of-sample AUC of 0.787.
Beth Hilbert, Promotions: Impact of Mailer and Display Location on Kroger Purchases, July 2019, (Charles Sox, Yan Yu)
This project examined which promotions are most effective for pastas, sauces, pancake mixes, and syrups in Kroger stores. A dataset provided by 84.51 was used to analyze how weekly mailers and in-store displays correlate with sales (number of baskets) for each product. Random Forest, LASSO Regression, Hierarchical Clustering, and Association Rules were used to answer this question.
The first part of the analysis used Random Forest to determine which promotion types had the greatest influence on the number of baskets. Given these data, in-store end caps and interior page mailer placements had the most influence on purchases/baskets. The second part of the analysis used LASSO Regression and Hierarchical Cluster to identify similar products using product segmentation. Similar products were clustered and then analyzed to determine what differentiated the groups. This analysis showed that sauces appear four times as often in the cluster most responsive to promotions. Given these results, promoting sauces on end caps is recommended. The final part of the analysis used Association Rules to evaluate purchase pairings using market basket analysis. The analysis focused on the Private Label brand because it was available for both pastas and sauces. Given these data, promoting Private Label sauce on the back page of the mailer, instead of the interior page of the mailer, is recommended to increase the chance of pairing with Private Label spaghetti.
Rahul Muttathu Sasikumar, Credit Risk Modelling – Lending Club Loan Data, July 2019, (Yan Yu, Nanhua Zhang)
Credit risk refers to the chance that a borrower will be unable to make their payments on time and default on their debt. It refers to the risk that a lender may not receive their interest due or the principal lent on time. It is extremely difficult and complex to pinpoint exactly how likely a person is to default on their loan. At the same time, properly assessing credit risk can reduce the likelihood of losses from default and delayed repayment. Credit risk modelling is the best way for lenders to understand how likely a loan is to get repaid. In other words, it’s a tool to understand the credit risk of a borrower. This is especially important because this credit risk profile keeps changing with time and circumstances. As technology has progressed, new ways of modeling credit risk have emerged including credit risk modelling using R and Python. These include using the latest analytics and big data tools to model credit risk. In this project, I will be using the data from Lending Club which is a US peer-to-peer lending company, headquartered in San Francisco, California. The past loan data is used to train a machine learning model which identifies a loan applicant as risky of defaulting the loan or not.
Jyotsana Shrivastava, Factors Influencing Length of Service of Employees, July 2019, (Yan Yu, Dung Tran)
Macy’s has 587 stores in 42 states in US with 225,656 employees as of Dec’2017. These stores are divided into 5 geographical regions for retail sales. We had observed in the prior analysis with Macy’s People analytics team that average sales vs plan for all Macy’s stores has a negative relation with Length of service of employees. In this paper, Length of service of employees also being referenced as ‘LOS’ is analyzed with varied employee information as available through the HR information base. The analysis as conducted is for overall Macy’s stores and on a region level. Further deeper analysis is also conducted on furniture and non-furniture stores and region wise variation if applicable. The analysis is done using data mining fundamentals using R throughout the project. This paper would aid Macy’s to incorporate HR related changes as required on a region or store level. Key findings of the paper are that LOS has a positive relation with Average standard working hours of employees in the store and a negative relation with the number of full-time employees in the store. North-East region which includes Macy’s flagship store Herald square has the highest LOS amongst all stores and this region NE has a positive relation with LOS. The key results are summarized in the paper and detailed results are available upon request.
Xi Ru, Predicting High-Rating Apps, July 2019, (Yan Yu, Yichen Qin)
As there are so many applications developed each day, it’s difficult for the software developers to determine what kinds of applications will become popular after it is released and be rated with higher rating by the public. This project is created to predict the high rating applications on google play store so that the app developers can invest their time and resources properly to gain profit. In this project, both regression and classification models are built to find the “best” model by comparing their prediction accuracy. The criteria used in model assessments for regression models are AIC, BIC, Adjusted R-square, and MSE. The classification models are assessed by both in-sample and out-of-sample AUC and MR. To eliminate the impacts of different categories, the same data mining techniques applied for a single category, which is the largest category “Family” in the original dataset. The model with highest predictive accuracy for multiple categories is Random Forest, while the performance of predictive models for single category “Family” do not have significant differences.
Bharath Vattikuti, IBM HR Analytics: An Analysis on Employee Attrition & Performance, July 2019, (Yan Yu, Liwei Chen)
Attrition is a problem that impacts all businesses, irrespective of geography, industry and size of the company. Employee attrition leads to significant cost to business, including hiring expenses, training new employee along with lost sales and productivity. Hence, there is a great business interest in, understanding the drivers of and, minimizing the employee attrition. If the reasons behind employee attrition are identified, the company can create a better working environment for the employees and if employee attrition can be predicted, the company could take necessary actions to stop the valuable employees from leaving. So, this report aims to explore the HR dataset by IBM Watson Analytics, manipulate it to get some meaningful relation between response variable (whether an employee left the company or not) and dependent variables which provide information about an employee. Then, multiple predictive statistical models are built in order to predict the possibility of an employee leaving the firm and factors were studied by plotting variable importance. In order to evaluate the model performance, Prediction Accuracy, Sensitivity and AUC are considered. Of the different models-built, Support Vector Machines (SVM) was picked due to its higher F1 score, comparable accuracy.
Rashmi Prathigadapa, Movie Recommender Systems, July 2019, (Yan Yu, Edward Winkofsky)
A movie recommender system has gained popularity and importance in our social life due to its ability to provide enhanced entertainment. It employs a statistical algorithm that seeks to predict users' ratings for a particular movie, based on the similarity between the movies, or similarity between the users that previously rated those movies. This enables the system to suggest the movies to its users based on their interest or/and popularity of the movie. There are a lot of recommender systems in the existence, most of them either cannot recommend a movie to its existing users efficiently or to a new user. In this project, we not only focus on the recommender systems for an existing user based on their taste and shared interests but also a recommender system that can suggest the new users based on the popularity. The dataset used has 45,000 movies and all the information about cast, crew, ratings, keywords etc. We have built 4 recommender systems namely: Simple Recommender, Content Based Recommender, Collaborative Filtering, Hybrid Engine.
Paul Boys, Statistical Inference for Predicting Parkinson’s Disease Using Audio Variables on an At-Risk Population, July 2019, (Yan Yu, Edward Winkofsky)
Parkinson’s disease is a degenerative neurological disorder characterized by progressive loss of motor control. Approximately 90% of people diagnosed with Parkinson’s disease (PD) have speech impairments. Development of an audio screening tool can aide in early detection and treatment of PD. This paper re-examines research data of audio speech variables from recordings of three groups: 1) healthy controls, 2) patients newly diagnosed with PD and 3) an at-risk group. The focus of this paper is on accurately predicting Parkinsonism in the at-risk group. The original research reported a 70% accuracy using quadratic discriminant analysis (QDA). This paper examines QDA, linear discriminant analysis (LDA), support vector machines (SVM) and Random Forest with use of least absolute shrinkage and selection operator (LASSO) for feature selection. LASSO selected two variables. Utilizing these two LASSO variables, Random Forest had the best out-of-sample accuracy at 72%. SVM and Random Forest resulted in sensitivities superior to QDA and LDA. Utilizing model accuracy on the control and PD group as model selection criterion, the SVM with a Bessel kernel was chosen as the optimal model. This SVM model was 64% accurate when validated on the at-risk group. Human speech screening of the at-risk group resulted in correctly identifying speech impairments in 2 of the 23 Parkinson’s positive patients. This SVM model improved on the human performance by correctly identifying speech impairment in 8 of these 23.
Harish Nandhaa Morekonda Rajkumar, Improving Predictions of Fraudulent Transactions Using SMOTE, July 2019, (Yan Yu, Edward Winkofsky)
This project aims to predict whether or not a credit card transaction is fraudulent. On the highly imbalanced dataset, logistic regression and random forest models are applied to understand how well the true positives are captured. Two sampling techniques - Random Oversampling and SMOTE are explored in this project. SMOTE stands for Synthetic Minority Oversampling Technique, a process in which synthetic instances are generated from the minority feature space to offset the imbalance. The models are applied again on the resampled data, and the area under the ROC and PR curves are observed to increase sharply. With SMOTE data, it is also observed that there is a sharp drop in the false positives, reducing by up to 38% and possibly leading to hundreds of thousands of dollars in cost savings. The threshold range is also found to increase, allowing more room for the model to be flexible.
Shagun Narwal, Analysis of e-Commerce Customer Reviews and Predicting Product Recommendation, July 2019, (Dungang Liu, Yichen Qin)
The dataset used for the project belongs to the e-commerce industry, specifically a women’s clothing website. In the last 2 decades, the e-commerce industry has consistently leveraged data to improve sales, advertisement and customer experience. Customers on a women’s e-commerce website have provided reviews and also voted whether they will recommend the product or not. This data was analyzed to generate insights in the e-commerce product reviews and recommendations space. Also, sentiment analysis was performed and used to understand the association between review words and customer sentiments. Further, the reviews were used to develop a model which can predict whether the person will recommend the product or not.
Lissa Amin, Psychographic Segmentation of Consumers in the Banking Industry, July 2019, (Michael Fry, Allison Carissimi)
The competitive landscape of the banking industry has forced traditional retail banks to shift their focus towards becoming more consumer-centric organizations and maintain levels of service and convenience that compete with the experiences consumers have both within and outside the industry. In order to be a truly customer-centric organization, there must first be a true understanding of who the consumer is which includes their needs, attitudes, preferences and behaviors. Market segmentation which aims to identify the unique groups of consumers that exist within the market based on shared characteristics is critical to understanding the consumer. Bank XYZ is one example of a traditional retail bank that has adopted more customer-centric values and is working to redefine the way they build products, services and marketing campaigns in order to drive value for both the customer and the bank. This analysis focuses on a psychographic segmentation of the consumers within Bank XYZ’s geographic footprint and identifies the unique groups that exist based on their attitudes, needs, behaviors and beliefs.
Harsharaj Narsale, Owners’ Repeat Booking Pattern Study and Forecasting, July 2019, (Charles Sox, Andrew Spurling)
NetJets Inc., a subsidiary of Berkshire Hathaway, is an American company that sells part ownership or shares (called fractional ownership) of private business jets. Accurate demand forecasting is essential for operational planning. Fleet management plays an important role in operations. Fluctuating demand with high variance needs to be fulfilled daily on time without declining any requesting flight to keep reputation of the company on high stature. The company has its own fleet as well as it can subcontract aircrafts from other companies on a temporary basis. Subcontracts are costly and hence need to be avoided, if possible. Subcontracted flights can be avoided by detailed planning based on accurate forecasts. Flight demand is currently being forecasted from last 5 years using the time series models built in SAS enterprise with an accuracy of ~96% in forecasting the total number of flights booked. As most of the owners are business or sport personnel, flights booked for annual business meetings or sport events are expected to show certain patterns. Owners’ repeat booking behavior can help in fine-tuning demand forecasts. Booking pattern has been analyzed in this project across 365/366 days of the year. Autoregressive time series model is built with an MAPE error rate of 10.75% for repeat percentage. Temporal aspects of flight booking in advance across new and repeat bookers has been analyzed to improve demand forecast for a flight day well in advance.
Neha Nagpal, Predicting Dengue Disease Spread, July 2019, (Dungang Liu, Chuck Sox)
Dengue fever is a mosquito-borne disease that occurs in tropical and sub-tropical parts of the world. Because it is carried by mosquitoes, the transmission dynamics of dengue are related to climate variables such as temperature and precipitation. In this project, various climatic factors are analyzed to generate a model for predicting the number of cases of dengue in the future, which in turn would help public health workers and people around the world take steps to reduce the impact of these epidemics. The data is provided by U.S. Centers for Disease Control and Prevention for two cities: San Juan, Puerto Rico and Iquitos, Peru. Various statistical methods have been used to find the best model as per the data. Finally, Support Vector Machine provided the lowest Mean Absolute error which makes it a suitable model for predicting the number of dengue cases in the future. From the analysis, we also found that humidity in the air, high temperature and some specific seasons result in a greater number of dengue cases.
Swagatam Das, IMDB Movie Score Prediction: An Analytical Approach, July 2019, (Dungang Liu, Liwei Chen)
Films have always been an integral part of the world of entertainment. They can be used as a medium to convey important messages to the audience and also a creative medium to portray the fictional world. A filmmaker’s aspiration is not only to achieve commercial success but also to gain critical acclaim and content appreciation by the audience. The most commonly used metric for filmmakers, audience, critics, etc. is the IMDB Score. This score out of 10 marks the overall success or failure of a film. In this project, I have studied the factors that affect the final IMDB Score right from popularity of Actors/Directors to commercial aspects like Budgets and Gross Earnings. The intent is to help future filmmakers make educated decisions while creating films. The data has been sourced from Kaggle which was originally pulled from the IMDB website. I have used Data Visualizations and Machine Learning Algorithms to make predictions regarding the response i.e. the IMDB score. I have concluded that total number of users who voted for the movie, duration of the movie, budget and gross earnings of the movie are important factors that determine the IMDB score and would recommend future film makers to look into these factors before producing/directing films. I have used different models to assess the prediction accuracy of the IMDB Score. From the analysis, Random Forest Algorithm had the best accuracy rate of 78.42% compared to 75.65%, 78% and 77.5% for Multinomial Logistic Regression, Decision Tree and Gradient Boosting respectively.
Akshay Singhal, Vehicle Loan Default Prediction, July 2019, (Dungang Liu, Yichen Qin)
The major chunk of the income for banks comes from the interest earned on loans, but loans can be risky too. Banks must deal with the risk to reward ratio for any kind of loan. This is where credit scoring comes in. Loan defaults will cause huge losses for the banks, so they pay much attention on this issue and apply various methods to detect and predict default behaviors of their customers. In this report, we attempt to predict the risk of the loan being default based on the past data. The data was obtained from Kaggle. The data contains the information about the customers from the Indian Subcontinent. The main idea of the project is to find out the factors responsible for the loan default. For this Exploratory Data Analysis was done to understand the data better and to study the relationship between various variables then a comparison of the performance and accuracy of various machine learning algorithms like logistics regression, Random forest and Gradient Boosting is done to find out which technique works best in this scenario. It was found that the loan default is highly influenced by the loan amount and the customer’s credit history. Random forest gave out the best results for the prediction of default based on the data.
Mrinal Eluvathingal, A Machine Learning Walkthrough, July 2019, (Peng Wang, Ed Winkofsky)
The main goal of this project is to provide a machine learning walkthrough of a dataset and through the process of Data Munging, Exploration, Imputation, Engineering and Modeling show that the stages of Preprocessing and Feature Engineering are the most important, and is the foundation upon which a model can be more powerful. Using the Ames Housing Dataset we perform Exploratory Data Analysis and Feature Engineering and Selection using advanced techniques including an innovative new method for feature creation, and compare different machine learning algorithms and analyze their relative performance. This project, titled ‘A Machine Learning Walkthrough’ lays emphasis on the most important part of the Data Science problem which is data preparation.
Margaret Ledbetter, The Role of Sugar Propositions in Driving Share in the Food and Beverage Category, April 2019, (Yan Yu, Joelle Gwinner)
Sugar continues to be a “hot topic” for [food and beverage] consumers and a driver of recent buyer and volume declines in the aisle. To date, there has been limited understanding of consumer preferences for specific sugar ingredients – i.e., natural vs. added – and lower sugar propositions. This research seeks to understand the role of sugar ingredient and lower sugar propositions as well as other factors in the [food and beverage] consumer purchase decision, including: brand, variety, all natural claim, and added benefit. The insights uncovered in this research will be used to inform [Client] line optimization and new product development for FY20 + beyond.
Jacob Blizzard, Pricing Elasticity Evaluation Tool, April 2019, (Yan Yu, Chad Baker)
EG America is looking to maximize profits for their beer category through pricing analytics. Pricing analytics is at the core of profitability, but setting the right price to maximize profits is often difficult and extremely complex. This project aims to create a tool that recommends the optimal price to maximize profit by using historic sales data and the price elasticity of demand for top selling items within each state in which EG America operates. The tool compiles sales data queried from internal data systems and market research data from Nielsen and calculates the optimal price to set for each item by using the elasticity coefficient and the cost of each item. Setting the correct price for beer in convenience stores is of utmost importance due to the customer base that is exceedingly price aware and very sensitive to price changes.
Daniel Schwendler, Single Asset Multi-period Schedule Optimization, April 2019 (Mike Fry, Paul Bessire)
In a production environment, the capacity to produce finished materials is the primary focus of operations leadership. Sophisticated systems surround the scheduling of production assets and resources are dedicated to making the most with what capacity is available. The two primary changes that impact a production environment’s ability to increase production are efficiency and total asset availability; in other words, increasing production with current assets or increasing the total potential for production through increases in staffing or production equipment. In many environments, the purchasing of production equipment is a sizable capital expenditure that is not an option or requires comprehensive justification. In a continuous flow example, the scheduling of a single machine can drive the entire supply chain. The production schedule of this machine is critical. Operations management relies on production and scheduling to steer the business. In these cases, the application of optimization provides objective recommendations and isolates skilled resources on decision making. In this paper, we will explore the application of mixed integer linear optimization in a continuous flow environment as an enterprise resource planning tool. An optimal master production schedule alone adds value in understanding machine capabilities to meet demand, while it also informs many other facets of the business. Material requirement planning, Inventory management, sales forecasting, required maintenance and supply chain logistics are all critical considerations.
Khiem Pham, Optimization in Truck-and-Drone Delivery Network, April 2019, (Leonardo Lozano, Yiwei Chen)
With the introduction of unmanned aerial vehicles (UAV), also known as drones, several companies with shipping service promised to greatly cut down the delivery time using these devices. One of the reasonable methods to use the drones is to launch them not directly from the shipping centers, but from the normal delivery trucks themselves. With multiple vehicles delivering at the same time, this can save a lot of time. In this paper, we discuss a general approach to the problem from an optimization point of view. We consider different drones’ specifications as well as the number of drones to deploy. We aim to formulate a model that can return optimal vehicle routes and measure the computational expense of the model.
Lauren Shelton, Black Friday Predictions, April 2019, (Dungang Liu, Liwei Chen)
In the United States, the day after Thanksgiving is known as Black Friday, the biggest shopping day of the year. Stores offer their best sales to kickoff the holiday season. Store owners could benefit from being able to predict what customers want to buy, how much they are willing to spend, or the demographic of customers to target. For this project, a linear model, generalized additive model, neural network model, and classification tree model are used to predict purchase prices in dollars. All predictor variables, including gender, age, marital status, city category, years in city, occupation category, and product categories, were important. The final model chosen was the linear model, which performed best.
Niloufar Kioumarsi, Mining Hourly Residential Energy Profiles in order to Extract Family Lifestyle Patterns, April 2019, (Yichen Qin, Peng Wang)
This study presents a new methodology for domestic hourly energy time series characterization based on hierarchical clustering of seasonal and trend components of energy load profiles. It decomposed energy time series in to their trend, seasonal and noise components. It segmented households into two groups through clustering their trend components. In order to interpret the trend clustering results, it looked at the correlation of energy time series in each cluster with weather. The study also examined the influence of household characteristics on patterns of electricity use. Each trend-cluster was linked to household characteristics (house age and size) by applying a decision tree classification algorithm. Finally, the seasonal component of energy profiles were used to cluster customers based on family lifestyle patterns. This study constructed 6 profile classes/typical load profiles reflecting different family lifestyles, which can be used in various energy saving behavioral studies.
Mark McNall, From Sinatra to Sheeran: Analyzing Trends in Popular Music with Text Mining, April 2019, (Dungang Liu, Edward P. Winkofsky)
Starting in late 1958, Billboard Magazine’s Hot 100 became the single, unified popularity chart for singles in the music industry. Because music is such a universal and beloved form of art and entertainment, exploring how popular music has changed over the years could provide interesting and valuable insight both for consumers and for the music industry (musicians, songwriters, lyricists). One way to explore trends in music is analyzing their lyrics. This project aims to analyze the lyrics of every #1 hit over time by using a variety of text-mining applications such as tokenization, TF-IDF ratio, lexical density and diversity, compression algorithms, and sentiment analysis. After extensive research, results showed #1 hits have steadily gotten more repetitive over time, as popular songs have had a declining lexical density and increasing compression ratios. Sentiment analysis showed that popular music has also gotten more negative, and emotions such as anger and fear are more prevalent in lyrics than positive emotions such as joy compared to the past. Finally, the usage of profanity in popular music has skyrocketed in the last two decades, showing that music has not only got more negative but also more vulgar.
Laura K. Boyd, Predicting Project Write Downs, May 2019, (Edward Winkofsky, Michael Fry)
Company A is a national engineering consulting firm that provides multi-media services within four distinct disciplines: Environmental, Facilities, Geotechnical, and Materials. The goal of this project is to investigate the amount of project write downs for the Cincinnati office, specifically for the Environmental Department over the past four years. Preliminary review of Company A’s data indicates that the average monthly write downs for the Environmental Department is approximately $17,000. This analysis will use linear regression to explore the relationships of project data for the Cincinnati office between 2015 and 2018. Backward selection was utilized to select predictors and removed one at a time if determined that the predictor variable does not contribute to the overall prediction. The following limitations were encountered during this analysis: available data analytics tools and data integrity. As part of the BANA program I was exposed to multiple data analytical tools including R, SAS, and Microsoft Excel. Company A does not utilize R or SAS; therefore, Microsoft Excel was used for this analysis. When dealing with data its integrity is always a concern, especially when your data relies heavily on user inputs. The data used in this analysis was exported from a project management database. The data in the database is entered by each project manager and relies on accurate and up-to-date entries.
Joe Reyes, Cincinnati Real Estate – Residential Market Recovery, May 2019, (Shaun Bond, Megan Meyer)
During the Recession of 2007-2009, real estate was affected nationwide. Local homeowners in the Cincinnati tri-state also felt the impact of the downturn. The Hamilton County Auditor’s Office maintains and publishes real estate sale records. This data is useful in evaluating not only the general market for different neighborhoods, but also, how much local property values were influenced by the recession. Historical sales volumes and values provide insight into the overall character of real estate values as a function of property sales. Further, consideration of market supply and demand during the same period gives a view into the drivers behind the decline and rebounds around the recession. This brief summarizes residential trends using the above data from the years 1998 through 2018 for Hamilton County. Comparison to national and regional information to this local information gives an idea of how Cincinnati residents fared compared to the Midwest and USA.
Poranee Julian, A Simulation Study on Confidence Intervals of Regression Coefficients by Lasso and Linear Estimators, May 2019, (Yichen Qin, Dungang Liu)
We performed a simulation study to compare the confidence intervals of regression coefficients by Lasso (a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces) and linear regressions. We studied five cases. The first three cases contains three different numbers of independent regressors. In the fourth case, we studied a data set of correlated regressors with a given correlation matrix Toeplitz. The last case is similar to the fourth case but the correlation matrix given by AR1. The results showed that linear regression slightly performed better. However, Lasso regression gave effective models as well.
David Horton, Predicting Single Game Ticket Holder Interest in Season Plan Upsells, December 2018, (Yan Yu, Joseph Wendt)
Using customer data provided from the San Antonio Spurs, a statistical model was built that predicts the likelihood that an account which only purchased single game tickets in the previous year will upgrade to some sort of plan, either partial or full season, in the current year. The model uses only variables derived from customer purchase and attendance histories (games attended, tickets purchased and attended, money spent) over the years 2013-2016. The algorithm used for training was the Microsoft Azure Machine Learning Studio implementation of a two-class decision jungle. Training data was constructed as customers who had purchased only single game tickets in the previous year. This data was split randomly so that 75% was used for training and 25% for testing. In later runs, all data from 2016 was withheld from training and testing as a validation set, as noted in the results section. The final model (including 2016 data in training) shows a test accuracy of 84.9%, where 50% accuracy is interpreted as statistically random and 100% yields only perfect predictions. This model is likely to see improvements in predictive power as demographic information is added, new variables are derived, the feature selection method becomes more sophisticated, the model choice becomes more sophisticated, model parameters are optimized, and more data becomes available.
Ravi Theja Kandati, Lending Club – Identification of Profitable Customer Segment, August 2018, (Yan Yu, Olivier Parent)
Lending club issues unsecured loans to different segments of customers. The interest rate for the loan is dependent on the credit history of the customer and various other factors like income levels, demographics, etc. The data of the borrowers is public. The objective is to analyze the dataset and identify the good customers from the bad customers (“charged off”) using machine learning techniques. This dataset falls under the category of class imbalanced dataset as the number of good customers are far greater in number than the number of bad customers. As this is a typical classification problem, CatBoost technique is used for modelling.
Pengzu Chen, Churn Prediction of Subscription-based Music Streaming Service, August 2018, (Dungang Liu, Leonardo Lozano)
As a well-known subscription business model, paid music streaming became the largest recorded music market revenue source in 2017. Churn prediction is critical for subscriber retention and profit growth in a subscription business. This project uses a leading Asian music streaming service’s data to identify parameters that have an impact on users’ churn behavior and to predict churn. The data contains user information, transaction records and daily user activity logs. Prediction models are built with logistic regression, classification tree and support vector machines algorithms, and their performances are compared. The results indicate that classification tree model has the best performance among the three in terms of asymmetric misclassification rate. The parameters that have a big impact on churn are whether subscription auto-renew is enabled, payment method, whether the users cancel the membership actively, payment plan length, and user activities 0-2 days before subscription expires. This informs the service provider where customer relationship management should focus.
Tongyan Li, Worldpay Finance Analytics Projects, August 2018, (Michael Fry, Tracey Bracke)
Worldpay, Inc. (formerly Vantiv, Inc.) is an American payment processing and technology provider headquartered in the greater Cincinnati area. As a Data Science Analytics Intern, I directly work with the Finance Analytics team on multiple projects. The main purpose of the first project is to automate the process that used to be manually accomplished within different databases, RStudio was used and substantially reduced the time required to produce flat files for further usage. In the second project, the year-over-year (YoY) average ticket price (AVT) growth was analyzed. The Customer Attrition project focuses on the study customer’s attrition behavior.
Navin Mahto, Generating Text by Training a Recurrent Neural Network on English Literary Experts, August 2018, (Yan Yu, Yichen Qin)
Since the advent of modern computing we have been trying to make computers learn and respond in a way unique to humans. While we have chatbots which mimic human response by pre-coded answers, they are not fluid or robust in their response. In our project we would like to train a Recurrent Neural Network on an English Classic “War and Peace” by Leo Tolstoy and make it generate sentences similar in nature and structure to the language in the book. The sequential structure of RNN and its ability to retain previous inputs make them ideal to learn a literary style of the book. On increasing the length of RNN and epoch values, the error decreases from max of 2.9 to 2.2, and we see that the text generated resembles closer to English language.
Non-coherent output: “the soiec and the coned and the coned and the cone”.
Slightly coherent output: “the sage to the countess and the sale to the count”.
Zach Amato, Principal Financial Group: GWIS Portfolio Management Platform, July 2018, (Michael Fry, Jackson Bohn)
The overall goal of the GWIS Portfolio Management Platform project is to help bring together the disparate tools, research, and processes of the PPS boutique into as few locations as possible. Throughout the summer, we have started prototyping the Portfolio Viewer module and putting structure around the Research Module. In doing this, data management and data visualization skills have been used to meet the needs of the project and of the business. Future steps in the project will include completing current work and the modules in process and engaging in the Portfolio Construction and Trading modules. Future work will require data management, data manipulation, statistical testing, and optimization.
Nicholas Charles, Craft Spirits: A Predictive Model, July 2018, (Dungang Liu, Edward Winkofsky)
A new trend driving growth in the spirits industry is craft. Craft spirits are usually produced by small distilleries that use local ingredients. In the US, the spirits industry is structured as a three-tier system with manufacturers, distributors, and retailers. In certain states, the state government controls a portions of the three-tier system. For instance, the State of Iowa controls the distribution. The state purchases product from the manufacturers and subsequently sells to private retailers in the state. In doing so, the state tracks all transactions at the store level and makes available this data to the public. This project takes that open data and builds a logistic regression model that can be used to predict the outcome of a transaction as either a craft purchase or noncraft purchase. This information may prove useful to distilleries and distributors that specialize in craft by helping to pinpoint where their resources should be focused.
Keshav Tyagi, CC Kitchen’s Dashboard, July 2018, (Michael Fry, Harrison Rogers)
I am working as a Business Analyst Co-Op with Project Management Operations Division within the Castellini Group of Companies providing Business solutions to CC Kitchens, which is one of its subsidiaries specializing in Deli and Ready to eat products. The project, which I was assigned to, aims at creating an executive level dashboard for CC Kitchens visualizing important business metrics, which can assist top-level executives in making informed decisions on a day-to-day basis.
My responsibilities include but are not limited to interacting with different sectors within the company to identify the data sources for the above metric, data cleansing, creating data pipelines, preparing data through SQL stored procedures and creating visuals over them in a tool called DOMO. The data resided in flat files, Excel sheets, emails, ERP, API’s and I created an automated data flow architecture to collect and dump data at a centralized SQL warehouse.
Swidle Remedios, Analysis of Customer Rebook Pattern for Refund Policy Optimization, July 2018, (Michael Fry, Antonio Lannes Filho)
Priceline offers lower rates to its customers on certain deals which are non-cancellable by policy. In order to improve customer satisfaction, certain changes were deployed in June 2015 to make exceptions to these policies and refund the customers. These exceptions are only applied to cancel requests that fall under Priceline’s predefined categories/cancel reasons. In this paper, the orders processed under Priceline’s Cancel and exception policy will be analyzed for two of the top cancel reasons. The goal is to determine if the refunds are successfully driving customer behavior and repurchase habits. The insights obtained from the analysis will help the Customer care team at Priceline redesign and optimize the policies for each of the cancel reasons.
Rashmi Subrahmanya, Analysis of Tracker System Data, July 2018, (Michael Fry, Peter M. Chang)
Tracker system is an internal system in Boeing which records requests from employees working on the floor. Based on the nature of request, they are directed to a respective department. Standards organization is responsible for supply of standards (such as nuts, bolts, rivets) for assembling aircraft. Any request related to standards such as missing or defective or insufficient number of standards are directed to Standards organization, which then resolves the request. The resolution time and in fact, the requests directly impact the aircraft assembling process. The project focuses on analysis of tracker request data to identify patterns in the data. Data is analyzed on two key metrics - number of requests and average resolution time of the requests. The top problem type names, area of aircraft, hour and day with highest requests have been identified. This would help understand the reasons behind these requests and help take preventive action such as increase staffing at a particular time of the day so that the requests are resolved quicker. Two dashboards were developed to show active number of requests and to show the requests by Integration Centers for 7XX program.
Spandana Pothuri, Data Instrumentation and Significance Testing for a Digital Product in the Entertainment Industry, July 2018, (Dungang Liu, Chin-Yang Hung)
The entertainment technology industry runs on data. As entertainment is created, unlike a more need-based industry like agriculture, it is important to see how the receiver uses the end product. Based on the feedback loop, more products are created or existing products are made better. How a user uses an application, determines how its next version is built. In this world, clicks translate to dollars, and data is of utmost importance. This paper focuses on the cycle of data in a technological project, starting from instrumentation and tracking to reporting and deriving the business impact of the product. The product featured in this paper is Twitch’s extensions discovery page. The goal of launching this product was to the increase visibility for extensions. This product succeeded, increasing viewership by 37%.
Sourapratim Datta, Product Recommendation: A Hybrid Recommendation Model, July 2018, (Michael Fry, Shreshth Sharma)
This report provides a recommendation of products (movies) to be licensed for an African country channel. The product recommendations are based on its features such as genre, release year, US box office revenue etc. and its performance on other African and worldwide channels. A hybrid recommendation model combining the product features (Content Based Recommendation) and the performance of the products (Collaborative Filtering model) has been developed. For the content-based recommendation, a similarity matrix is calculated based on the user preferences of the market and the most similar products are considered. To calculate the performance of the products that have not been telecast in the African channels, a collaborative filtering model is trained on the known performance indexes. From the predicted performance of the products, the top products are considered. Finally, combining the considered products from both the methods, a final list of products has been recommended by giving equal weightage to both methods.
Akhil Gujja, Hiring the Right Pilots – An Analysis of Candidate Applications, July 2018, (Michael Fry, Steven Dennett)
Employees are an asset to any organization, and the key to any firm’s success. For a company to grow, flourish, and succeed, the right people must be hired for the job from the start. Hiring the right personnel is a time consuming and a tedious task for any organization. Especially, in the aviation industry, where safety and reliability are of utmost importance, hiring the right pilots is critical. Even under ideal situations, hiring pilots can be an arduous task. It is extremely difficult to predict exactly how well pilots will perform in the cockpit. It is because a pilot’s future performance cannot just be predicted based on academic performance or historical flying metrics. It depends on a lot on non-quantifiable metrics. Candidate profiles are analyzed to understand the profiles that made through the selection process, and the ones that did not make it through the resume screening process. This analysis can be used by the recruiters to rank candidate profiles and expedite the hiring process of top ranked candidates.
Yang He, Incremental Response Model for Improved Customer Targeting, July 2018. (Dungang Liu, Anisha Banerjee)
Traditional marketing response models score customers based on their likelihood to purchase. However, among potential customers, some customers would purchase regardless of any marketing incentive while some customers would purchase only because of marketing contact. Therefore, the traditional predictive models sometimes lead to money wasting on customers who would shop regardless of marketing offers and customers who would stop shopping if you ‘disturb’ them with marketing offers. The Oriental Trading Company Inc., a company with catalog heritage, does not want to send any catalogs to the customers who would purchase naturally for cost saving and profit maximization. For my internship, the main objective is to distinguish customer groups that need catalogs to shop from customer groups that will shop naturally or will not shop if given marketing incentive using incremental response model in SAS Enterprise Miner. This report shows that the basic theory of the incremental response model and how the model is applied to an Oriental Trading Company dataset. A combined incremental response model was successfully built using demographic and transactional attributes. The estimated model identified that incremental response was 11.9%, which was 1.7 times higher than baseline incremental response (7.2%). The model was used to predict customers’ purchase behavior in future marketing activity. Additionally, from the outputs of modeling, we identified that the overall number of orders, sales of some major products and days since first purchase were the most important factors affecting customers’ response to the mailed catalogs.
Nandakumar Bakthisaran, Customer Service Analytics – NTT Data, July 2018, (Michael Fry, Praveen Kumar S)
The following work describes the application of data analysis techniques for a healthcare provider. There are two tasks covered here. The first is an investigation to identify the root cause of an anomalous occurrence in a business process. The average of the scores measuring agent performance on a call exhibited an unusual rise starting 2018. The chief cause was identified to be a deviation from standard procedure by evaluators. Naturally, the subsequent recommendation was to ensure greater adherence to the procedure followed. The second task follows it with scrutiny of the single scoring benchmark used for different types of incoming calls and how it falls short of measuring performances in an accurate manner. Probability distributions were attempted to be fit to the underlying data for each type to check for any inherent distributions. The conclusion was to employ a type-specific system of scoring using point estimates obtained from existing data.
Christopher Uberti, General Motors Energy/Carbon Optimization Group, July 2018, (Michael Fry, Erin Lawrence)
This capstone outlines a strategy for implementing improved statistical metrics used for analyzing General Motors (GM) factories. Current GM reporting methods and data available are outlined. Two methodologies are outlined in this paper for improved metrics and dashboards. The first is an analysis of individual HVAC units within a factory (of which each factor has dozens) in order to identify units that be performing poorly or are not set to correct modes. The second methodology is creating a prediction model for overall plant energy usage based on historical data. This would provide plant operators a method for comparing current energy usage to past performance while taking into account changes in weather, production, etc. Finally some potential dashboards are mocked up for use in the energy reporting software.
Anumeha Dwivedi, Sales Segmentation for a New Ameritas Universal Life Insurance Product, July 2018, (Dungang Liu, Trinette James)
The key to great sales for a new product is knowing the right kind of customers (who are most profitable) for it and deploying your best agents (high performing sales persons) out to them. Therefore, this project is aimed at performing customer and agent segmentation for a new Ameritas Universal Life Insurance product that is slated for a launch later this year. The segmentation is based on customer, agent, policy and riders data on other such historical products. The segmentation utilizes different demographic, geographic and behavioral attributes that are available directly or could be inferred or sourced externally. Segments developed would not only allow for more effective marketing efforts (better training, better-informed agents and marketing collateral) but also result in better profitability from the sales. Sales segmentation has been attempted using suitable clustering (unsupervised machine learning) techniques and the results suggest that cluster of clients in the early sixties and mid-thirties are most profitable and, in that order, and form the major chunk of the customer base. The age band of 45-55 years has not been as profitable with higher coverage amounts for medium premium payments. On the agents end, the most experienced agents (oldest in age and biggest tenures with Ameritas) have been most successful selling UL policies, followed by the youngest group of agents in their thirties and shortest tenures of 2-5 years while the ones with 6-15 years tenure in the 45-55 years age band are more complacent and limited with the sales of these policies.
Kamaldeep Singh, DOTA Player Performance Prediction, July 2018, (Peng Wang, Yichen Qin)
Dota2 is a free-to-play multiplayer online battle arena (MOBA) video game. Dota 2 is played in matches between two teams of five players, with each team occupying and defending their own separate base on the map. Each of the ten players independently controls a powerful character, known as a "hero" (which they choose at the start of the match), who all have unique abilities and differing styles of play. During a match, players collect experience points and items for their heroes to successful battle with the opposing team's heroes, who attempt to do the same to them. A team wins by being the first to destroy a large structure located in the opposing team's base, called the "Ancient". The objective of this project is to come up with an algorithm that can predict a player’s performance with a specific hero, by learning from his performance with other heroes. The response variable used for quantifying performance is KDA ratio i.e. Kill to deaths ratio of that user with that hero. The techniques used in this project are Random Forest, Gradient Boosting and H2O package that encapsulates various techniques and automates model selection. Data was provided by Analytics Vidhya and is free to be used by anyone.
Sarita Maharia, NetJets Dashboard Management, July 2018, (Michael Fry, Stephanie Globus)
Data visualization plays a vital role in exploring the data and summarizing the analysis results across the organization. The visualizations in Netjets were developed using disparate tools on a need basis without any set of corporate standards. Once the employees began using Tableau as the data visualization tool, it became even more important to have a centralized team to develop the infrastructure, set the corporate standards and enforce required access mechanism. The Center of Excellence for Visual Analytics (CoE-VA) team now serves as the central team to monitor the visualization development across the organization. This team requires a one-stop solution to answer analytics community queries related to dashboard usage and compare access between users. NetJets Dashboard Management project is developed as the solution to enable CoE-VA team to monitor the existing dashboards and access structure. This primary purpose of this project is to present dump of dashboards’ usage and access data in a concise and user-friendly framework. Two exploratory dashboards are developed for this project to accept the user input and provide the required information visually with a provision to download the data. The immediate benefit from this project is the time and effort savings for CoE-VA team. The turnaround time for comparing access between two users is now reduced from approximately an hour to few seconds. The long-term benefit from this project would be to promote the Tableau usage culture in the organization by tracking dashboard usage and educating end-users on the under-utilized dashboards.
Mohammed Ajmal, New QCUH Impact Dashboard &Product Performance Dashboard, July 2018, (Michael Fry, Balji Mohanam)
Qubole charges for its services to its customers based on their cloud usage. Its current revenue methodology is dependent on the instance type that is being used and the Qubole defined factor (QCUH factor) associated with it. The first project evaluates the impact of new QCUH factor from revenue standpoint. Additionally, a dashboard is also built that would enable the sales team to identify customers whose invoices would increase due to new QCUH factor. The dashboard also has functionalities that will aid the sales team to arrive at mutually agreed terms with individual customers with respect to the new QCUH factor. Currently Qubole does not have one single reporting platform where all the important metrics are tracked. The second project attempts to answer this concern. There are two dashboards that are built. The first dashboard tracks all critical metrics at overall or organization level. The second dashboard tracks almost all the metrics that are being tracked in the first dashboard at an individual customer level. The user needs to input the customer name to populate the data for the concerned customer. The two dashboards provide comprehensive overview of the health of Qubole.
Maitrik Sanghavi, Member Churn Prediction & CK Health/Goals Dashboards, July 2018, (Michael Fry, Rucha Fulay)
This document provides information on two key projects executed while interning at Credit Karma. A bootstrapped logistic regression model was created to predict the probability of a Credit Karma member not returning to the platform within 90 days. This model can be used to effectively target the active members who are ‘at-risk’ of churning and also be used as a baseline reference for future model improvisations. CK Health and CK Goals dashboards have been modified and updated to monitor company’s key performance indicators and track 2018 goals. These dashboards have been created using BigQuery and Looker and are automatically updated daily.
Nitish Ghosal, Producer Behavior Analysis, July 2018, (Dungang Liu, Trinette James)
Ameritas Insurance Corporation Limited works on the B2B marketing model partnering with agencies, financial advisors, agents & brokers to sell its products and services to the end-customer. An agent can choose to sell products for multiple insurance companies, but he/she can be contracted full time with only one insurance company. In order to incentivize an agent’s affiliation with Ameritas, it has an Agents Benefits & Rewards Program in place which works on the principle of “greater the agent’s results, greater the rewards”. The aim of my study is to provide a holistic overview of our agents’ behavior and identify their drivers of success through segmentation and agent profiling. In order to achieve this, data visualizations were created in tableau to find trends and clustering was performed in SPSS to segment the agents into groups. The agents could be grouped into five distinct categories - top agents, disability insurance specialists, life insurance specialists, generalists and inactive agents. The analysis revealed that factors such as benefits, club qualification, contract type, club level, agency distribution region, persistency rates, home agency state, AIC (Ametrias Investment Corporation) affiliation are some of the factors which have an impact on an agent’s success and his sales revenues.
Krishna Chaitanya Vamaraju, Recommender System on the Movie Lens Data Set, July 2018, (Dungang Liu, Olivier Parent)
Recommendation systems are used in most e-commerce websites to promote products, up-sell and cross-sell products to new or existing customers based on the history of data present for existing customers. This helps in recommending the right products to customers thereby increasing sales. The current report is a summary of various techniques that are used to for recommendation. A comparison of the models against the time taken to run and the issues concerning each model are discussed in the report. For the current project, data from Kaggle has been utilized for the analysis - The 100k MovieLense ratings data set. The goal of the current project is to use the MovieLens data in R and build recommendation engines based on different techniques using the Recommender Lab package. This could, if deployed into production, serve as a system like that we see on Netflix. For the analysis cosine similarity is used to compute the similarity between users and items. The Recommendation Techniques that have been used are User based Collaborative Filtering, Item based collaborative Filtering and Collaborative techniques based on Popularity and Randomness. Also, a recommender system using Singular value decomposition and K-Nearest Neighbors is also built. A comparison of the techniques indicates that the popular methods technique gives the highest accuracy as well as good run time however this depends on the data set and the stage of recommendation we are in. Finally, the right metric one wants to indicate using a recommender system determines the type of Recommender system one should build.
Ananthu Narayan Ambili Jayanandan Nair, Comparing Deep Neural Network and Gradient Boosted Trees for S&P 500 Prediction, July 2018, (Yan Yu, Olivier Parent)
The objective of this project was to build a model to accurately predict the S&P 500 index in the (t+1)st minute using the component values in the tth minute. Two different machine learning techniques, Artificial Neural Networks, and Gradient Boosted Trees were used to build the models. Tensor flow, which makes use of the NVIDIA GPU was used for training the Neural Network model. H2O, which speeds up the training process by parallelized implementations of algorithms was used for Gradient boosted trees. The models were compared using their Mean Squared Errors and the Neural Network model was found to be better suited for this application.
Prerit Saxena, Forecasting Demand of Drug XYZ using Advanced Statistical Techniques, July 2018, (Michael Fry, Ning Jia)
Client ABC is a large pharmaceutical company and a client of KMK Consulting Inc. ABC has a diverse portfolio of drugs in various disease areas. The organization is structured in the form of division for every disease area. The NET team is responsible for ABC’s drugs in the Neuro-endocrine tumor area, a disease area with a market of about $1.5B globally. ABC’s major drug XYZ is in the market for a few years and has a major market share. The drug is a “Buy and Bill” drug which means hospitals buy the drugs in advance and stock it and then bill the payers according to the usage. The project shared in this report is the forecasting exercise for drug XYZ. In this project, forecasting has been done for 3 phases: remaining 2018, 2019 and 2020-2021. The team uses various forecasting methods such as ARIMA, Holt-Winters and trends in conjunction with business knowledge to prepare forecasts of number of units of drugs, as well as dollar sales for the upcoming years.
Manoj Muthamizh Selvan, Donation Prediction Analysis, July 2018, (Andrew Harrison, Rodney Williams)
UC Foundation Advancement Services team participates in the process to bring donations to UC in an effective manner. The team has data of all the donors collected in the past 12 years and are interested in understanding any findings and insights from the data. The UC Foundation team would like to predict probability of large future donations and target the donors effectively. Hence, the team would want to understand: Trigger and factors responsible for the donations and Probability of donors to donate a larger amount (> $10,000). Random Forest model was used to identify the trigger factors and also predict the high donors on the prospect population. The results are being used by the Salesforce team of UC Foundation to target the high donors with better accuracy than heuristic based models.
Akhilesh Agnihotri, Employee Attrition Analysis, July 2018, (Dungang Liu, Peng Wang)
Human resources plays an important part of developing and making a company or an organization. The HR department is responsible for maintaining and enhancing the organization's human resources by planning, implementing, and evaluating employee relations and human resources policies, programs, and practices. Employers generally consider attrition a loss of valuable employees and talent. However, there is more to attrition than a shrinking workforce. As employees leave an organization, they take with them much-needed skills and qualifications that they developed during their tenure. If the reasons behind employee attrition are identified, the company can create a better working environment for the employees and if employee attrition can be predicted, the company could take necessary actions to stop the valuable employees from leaving. So, this report attempts to explore the HR data, manipulate it to get some meaningful relation between response variable (whether an employee left the company or not) and dependent variables which provide information about an employee. Then, the report also tries to build several statistical models which can predict the probability of an employee leaving the company given his information and conclude on the best model having highest performance.
Amoul Singhi, Identifying Factors that Distinguishes High Growth Customers, July 2018, (Dungang Liu, Lingchan Guo)
If a bank is able to identify customers who have potential to spend more next year than what they have spent this year they can market better products to them and increase customer satisfaction along with their profits. The aim of this analysis is to identify the set of customer features which can distinguish high growth customers from others. The data collected for the analysis was divided into 2 categories transactional and demographical. From Data Exploration some factors which were identified which have a different behavior in both the group. After which, a Linear Regression Model was made with percentage increase from 2016 to 2017 as target variable to identify the factors which are statistically important in determining the growth of cardholder. A logistic regression model was made with classifying accounts with more than 25% growth as high growth customers. This was followed by tree model and Random Forest model to increase the efficiency of the model. It was found that though some of the variables statistically significant but their coefficient is very low implying though they are important in determining the growth of a accounts but their impact is not much. There are 2 transactional variables which were identified from Random Forest which can help to determine if a customer is high growth or not, but accuracy of the model is quite low. Overall there are certain factors which are identified as important but it’s very difficult to predict if a cardholder is going to spend more in upcoming years.
Sudaksh Singh, Path Cost Optimization: Tech Support Process Improvement, July 2018, (Michael Fry, Rahul Sainath)
The objective of this project is to optimize the process of diagnosis and resolution of issues with various products faced by the customers of one of world’s largest technology companies and addressed by tech support agents. The organization’s tech support agents use multiple answer flow trees which is a tree structured question-answer based graph used by agents for diagnosing issues in customer products. The objective of optimizing the answer flow trees is achieved by studying the historical performance of the issue diagnosis and resolution activity carried out by various agents using these trees. Performance of these trees is measured across a collection of metrics defined to capture the speed and accuracy of issue diagnosis and resolution. Based on the analysis, recommendations are made to reorder or prune the answer flow tree to achieve better performance across these metrics. These measures and the following recommendations for editing the answer flow trees will serve as a starting point for more advanced, holistic techniques to design algorithms which generate new answer flow trees having the best performance across all metrics while considering the constraints which limits the reordering and pruning of the answer flow graph.
Max Clark, HEAT Group Customer Lead Scoring, July 2018, (Michael Fry, Maylem Gonzalez)
The HEAT Group is responsible for all events taking place at the American Airlines Arena, such as NBA basketball games, concerts and other performances. While offering a winning and popular product will yield high demand, the HEAT Group must employ analytical methods to smartly target customers who otherwise would not be attending events. The purpose of this project was to determine the differences between the populations of the HEAT Group’s two main customer groups, premium and standard customers. Furthermore, a machine learning model was implemented to, one, create a scoring method that will be used to assess which customers the HEAT Group would have the highest probability to convert from standard to premium customers, and, two, determine which features have the largest impact. It is discovered that the age and financial status are the largest and most important differentiators for the two population groups.
Mohit Deshpande, Wine Review Analysis, July 2018, (Yichen Qin, Peng Wang)
Analyzing structured data is simpler as compared to unstructured data because observations in structured data are arranged in a specific format suitable for implementing analytical techniques on them. On the other hand, internal structure of unstructured data (audio, video, text etc.) do not adhere to any format. Nowadays, unstructured data generation is at an all-time high and thus, comprehending methods to analyze them is the need of the hour. This project aims to study and implement one such technique of Text Analytics which is used to examine textual data. Initial part of the project revolves around performing Exploratory Data Analysis on a dataset containing wine reviews to discover hidden patterns in the data. The latter part focuses on analyzing the text heavy variables by performing basic text, word frequency and sentiment analysis.
Devanshu Awasthi, Analysis of Key Sales-Related Metrics through Dashboards, July 2018, (Michael Fry, Jayvanth Ishwaran)
Visibility is key to running a business at any scale. Organizations have a constant need to assess where they stand day-in and day-out and where they can improve. With the advancements in tools and technologies that help capture large chunks of operational and business data at even shorter intervals, companies have started to explore methods of using this data in a better way to get insights more frequently. One step in this direction at NetJets is to move away from the traditional methods of using static systems for business reporting. Descriptive analytics using advanced techniques of data management and data visualization is used to create dashboards which can be shared across the organization with different stakeholders. Dashboards prove to be extremely useful in analysis as they show the trends for different metrics over the month, and help us dig down deeper through the multiple layers of information. This project involved transforming the daily reporting mechanism for the sales team through dashboards for three large categories – cards business, gross sales and net sales. Each category has important metrics the business users are concerned with. On any day, these dashboards help analyze the time-series data associated with these categories and assess how the business has fared on certain metrics for the month, identify anomalies and get a comprehensive view of the expected sales for the rest of the month.
Ayank Gupta, Predicting Hospital-Acquired Pressure Injuries, July 2018, (Michael Fry, Susan Kayser)
A pressure injury is defined by NPUAP as "localized damage to the skin and/or underlying soft tissue usually over a bony prominence or related to a medical or another device. The injury can present as intact skin or an open ulcer and may be painful. The injury occurs as a result of intense and/or prolonged pressure or pressure in combination with shear. The tolerance of soft tissue for pressure and shear may also be affected by microclimate, nutrition, perfusion, co-morbidities, and condition of the soft tissue. Identification of factors responsible for the pressure injury can be very difficult and is of vital importance for the hospital bed manufacturers. It is crucial to identify the type of pressure injury a patient might acquire in a hospital and educate the nurses to take proactive measures instead of preventive measures. The objective is to predict whether a patient will have a hospital-acquired pressure injury given various demographic information about the patient, information about the wound, and the hospital.
Renganathan Lalgudi Venkatesan, Detection of Data Integrity in Streaming Sensor Data, July 2018, (Michael Fry, Netsanet Michael)
The Advanced Manufacturing Analytics team wants to identify if there are any data integrity issues in the streaming sensor data that is gathered from the manufacturing floors. The infrastructure for asset tracking has already been set up in several phases of time. Each site consists of sensors that capture the spatiotemporal information of all the tagged assets thereby giving real-time information regarding the whereabouts of each of the assets. Depending on their purpose, there are several types of these sensors positioned in different locations within a given plant. This project aims at examining the integrity of the streaming data and to monitor the health of the flow, detect and label the time frames of historical disruptions. Also, since the streaming data is on its early stages of infrastructure, the goal of the analysis is to identify the shortcomings and explore the possible improvements that will be required for future production critical processes. Several methods are proposed to address the current and recent streaming data issues by capturing for disruptions in the data feed using historical data. This would help in capturing disruption on time and thus make real-time site operation decisions. The infrastructure also has times in which there is a data loss as well as high volume of one time data (referred as an outliers in the data). The project proposes methods to detect and quantify the data losses as well as in detecting outliers in the sensor data based on the operating characteristics and other factors at the site.
Aksshit Vijayvergia, Predict Default Capability of a Loan Applicant, July 2018, (Yichen Qin, Ed Winkofsky)
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders. Borrowing from financial institutions is a difficult affair for this sect of people. Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities. So, for the capstone project I will be digging deep into the dataset provided by Home Credit on the analytics website called Kaggle. In order to classify whether an applicant will default, I will be analyzing and munging two datasets. The first dataset is extracted from the application filed by a client in Home Loan’s portal. The second dataset contains a client’s historical borrowing information.
Sylvia Lease, Analytics & Communication: Leveraging Data to Make Connections, Summarize Results, & Provide Meaningful Insights, July 2018, (Mike Fry, Steve Rambeck)
Entering into my time as Ever-Green Energy’s Business Analyst Intern, a defined project goal was established to create a variety of reports for client and internal use alike. Armed with newly developed skills in coding, data visualization, and managing data, it was quickly realized that these skills would serve as tools for an overarching, more imperative goal: to communicate effectively. Over several weeks, the opportunity to merge data with communication took a variety of forms. In the beginning, discussions with various leaders and groups within the company translated to an understanding of how analytics could lend itself to furthering the company’s mission. This led to a recognition of how analytics could assist in bridging a gap between the IT and Business Development groups to create reports that helped the teams serve clients by answering key questions and interests. Ultimately, through the creation of polished and carefully designed reports, communication was key in the success of each created report by whether the report provided useful insights and summaries of data in a clear and efficient manner.
Anitha Vallikunnel, Product Reorder Predictions, July 2018, (Dungang Liu, Yichen Qin)
Uncovering the shopping patterns helps the retail chains to design better promotional offers, forecast the demand, optimize the brick and mortar store aisles - in short, everything to build a better experience for the customer. In this project, using the Instacart’s open sourced transactional data, I have identified and predicted the items that are ordered together. Apriori algorithms and association rules are used for market basket analysis to achieve this. Using feature engineering and gradient boosted tree models, reordering of items are also predicted. This will help the retailers in demand forecasting and identifying the items that will be ordered more frequently. F1 score is used as the metric for measuring prediction accuracy for reordering. On training sample, we got F1 score of 0.772 and the F1 score in the out of sample method is 0.752.
Ananya Dutta, Trip Type Classification, July 2018, (Dungang Liu, Peng Wang)
Walmart is the world’s largest retailer and improving customer’s shopping experience is a major goal for them. In this project we are trying to recreate their existing trip type classification with a limited number of features. This categorization of trip types will help Walmart improve the customer’s experience by personalized targeting and can also help them identify business softness related to specific trip types. The data contains over 600k rows consisting of purchase data at a scan level. After rigorous feature engineering and model comparisons we found that the results using an Extreme gradient boosting model is promising with an accuracy of ~90% in training and ~70% in testing data. After looking at importance of variables - total number of units, total number of, count of UPCs, coefficient of variation of percentage contributions across departments and items sold in departments like financial services, produce, menswear and DSD Grocery were found important in building this classifier.
Nitha Benny, Recommendation Engine, July 2018, (Dungang Liu, Yichen Qin)
Recommendation engines are widely used today across e-commerce, video streaming, movie recommendations etc. and this is how each of these businesses maintains their edge in the highly competitive online business world. The idea behind using recommendation engines itself is intriguing and this project aims to compare collaborative filtering techniques to better understand how recommendation engines work. Two main types of collaborative filtering i.e., user based, and item-based methods are used here. The two models are built, and we calculate the MSE and MAE values for each. The models are then evaluated using ROC curve and precision-recall plots for a different number of recommendations. We find that the user based collaborative filtering method using the cosine similarity function works best giving a lower MSE value of 1.064 as well as the better area under the curve and precision-recall curves. Hence, the User-based collaborative filtering method will help businesses recommend better products to their customers and thus improve their customer experience.
Nirupam Sharma, UC Clermont Data Analysis and Visualization, July 18, (Mike Fry, Susan Riley)
For my summer internship, I worked as Graduate Student at UC Clermont College in Batavia, Ohio for the office of Institutional Effectiveness from May 2018 to July 2018. My responsibilities were to build R analytical engine to perform data analysis and to design Tableau dashboards highlighting key university insights. Data used in analysis consisted of tables describing information about number of enrollments, courses, employees, accounts and sections for different semesters. The analytical engine was written in R language to connect to data, combine data tables and perform SQL and descriptive analysis to get inferences in trends across years. The results of analysis were used to build dashboards in Tableau. My responsibilities for Tableau work were to create new calculative fields, parameters and dynamic actions and use other advanced Tableau features learned during my masters at UC to build charts and dashboards to be uploaded on the UC Clermont website. The analytical engine I built allowed college to perform data pipeline tasks effortlessly and quickly without much human input thus saving the college a lot of time and resource efforts. The dashboards built help the college to better understand trends in data and make recommendations to management. My internship allowed me to hone my R and Tableau skills. I learnt to use many advanced R packages and my ability to write quality code increased significantly. My experience at UC Clermont College will allow me to work more professionally and effectively in my future job.
Scott Fueston, Preventing and Predicting Crash-related Injuries, July 2018, (Yan Yu, Craig Zielazny)
This study aims to identify influential factors that elevate a motorist’s risk of sustaining a serious or fatal injury during a motor vehicle crash. Addressing these factors could then potentially save lives, prevent long-term pain and suffering, and avert liabilities and monetary damages. Using population comparisons through exploratory data analysis and model creation for prediction, contributory factors to devastating injuries have been identified. These include: lack of restraint use, deployment of an air bag, crashing into a fixed object, crashing head-on, a roadside collision, time of day is night, the vehicle type is a car, speeding, a rollover occurred, impact of first collision occurred in the front-left corner, disabling damage occurred to the vehicle, and alcohol involved. This information could be invaluable to key members in areas such as policy design, regulatory agencies, car manufacturers, and consumers in: developing clear communications and advocacy for ways to aid in prevention, proposing and implementing effective policy and laws, aiding in the approach taken in manufacturing and designing future automobiles, and elevating the general public’s awareness in terms of risk factors.
Amit Kumar Mishra, Customer Churn Prediction, July 2018, (Yan Yu, Yichen Qin)
Churn Rate is defined as the number of customers who moved out of the subscription of an organization. It is an important component in the profitability of an organization. This gives an indication of the revenue lost by an organization. Additionally, an organization can comprehend the factors which are responsible for customer churn and can allocate its resources to those factors. A customer retention program can be developed so that customer retention is maintained. Thus, given the significance of customer churn, the telecommunication customer data is obtained from the IBM repository and was explored to find the factors that are responsible for customer churn. Various machine learning techniques like logistic regression (with various link functions, namely – probit, logit and cloglog, and using different variable selection procedures), tree, random forest, support vector machine and gradient boosting were used to predict the customer churn and the best model was identified in terms of in-sample and out-of-sample performance. Tenure, contract, internet service, monthly charges and payment method were found to be the most important variables for predicting the customer churn in the telecommunication industry. Among all the different classification techniques, support vector machine with radial basis kernel (RBF) performed the best in terms of accuracy with 80.10% of data classified correctly.
Pranil Deone, Default of Credit Card Clients Dataset, July 2018, (Peng Wang, Liwei Chen)
The Default of credit card client’s data set is used for the purpose of this project. The main objective is to build a credit risk model which accurately identifies the customers who will default their credit card bill payment in the next month. The model is based on the credit history of the customers which includes information regarding their limit balance, previous month’s payment status, previous month’s bill amount. Also, various demographic factors like age, sex, education, marital status has been considered to build the model. The data set contains 30000 observations and 25 variables. Some preprocessing is done on the data to prepare for analysis and modeling. Quantitative and categorical variables are identified and separated for performing appropriate exploratory data analysis. Data modeling techniques like generalized logistic regression, stepwise variable selection, LASSO regression and Gradient Boosting Machine have been used to build different credit risk models. The model performance is evaluated on the training and the testing data. Model performance criteria like misclassification rate and AUC have been used to evaluate different models and select the best model.
Hemang Goswami, Ames Housing Dataset, July 2018, (Dungang Liu, Yichen Qin)
Residential real estate prices are fascinating… and frustrating at the same time. None of the parties involved in the house buying process: the homebuyer, the home-seller, the real estate agent, the banker can point out the factors affecting the house pricing with total conviction. This project explores the AMES Housing Dataset which contains information on the residential property sales that occurred in Ames, Ohio from 2006 to 2010. The dataset has 2930 observations with 80 features describing the state of the property including our variable of interest: Sale Price. After creating 10 statistical models ranging from a basic linear regression model to the highly complex models Gradient Boosting and Neural Network, we were successfully able to predict the house prices with a MSE as low as 0.015. In the process, we found out that the Overall quality of the house, exterior condition, area of the first floor and neighborhood were some of the key features affecting the prices.
Ameya Jamgade, Breast Cancer Wisconsin Prediction, July 2018, (Yan Yu, Yichen Qin)
Breast cancer is a cancer that develops from the breast tissue. Certain changes in the DNA (mutations) result in uncontrolled growth of the cells, eventually leading to cancer. Breast cancer is the one of the most common types of cancer in women in the United States, ranking second among cancer deaths. This project aims at analyzing data of women residing in the state of Wisconsin, USA by applying data mining techniques to classify whether the tumor mass is benign or malignant. Data for this project was obtained from UCI Machine Learning repository containing information of 569 women across 32 different attributes. Data cleaning and exploratory data analysis procedures were performed to prepare and summarize main characteristics of the data-set. The data was portioned into training and test sets, consisting of 80% and 20% split respectively and data mining algorithms such as K-nearest neighbor, random Forest and Support Vector Machine were used for classification of the diagnosis Y-variable as malignant or benign. The optimal value of K is 11 for k-nearest neighbor classifier which gives 98.23% accuracy. The tuned random Forest model has an error rate of 3.87% and identified the top 5 predictor variables. The tuned SVM model gives accuracy of 98.68% and 95.58% on training and test data respectively. The findings indicated in this project can be used by the heath-care community to perform additional research corresponding to these attributes to help prevent the pervasiveness of breast cancer.
Sai Uday Kumar Appalla, Predicting the Health of Babies Using Cardiotocograms, July 2018, (Yan Yu, Yichen Qin)
The aim behind doing this research is to predict the health of a baby based on different diagnostic features observed in the cardiotocograms. The data was collected from the UCI Machine Learning repository. Different Machine Learning algorithms were built to try and understand what are the factors that have a significant influence on the baby’s health and predict the health state of the baby based on these factors with the best possible accuracy. Initially, basic classifiers like K-nearest neighbours and Decision Trees are used to make predictions. These algorithms have higher interpretability and they help us understand the significance of different variables in the analysis. During the later parts of the analysis, complex classifiers like Random Forest, Gradient Boosting and Neural Networks are used to boost the accuracy of the predictions. Finally, after looking at all the different model metrics, Gradient Boosting tree is selected as the best model as it has better model metrics than any of the other models.
Piyush Verma, Building a Music Recommendation System Using Information Retrieval Technique, July 2018, (Peng Wang, Yichen Qin)
Streaming music have become one of the top sources of entertainment for millennials. Because of Globalization, people all around the world are now able to access different kinds of music. The global recorded music industry is worth $15.7 billion and is growing at 6% as per 2016. Digital music is responsible for driving 50% of those sales. There are 112 million paid subscribers for the streaming business and roughly a total of 250 million users, if we include those who don’t pay. Thus, it becomes very important for streaming service providers like YouTube, Spotify and Pandora to continuously improve their service to the users. Recommendation Systems are one such information retrieval technique to predict the ratings or popularity a user would give/have for an item. In this project I would be exploring bunch of methods to predict ratings of users for different artists using GroupLen’s Last.FM dataset.
Poorvi Deshpande, Sales Excellence, July 2018, (Yichen Qin, Ed Winkofsky)
One of the features that a bank offers is to provide loans. The process by which the bank decides whether an applicant should receive a loan is called underwriting. An effective underwriting and loan approval process is a key predecessor to favorable portfolio quality, and a main task of the function is to avoid as many undue risks as possible. The aim of this process, when undertaken with well-defined principals, the lender is able to ensure good credit quality. This is a problem faced by a digital arm of a bank. The primary aim of this division is to increase customer acquisition through digital channels. This division sources leads through various channels like search, display, email campaigns and via affiliate partners. As expected, they see differential conversion depending on the sources and the quality of these leads. Consequently, they now want to identify the leads' segments having a higher conversion ratio (lead to buying a product) so that they can specifically target these potential customers through additional channels and re-marketing. They have provided a partial data set for salaried customers from the last 3 months. They also capture basic details about customers. We need to identify the segment of customers with a high probability of conversion in the next 30 days.
Jatin Saini, An Analysis of Identifying Diseased Trees in Quickbird Imagery, July 2018, (Yan Yu, Edward P Winkofsky)
Machine learning algorithms are used widely to identify patterns in data. One of its applications has been found in identifying diseased trees from Quickbird imagery. In this project, we apply logistics regression, LASSO and Classification Trees (CART) models on imagery data to identify significant variables. We designed this study to create training and testing dataset and compared Area Under Curve (AUC) The results using logistic regression showed us 0.97 AUC value for both training and testing datasets, on the other hand, CART showed AUC 0.92 on testing data and 0.91 on training data. After examining the accuracy of different algorithms, we conclude that logistic regression showed us more accurate results on training and testing data.
Raghu Kannuri, Recommender System Using Collaborative Filtering and Matrix Factorization, July 2018, (Peng Wang, Yichen Qin)
This project aims to develop a recommender system using various machine learning techniques. A recommender system helps in developing a customized list of recommendations for every user and thus, acting as a virtual salesman. It predicts missing user-product rating by drawing information from the user's past product ratings or buying history and ratings by similar users. Content-based Filtering, Knowledge-based, Collaborative Filtering and Hybrid filtering are the widely used recommender system techniques. This project deals with techniques like Item-Based (IBCF) and User-Based (UBCF) collaborative filtering with different similarity metrics and Matrix Factorization with Alternative Least Squares. The results of Matrix Factorization outperformed UBCF and IBCF in all evaluation metrics like precision, recall and AUC.
Madhava Chandra, Analysis on Loan Default Prediction, July 2018, (Yichen Qin, Peng Wang)
The purpose of this study was to determine what constitutes risky lending. Each line item in the data corresponded to a loan, and had various features relating to loan amount, employment information of the borrower, payments made, and the classification of the loan as charged off or active with any delays in payments noted. An exploratory data analysis was performed on the data, to look for outliers and individual distributions of the variables. Following which, the interactions between these variables were studied to weed out highly correlated variables. Owing to low representation of defaults in the sample, this was treated as an imbalanced class problem, wherein traditional random sampling would not yield optimal results. To overcome this problem, stratified sampling, random under and over sampling, SMOTE and ADASYN methodologies were explored.
All the above sampling methodologies were trained and tested on logistic regression to pick which sampling procedure to follow for this exercise. Following which, it was found that SMOTE gave the best results. To best classify which loans would likely default from the given dataset, various statistical learning techniques, such as Regression, Tree-based methods- standalone, boosting and bagging ensemble methods, Support Vector Machines and Neural Networks were employed. Amongst these classifiers, Gradient boosting was observed to have the best performance, although with further fine tuning, Deep Neural Networks could possibly classify better.
Samrudh Keshava Kumar, Analytical Analysis of Marketing Campaign Using Data Mining Techniques, July 2018, (Dungang Liu, Peng Wang)
Marketing products is an expensive investment for a company, and spending money to market to customers who might not be interested in the product is inefficient. This project aims to determine and understand the various factors which influence a customer’s response to a marketing campaign. This will help the company design targeted marketing campaigns to cross-sell products. Predictive models were built to predict the response of each customer to the campaign based on various characteristics of the customer using models such as logistic regression, Random Forests and Gradient Boosted trees. The factors affecting the response was determined to be Employment status, Income, type of renew offer, months since policy inception and last claim. The models were validated using a test set and the best accuracy was achieved by the Random Forests model. It has an AUC of 0.995 and misclassification rate of 1.3%.
Rohit Bhaya, Lending Club Data – Assessing Factors to Determine Loan Defaults, July 2018, (Yan Yu, Peng Wang)
Lending club is an online peer-to-peer platform that connects the loan customers with potential investors. Loan applicants can borrow money in the range of $1,000 to $40,000, and the investors can choose the loan products they want to invest in and make profit. The loan data was available on Kaggle and contains applicant information about loans that were originated between 2007 and 2015. Using the available information for applicants who have already paid-off the loan, various machine learning algorithms are built to estimate the propensity of a customer’s default. Further, it was observed that the step AIC approach for logistic regression had the best performance amongst all the models tested. This final model was then used to build an applicant default scorecard that has a range between 300 and 900. A higher score indicates a higher propensity for an applicant to default. Further, the scorecard gave good performance in both the training data and the test data. This scorecard was then used on the active customer base to score an applicant’s propensity to default. From the distribution of this score, it was observed that the most of the active loan customers fall into low-risk category. Further, for higher score applicants, the management can prepare preventive strategies to avoid future losses.
Nitin Sharma, A Study of Factors Influencing Smoking Behavior, July 2018. (Dungang Liu, Liwei Chen)
In this study, statistical analysis is performed to understand the factors that influence smoking habits. Data used in this experiment is obtained from a survey conducted in Slovakia on participants aged 15-30 years. This dataset is available for the public at the Kaggle website. Data collected in this survey includes information about “Smoking habits” of the participants. This is the variable of interest which is a categorical variable with values: Never smoked, Tried smoking, Former smoker and Current smoker. The goal of this study is to find out which factors influence the smoking habit. Machine learning techniques (logistic regression, ensemble methods) are used to predict whether an individual is a current/past smoker or is someone who has never smoked. The best model selected in this study provides an overall accuracy of 83% in the test sample. The result of this study is applicable only to 15-30 years old Slovakia population and cannot be associated with a different population.
Yiyin Li, Foreclosure in Cincinnati Neighborhoods, July 2018, (Yan Yu, Charles Sox)
The main purpose of this paper is to analyze what factors would likely affect foreclosure in Cincinnati neighborhoods and build a model to predict whether the property will be foreclosed by banks. The dataset that is analyzed lists all real estate transactions in Cincinnati from 1998 to 2009. In this paper, after a brief description of the project background and data, exploratory data analysis will be provided, which mainly includes a basic analysis of each individual variable, the correlation statistics between variables and the basic information of 47 Cincinnati neighborhoods. Then, 10 different types of models and a model comparison are provided in the modeling section in order to find the best model to predict the foreclosure. In conclusion, properties’ sales price, building and land value, selling year and that year’s properties mortgage rate, and the median family income are the most influential variables, and the gradient boosting model is the best model for predicting foreclosure.
Adrián Vallés Iñarrea, Predicting Customer Satisfaction and Dealing with Imbalanced Data, July 2018, (Dungang Liu, Shaobo Li)
From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. In this paper, we will compare logistic regression, classification tree, random forest and extreme gradient boosting models to predict whether a customer is satisfied or dissatisfied with their banking experience. Doing so would allow banks to take proactive steps to improve a customer's happiness before it's too late. The dataset was published in Kaggle by Santander Bank, a Spanish banking group with operations across Europe, South America, North America and Asia. It is composed of 76020 observations and 371 variables that have been semi-anonymized to protect the client’s information. 96.05% of the customers are satisfied and only 3.95% are dissatisfied, making this classification problem to be highly imbalanced. Since most of the commonly used classification algorithms do not work well for imbalanced problems, we also compare in this paper two ways to deal with the imbalanced data classification issue. One is based on cost sensitive learning, and the other is based on a sampling technique. Both methods are shown to improve the prediction accuracy of the minority class, and have favorable performance compared to the existing algorithms.
Guansheng Liu, Development of Statistical Models for Pneumocystis Infection, July 2018, (Peng Wang, Liwei Chen)
The yeast-like fungi Pneumocystis reside in lung alveoli and can cause a lethal infection known as Pneumocystis pneumonia (PCP) in hosts with impaired immune systems. Current therapies for PCP suffer from significant treatment failures and a multitude of serious side effects. Novel therapeutic approaches, such as newly developed drugs are needed to treat this potentially lethal opportunistic infection. In this study, I built a simplified two-stage model for Pneumocystis growth and determined how different parameters control the levels of Trophic and Cyst forms of the organism by employing machine learning methods including multivariate linear regression model, partial least squares, regression tree, random forest and gradient boosting machine. It was discovered that parameters of K_sTro (replication rate of Trophic form), K_dTro (degradation rate of Trophic form) and K_TC (transformation rate from Trophic form to Cyst form) play predominant roles in controlling the growth of Pneumocystis. This study is of great clinical significance, as the extracted statistical trends on the dynamic changes of the Pneumocystis will guide the design of novel and effective treatments for controlling the growth of Pneumocystis and PCP therapy.
Vignesh Arasu, Major League Baseball Analytics: What Contributes Most to Winning, July 2018, (Yan Yu, Matthew Risley)
Big data and analytics has been a growing force in Major League Baseball. The principle of moneyball vitalizes the importance of two of these statistics, on-base percentage and slugging (Total Bases/Number of At Bats) as the core principles for building winning franchises. The analysis of this report of data from all teams from 1962-2012 incorporating methods of multiple linear regression, logistic regression, regression and classification trees, generalized additive models, linear discriminant analysis, and k-means clustering creating the best models for the number of wins by a team(linear regression response variable) and whether or not a team makes the postseason(logistic regression response variable) shows that runs scored, runs given up, on-base percentage, and slugging do have strong effects on team success of wins and making the playoffs. The in-sample best models of supervised logistic regression techniques all show great results with AUC values all over 0.90 while the unsupervised k-means clustering technique showed that the data can be effectively grouped in 3 clusters. A mix of supervised and unsupervised study techniques show that a variety of statistical techniques can be used to analyze baseball data.
Preethi Jayaram Jayaraman, Prediction of Kickstarter Project Success, July 2018, (Yichen Qin, Bradley Boehmke)
Kickstarter is an American public-benefit corporation that uses crowdsourcing to bring creative projects to life. As a crowdfunding platform, Kickstarter promotes projects across multiple categories such as film, music, comics, journalism, games and technology, among others. In this project, the Kickstarter Projects Database was analyzed and explored in detail. The patterns identified in the Data exploration stage were used as inputs in for predictive modeling. Classification models such as Logistic Regression and Classification Trees were built to classify the Kickstarter projects. Performance across the two models was compared on the validation set (hold-out set – 20% of the data) using accuracy, sensitivity and AUC as the performance criteria. ROC curves were also plotted for both the models. The Logistic Regression model was chosen as the best model for the Kickstarter project classification with an accuracy of 0.9996 and AUC of 0.9999. The performance of the Logistic Regression model (best performing model) was evaluated on the test data to conclude the classification problem. The Logistic Regression model classified the Kickstarter projects with an accuracy of 99.96% on the test data. The analysis of the Kickstarter Projects was further extended to include projects of states - ‘Suspended’, ‘Live’, ‘Undefined’ and ‘Canceled’, recoded as ‘Failed’. Building Logistic Regression and Classification Tree models resulted in Logistic Regression as the best model with a classification accuracy of 0.9656 on the test data.
Rohit Pradeep Jain, Image Classification: Identifying the object in a picture, July 2018, (Yichen Qin, Liwei Chen)
The objective of this project was to classify images of fashion objects (like T-shirts, sneakers, etc.) based on the pixel information contained in these pictures. The image was classified into one of the 10 available classes of fashion objects using different modeling techniques and a final model was chosen based on the accuracy on the cross-validation dataset. The final model was then tested on the untouched testing dataset to validate the out of bag accuracy. The project serves as a benchmark for more advanced studies in the image classification field and helps in technologies like stock photography.
Priyanka Singh, Mobile Price Prediction, July 2018, (Peng Wang, Liwei Chen)
The aim of this study is to classify the prices of mobile devices from 0 to 3 with the higher number denoting higher prices. The dataset has a total of 2000 observations and 21 variables. The response variable, the price range is to be predicted with the highest accuracy possible. The analysis starts with performing the exploratory data analysis followed by the construction of machine learning models. The exploratory data analysis revealed that the categorical variables weren’t significant enough in determining the price of the devices. The numeric variables, battery power and ram of the phones, had a considerable impact on the prices. Classification tree, random forest, support vector machines and gradient boosting machines were used to predict the price of the phones. Support vector machine model was chosen as our final model as it gave the lowest misclassification rate of 0.08 and highest area under curve (AUC) value of 0.97. The features used in generating the model were: ram, battery power, pixel width, pixel height, the weight of the mobile, internal memory, mobile depth, clock speed and touchscreen.
Gautami Hegde, HR Analytics: Predicting Employee Attrition, July 2018, (Yan Yu, Yichen Qin)
Employee attrition is a major problem to an organization. One of the goals of the HR Analytics department is to identify the employees that are likely to leave the organization in the future and take actions to retain them before they leave. Thus, the aim of this project is to understand the key factors that influence this attrition and predict the attrition of an employee based on these factors. The dataset used here is the HR analytics dataset by IBM Watson Analytics which is a sample dataset created by IBM data scientists. In this project, the exploratory data analysis includes feature selection based on distributions, correlation and data visualizations. After eliminating some features, logistic regression, generalized additive model, decision tree and random forest techniques are used for building models. In order to evaluate the model performance, the prediction accuracy and AUC are considered. Of the different classification techniques, logistic regression model and generalized additive model were found to be the best in predicting the employee attrition.
Venkata Sai Lakshmi Srikanth Popuri, Prediction of Client Subscription from Bank Marketing Data, July 2018, (Peng Wang, Yichen Qin)
Classification is one of the most important and interesting problems in today’s world. It has applications ranging from email spam tagging to fraud detection to predictions in the healthcare industry. The area of interest here is Bank Marketing of a Portuguese banking institution. The marketing teams at banks run campaigns to pursue clients to subscribe for a term deposit. The purpose of this paper is to apply various data and statistical techniques to analyze and model the bank marketing data and predict whether a client will subscribe for a term deposit. The analysis aims at addressing this classification problem by performing Explanatory Data Analysis (EDA), building models like Logistic Regression, Step AIC & Step BIC models, Classification Tree, Linear Discriminant Analysis (LDA), Support Vector Machines(SVM), Random Forest(RF), Gradient Boosting(GB) and validating these models using the misclassification rate and area under the ROC curve. The performance of SVM is better than other models for this dataset with a low out-of-sample Misclassification Rate and good AUC values.
Ali Aziz, Financial Coding for School Budgets, July 2018, (Yan Yu, Peng Wang)
School budget items must be labelled according to their description in a difficult task known as financial coding. A predictive model that outputs the probability of each label can help in accomplishing this work. In this project the effectiveness of several data processing techniques and machine learning algorithms was studied. After applying data imputation and natural language processing techniques, a one-vs-rest classifier consisting of L1 regularized logistic regression models performed the best out of all classifiers investigated. This classifier achieved an out-of-sample Log Loss of 0.5739, an improvement of approximately 17% on the baseline predictive model.
Shashank Badre, A Study on Online Retail Data Set to Understand the Characteristics of Customer Segments that Are Associated with the Business, July 2018, (Peng Wang, Yichen Qin)
Online retailers in the world who happen to have a small business and are new entrants in the market are keen on using data mining and business intelligence techniques to better understand existing and potential customer base. However, such small businesses often lack expertise and technical know-how to perform requisite analysis. This study will help such online retailers to understand the approach and different ways the data can be utilized to gain insights into its customer base. This study is done on an online retail data set to understand characteristics of different segments of customers. Based on these characteristics the study will explain which customers segments contribute high monetary value and which customer segments contribute low monetary value to the business.
Ravish Kalra, Phishing Attack Prediction Engine, July 2018, (Dungang Liu, Edward Winkofsky)
A phishing attack forces the users to enter their credentials in a fake website or by making them open a malware in their system. This, in turn, results in identity theft or financial losses. The aim of this project is to build a prediction engine through which a browser plugin can accurately predict whether a given website is legitimate or fraudulent after capturing certain features from the page. The scope of the project is only limited to websites and does not involve any kind of other electronic media. The data set used for the analysis has been obtained from UCI Machine Learning repository. After evaluating a website through 30 documented features, the model predicts a binary response of 0 (legitimate) or 1 (phishing). Methods of analysis include (but not limited to) visualization of spread of different features, identifying correlation between covariates and the dependent variable and implementing different classification algorithms such as Logistic Regression, Decision Tree, SVM and Random Forest. Due to the unavailability of asymmetric weights for false positives and false negatives, various other evaluation metrics such as F Score, Log Loss etc. along with out-sample AUC are compared. The Random Forest model outperforms other modelling strategies considerably. Although a blackbox classifier, Random Forest model works well for the purpose of a back-end prediction engine that insulates decision making from the users.
Akul Mahajan, TMDB - "May the Force be with you", July 2018, (Yan Yu, Yichen Qin)
Today, we live in an era where almost every important business decision is guided through the application of statistics, one of the most popular areas in this regard is the application of statistical models in machine learning and prediction modelling in order to garner outcomes and align them with the goals of the industry and formulate and improve strategy to meet these goals. TMDB is one of the most popular datasets on Kaggle which houses the data of 5000 movies from different genres, geographies and languages. The use of predictive modelling can be applied to gain insights about the expected performance of the movie before they are released and formulate proper marketing strategies and campaigns in order to further improve their performance. This paper employs the use of some of the advanced predictive algorithms like linear regression, CART, Random Forests, Generalized additive model and Gradient Boosting along with tuning these model to achieve optimum performance and evaluating their potential using proper evaluation metrics.
Kevin McManus, Analysis of High School Alumni Giving, July 2018, (Yan Yu, Bradley Boehmke)
Archbishop Moeller High School has an ambitious plan to increase its participation rate (giving + activities), up from 4% a few years ago to 13% last year with a goal of 15%. Donations to the 2017 Unrestricted Fund were made by 9% of the 11,524 alumni base and reflected an increase of 258% vs 2013. The analyses focused on a regression predictive model for donation amount and a classification model to predict which alumni will donate. Both suggest that prior alumni giving, and connections to the school via other affiliations were strong predictors, among several others. The school should focus on creating opportunities for involvement by alumni as well as maintain strong connections to its base who give consistently. Overall, higher wealth levels were not a significant predictor for giving to the Unrestricted Fund. The analyses also performed unsupervised clustering which suggested there were distinct groups of those strongly connected with the school through other affiliations and those who were not. The former group tended to live within 100 miles of Cincinnati and give at a higher rate than the other groups. Even the clustering of giving alumni showed a small consistent group of givers and a second group of occasional donors. The former group also had a higher rate of other connections to the school compared to those who gave only occasionally.
Ritisha Andhrutkar, Sentiment Analysis of Amazon Unlocked Phone Reviews, July 2018, (Yichen Qin, Peng Wang)
Online customer reviews hold a powerful effect on the behavior of consumers and, therefore, the performance of a brand in the Age of Internet today. According to a survey, 88% of consumers trust online reviews as much as personal recommendations for purchasing any item on an e- commerce website. Positive reviews boost the confidence of an organization while Negative reviews suggest areas of improvement. It is also certain that having more reviews for a product will result in a high conversion rate for that product. This report is aimed at analyzing and understanding the trend of human behavior towards unlocked mobile phones sold on Amazon. The dataset utilized has been scraped from the e-commerce website and consists of several listings of phones along with their features such as Brand Name, Price, Rating, Reviews and Review Votes. Text Mining techniques have been leveraged on the dataset to identify the sentiment of each customer review which would help Amazon and, in turn, the manufacturer to improve their current products and sustain their brand name.
Swapnil Patil, Applications of Unsupervised Machine Learning Algorithms on Retail Data, July 2018, (Peng Wang, Yichen Qin)
Data Science and Analytics is widely used in the retail industry. With the advent of bid data tools and higher computing power, sophisticated algorithms can crunch huge volumes of transactional data to extract meaningful insights. Companies such as Kroger invest heavily to transform more than a hundred-year-old retail industry through analytics. This project is an attempt to apply unsupervised learning algorithms on the transactional data to formulate strategies to improve the sales of the products. This project deals with online retail store data taken from UCI Machine Learning Repository. The data pertains to a UK-based registered online retail store’s transaction between 01/12/2010 and 09/12/2011. The retail store mostly sells different gift items to wholesalers around the globe. The objective of the project is to apply statistical techniques such as clustering, association rules and collaborative filtering to come up with different business strategies that may lead to an increase in the sales of the products.
Tathagat Kumar, Market Basket Analysis and Association Rules for Instacart Orders, June 2018, (Yichen Qin, Yan Yu)
For any retailer it is extremely important to identify customer habits, why they make certain purchases, gain insight about their merchandise, movement of goods, peak time of sales and set of products which are purchased together. It helps them in structuring store lay out, designing various promotion and coupons and combining all with a customer loyalty card which makes all the above strategy even more useful. The first public anonymized dataset from Instacart is selected for this paper and the goal is to analyze this data set for finding out fast moving items, frequent basket size, peak order times, frequently reordered items and high moving products in aisles. This paper also demonstrates the loyal customer habit pattern and prediction of their future purchase with reasonable accuracy. Market basket analysis with association rules are used to discover the top strong rules of product association based on different association measures e.g. support, confidence and lift. Analysis has been conducted to uncover the strong rules for high frequent and less frequent items. Also, it is shown in the example of top selling products demonstrating which product will follow before and after its purchase using left hand and right-hand association rules.
Sayali Dasharath Wavhal, Employee Attrition Prediction on Class-Imbalanced Data using Cost-Sensitive Classification, April 2018, (Yichen Qin, Dungang Liu)
Human Resource is the most valuable asset for an organization and every organization aims at retaining its valuable workforce. Main goal of every HR Analytics department is to identify the employees that are likely to leave the organization in the future and take actions to retain them before they leave. This paper aims at identifying the factors resulting in employee attrition and build a classifier to predict employee attrition. The analysis aims at addressing the class-imbalance classification problem by exploring the performance of various Machine Learning models like Logistic Regression, Classification Tree using Recursive Partitioning, Generalized Additive Modeling and Gradient Boosting Machine. This being a highly-imbalanced class problem, with only 15% Positives, “Accuracy” is not a suitable indicator of model performance. Thus, to avoid the bias of the classifier towards the majority class, Cost-Sensitive classification was adopted to tackle misclassification of minority class, where False Negatives have a higher penalty as compared to False Positives. The model performance was evaluated based on Sensitivity (Recall), Specificity, Precision, Misclassification Cost and Area under the ROC Curve. The analysis in this paper suggests that although the recursive partitioning and ensemble techniques of decision trees have a good predictive power of the minority class, but more stable prediction performance is observed with the Logistic Regression Model and Generalized Additive Model.
Yong Han, Whose Votes Changed the Presidential Elections?, April 2018, (Dungang Liu, Liwei Chen)
The unique aspect of the YouGov / CCAP data was that it contained the information of 2008 to 2016 elections from the same group of 8000 voters. This might provide information on voting patterns between elections.
The goals of this study were to find: Was any predictor significant to the 2012 and 2016 presidential vote? Was it consistent between elections? Was any predictor significant to the change-vote between two elections? Was it consistent? Based on exploratory data analysis, 70% of voters never changed their votes, and 20% of voters changed at least once in last three elections. Was any predictor significantly associated with this behavior?
Using VGLM method, this study found that: In single elections, some common predictors were significant in elections, such as Gender, Child, Education, Age, Race and Marital status. Meantime, different elections had different significant predictors. In vote-change between two elections, significant predictors were different between two different elections. Between 2012-2016 elections, model suggested that Education, Income and Race were significant to vote-change. While between 2008-2012, model suggested that Child and Employment status were significant to vote-change. With 2016 elections data, the never-change-vote model found that Income, Age, Ideology, News and Married status were significant to this never-change-vote behavior. Individual election models could predict ~60% of votes in testing samples. Utilizing a previous vote as a predictor, models could predict ~ 89% of votes in testing samples. The never-change-vote model predicted well on the 70% never-change-vote voters, but missed almost all on the 20% change-vote voters.
Yanhui Chen, Binning on Continuous Variables and Comparison of Different Credit Scoring Techniques, April 2018, (Peng Wang, Yichen Qin)
Binning is a widely-used method to group a continuous variable into a categorical variable. In this project, I binned the continuous variables amount, duration and age in German credit data, and performed a comparative analysis on the logistic model using binned variables, to logistic model without using binned variables, to logistic additive model without using binned variables, to random forest, and to gradient boosting. I found that the performance of logistic with binning model is the weakest one among fitted five models. I also shown that the variable importance varied with different models, and the variable checkingstatus is selected as one of the important variables in most of the built models. Binned variables duration and amount were determined to be important variables in logistic with binning model. Random forest is the only model which selected variable history as an important variable.
Jamie H. Wilson, Fine Tuning Neural Networks in R, April 2018, (Yan Yu, Edward Winkofsky)
As artificial neural networks grow in popularity, it is important to understand how they work and the layers of options that go into building a neural network. The fundamental components of a neural network are the activation function, the error measurement and the method of backpropagation. These methods make neural networks good at finding complex nonlinear relationships amongst predictor and response variables as well as interactions between predictor variables. However, neural networks are difficult to explain, can be computationally expensive and tend to overfit the data. There are two primary R packages for neural networks: nnet and neuralnet. The nnet package has fewer tuning options but can handle unstandardized and standardized data. The neuralnet package has a myriad of options, but only handles standardized data. When building a predictive model using the Boston Housing Data, both packages are capable of producing effective models. Tuning the models is important to get valid and robust results. Given the amount of tuning parameters in neuralnet, these models perform better than the models built in nnet.
Kenton Asbrock, The Price to Process: A Study of Recent Trends in Consumer-Based Processing Power and Pricing, April 2018, (Uday Rao, Jordan Crabbe)
This analysis investigates the effects of the deceleration of the observational Moore’s Law on consumer based central processing units. Moore’s Law states that the number of transistors in a densely integrated circuit approximately doubles every two years. The study involved a data-set containing information about 2241 processors released by Intel between 2001 and 2017, which is the approximate time frame associated with the decline of Moore’s Law. Data wrangling and pre-processing was performed in R to clean the data and convert it to a state that was ready for analysis. Data was then aggregated by platform to study the evolution of processing across desktops, servers, embedded devices, and mobile devices. Formal time series procedures were then applied to the entire data set to study how processing speed and price has changed recently and how future forecasts are expected to behave. It was determined that while processing speeds are in a period of stagnation, the price paid for computational power has been decreasing and is expected to decrease in the future. While the negative effects of the decline of Moore’s Law may have an impact on a small fraction of the market through speed stagnation, the overall price decrease of processing performance will benefit the average consumer.
Hongyan Ma, A Return Analysis for S&P 500, April 2018, (Yan Yu, Liwei Chen)
Time series analysis is commonly used to analyze and forecast economic data. It helps to identify patterns, to understand and model the data as well as to predict short-term trends. The primary purpose of this paper is to study the Moving Window analysis and GARCH Models built through analyzing the monthly return of S&P 500 for recent 50 years from January 1968 to December 2017.
In this paper, we first studied the raw data to check its patterns and distributions, and then analyzed the monthly returns in different time windows, that is, 10-year, 20-year, 30-year and 40-year by Moving Window analysis. We found that over the long horizon, the S&P 500 had produced significant returns for investors who had long stayed in investment. However, for a given 10-year period, the return can go even negatively. Finally, we fitted several forms of GARCH models in normal distributions as well as in student t distributions and found the GARCH (1,1) Student-t model as the best model in terms of the Akaike’s Information Criteria and log-likelihood values.
Justin Jodrey, Predictive Models for United States County Poverty Rates and Presidential Candidate Winners, April 2018, (Yan Yu, Bradley Boehmke)
The U.S. Census Bureau administers the American Community Survey (ACS), an annual survey that collects data on various demographic factors. Using a Kaggle dataset that aggregates data at the United States county level and joining other ACS tables to it from the U.S. FactFinder website, this paper analyzes two types of predictive models: regression models to predict a county’s poverty rate and classification models to predict a county’s 2016 general election presidential candidate winner. In both the regression and classification settings, a generalized additive model best predicted county poverty rates and county presidential winners.
Trent Thompson, Cincinnati Reds – Concessions and Merchandise Analysis, April 2018, (Yan Yu, Chris Calo)
Concession and Merchandise sales account for a substantial percentage of revenue for the Cincinnati Reds. Thoroughly analyzing the data captured from Concession and Merchandise sales can help the Reds with pricing, inventory management, planning and product bundling. The scope of this Concession and Merchandise analysis includes general exploratory data analysis, identifying key trends in sales, and analyzing common order patterns. One major finding from this analysis was calculating 95% confidence intervals of Concession and Merchandise sales resulting in improved efficiency in inventory management. Another learning is that generally, fans buy their main food items (hot dog, burger, pizza) before the game and then beverages, desserts and snacks during the game. Finally, strong order associations exist among koozies with light beer and bratwursts with beverages and peanuts. I recommend displaying the koozies over the refrigerator with light beers and bundling bratwursts in a similar manner to the current hot dog bundle with hopes of driving a lift in sales.
Xi Chen, Decomposing Residential Monthly Electric Utility into Cooling Energy Use by Different Machine Learning Techniques, April 2018, (Peng Wang, Yan Yu)
Today the residential sector consumes about 38% of energy produced, of which nearly a half is consumed by HVAC systems. One of the main energy-related problems is that most households do not operate in an energy efficient manner, such as utilizing natural ventilation or adjusting the thermostat upon weather conditions, thus leading to higher usage than necessary. It has been reported that energy saving behaviors may lead to 25% energy-use reduction just by giving consumers a more detailed electricity bill with the same building settings. Therefore, the scope of this project is to construct a monthly HVAC energy use predictive model with simple and accessible predictors for home. The dataset used in this project include weather, metadata, electricity-usage-hours data downloaded from pecan street data port. The final dataset used in this project contains 3698 observations and 11 variables. Multiple linear regression, regression tree, random forest, and gradient boosting are four types of machine learning techniques that are applied to predict the monthly HVAC cooling uses. Root Mean Squared Error (RMSE) and adjusted R2 are two criteria that are adopted to evaluate the model fitness. All models are highly predictive based on the range of R2 from 0.823 to 0.885. Gradient boosting model has the best overall quality of the prediction with out-of-sample RMSE as 0.57.
Fan Yang, Breast Cancer Diagnose Analysis, April 2018, (Yichen Qin, Dungang Liu)
The dataset studied in this paper explains breast cancer tissue from two dimensions. The tissue is either benign or malignant. Our target is to recognize malignant tissue by knowing the dimension (mean, standard error and the worst) of it. This paper shows a section of feature selection which is based on correlation analysis and data visualization. After eliminating some correlated and visually unclassified features, logistic regression, random forest and xgboosting are conducted on training and validation data. 10 fold cross validation is also used for estimating performance of all the models, then prediction accuracy from different models are compared and area under ROC is used to evaluate model performance on validation data.
Sinduja Parthasarathy, Income Level Prediction using Machine Learning Techniques, April 2018, (Yichen Qin, Dungang Liu)
Income is an essential component in determining the economic status and standard of living of an individual. An individual’s income largely influences his nation’s GDP and financial growth. Knowing one’s income can also assist an individual in financial budgeting and tax return calculations. Hence, given the importance of knowing an individual’s income, the US Census data from the UCI Machine Learning Repository was explored in detail to identify the factors that contribute to a person’s income level. Furthermore, machine learning techniques such as Logistic regression, Classification tree, Random forests, and Support Vector Machine were used to predict the income level and subsequently identify the model that most accurately predicted the income level of an individual.
Relationship status, Capital gain and loss, Hours worked per week and Race of an individual were found to be the most important factors in predicting the income level of an individual. Of the different classification techniques that were built and tested for performance, the logistic regression model was found to be the best performing, with the highest accuracy of 84.63% in predicting the income level of an individual.
Jessica Blanchard, Predictive Analysis of Residential Building Heating and Cooling Loads for Energy Efficiency, March 2018, (Peng Wang, Dungang Liu)
This study’s focus is to predict the required heating load and cooling load of a residential building through multiple regression techniques. Prediction accuracy is tested with in-sample, out-of-sample, and cross-validation procedures. A dataset of 768 observations, eight potential predictor variables, and two dependent variables (heating and cooling load) will be explored to help architects and contractors utilize and predict the necessary air supply demand and thus design more energy efficient homes. Exploratory Data Analysis not only uncovered relationships between the explanatory and dependent variables, but relationships amongst explanatory variables as well. To create a model with accurate predictability, the following regression techniques were examined and compared to one another: Multiple Linear Regression, Stepwise, LASSO, Ridge, Elastic-Net, and Gradient Boosting. While each method has its advantages and disadvantages, the models created using LASSO Regression to predict heating and cooling load, balance simplicity and accuracy relatively well. However, when compared against the results from Gradient Boosting, the LASSO models produced greater root mean squared error. Overall, the regression trees created with Gradient Boosting yielded the best predictive results with parameter tuning to regulate “overfitting.” These models meet the purpose of this study to provide residential architects and contractors a straightforward model with greater accuracy than the current “Rules of Thumb” practice.
Zachary P. Malosh, The Impact of Scheduling on NBA Team Performance, November 2017, (Michael Magazine, Tom Zentmeyer)
Every year, the NBA releases their league schedule for the coming year. The construction of the schedule contains many potential schedule-based factors (such as rest, travel, and home court) that can impact each game. Understanding the impact of these factors is possible by creating a regression model that quantifies the team performance in a particular game in terms of final score and fouls committed. Ultimately, rest, distance, attendance, and time in the season had direct impact on the final score of the game while the attendance at a game led to an advantage in fouls called against the home team. The quantification of the impact of these factors can be used to anticipate variations in performance to improve accuracy in a Monte Carlo simulation.
Oscar Rocabado, Multiclass Classification of the Otto Group Products, November 2017, (Yichen Qin, Amitabh Raturi)
Otto group is a multinational group with thousands of products that need to be classified consistently in nine groups. The better the classification, the more insights they can generate about their product range. However, the data is highly unbalanced among classes so we try to find out if the balancing group Synthetic Minority Oversampling Technique has notable effects in the performance of the accuracy and Area under the Curve of the classifiers. Given the data set is obfuscated so that the interpretability of the dataset is impossible, we will use black box methods like Linear and Gaussian Support Vectors Machines and Multilayer Perceptron and Ensembles that combines classifiers like Random Forests and Majority Voting.
Shixie Li, Credit Card Fraud Detection Analysis: Over sampling and under sampling of imbalanced data, November 2017, (Yichen Qin, Dungang Liu)
Imbalanced credit fraud data is analyzed by over sampling and under sampling methods. A model is built with logistic regression and area under PRROC (Precision-Recall curve) is used to show model performance of each method. The disadvantage of using area under ROC is that due to the imbalance of the data the specificity will be always close to 1. Therefore the area under the curve does not work well on imbalanced data. This disadvantage is shown by comparison in this paper. Instead a precision-recall plot is used to find a reasonable region for the cutoff point based on the result from selected model. The cutoff value should be chosen within the region or around the region and it is all depends on whether precision or recall is more important to the bank.
Cassie Kramer, Leveraging Student Information System Data for Analytics, November 2017, (Michael Fry, Nicholas Frame)
In 2015, The University of Cincinnati began to transition its Student Information System from a homegrown system to a system created by Oracle PeopleSoft called Campus Solutions and branded by UC as Catalyst. In order to perform reporting and analytics on this data, the data must be extracted from the source system, modeled and loaded into a data warehouse. The data can then be exported to perform analytics. In this project, the process of extract, modeling, loading and analyzing will be covered. The goal will be to predict students’ GPA and retention for a particular college.
Parwinderjit Singh, Alternative Methodologies for Forecasting Commercial Automobile Liability Frequency, October 2017, (Yan Yu, Caolan Kovach-Orr)
Insurance Services Office, Inc. publishes quarterly forecast of Commercial Automobile liability frequency (number of commercial automobile insurance claims reported/paid) to help insurers make better pricing and reserving decisions. This paper proposes forecasting models based on time-series forecasting techniques as an alternative to already existing traditional methods and intends to improve the existing forecasting capabilities. ARIMAX forecasting models have been developed with economic indicators as external regressors. These models resulted in a MAPE (Mean Absolute Percentage Error) ranging from 0.5% to ~9% which is a significant improvement from currently used techniques.
Anjali Chappidi, Un-Crewed Aircraft Analysis & Maintenance Report Analysis, August 2017, (Michael Fry, Jayvanth Ishwaran)
This Internship comprised of two projects: Analysis of some crew data using SAS and analysis of the aircraft maintenance reports using text mining in R. The first project identifies and analyzes how different factors affected the crew ratio on different fleets. The goal of the second project is to study the maintenance logs which consisted of the work order description and work order action related to the aircrafts that were reported to go under maintenance.
Vijay Katta, A Study of Convolutional Neural Networks, August 2017, (Yan Yu, Edward Winkofsky)
The advent of Convolutional Neural Networks has drastically improved the accuracy of image processing. Convolutional Neural Networks in short CNNs, are presently the crux of deep learning applications in computer vision. The purpose of this capstone is to investigate the basic concepts of Convolutional Neural Networks in a stepwise manner and to build a simple CNN model to classify images. The study involves understanding the concepts behind different layers in CNN, studying the different CNN architectures, understanding the training algorithms of CNNs, studying the applications of CNNs, and applying CNN for image classification. A simple image classification model was designed on an ImageNet dataset which contains 70,000 images of digits. The accuracy of the best model was found to be 98.74. From the study, it is concluded that a highly accurate image processing model is achievable in a few minutes given the dataset has less than 0.1 million observations.
Yan Jiang, Selection of Genetic Markers to Predict Survival Time of Glioblastoma Patients, August 2017, (Peng Wang, Liwei Chen)
Glioblastoma multiforme (GBM) is the most aggressive primary brain tumor with survival time less than 3 months in >50% patients. Gene analysis is considered as a feasible approach for the predication of patient’s survival time. The advanced gene sequencing techniques normally produce large amount of genetic data which contain important information for the prognosis of GBM. An efficient method is urgently needed to extract key information from these data for clinical decision making. The purpose of this study is to develop a new statistical approach to select genetic markers for the prediction of GBM patient’s survival time. The new method named Cluster-LASSO linear regression model has been developed by combining nonparametric clustering and LASSO linear regression methods. Compared to the original LASSO model, the new Cluster-LASSO model simplifies the model by 67.8%. The Cluster-LASSO model selected 19 predictor variables after clustering instead of 59 predictor variables in LASSO model. The predictor genes selected for Cluster-LASSO model are ZNF208, GPRASP1, CHI3L1, RPL36A, GAP43, CLCN2, SERPINA3, SNX10, REEP2, GUCA1B, PPCS, HCRTR2, BCL2A1, MAGEC1, SIRT3, GPC1, RNASE2, LSR and ZNF135. In addition, The Cluster-LASSO model surpasses the out of sample performance of LASSO model by 1.89%. Among the 19 genes selected in the Cluster-LASSO model, the positively associated HCRTR2 gene and negatively associated GAP43 are especially interesting and worth of further study. A further study to confirm their relationship to the survival time of GBM and possible mechanism would contribute tremendously to the understanding of GBM.
Jing Gao, Patient Satisfaction Rating Prediction Based on Multiple Models, August 2017, (Peng Wang, Liwei Chen)
As the development of economy and technology, online health consultation provides a convenient platform which enables the patients seeking the suggestion and treatment quickly and efficiently, especially in China. Due to the large population density, physicians may need to take hundreds of patients every day at hospital, which is really time-consuming for patients. So there is no wonder why online health consultation grows so rapidly recently. Since healthcare service always related to issues of mortality and life quality for patients, hence online healthcare services and the patient satisfaction are always important to keep this industry running safely and efficiently. So in this project, we focus on the patient satisfaction. We integrate three levels of data (physician level, hospital level and patient level) into one. And we build multiple predictive models in order to know which independent variables have significant effects on the patient satisfaction rate as well as to check the precision of the models by comparison. This paper verifies that the physicians’ degrees of participation with the online healthcare consultation system as well as the hospital’s support affect the patient satisfaction significantly, especially the interactive activity such as total web visits, thanks letters, etc.
Jasmine Ding, Comparison Study of Common Methods in Credit Data Analysis, August 2017, (Peng Wang, Dungang Liu)
Default risk is an integral part of risk management at financial institutions. Banks allocate a significant amount of resources on developing and maintaining credit risk models. Binning is a method commonly used in banking to analyze consumer data to determine whether a borrower would qualify for a bankcard or a loan. The practice requires that numeric variables are categorized into discrete bins for further analyses based on certain cutoff values. The approach for grouping observations could vary from equal bin size to equal range depending on the situation. Binning is popular because of its ability to identify outliers and handle missing values. This project explores the basic methods that are commonly used for credit risk modelling, including simple logistic regression, logistic regression with binned variable transformation, and generalized additive models. After developing each model, a misclassification rate is calculated to compare model performance. In this study, the credit model based on binned variables did not produce the best results, both generalized additive model and random forest performed superior. In addition, the project also proposes other methods that can be used to improve credit model performance when working with similar datasets.
Sneha Peddireddy, Opportunity Sizing of Final Value Fee Credits, August 2017, (Michael Fry, Varun Vashishtha
The e-commerce company allows customers to “commit to purchase” an item and they charge the seller a fee (commission for sale) when this happens. If the actual purchase does not happen because of any reason, seller has to be refunded the fee amount as a credit. In this process, there are multiple reasons why a transaction could not be completed after “commit to purchase”. Also, there are cases where a transaction is taken off the website because of the mutual agreement between buyer and seller. This will result in loss of revenue for the company. The current project involves identifying the key reasons for an incomplete transaction and sizing the opportunity to minimize the credits payment for off platform transactions.
Krishna Teja Jagarlapudi, Solar Cell Power Prediction, August 2017, (Michael Fry, Augusto Sellhorn)
The rated power output from a solar cell is estimated through experimental measurements and theoretical calculations. However, it is difficult to obtain reliable prediction of the power output for varying weather conditions. With the advent of Internet of Things, it is possible to record exact power output from a solar cell over time. This data along with weather information can be used to build predictive models. In this project, a neural network model and a random forest model are built. The performance of the two models is compared using 10-fold cross validation, based on mean absolute error, and adjusted r-squared. It is seen that Random Forest performs better than neural network.
Mansi Verma, 84.51o Capstone Project, August 2017, (Michael Fry, Mayuresh Gaikwad)
84.51° is an analytics wing of Kroger which aims to make people’s life easier by achieving real customer understanding. It brings together customer data, analytics, business and marketing strategies for more than 15 million loyal Kroger Customers. It also collaborates with 300 CPG (consumer packages goods) Clients by driving awareness, trail, sales uplift, earned media impression and ultimately customer loyalty. Using the latest tools, technology and statistical techniques; 84.51° works towards producing insight on customer behavior with their spend data at the stores for business decisions. All goals of the company are centered towards customers at the center and not the profits only. Targeting the right customers is not an easy job. The objective of the customer targeting is to target right customer base and to know when to target them with what. This right kind of targeting not only drives sales but also saves business resources and maximizes profit. Kroger provides coupons in many channels being tills at the time of billing, emails, website, mobile app and direct mails that it sends to the best customers. This project aims to discuss about the model for best customer targeting for a direct mail campaign for a beauty CPG client for a new product launched.
Shengfang Sun, Human Activity Classification Using Machine Learning Techniques, August 2017, (Yichen Qin, Liwei Chen)
In this work, machine-learning algorithms are developed to classify human activities from wearable sensor data. The sensor data was collected from 10 subjects of diverse profile while performing a predefined set of physical activities. Three activity classifiers using the sensor metrics were trained and tested: random forest, Naïve Bayes and neural network. Performances of these classifiers were scored by leave-one-subject-out cross validation. The results show that neural network performs best with an accuracy rate of 85%. A closer look at the aggregated confusion matrix suggests that most activities of new subject can be predicted well by the pre-trained neural network classifier, despite that some activities appears to be very subject-sensitive and may require subject-specific training.
Sakshi Lohana, Market Basket Analysis of Instacart Buyers, August 2017, (Peng Wang, Uday Rao)
Market Basket Analysis is a modelling technique used to determine the unique buying behavior of customers. It can be used to formulate strategies to increase sales by suggesting customers what to buy next and providing promotions on relevant products of their choice. Through this project, Market Basket Analysis and Association Rules are explored using the dataset available on Kaggle.com. This data set is transactions by various users on an ecommerce website known as Instacart. After careful analysis, it is found that the items of daily use such as fruits, milk, sparkling water are ordered the most. Also, the proportion of reordered products is as high as 46% and hence customers can be encouraged to buy the same product again if they are satisfied with the buying experience the first time. There is high level of associations between different yoghurts, pet foods and organic items. A person buying organic cilantro is most likely to buy organic limes.
Sahil Thapar, Predicting House Sale Price, August 2017, (Dungang Liu, Liwei Chen)
Over the recent years we have seen that house prices can be an important indicator of the state of the country's economy. In this project, we will employ machine learning techniques to predict the final sales price for a house based on a range of features of the house. We know that houses can be the single biggest investment an individual makes in his lifetime. A sound statistical model can help the customer get a fair valuation of the house - both at the time of purchase and sale. The final house prices are a continuous variable and are predicted using linear regression. As a part of this project, regularization was performed to achieve simpler predictive models.
Pradeep Mathiyazagan, Website Duration Model, August 2017, (Yichen Qin, Yan Yu)
This capstone project is a natural extension of the Graduate Case Study that I worked on in the Spring Semester, 2017 as part of the Business Analytics program at University of Cincinnati. This will explore a bag-of-words model with user browsing data on the website of a local TV news station in Las Vegas owned by EW Scripps. The original Graduate Case Study did not afford us the time to explore a bag of words model as it involved a fairly large amount of web-scraping. Another worthwhile information I hope to include in this model is the amount of media elements present on a webpage in form of tweets, pictures and videos to analyze their impact on user engagement. Through this, we hope to identify pertinent information that results in better user engagement which would ultimately result in increased advertising revenue.
Rajul Chhajer, Forecasting Stock Reorder Point for Smart Bins, July 2017, (Michael Fry, John E. Laws)
Forecasting the reorder point plays an important role in efficiently managing the inventories. The reorder point is essentially the right time to order a stock considering the lead time to get the stock from the supplier and the safety stock available. It is difficult to determine the replenishment point if the sales information and lead time are unknown. In this study, historical reorder trends have been observed at product level for the forecasting. Apex ActylusTM smart bins have the ability to reorder stocks automatically based on the inventory level and it stores the information of all the past orders. The past reorders helped in understanding the velocity of a product present in a bin and then a moving average technique was used at product level to predict the next replenishment. The reorder point prediction would reduce the frequency of ordering and would help the floor managers in making better reorder plans.
Wei Yue, Analysis of Students’ Dining Survey, July 2017, (Peng Wang, Yinghao Zhang)
The goal of this project is to explore the factors that influence the customer experience the most under the designed circumstance. To achieve this objective, regression models were built to represent the relationship between customer experience and their basic information. The results of model building showed that the customer experience is not directly related to all the information provided by the survey. The survey results were supplied by randomly selected students from a University in Guangdong Province, China. The purpose of the survey is to help the restaurant management to better understand which dishes are more popular among students, and more importantly, if there are connections between dish ordering patterns and different students.
First, the students’ basic information was collected and categorized, such as, gender, major, frequency of dining out, etc. Then participants were asked to pick 5 dishes from eight categories of dishes on the menu with two in each category (16 in total) as they were dining in and then one out of the five dishes was randomly selected to be out of stock. Under this circumstance, participants need to pick one other dish to replace it. Then the customer experience was surveyed for analysis. The total number of participants is 98.
Gupta, Akash, Customer Segmentation and Post Campaign Analysis, July 2017, (Michael Fry, Naga Ramachandran)
A marketing campaign is a focused, tactical initiative to achieve a specific marketing goal.
Marketing activities require careful planning so that every step of the process is understood before you launch. Because a marketing campaign is tactical and project based, you need to map out the process from the initial promotional intent to the ultimate outcome. Based on that purpose, you need to set specific goals and metrics or key performance indicators (KPIs) that will help you determine how your campaign is performing against that goal and are helpful when creating or refining marketing strategies. It is important to track our marketing activities to results. Results will be determined by what our goals were for the campaign. But in most cases, results are usually in terms of sales or qualified leads and eventually applications.
Palash Siddamsettiwar, Internship at Tredence Analytics, July 2017, (Michael Fry, Sumit Mehra)
During my period of internship at Tredence Analytics, I was working as an analytics consultant to one of the biggest plumbing, HVAC&R and fire protection distributor in United States with more than $13 billion of yearly revenue. I was involved in building the Analytics capabilities in various divisions including supply chain, operations and products. My primary project involved working with warehouse managers and the head of data to understand how to cut down shipping costs to customers by optimizing modes of shipment and timing of delivery and thus, cutting down fixed and variable costs. By providing cost estimations for the options available, sales representatives and dispatchers would be able to make data-driven decisions rather than instinct-based ones.
My secondary project involved working with the products team and the ecommerce team to help them categorize their products using machine learning techniques. Within more than 3 million SKUs involved and more than 2 million of these still unclassified, the current pace and accuracy of classifying these products was not sustainable. Using machine learning would help these two teams to significantly reduce effort, time and money needed to classify the products and check the classifications. Both projects involve creation of a long-term, automated and real-time solution which will be integrated into their IT systems, to help people make quicker and more efficient decisions.
Jordan Adams, Forecasting Process for the U.S. Medical Device Markets, July 2017, (Yan Yu, Chris Dickerson)
The goal of this capstone is to build a forecasting process and model for Company X to forecast the US medical device market sales and share for Company X and all competitors. The forecasting process will be built using two data analytics tools to handle data management, data modeling, data visualization, and statistical analysis. The forecast process for the medical device market will involve conducting a baseline forecast using an array of time series forecasting methodologies, and adjusting the forecasts based on economic trends, competitive intelligence, market insights, and organizational strategies. The forecaster will have the flexibility to choose among many differing forecasts to select the model that they feel has the best predictability power, and the ability to cleanly visualize and explore each forecast in depth.
Aditya Singh, Churn Model, July 2017, (Michael Fry, Evan Cox)
The client is a cosmetics company based in New York City. The company has close to 9000 members globally, both men and women, from over 2250 companies in the beauty related industries. The primary reasons for becoming a member are as follows:
- Networking with other people in the beauty industry
- Find a career in the beauty industry
- Learn more about the latest trends in the beauty industry
- Get your product/company recognized at an awards event hosted by the company
A big percentage of the members churn after just one year of subscription. The goal is to identify patterns among these members who are likely to churn and eventually predict when a member is going to churn. A significant amount of time has been spent setting up the dataset before the modeling process. After, data cleaning and manipulation, I have built a Logistic Regression Model which predicts whether a member is going to churn or not.
Catherine Cronk, A Simulation Study of the City of Cincinnati’s Emergency Call-Center Data: Reducing Emergency Call Wait Times, July 2017, (David Kelton, Jennifer Bohl)
Emergency-response call centers are arguably one of the most important services a city can provide for its constituents. When a person calls 911 there is an expectation that the call will be answered and dispatched to the nearby emergency response department within seconds. In recent years, the total number of calls to 911 have increased, causing wait times to be up to 30 minutes for people contacting emergency services. The purpose of this simulation study is to analyze the current emergency call-center system and data for the City of Cincinnati and simulate alternate systems. The goal is to identify a better system that can achieve the City Administration’s goal of call takers’ answering 90% of 911 phone calls in under 10 seconds.
Michael Ponti-Zins, Inpatient Readmissions Reduction and MicroStrategy Dashboard Implementation, July 2017, (Michael Fry, Denise White)
Inpatient hospital readmission rates have been considered a major indicator of quality of care for several decades and have been shown to have a highly negative correlation with patient satisfaction. In 2017, the Ohio Department of Medicaid announced a 1% reduction in Medicaid reimbursement for all hospitals that are deemed to have excessive readmissions. In order to improve care and avoid potential payment reductions, Cincinnati Children’s Hospital created an internal quality improvement team focusing on readmissions reduction. In order to better understand the millions of data points related to readmissions, a dynamic dashboard was created using MicroStrategy, a business intelligence and data visualization tool. This dashboard was used to track the percentage of patients readmitted within 7 and 30 days of discharge, track why patients were returning, the percentage of readmissions that were potentially preventable, and other related aspects of each inpatient encounter. This information was used to identify targeted interventions to decrease future readmissions. These interventions included improved discharge and home medication instructions, automated email notification of providers, and data exports to assist in ad hoc analysis.
Ajish Cherian, Predicting Income Level using US 1994-95 Census Data, July 2017, (Peng Wang, David F. Rogers)
The objective of the project was to predict whether income exceeds $50,000 per year based on US 1994-1995 census data using different predictive models and comparing their performance. Since, the prediction to be made is a categorical value (income <=50K or >50K), the predictive models built were for classification. Models designed for the dataset were Logistic Regression, Lasso Regression, K-Nearest Neighbor, Support Vector Machine, Naive Bayes, Classification Tree, Random Forest and Gradient Boosting. Performance and effectiveness of all the models were evaluated using Area-Under the Curve (AUC) and Misclassification Rate. AUC and misclassification rate are calculated on the training and test datasets. However, for finalizing a model only metrics from the test dataset were used. Gradient Boosting performed best out of the selected models.
Rui Ding, Analysis of Price Premium for Online Health Consultations by Statistical Modeling, July 2017, (Peng Wang, Liwei Chen)
In this project, we focus on the mechanism of how the descriptive information of physicians and information of interactive reviews from patients will affect the price premium of online health consultation. Section 1 briefly introduces the definition of online health consultation and the techniques to be used in the project. Section 2 concentrates on the exploratory data analysis of the data set to obtain the overview of distribution of price premium and physicians. Section 3 discusses the analysis process of the data set by different modeling methods. The performance of each method is evaluated by in-sample, out-of-sample mean squared error and prediction error. Generalized linear modeling and mixed effect modeling demonstrate similar performance without obvious over fitting. Regression tree shows better prediction performance. However, tree-based bagging and random forest methods provide excellent performance with potential over fitting problem. Section 4 concludes the finding from the modeling and interprets the importance of the variables in the finalized models.
S.V.G. Sriharsha, Analysis of Grocery Orders Data, July 2017, (Yichen Qin, Jeffery Mills)
Objective of this analysis is to study order pattern of users of Instacart, a grocery delivering company and provide key insights about the customer behavior. There are 206209 users in the database and 49687 different products available to order through Instacart which can be characterized to 21 different departments. Current database consists of the details about 3421083 orders placed by the users over a certain amount of time. This analysis starts with exploration of variables then moves on to i) Association rule mining using apriori algorithm, ii) Unsupervised classification of customers based on their buying behavior using K-means clustering algorithm, iii) Product embedding using Word2Vec analysis and concludes with a summary of the results.
Linxi Yang, Analysis of Feedback from Online Healthcare Consultation with Text Mining, July 2017, (Peng Wang, Liwei Chen)
China has experienced rapid economic growth which benefited many industries but not the healthcare system. Because of the uneven economic development in China, not all residents can receive appropriate medical care. With an immature healthcare system and scarce medical resources for 1.3 billion people, the online healthcare consultation community in China now has become as popular as it is in other developed countries. The data was collected from an online healthcare consultation community, Good Doctor Community. Good Doctor Community (www.haod.com), which is the earliest and largest online healthcare consultation community in China, has been growing rapidly in the past 10 years. This research project will focus on how to improve the quality of service in the healthcare industry and provide insightful analysis for Good Doctor Community for future development by using text mining. Results show that the main purpose of visits is for treatment and diagnosis, and the main reason for choosing the physician is the online reviews and recommendation from friends, relatives, etc. There are 11,671 out of 22,625 respondents registered at the counter before they have seen physician, and 9,290 out of 22,625 respondents registered via an online system. The most frequent word appeared in the dataset is patient, and the most frequent word appeared in the dataset with dissatisfied review is impatient. By analyzing the sentiment of text, most patients have very positive sentiment and only 1/48 people have negative sentiment.
S. Zeeshan Ali, Image Classification with Transfer Learning, July 2017, (Peng Wang, Liwei Chen)
To correctly classify an image is a problem which has been there since the breakthrough of the modern computers. Nowadays because of techniques like deep learning there has been breakthroughs in this field. We will explore some techniques like transfer learning to classify the images in this project. We will also touch upon image feature extraction and modelling with image arrays. We will see this with a digit image dataset for simplicity.
Apoorv Joshi, Predicting Realty Prices Using Sberbank Russian Housing Data, July 2017, (Dungang Liu, Liwei Chen)
Sberbank is Russia’s oldest and largest bank. It utilizes historical property sale data to create predictive models for realty price and assists customers in making better decisions while renting or purchasing a building. The Sberbank Housing Dataset describes the property and the sub area to which it belongs in Moscow. The dataset contains 30471 observations and 292 variables. The variables are analyzed using Exploratory Data Analysis to see how they individually affect the price of a house. Further, the available data is cleaned, manipulated, and is used to fit models that can predict the house prices. Linear Regression, LASSO, Random Forest, and Gradient Boosting models were fit on the data, and we could make the predictions with sufficient accuracy.
Aishwarya Nalluri, Multiple Projects with Sevan Multi Site Solutions, July 2017, (Michael J. Fry, Doug Gafney)
Client Company A is a well-known fast food restaurant chain, spread across the world. Their business model in USA is divided into major FETs. In this project, an attempt has been made to map employees (supporting Company A but who are employed by Sevan Multi Site Solutions) working at different levels in a single dashboard. The tool used is Power BI. Main challenge is collecting data and preparing it for use in Power BI. The data had to be valid for representing in a dashboard and how the headshots can be embedded in the dashboard instead of simply specifying external hyperlinks.
QBR is a Quarterly Business Report which is presented to board members of the company. Every quarter a meeting is held and an opportunity is provided for each department to represent where they stand and what are the challenges they are facing. QBR is mainly focused on 4 aspects: people, clients, operations and finance. This methodology was introduced when the company started acquiring more projects from a variety of clients. As quarters passed by, many modifications have been made to the process of collecting required data and presenting it. The main challenge that the company faced is that there is no standard framework to work on QBR release reporting. The Finance team had issues collecting data, cleansing and representing it. As part of the solution to this challenge, a standard approach was formed using excel. The only effort needed by the Finance team now is loading a report from Quick books from excel which would automatically update all the reports. This solution has reduced their time by 50%.
Siva Ramakrishnan, The Insurance Company Benchmark (CoIL 2000), July 2017, (Yan Yu, Edward Winkofsky)
This project focuses on predicting potential customers for the Caravan Insurance Company. The dataset was used in the Computational intelligence and Learning(CoIL) 2000 challenge. It consists of 86 variables and includes product ownership data and socio-demographic data. The aim of the project is to classify customers as either buyers or non-buyers of the insurance policy. Six different models where developed including Logistic Regression, Classification Tree, Naïve Bayes, Support Vector Machine, Random Forest and Gradient Boosted trees. These models were evaluated based on the competition rules where contestants had to select a set of 800 observations from the test set of 4000. The logistic regression model performed better than all the other models.
Nitisha Adhikari, PD and LGD Modelling Methodology for CCAR, July 2017, (Michael J. Fry, Maduka S. Rupasinghe)
With the acquisition of First Niagara Bank in 2016 Key Bank acquired $2.6b Indirect Auto Portfolio. This was a new addition to the list of existing portfolios at Key and a Loss estimation model is being built to generate stressed loss forecasts for the Comprehensive Capital Analysis and Review (CCAR) and Dodd-Frank Act Stress Tests (DFAST). This document talks about the data preparation and modeling methodology for Probability of Default model (PD) and Loss given Default (LGD). The PD and LGD along with the Exposure at Default (EAD) are used to generate stressed loss forecasts for the CCAR and DFAST.
Venkat Kanishka Boppidi, Lending Club – Identification of Profitable Customer Segment, July 2017, (Dungang Liu, Liwei Chen)
Lending club issues unsecured loans to different segments of customers. The interest rate for the loan is dependent on the credit history of the customer and various other factors like income levels, demographics etc. The data of the borrowers is public. The current analysis has several objectives:
- To review the lending club dataset and summarize thoughts on LC risk profiles by loan type, grade, sub grade, loan amount, etc. using loan status of ‘Charged Off’ and ‘Default’, as indicators of a ‘bad loan’.
- To identify fraudulent customers (customers with no payment) in Lending Club data. The key characteristics of these fraudulent applications.
- To Identify best and worst categories by purpose (a category provided by the borrower for the loan request) in terms of risk.
- To build a statistical model using classification techniques and identify the less risky customer segment. These recommendations can be used to cross sell the loans for a customer segment which has low default rate and high profit.
Xiaojun Wang, Co-Clustering Algorithm in Business Data Analysis, July 2017, (Yichen Qin, Michael Fry)
In this project, we investigate a two-way clustering method and apply it to a business data set.
The classical clustering method is one-way. Given a data matrix, it is performed either on the whole row (observation-wise), or on the whole column (variable-wise). For example, in the well-known K-means method, all the variables involving in the distance measure either come from variables, or records, but not both. Co-clustering, also called bi-clustering or block clustering, is a two-way clustering method. It does clustering simultaneous on the rows and columns of a data matrix and turns the data into blocks. Our data set comes from a retail company that has hundreds of stores, each of which contains hundreds of business departments. Co-clustering analysis helps to group the data into blocks based on the similarity in productivity. Each block will consist of a group of departments and the corresponding group of stores they belong to. Our goal is to study these blocks so that business decisions can be made based on the information they bring with. The result we get shows co-clustering serves our purpose well.
Manisha Arora, Marketing Mix (Promotional Spend Optimization) for a Healthcare Drug, July 2017, (Michael Fry, Juhi Parikh)
The Healthcare Industry is one of the world’s largest and fastest-growing industries, consuming over 10% of GDP for most developed nations. Data and analytics are playing a major role in healthcare, allowing organizations the ability to make smart, impactful, data-driven decisions to mitigate risk, improve employee welfare and capitalize on the opportunities. This capstone project focusses on evaluating the effectiveness of its professional tactics for a particular drug, and optimizing its promotional spends, based on the channel effectiveness. This project analyzes each of the channels and would try to answer the following questions:
- What is the impact of each channel on the promotion of the drug?
- What is the average and marginal ROI for each channel?
- What would be the ideal spend levels per tactic and optimized based on a brand budget number?
Jayaram Kishore Tangellamudi, Predicting Housing Prices for ‘Sberbank’, July 2017, (Yan Yu, David F. Rogers)
Sberbank, Russia’s oldest and largest bank, helps their customers by making predictions about realty prices so renters, developers, and lenders are more confident when they sign a lease or purchase a building. Although the housing market is relatively stable in Russia, the country’s volatile economy makes forecasting prices as a function of apartment characteristics a unique challenge. Complex interactions between housing features such as number of bedrooms and location are enough to make pricing predictions complicated. Adding an unstable economy further complicates the predictions. Several regression models such as Linear Regression, General Additive Models (GAM), Decision Trees, Random Forest (RF), Support Vector Regression (SVR), Extreme Gradient Boosting (XGB) were built on the housing features alone to predict the housing prices. Additionally, economic indicator data was merged with Housing features data to check if these indicators can further explain the variance in the housing prices. The predictive model performances were compared using the Mean Square Error (MSE) of the logarithmic value of the housing prices.
Ramya Kollipara, Analysis of Income Influencing Factors in Different Professions, July 2017, (Dungang Liu, Liwei Chen)
Knowing the characteristics of a high/low income individual can be useful in marketing a new service targeted at potential customers within a salary range. There is always a cost involved in attracting the right customers, which the organization would always want to minimize. If a model was designed to accurately identify the right people in an income range, the cost could be significantly decreased with a higher rate of returns. The objective of this project is to explore and analyze the variables associated with an individual that might prove to be useful in understanding whether his/her income exceeds $50K/year, specially focusing on 3 different professions: Sales, Executive Managers, Professional Specialties. Various modelling techniques are explored and the different models are compared to see how some characteristics have a greater influence on certain professions compared to others and the most effective model is selected to accurately predict whether an individual’s income exceeds $50K/year based on the census data.
Shalvi Shrivastava, Black Friday Data Analysis, July 2017, (Yan Yu, Yichen Qin)
Billions of dollars are spent on Black Friday and the holiday shopping season. ‘ABC Private Limited’ has shared data of various customers for high volume products from the Black Friday month and wants to understand the customer purchase behavior (specifically, purchase amount) against various products of different categories. The challenge is to predict purchase amounts of various products purchased by customers based on the given historical purchase patterns. The data contained features like age, gender, marital status, categories of products purchased, city demographics etc. We were to build our models on the train data and validation data. The evaluation metric was RMSE, which also seemed a very appropriate choice for this problem.
Junbo Liu, Predicting Movie Ratings with Collaborative Filtering, July 2017, (Peng Wang, Zhe Shan)
Collaborative filtering, the most popular recommendation system, has been widely applied to virtually every aspect of people’s lives and has generated remarkable success in e-commerce. To make a relevant recommendation to an active user, recommendation system must be able to accurately predict the utility of items for this user because items with the highest utility (ratings in movie case) are recommended. Therefore, prediction accuracy is the key to success of a recommendation system. In this report, we compared three representative types of collaborative filtering approaches derived from three distinct rationales using movie ratings data. The three types are user-based collaborative filtering (UBCF), single value decomposition (SVD) and group-specific recommendation system (GSRS). The minimum root mean square error (RMSE) for UBCF is 0.9432 when the number of neighbor is set to 28. For SVD, the minimum RMSE is 0.9240 when the tuning λ is 0.17 and the number of latent factor is 19. For GSRS, the same number of latent factor of 19 is used and the cluster number of both users and items is set to default value of 10. When the λ value is 65, the RMSE for GSRS is 0.9007. Therefore, our results show that GSRS has the highest prediction accuracy, SVD next and UBCF last. They are consistent with the conclusion from publications.
Aditya Nakate, Talmetrix Inc, Cincinnati, July 2017, (Michael J. Fry, Ayusman Vikramjeet)
I am working as a Production Support Analyst Intern at Talmetrix at downtown Cincinnati. This company helps organizations capture feedback regarding the employee experience and analyses that data to help organizations attract desired talent and improve employee retention, performance and productivity. It helps organizations to make more informed decisions about their employees. Analysis or reports created by the company are mainly consumed by the human resource heads of the client company. During my internship, I am working on various projects including report generation and ad-hoc analysis. While working here, I have used technologies like SQL, R, Tableau etc. and have also used various statistical skills like classification algorithm and regression to name a few. This report contains the summary of the work I have done during my internship at Talmetrix. My first project at the company was about the driver’s analysis. This project was intended to find out the categories which are critical for employee satisfaction and which the clients need to focus on. Later, I also worked on a report generation process for one client. We had the employee feedback data. Employees were asked to take surveys which contained both type of questions: Likert’s and open-ended questions. Reports were created in tableau. Different views were created at levels such as overall, region, age-levels, Tenure levels, department, operating unit and suboperating units. Currently, we are generating more reports based on this one as clients are doing some deep-dives.
Suchith Rajasekharan, Allstate Insurance Claim Severity Analysis, July 2017, (Yichen Qin, Michael Magazine)
In the insurance industry, having the ability to accurately predict the loss amount of a claim is of paramount importance. Companies build predictive models based on different features of a claim and use the predictions from these models to apply proper claims practices, business rules and experienced resources to manage the claims. In this paper, we explore the different steps involved in building a model to predict the loss amount of a claim. A Kaggle dataset provided by Allstate Insurance is used for this study. Various machine learning techniques, viz, Multiple Linear Regression, Generalized Linear Model, Generalized Additive Model, Extreme Gradient Boosting, and Neural Networks are used to build different models. The models are implemented using various packages available in the open source software ‘R’. Models built using different techniques are compared based on their performance on a validation set and the best model is chosen. XGBoost model gave the best performance out of all the models. Therefore, it is chosen as the final model.
Mahesh Balan, Cash as a Product, July 2017, (Michael Fry, Fan Yang)
The project analyzed the potential of adding cash as an additional payment feature to more markets. The analysis quantified the pros and cons of cash. The economics of a cash vs non-cash trip on Uber was analyzed. The cash trip was economically beneficial to Uber compared to that of a non-cash trip. The project also deep dived into aspects such as driver and rider experience in a cash vs a non-cash trip. The experience for a non-cash trip appears to be seamless compared to that cash trip. The project also tried to quantify the risk and safety issues in a cash vs a non-cash trip. The non-cash trip appears to be safer and more trustworthy compared to that a cash trip. The project also looked at various ways to improve the existing economics and current rider/driver experience in a cash trip. The recommendations from the analysis was presented to the Growth and Product Team to improve the overall cash experience for a rider and driver in a cash trip.
Anitha Sreedhar Babu, eCommerce Marketing Analytics, July 2017, (Michael Fry, Maria Topken)
The client, a well-known online food delivery service in Cincinnati, is looking to engage its existing customers and increase the size and frequency of purchases. In order to understand the customer behavior and to drive revenue and engagement, customers were segmented based on frequency of orders and average lag between the orders. Customers were grouped into frequent shoppers, yearly shoppers, and one-time buyers. The data was also used to perform a market basket analysis to understand their purchase patterns. This information was used to drive recommendation engines as well as for effective cross selling of products to existing customers, by designing suitable combos. Targeted marketing strategies were developed based on the insights derived from the analysis.
Dhivya Rajprasad, Prediction of 30-Day Readmission Rate for Congestive Heart Failure Patients, July 2017, (Michael Fry, Scott Brown)
Prediction of readmission rates for patients has gained importance in the present healthcare environment for two major reasons. First, transitional care interventions have a role in reducing the readmissions among chronically ill patients. Second, there is an increased interest in using readmission rates as a quality metric with the Centers for Medicare and Medicaid Services (CMS) using the readmission rate as a publicly reported metric aimed at lower reimbursements for hospitals which have excess readmission rates according to reported risk standards. The objective of this project is to understand the factors which contribute to high readmission rates and predict the probability of a patient being readmitted. With a prediction model in place, hospitals will be able to better understand patient dynamics and provide better care while avoiding penalties for higher readmission rates. In this report, several different data mining, advanced statistical and machine learning techniques are explored and used to predict readmission rates. A comparison of the different techniques is also provided.
Prarthana Rajendra, Cincinnati Children’s Hospital and Medical Center, July 2017, (Jason Tillman, Michael Fry)
The scope of the project is to compute the metric usage of reports of various learning networks. This is done through the Extraction, Transformation, and Load processes. Data is extracted from various tables, transformed according to the requirements, and loaded into a single reporting database. The measures computed are then visualized in the form of graphs. SSIS, SSRS and SQL Server are the main technologies used in accomplishing this task. Packages were built in SSIS to automate these tasks and the resultant data sets can be viewed and analyzed through reports built using SSRS.
Matthew Murphy, Optimization of Bariatric Rooms and Beds within a Hospital, July 2017, (Michael Magazine, Neal Wiggermann)
Currently, hospitals do not have the ability to predict the quantity and type of specialty resources needed to care for specialty patients. This inability is especially problematic given the explicit and implicit cost of under or overestimating the need. Two such specialty resources are bariatric beds and bariatric rooms. According to the Center for Disease Control, the obesity rate within the United States adult population has risen to 36%. The increase in the obese population of the United States along with high costs for bariatric beds and dedicated bariatric rooms have necessitated investigating a better way to determine the proper number of bariatric rooms to construct, bariatric beds to own, and bariatric beds to rent. In this paper, we use simulation and probabilistic techniques along with queueing theory models to investigate the relationship between service level of severely obese patients and the number of bariatric rooms needed to reach a designated service level with such patients. Furthermore, we investigate and build a model that can be used to determine the optimal mix of beds to buy versus rent to minimize the overall cost of bariatric equipment for the entire hospital.
Soumya Gupta, Employee Attrition Prediction, July 2017, (Yan Yu, Peng Wang)
Every company wants to make sure that its employees especially the good ones continue to work for it. Losing valuable employees is very expensive for a company both monetarily and non-monetarily. In this project, we aim to predict whether an employee will leave the company. Three classification techniques —logistic regression, decision trees, and random forest —have been used for building the predictive models. Their results have been compared. Valuable employees have also been identified by making a few assumptions and separate models have been built for this set of employees since the cost of losing a valuable employee is much higher. The prediction accuracy of the random forest is quite high in this case.
Apurva Bhoite, Predicting Success of Students at Medical School, July 2017, (Peng Wang, Liwei Chen)
The University of Cincinnati’s College of Medicine wanted to conduct a study to explore the students’ information enrolled at the College of Medicine by exploring their MCAT scores, MMI scores, Academic Background, Race, and overall background. The College of Medicine also want to identify the most influential predictors in determining the success of the students at the medical school, and finally build a predictive model to do so. The main aim of this project was inference based. Thus, a lot of graphical exploratory analysis which included mainly box plots and bar plots faceted over variables were plotted to get an overall idea. Due to the high dimensionality of the data and less number of observations I used lasso subset selection with cross-validation to reduce the number of predictors. The modeling techniques Logistic Regression, Classification Trees and Random Forests were used to build the predictive model and compare its performance to select the best model. The College of Medicine can employ this model while admitting the students to the college.
Swapnil Sharma, Application of Market Basket Analysis to Instacart Transaction Data, July 2017, (Yichen Qin, Edward P. Winkofsky)
With the rise in online transactions, companies are trying to leverage the humongous data generated by the transaction activities to transform it to meaningful insights. Data Mining techniques can be used to develop a cross selling strategy for the products. Data scientists use predictive analytics to improve the customer experience of shopping online by developing models that predict which products a user will buy again, or try for the first time, or which products are bought together. In this paper, we analyze the trend in customer shopping behavior on the Instacart website for buying groceries. The data set was made public by Instacart—a same day grocery delivery service, for 3 million transactions of over 200,000 users. The data set is explored using the open source statistical learning tool R. Market Basket Analysis is done using the Apriori algorithm for various support levels, confidence and lift to suggest combinations of products to be included in a basket to cross sell the products on the platform. The model is developed to predict which previously purchased products will be in a user’s next order. The F-score measures the model’s performance.
Jasmine Sachdeva, Malware Analysis & Campaign Tracking, July 2017, (Michael J. Fry, Dungang Liu)
Any software that does something that causes harm to a user, computer, or network can be considered malware. Malware analysis is all about examining malware to understand how it may harm the device, what is its source, how it works, and how to destroy it. As the number of malware attacks hitting an organization is increasing every day, it is crucial to analyze and mitigate it to ensure the security of the sensitive data residing in the devices. This project is also about IT Security awareness programs that were conducted and analyzed enabling employees to become more vigilant, ensuring data and security is not breached within the organization.
William Newton, Concrete Compressive Strength Analysis, July 2017, (Yan Yu, Edward Winkofsky)
Concrete is an indispensable material in modern society. From roadways to buildings, humankind is literally surrounded and supported by this chemical bond comprised of relatively basic ingredients. Concrete is so ubiquitous today that it is often taken for granted. Many never question how concrete got here or how it can be trusted. Responsible for the construction of buildings thousands of years old, which in some cases still stand, the contained analyses seek to explain what concrete is and how its strength can be evaluated. Using principal components analysis and linear regression techniques, a dataset comprised of different concrete mixtures was analyzed. The analyses provide bases for reasonable inferences to be made about compressive strength and how different elements behave in the presence of others. But they also indicate that this particular dataset is not comprehensive enough to make reliable predictions regarding compressive strength of concrete.
Eric Nelson, Enhancing Staffing Tactics for Retailer Credit Card Customer Acquisition, July 2017, (Peng Wang, Justin Arnold)
Credit card companies across the country spend millions of dollars promoting their credit cards to consumers. Obtaining the attention and interest of a shopper can be extremely difficult, nonetheless so when a credit card is being promoted by a retailer whose marketing budget does not stand up to those of larger banks. To attract customers, many retailers set up in-store promotional activities to give customers a chance to learn more about the card. One such retailer invested in this strategy but required assistance determining which stores should receive marketing at which times as the required materials and staff are limited and expensive. To answer the question, this project looks at existing credit consumers through the lens of their shopping history. A model is used to determine which potential acquisition customers (“lookalikes”) are most similar to customers who already possess one of the retailer’s credit cards. A final tool shows which stores have the largest number of lookalike households and when those shoppers are in store and likely to notice the credit card promotional materials.
Nitish Puri, A Study of Market Segmentation and Application with Cincinnati Zoo Data, July 2017, (Yan Yu, Yichen Qin)
The process of dividing a market into homogeneous groups of customers is known as market segmentation. Customers can be grouped together based on where they live, other demographic factors or even their behavioral patterns. This project explains these along with other ways of grouping customers in the market, purpose of doing segmentation and the general process followed. Clustering is the main statistical technique used for performing segmentation. The two most commonly used algorithms are K-Means and Hierarchical Clustering, and they are explained in detail in this project. The final part of the project describes the membership data from Cincinnati Zoo, and segments the Cincinnati Zoo customers by performing K-Means clustering on this data.
Tauseef Alam, Internship with JP Morgan Chase Bank, July 2017, (Michael Fry, Yuntao Zhu)
The Chase Consumer & Community Banking (CCB) Fraud Modeling team at JPMorgan Chase & Co. is an analytical center of excellence to all fraud risk managers and operations across the bank. CCB Fraud Modelling team is responsible for building predictive models for managing fraud risk at transaction, account, and customer and application level. As part of CCB fraud modeling team my role is to build machine learning model for predicting Credit Card Bust out Account fraud. "Bust-out" fraud also known as sleeper fraud, is primarily a first-party fraud scheme. It occurs when a consumer applies for and uses credit under his or her own name, or uses a synthetic identity, to make transactions. The fraudster makes on-time payments to maintain a good account standing, with the intent of bouncing a final payment and abandoning the account. ("Bust-out fraud white paper" 2009 Experian Information Solutions, Inc.) I used GBM as my modelling technique for predicting fraud accounts. As part of the process I have created some independent variables and tuned model parameters to build the model. As part of our next steps we will enhance our model performance by including more features. Once the model is finalized it would be implemented and the scores generated from this model will be used in deciding whether the Credit Card account is fraud or not.
Xiaoming Lu, Investigating the Information Loss of Binning Variables for Financial Risk Management, July 2017, (Peng Wang, Dungang Liu)
In financial risk management, binning technique is widely used in the credit scoring field, especially in the scorecard development. Binning is defined as the process of transforming numeric variables into categorical variables and regrouping the categorical variables into new categorical variables. This technique is usually employed at the early stage of model development to coarsely select important variables for further evaluation. One potential problem of binning is the information loss due to transformation. To tackle this question, we performed automatic binning on the German Credit dataset using the “woeBinning” package. Then, we explored the potential information loss of binning in the development of several models using R, including logistic regression, classification tree and random forest. We employed residual mean deviance, OOB estimate of error rate, ROC Curve, symmetric and asymmetric misclassification rates (MR & AMR) to compare model performance. In general, there is little difference in the model performance for the original data and binning data, which means there is little information loss after binning the data.
Sushmita Sen, Digit Recognition with Machine Learning, July 2017, (Yan Yu, Liwei Chen)
Computer vision is a subject that piques everyone’s interest. As humans, we learn to see and identify objects very early and as such don’t give much thought to the process. But in the background an immensely complex architecture of neurons carry on this task. In pursuit of replicating this process, many fields of study have emerged. Machine learning and pattern recognition are among those. In this project, I have attempted to identify images of handwritten digits from the very popular MNIST dataset. I have used a very popular classifying algorithm Support Vector Machine and Neural networks and compared their results in this document.
Abhishek Rao, NBA’s Most Valuable Player of 2017, July 2017, (Yan Yu, Liwei Chen)
Analytics and sports have been around together for a while now, with advancements in sports technology, the application of analytics to sports increases with every passing day. A lot of decision making, scouting, recruiting, and coaching in this age has something to do with how they crunch their numbers. While it’s true that uncertainty in sports is the best thing, an increasing number of people are becoming proponents of analytics applied to Basketball. The nature of the sport makes it very suitable for statistical analysis. The plethora of variables and their interrelationships reveal some of the important facets of the game. Although it’s difficult to evaluate an individual’s ability through analysis of a team game, it reveals things that wouldn’t have been noticed by plain sight.
Gautam Girish, Predicting Wine Quality, July 2017, (Peng Wang, Yan Yu)
Wines have been produced across the world for hundreds of years. However, there are significant differences in the quality of the wine which may be due to several factors. These factors can range from alcohol content, pH of the wine, fixed and volatile acidity etc. In this paper, I am trying to predict the quality of wine based on several of these factors. Different modeling techniques will be used to determine the best model to predict the quality of wine. The prediction techniques used are Linear regression, Generalized Additive Models, Regression trees. Ensemble methods like Boosting and Random Forest have also been used. Principal Component Analysis will also be done to try and improve the model performance. The dataset has been obtained from Kaggle. All the ensuing analysis and model building has been done in R with the necessary packages. The R-squared values obtained from the test dataset is used as the metric for model comparison.
Keerthana Regulagedda, Diabetes Prediction In Pima Indian Women, July 2017, (Yan Yu, Michael Magazine)
The objective of the project is to predict diabetes in Pima Indian women based on different diagnostic measures. As the size of the data set in consideration is small and has missing values in some of the variables, models are built using algorithms that are robust to missing data. In order to achieve this, first data exploration is performed and all the predictor variables are analyzed, correlations and patterns in the data are noted. Based on the preliminary analysis, variable selection is done and initial prediction model is built using logistic regression technique by removing the records with missing data. While removing the missing data, half of the information is lost and as a result the logistic regression model built gave poor results in prediction with AUC as 0.6 and misclassification rate as 0.54. In the model building process, CART and Gradient boosting classification algorithms that handle missing data well are implemented and performance metrics are calculated. Missing data imputation is also done and effects of imputation on variable distributions are studied. Finally, models are built on complete data to see if the accuracy of prediction improves after imputation of missing values.
Rajarajan Subramanian, Predicting Employee Attrition Using Data Mining Techniques, July 2017, (Yan Yu, Edward Winkofsky)
For any organization, human resources form one of the many pillars of foundation to ensure its sustainability in the market. Employee satisfaction and attrition are two critical factors that impact its growth in the near future. They tend to have positive and negative influences respectively not only in an organization’s success but also among the existing workforce in the organization. Using statistical modeling, it would be possible to determine the various causes/factors that lead to employee attrition and predict whether an employee would leave the organization or not. The objective of this study is to compare various predictive models and identify the best among them for predicting an employee attrition. A fictional dataset from Kaggle, created by IBM data scientists is used for the study. The models built include logistic regression, classification tree, generalized additive model, random forest and support vector machines. All the models are evaluated with respect to their out-of-sample prediction performance. Misclassification rate, Cost and Area Under Curve (AUC) are considered as metrics for comparison.
Rohit Khandelwal, Comparison of Movie Recommendation Systems, July 2017, (Yan Yu, Peng Wang)
Recommendation systems have become a key tool in marketing and CRM strategies of companies in all spheres of life. This project aims to build a movie recommendation system that will use users’ ratings of movies to recommend movies they are likely to watch. Various models have been built to predict ratings and recommend movies accordingly and the results from these models are compared. A good model is one that not only predicts the ratings right but also has high precision and recall and makes recommendations in the right order.
Andrew Garner, Coffee Brand Positioning from Amazon Reviews, July 2017, (Yan Yu, Roger Chiang)
Online reviews contain rich information about how customers perceive brands in a product category, but the information can be difficult to extract and summarize from unstructured text data. Text mining and machine learning are applied to Amazon reviews to map the brand positioning of coffee companies. Specifically, a two-dimensional map places companies with similar Amazon reviews close together. This was accomplished by cleaning the text data, training a word2vec model to create a numeric representation of the review text, and applying t-SNE to reduce the high dimensional data to a two-dimensional map. Hierarchical clustering was used to label brands with distinct clusters.
Nitin Abraham Mathew, Lending Club Loan Default Analysis, July 2017, (Yan Yu, Liwei Chen)
Peer to peer lending platforms have become increasingly popular over the past decade. With relaxed rules and less oversight, the possibility of an investor losing money has greatly increased. This calls for the need to build risk profiles for each and every loan disbursed on these peer to peer platforms. The objective of this project is to explore the application of different risk modeling techniques along with techniques to tackle class imbalance on financial lending data in order to maximize expected returns while minimizing expected variance or risk.
Dhanashree Pokale, Image Classification using Convolutional Neural Network and TensorFlow, July 2017, (Yan Yu, Dungang Liu)
The inspiration behind the Image Classification problem considered for this project is employing deep learning techniques and TensorFlow like advanced data processing libraries hosted in Python to classify data. The focus of this project is using Convolutional Neural Network which layer by layer learns features from the images and considers the fact that in images pixels close by are more correlated than those which are apart. Feed forward neural network is a fully connected neural network and hence fails to utilize spatial correlation factor while classifying. With 2 convolutional layers, I could achieve a classification accuracy of 89% on Street View House Numbers dataset. With deeper architecture employment, maximum accuracy of 97% can be achieved.
Matthew Wesselink, Analysis of NBA Draft Selections, July 2017, (Yichen Qin, Edward Winkofsky)
The NBA offseason is a short time from June to October each year when players and teams have the opportunity to regroup and improve their prospects for the coming season. Teams can do this a number of ways either through free agency, the NBA draft or outright trades. The focus of the following analysis will be on the NBA draft and the future performance of those players. By analyzing win shares, one measure of an individual player’s offensive efficiency, we can better project the value each draft selection will provide to their team. Multiple forms of regression were used to predict player win shares value based on draft position. To evaluate the models, we analyzed AIC and BIC values, residuals, Cook’s Distance, and leave-one-out cross validation. Consistently, the logarithmic model performed better than other forms of regression. Logarithmic regression does well at modeling the average, but fails to predict an individual player’s success.
Angie Chen, Textual Analysis of Quora Question Pairs, July 2017, (Peng Wang, Dungang Liu)
Quora is an online platform that allows people to ask questions and connect with those who can share unique insights. The site’s mission is to distribute knowledge in order for people to better understand the world. However, with the platform’s ever-growing popularity, many users submit similar questions. At the same time, there are a limited amount of experts who do not have time to answer multiple variations of the same questions. Quora aims to allow experts to share knowledge in a scalable fashion – writing an answer once and disseminating the information to a wide audience. As a result, Quora wishes to focus on the canonical form of a question – phrases that are the most explicit, least ambiguous construct of a question. To address this dilemma, we used data analysis and modeling techniques to identify duplicate question pairs. Exploratory data analysis and text mining procedures were performed to develop a predictive model that classifies for duplicate question pairs. Two types of ensemble learning procedures, random forest algorithm and gradient-boosted trees, were attempted. Based on the research, an effective model was ultimately developed through sentiment analysis (positive or negative valence), evaluation of key question pair characteristics (number of common words, difference in character length, similarity ratio), and gradient-boosted trees, which yielded an accuracy rate of 70% on the testing data. As such, this solution can be used to efficiently focus on the canonical form of a question. This will facilitate high quality answers, and provide for better user experience on the platform.
Jainendra Upreti, Rossmann Store Sales Forecasting, July 2017, (Peng Wang, Dungang Liu)
For retail stores, the sales are affected by a combination of several factors such as, promotional offers, presence of competitors, assortment levels, store types etc. It is very important for the stores to understand how these parameters and use the analysis to predict the sales in future. Predictive models based on these characteristics are used to forecast the sales efficiently and accurately. These predictions help the store managers to comprehend the store performance against performance indicators and prepare in advance, the measure that should be taken to improve sales for example, introducing promotional offers, understanding competitor market, etc.
In this paper, we cover the processes involved in building a model to forecast store sales over a given period based on certain attributes. A store sales dataset from Kaggle is used to achieve the same. Different modelling techniques are explored - random forest, gradient boosting and Time Series Linear Model. All the models are built using R. The different modeling techniques are trained on the training dataset using the metric Root mean square percentage error (RMPSE) and then based on the prediction power they are used to forecast the store sales. Since our test dataset does not hold any value for sales therefore, the prediction error is tested by submission of the outputs on Kaggle.
Matt Policastro, District Configuration Analysis through Evolutionary Simulation, July 2017, (Peng Wang, Michael Magazine)
This capstone replicates a methodology for identifying biased redistricting plans in a new context. Rather than electoral districts, twenty-four of the City of Cincinnati’s fifty neighbourhoods across three of the five Cincinnati Police Department districts were chosen as the units of analysis. It should be noted that this project did not constitute a rigorous analysis of potentially-biased districting practices; instead, this project identified advantages, trade-offs, and other challenges related to implementation and analysis. While the results of the evolutionary algorithm-driven simulation suggested deficiencies in the current implementation, the underlying methodology is sound and provides a basis for future improvements in evaluation criteria, computational efficiency, and evolutionary operators.
Ritesh Gandhi, Gender Classification by Acoustic Analysis, July 2017, (Dungang Liu, Liwei Chen)
With the advent of machine learning techniques and human-machine interaction, automatic speech recognition is finding practical utilities in today’s world. As a result, gender classification based on acoustic properties of speaker’s voice is applicable in a range of applications in different fields. It starts with extracting voice characteristics from huge databases containing human voice samples before processing and analyzing those features to propose the best model for implementation in respective systems. The purpose of this paper is to perform comparative study of gender classification algorithms applied on voice samples. Extreme Gradient Boosting (XGBoost), Random Forest, Support Vector Machine (SVM) and Neural Network are employed to first train our model and then compare the results to determine the best classifier for gender. It has been shown for all the models that average model performance crosses 95% and misclassification rate stays below 5%. Final results suggest that Random Forest is the best classifier among all the six techniques that are used for gender recognition.
Anurag Maji, Analysis of the Global Terrorism Data, July 2017, (Yichen Qin, Edward Winkofsky)
The problem we are trying to address is to predict whether an attack will result in casualties given the nature and characteristics of the attack. The dataset was obtained from Kaggle [10]. The motivation behind this project was to gain an understanding of how terror attacks have spread over time and over different regions and what have been the key drivers in features involved in such an incident. A detailed analysis has been done on spatial and temporal nature and characteristics observed in majority of the attacks. The target variable has been converted from continuous to a categorical one as it was deemed more important to know whether there will be civilian casualties as opposed to knowing the magnitude of damage to life.
Krishnan Janardhanan, Win Probability Model for Cricket, July 2017, (Peng Wang, Ed Winkofsky)
Cricket is a popular team sport played around the world with batters and bowlers. There are a limited number of resources available to each team in the form of wickets and balls. In order to understand the impact of the resources and the current situation of the game, a win probability model was created which estimates the probability of the team winning. The model was created using Logistic regression, Classification tree using boosting and Local regression. The models were analyzed based on model parameters such as Area under the curve and Misclassification Rate. The most suitable win probability model was chosen, and the model was applied to a game to examine its predictability. Win probability models may be used to evaluate player value and contribution, used in betting sites to calculate odds of a team winning.
Kuldip Dulay, Fraud Detection, July 2017, (Yan Yu, Edward Winkofsky)
Credit cards have become an integral part of our financial system and most of the people use it for their daily transactions. Given the huge volume of transactions that occur, it is very important to ensure that these transactions are valid and have been performed by the credit card owner himself.
With the advancement in machine learning algorithms it has become possible to narrow down our search for fraud transactions to a very limited number of records which can be later manually verified. For this project, I have tried to implement 4 such machine learning techniques to identify fraud transactions. Predictive models have been built using these 4 techniques and based on a few selected performance criteria the best model has been identified.
Nidhi Mavani, How Can We Make Restaurants Successful Using Topic Modeling and Regression Techniques, July 2017, (Dungang Liu, Liwei Chen)
Yelp is an online platform both website and app, where people write about their experiences about a place they visited. Yelp had published the data for competition, wherein the information about different businesses across the countries of US, Canada, Germany and U.K and their check-ins, reviews were made available. The objective of the project is to identify the factors that affects the business of restaurants across the mid-west region of US, states of Ohio, Wisconsin, Illinois and Pennsylvania. About 6.2K restaurants with 50 attributes and having 2M reviews are analyzed. Analysis is spread across two spectrums, first is to analyze the text and identify the topics that customers are more concerned about using Topic Modeling technique called Latent Dirichlet Allocation (LDA) and the other is to find the features of highly appreciated restaurants using Logistic Regression. Thus, an analysis both on qualitative and quantitative data is done to understand the customer’s preferences.
Aditya Bhushan Singh, Price Prediction for Used Cars on eBay, July 2017, (Dungang Liu, Liwei Chen)
With the advent of E-Commerce Industries everything from household items to cars are being made available online. Changing market trends indicate various demographics now prefer to shop for almost everything from the comfort of their own homes. In this analysis we will predict the price of second hand cars whose ads were posted on EBAY Kleinanzeigen based on various attributes of the car made available by the sellers. This will help prospective buyers gather an accurate estimate of the cars while also helping sellers price their cars at an optimum level. In order to accomplish this, different data mining algorithms will be applied and evaluated to identify the best solution for this problem. Once the best solution has been established for this use case, it can be then easily transferred onto other products being sold across the E-commerce spectrum.
Ishant Nayer, Airbnb Open Data for Boston, July 2017, (Yan Yu, Yichen Qin)
The purpose of the analysis was to show people how Airbnb is really being used and how it is affecting their neighbourhood. By analysing reviews from Airbnb’s data, itself we can judge which areas are most popular, or which apartment types are most commonly used and how all the listings are reviewed. Airbnb started their open data initiative in which they disclosed some data location wise. The data was analysed using R and visualizations such as Google Static Maps, Word Clouds, were integrated into the Sentiment analysis to highlight the sentiments - location wise for the Boston area. Such kind of analysis gives a holistic approach where a vibe of the neighbourhood can be picked up using analytics. Recommendations can be made to Airbnb or the people who put up their places on their website. With the use of all the recommendations Airbnb can improve their service which will lead to happy customers and thus, better business. Sentiment Analysis was done using faceted vertical bar graphs, word cloud, and horizontal bar charts, etc. The analysis show that overall there is a Positive vibe from the listings at Boston, MA.
Yash Sharma, Image Recognition, July 2017, (Bradley Boehmke, Liwei Chen)
Computer vision is a concept which deals with automated extraction, analysis and understanding of information from images. This field has enormous use cases and numerous organizations like Google, Tesla, Baidu, Honeywell etc. have invested significant resources into research and development of computer vision technologies. Computer vision can be utilized in autonomous vehicles, language translators, wildlife conservation, medical solutions, forensics, census and many more fields. Character recognition could be taken as the first step into computer vision. This project leverages data from a Kaggle competition where 42000 labeled samples of numbers were given and participants must build models which could accurately recognize numbers from 0 to 9. Machine learning techniques like Principal Component Analysis, Random Forest and Artificial Neural network were used to build models which were trained to identify hand written numbers. Predictions from each of these models were compared and metrics such as precision, recall and F1 scores were used to judge the accuracy of the model predictions.
Aditya Kuckian, Loan Default Prediction, June 2017, (Dungang Liu, Liwei Chen)
Loan defaults is a most common problem that banks face today for all its assets. The problem aggravates at the time of economic downturn. This project is from an online competition on Hackerearth wherein a bank wants to control its Non-performing assets by timely identifying the propensity of loan defaults among applicants. The data provided was related to loan application, customers’ engagement and demographics, and credit information. It had ~532K records and 45 features. The scope of the project was to identify the characteristics of loan defaulters for credit card and house purchase. This information was extracted from ‘purpose’ column in the data. Machine learning classification models such as Logistic Regression and Gradient Boosting Machine were used for this purpose. The predictions from each of the models were compared for concordance and area under the curve (AUC) metrics. Both the techniques identified similar characteristics from factors such as number of inquiries and credit lines, delinquency metric, verification status, grade, etc. that distinguished loan defaulters. Variable ‘total interest received till data’ showed contrasting behaviour. Gradient Boosting machine could significantly improve the predictions for credit card defaults.
Sanket Purbey, Spotify Tracks Clustering and Visualization: Creating playlists by preferentially ordering audio features, April 2017, (Jeff Shaffer, Michael Fry)
Spotify is a leading music streaming service with the majority of its revenue coming from paid subscriptions. As of 2017, Spotify still isn’t profitable. Its revenue model adopts Open Music Model with $9 monthly unlimited subscription fees. To enhance subscription, Spotify offers several recommendation services and pre-built playlists for various moods and occasions. This study focuses on providing subscribers a more customized way to explore their music and easily create their own playlists through clustering preferentially on chosen audio features reflective of their mood. The study details the data sources and storage to facilitate the analysis. Exploratory Data Analysis is performed to identify suitable audio features for this clustering study. It then compares different clustering methods elaborating on BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) as a preferred choice. The study then emphasizes on the user experience and visualization aspects of the hierarchical clustering using D3.js JavaScript library offering a potentially new way to navigate music choices.
Pradyut Pratik, Movie Recommendation Engine: Building A Movie Recommendation Engine to Suggest Movies to Users, April 2017, (Peng Wang, Ed Winkofsky)
A recommendation engine is a tool or algorithm that makes suggestions to the user. A recommendation engine using statistical techniques is quite widespread in e-Commerce nowadays. It solves the problem of connecting the existing users with the right items in the massive inventory of products or content. Amazon has reported 29% of the sales because of cross sell in 2016 and Netflix paid $1 Million to the team which can improve the recommendation accuracy by 10%. The idea behind this project is to study different statistical methods used in the recommendation engine and better understand collaborative filtering and content-based filtering used in the recommendation engine.
Zhengquan Wang, Modeling Bankruptcy Probability: A Study of Seven Major Banks, April 2017, (Dungang Liu, Liwei Chen)
The goal of this study is to model the bankruptcy probabilities of seven major banks under different economic scenarios. The balance sheet, income, and bankruptcy risk are analyzed so that their capital status, operation risks and default probabilities are evaluated. The seven banks investigated here are: 1) J. P. Morgan & Chase (JPM), 2) Bank of American (BAC), 3) CITI Group (CITI), 4) Wells Fargo (WFC), 5) Goldman Saches (GS), 6) Morgan Stanley (MS), and 7) U.S. Bank (USB). The balance sheet and earning data of the banks is collected from the 10K documents published by these banks.
The balance sheet analysis is performed in terms of the ratios of the total bank liabilities versus the gross domestic production (GDP), and total bank assets versus GDP. Results indicate that these ratios have declined after the 2008 financial crisis. The revenue and net income analysis reveal that both revenue and income for most banks declined during the crisis significantly and they have recovered after that and stayed stable since. A logistic regression model is with the goal to predict the bankruptcy risk probabilities of the banks. The approach to achieve this goal is to borrow the information from an existing dataset which consisted of the corporation bankruptcy data of other industries. This approach is based on the assumption that bank failures may follow a similar mechanism as the corporations in other industries. Results of this study show that the liability ratio (liability/asset) of the banks has significant impacts on bankruptcy.
Shreya Ghelani, Product Recommendation for Customers of a Bank, April 2017, (Yan Yu, Edward Winkofsky)
Recommendation systems can enhance customer engagement by not only providing selective offers which can be highly appealing to the customer but also by adopting targeted marketing and advertising efforts towards potential customer segments and thereby achieving cost efficiency. The objective of this analysis is to look at customer purchasing behavior of financial products at a bank and predict the new products that customers are likely to purchase thereby recommending those products to the customers. With a more effective recommendation system in place, the bank can better meet the individual needs of all customers and ensure their satisfaction. Different data mining classification algorithms are tried and compared to identify the best model for such a problem.
Arun Yadav, Application of Predictive Modeling in Loan Portfolio Underwriting, April 2017, (Yan Yu, Edward P. Winkofsky)
In financial institutions, a decision on an online loan application is made within a matter of a few seconds. Predictive models based on a large volume of consumer characteristics are used to make the decision efficiently and accurately. This ensures prevention of loans to people susceptible to bankruptcy; thus, avoiding bad debt. On the other hand, an accurate model will also ensure good credit risk customers are not erroneously rejected a loan; leading to increased profits. In this paper, we cover the processes involved in building a model to predict the probability of customers’ loan default in financial institutions. A financial transaction dataset from Kaggle is used to do the same. Different machine learning techniques are explored - logistic regression, random forest, gradient boosting and neural networks. All the models are built using the ‘sklearn’ package in Python.
The different modeling techniques are validated for out-of-sample data using Area Under ROC (Receiver Operating Characteristic) Curve (AUC) and Kolmogorov–Smirnov (KS) statistics as performance metrics. Logistic regression and neural network have comparable performance; however, logistic regression is chosen as the final model considering model complexity.
Mitchell Garner, Analysis of McDonald’s Made for You Production System: Current Barriers, April 2017, (Amitabh Raturi, Katie Blankenship)
With the introduction of ADB 2.0, the second phase of All Day Breakfast at McDonald’s, interesting challenges have been introduced into the kitchen systems of the iconic QSR. Here the barriers will be analyzed using experimental design, statistical inference, queueing analysis, and simulation modeling. These methods are meant to not only search for and analyze the potential barriers but prescribe fixes both temporary and permanent that Owner/Operators may adopt to improve operations and sales. Data was collected for this study in a manner which will generalize these results to as wide an audience as possible. Prior beliefs in assembly barriers will be tested using these data. Models will be constructed integrating the data with knowledge of the system. Recommendations include planning lower volume kitchen positioning around the product mix of the hour, utilizing positioning standards in higher volume stores to their fullest, and using the acquired statistical evidence to necessitate a group collaboration with McDonald’s for solution development.
Hui Wang, Spline Regression Modeling and Optimization Analysis of Floor Space, April 2017, (Yichen Qin, Michael Fry)
The purpose of this project is to seek the maximum total net sales by modifying the floor space allocated to different product types for a large retail chain of stores. In order to achieve this objective, spline regression models were built to represent the relationship between floor area and the productivity for each of their products. The results of model building showed that the behavior of most of the products sold by the retailer follows the anticipated “U” shape curve, that is, along with the increase of selling area, the productivity of the product initially decreases and then increases; at a certain threshold, the increase of the productivity terminates, the productivity then either becomes a plateau or starts to decrease again. A non-linear optimization model was then conducted based on the spline regression models with an objective of maximizing the total net sales of a store. The optimization procedure was designed at three different levels: optimizing the selling area of products in a whole store; optimizing the floor area of products within each floor of a store; and optimizing the selling space within each product category in a store. The output of the optimization analysis are the optimal floor areas assigned to specific product groups. These analyses and recommendations are significantly important because they serve as valuable reference for the retailer on whether and how to adjust the floor space of their products in order to maximize total net sales.
Todd Eric Hammer, Simulation Study: Waiting Times During Checkout at Non-Profit Tag Sale, March 2017, (W. David Kelton, Edward Winkofsky)
The purpose of this simulation is to study the waiting times of buyers during a public consignment sale held by a non-profit organization. During a substantial portion of the 4 hour public sale, the lines become very long and the non-profit group worries that some people will abandon their purchases and they will lose the commission from those sales. A simulation model was created in Rockwell’s Arena software to simulate the sale and modifications to the sale to determine if different configurations would reduce the amount of waiting time for the buyers. The study primarily focused on adding more helpers and tables during the checkout process. The study found that adding more tables was the most beneficial way of shortening lines and reducing the amount of time the buyer was in the checkout process.
Dan Shah, Developing a Health Score for Network Traffic, March 2017, (Yan Yu, Mukta Phatak)
GE has over 30 locations globally within its IT infrastructure deemed “mission-critical” to ongoing operations. Many of these locations are noting unacceptably high levels of variability in network latency, but limited information makes it difficult for the network engineering team to determine root cause. A “health score” implemented to a visual dashboard will help GE Digital understand current and past network health and prioritize improvements. In this paper, the network traffic across sites is analyzed to generate a framework for assigning a score to a location’s network traffic. The parameters for the framework are generated through simulation and input from the client and, subsequently, a prototype implementation is established. Additionally, based upon further research and simulation, alternative methods for outlier detection including Cumulative-sum (CUSUM) and Exponentially-weighted-moving-average (EWMA) control charts are compared and a path forward for the diagnostic tool is proposed.
Uma Lalitha Chockalingam, Customer Churn Propensity Modelling, August 2016, (Dungang Li, Edward Winkofsky)
Churn is a measure of subscription termination by customers. Churn incurs a loss to the company when investments are made on customers with high propensity to churn. Churn propensity models can help improve the customer retention rate and hence increase revenue. This paper focuses on the churn problem faced by companies and predicting customer churn by building churn propensity models. Data for this project is taken from the IBM Watson Analytics Sample Datasets, which contain around 7043 instances of telecommunication customers’ churn data. In this paper churn propensity models are built using techniques like logistic regression, support vector machines, neural networks, random forests, and decision trees. By comparing the various model performances it is observed that for out-of-sample prediction, neural networks, logistic regression and random forests perform better. While neural networks and random forests are black-box algorithms, logistic regression gives good insight of predictor variables that are effective in modelling churn. In-sample prediction measures of random forests show the ideal misclassification rate indicating over fitting to training data. Hence logistic regression is recommended owing to good out-of-sample prediction performance, along with insights on predictor variables that are significant to model.
Mohan Sun, Customer Analytics for Financial Lending Industry, November 2016, (Zhe Shan, Peng Wang)
This research involves discovering customers’ experience, attributes and performance to help the company make better decisions in increasing profit in all aspects of service, origination and collection. This has been done by examining different datasets such as customers’ attributes and performance data, using different tools such as SAS and GIS and performing various analysis. Upon examination of these datasets, it becomes clear that timely answering customers’ phone calls, targeting customers and locating stores in the area with high demand index as well as tracking stores and customers’ performance in time will help the company understand its operation and then make more profit.
Wenwen Yang, P&G Stock Price Forecasting using the ARIMA Models in R and SAS, December 2016, (Yichen Qin, Dungang Liu)
Time series analysis is commonly used in economic forecasting as well as analyzing climate data over large periods of time. It helps identify patterns in correlated data, understand and model the data as well as predict short-term trends from previous patterns. The aim of this paper was to present a concise demonstration of one of the most common time series forecasting models, ARIMA models in both R and SAS. The daily stock prices of Procter & Gamble from January 1, 2013 to September 30, 2016 with 693 points were used as an example. The autocorrelation function/partial autocorrelation function plots were used to examine the adequacy of the model as well as Akaike Information Criterion (AIC). The daily stock prices from October 1, 2016 to November 4, 2016 with 25 points were used to test the model’s performance by calculating the accuracy of the forecasts. First, the time series modeling was conducted in R, and then it was validated using SAS. The final model was identified as a moving average model with a first order difference. The AIC was 1494 and the average accuracy was 97%, which suggested that for the short-term prediction using the ARIMA model could do a good job. In addition, the log transformation was performed which was preferred in real economic prediction analysis. In this case, the same modeling results were obtained. To conclude, this paper demonstrated a comprehensive time series analysis in R and SAS, which could be a useful documentation for beginners.
Scott Woodham, Time Series Analysis Using Seasonal ARIMAX Methods, December 2016, (Yan Yu, Martin Levy)
The goal of this analysis is to develop a model that forecasts sales using time series methodology. First the ARIMA and SARIMA models are developed in their polynomial forms. Second the process of developing a model from start to finished is performed addressing such issues as stationarity of data, interpreting the ACF and PACF plots to infer the model parameters and then estimate the parameters. After the forecasts are made for the multiplicative seasonal model, the model is adjusted to include an exogenous variable, SARIMAX, to enhance performance. The current heuristic used to predict sales is the value of sales from a week prior or a lag 6 value, the final model selected has both AR and MA seasonal and non-seasonal components as well as a binary indicator variable, which is sometimes referred to as intervention analysis, though that term is not used here as it usually implies a large sudden shift from which the system recovers and in this analysis the data are more sinusoidal as the sales shift from weekdays to weekends.
Jin Sun, Internship at West Chester Protective Gear, December 2016, (Yichen Qin, Yan Yu)
West Chester Protective Gear founded in 1978, is a known leader in the marketplace for providing high performance protective gear for industrial, retail and welding customers. From gloves to rainwear to disposable clothing, WCPG offers a wide range of quality products including core, seasonal and promotional products and is one of the largest glove importers in the United States. This capstone is composed of five projects, most of which are interactive reports made with Microsoft Power BI, a cloud-based business analytics tool. The Order Picking and Fill Rate reports greatly increases the work efficiency of the Warehouse Department, and the reports for Purchasing Department provide people another view of the sales data, which will help the company make a better inventory plan. The last project is to analyze the relationship between Average Sales Price and Sales Units. A linear regression model is built to explain how the change of the price will affect the sales units. Model diagnostics are conducted. Model performance in terms of hold-out sample prediction is evaluated. Throughout this internship, I have practiced and made the best use of my knowledge from MSBA program to real world applications.
Ally Taye, Predicting Hospital Readmissions of Diabetes Patients, December 2016, (Yichen Qin, Yan Yu)
Diabetes is an increasingly common disease among the U.S. population. According to the CDC, the number of people diagnosed with diabetes increased fourfold from 1980 to 2014. In addition, if not well controlled, diabetes can lead to serious complications such as cardiovascular disease, kidney disease, peripheral artery disease, and many others that can result in hospitalization or even death. In light of the seriousness of this condition, it is worth looking into the causes of hospitalization of diabetes patients and what factors influence whether they stay healthy enough to avoid future hospitalizations. This paper looks at a de-identified dataset with information about diabetes patients admitted to hospitals across the U.S. over an extended period of time, and analyzes multiple variables from their hospital records to see if there was a statistically significant relationship between any combination of these factors and whether or not they were readmitted to the hospital any time during the observation period after their initial visit.
Hamed Namavari, Disney Princess: Strong and Happy or Weak and Sad, A Sentiment Analysis of Seven Disney Princess Films, December 2016, (Michael Fry, Jeffrey Shaffer)
In a world that is predominantly run by men, it has been suggested by several researchers that entertainment content is affected more by male influence than female influence (see Friedman et al.). But, what if the general male dominance coming from the study context is eliminated in the research process? In this capstone case, the significance of main female characters in a select list of Disney Princess title movies are explored by only comparing their scripts in those movies to that of the other main character, which is always not female, in each title. The research completed here supports the idea of that Disney princess characters are the most positive and most spoken characters in their movies.
Pramit Singh, Sentiment Analysis of First Presidential Debate of 2016, December 2016, (Amitabh Raturi, Aman Tsegai)
Datazar is a platform which uses open data to generate meaningful insights, hence the sentiment analysis was performed as a part of a scalable plan which would allow analysts to reuse the analysis to calculate sentimental scores based on Twitter feeds. It was performed after the First Presidential debate to capture the mood of people on social media, the tweets were classified as positive, negative and neutral and a sentiment score was calculated for each of the presidential candidates.
Also, logistic regression and Random forest techniques have been used to predict the negative sentiments. While this was implemented for the presidential debate, the functions used are reusable and hence can be used to get the score for any other brand. This was done to ensure the process is scalable and reusable.
Huangyu Ju, Regression Analysis for Exploring Contributing Factors Leading to Decrease of Cincinnati Opera Attendance, December 2016, (Dungang Liu, Tong Yu)
In recent years, there has been a growing concern about the diminishing audience for opera nationwide. Cincinnati Opera is currently facing a flat dropping down in total audience attendance in the past decade. The audience attendance of Cincinnati Opera includes subscribers and single ticket buyers. While the number of both subscribers and single ticket buyers is decreasing year by year, the number of single ticket buyer is not decreasing as rapidly as it is of subscribers. The goal of this project is to explore and identify the possible variables that may influence especially the number of subscribers. In this project, the regression analysis method is adopted for exploring the contributing factors that impacting the audience attendance. The analysis has identified four categories of variables: variables related to the origin of the opera piece, variables regarding the show time, variables related to the popularity, and the theatre capacity. To boost the audience attendance, it is recommended that the opera pieces with European background and good reputation should be included in each season’s performance. More performances should be scheduled during weekends so that they can attract more audience.
Darryl Dcosta, Analysis of Industry Performance for Credit Card Issuing Banks, August 2016, (Dungang Liu, Ryan Flynn)
Argus Information and Advisory Services, LLC, is a financial services company that utilizes the credit card level transaction data collected from different banks and credit bureaus, to offer various analytical services to credit card issuers. Argus possesses transaction, risk, behavioral and bureau sourced data that covers around 85-95% of all the banks in the US and Canada. The dataset contains transaction level data provided by nearly 30 banks, across 24 months, with an approximate size of 3+ million records. This study looks at how Argus can offer an early bird analysis of the variance in performance of the industry, while abiding legal regulations that prevent the company from revealing more than a certain level of data, which poses a threat of price fixing by the client bank. Data is pulled from tables containing different dimensions of data in the SQL server database and aggregated to produce client level reports. The analysis showed that the projections made were fairly consistent with the observed industry trend, after the transaction level bank data was loaded, validated, normalized and queried from the database. It is a good indication on the accuracy of the projections. Client banks use the flash report to tailor their revenue model and customer acquisitions strategy. The spike in Total New Accounts in the industry for March 2016 was not captured by the projection made, which would need to be revisited from a business point of view.
Shivaram Prakash, Predicting online Purchases Navistone®, August 2016, (Efrain Torres, Dungang Liu)
Ecommerce – the newly emerged platform for online retail sales has seen a burgeoning increase in its usage since inception. Although dominated by the giants like Amazon and Ebay, almost all businesses have their own online store or website which contribute to a sizable chunk of the total revenue. Navistone® collects the visitor browsing behavior data and analyses patterns to predict prospective buyers for its clients. The objective of this exercise is to analyze the browsing behavior data of online visitors, in order to predict the success of purchase for each visitor. In order to achieve this goal, visitor browsing data is collected from various client websites, checked for erroneous entries, cleaned and analyzed. Binary response models are then generated on a reduced, choice-based dataset (for enabling better prediction). The first model, a classification tree model, is generated to enable the management to understand the importance of different features of the dataset while the second, logistic regression model, is generated to better predict the response as compared to the classification tree. The logistic regression model produces better prediction in both the training and testing datasets and the classification tree provides evidence that the number of carts opened is the most statistically significant variable, prompting the management to focus the marketing efforts on visitors who put items in the cart and then abandon them later on.
Lavneet Sidhu, Predicting YELP Business Rating, August 2016, (Yan Yu, Glenn Wegryn)
Sentiment analysis or opinion mining is the computational study of people’s opinions, sentiments, attitudes, and emotions expressed in written language. It is one of the most active research areas in natural language processing and text mining in recent years. Its popularity is due to its wide range of applications because opinions are central to almost all human activities and are key influencers of our behaviors. Whenever we need to make a decision, we want to hear others’ opinions. The focus of this study is to quantify people’s opinion on a numerical scale of 1 to 5. Various predictive models were explored and their performance were evaluated to determine the best model. Attempts were made to extract the semantic space from all the reviews using latent semantic indexing (LSI). LSI finds ‘topics’ in reviews, which are words having similar meanings or words occurring in a similar context. Similar reviews were clustered into different categories using semantic space.
Ashok Maganti, Internship with Argus Information and Advisory Services, August 2016, (Harsha Narain, Michael Magazine)
Argus Information and Advisory Services is a leading benchmark, scoring solution, analytics provider for the financial Institutions. Argus helps its clients maximize the value of data and analytics to allocate and align resources to strategic objectives, manage and mitigate risk (default, fraud, funding, and compliance), and optimize financial objectives. One of the core competencies of Argus is being able to link the different accounts of a customer across financial institutions and have a complete view of the customer’s wallet. The deposits, transfers and the spending of the customer can be linked and the complete profile and spending behavior can be studied. Wallet Analysis Team is responsible for the Linkage and validation of the data. As a part of Data and Applications Vertical and Wallet Analysis Team, my primary objective was to study the concepts of Record Linkage, Identity Resolution and to develop an algorithm to identify the unique customers from different data Sources and to populate into a single normalized flat database using deterministic Record Linkage process for the UK market. The Records have the credit card account and customer details from different banks. These records are to be linked and integrated so as to identify the same customer across banks and remove the duplication. Apart from identifying the accounts of customers across banks, the changes in the customer details have to be captured and to be maintained in the integrated flat database with the help of slowly changing dimension of type-2.
Lian Duan, Fair Lending Analysis, August 2016, (Julius Heim, Dungang Liu)
The Consumer Financial Protection Bureau (CFPB) requires lenders to comply with fair lending laws, which prohibit unfair and discriminatory practices when providing customer loans. Applicants’ demographic information is usually prohibited for collection but it is needed to perform fair lending analysis. The objective of this project is to show that the race distribution is similar across the three bins of a predictor from our scorecard. In this project, the customers’ race categories were predicted using last names and residential locations according to the Bayesian Improved Surname Geocoding (BISG) proxy method published by CFPB. Modifications in our analysis including using customers’ Core Based Statistical Area (CBSA) information instead of home address, and using R software instead of STATA for data preparation and analysis. The predictor was evaluated based on race distribution in each bin, and our results suggest that the race distributions across the three bins of this predictor are similar.
Juvin Thomas George, Automation of Customer-Centric Retail Banking Dashboards, August 2016, (Andrew Harrison, David Bolocan)
Retail Banking is a competitive arena focused on customer-centric service. Customers interacting with banks through multiple channels have created an explosion of data, banks use to generate insights into their behaviors. Understanding customer data is crucial to developing better products and services. Performing analytics on transactional data and utilizing benchmarking studies requires creation of standard dashboards on a regular basis. Automating data input processes and updating dashboards are critical to on time services. This capstone project was completed at Argus Information & Advisory Services, part of Verisk Analytics, located at White Plains, NY.
Sudarshan K Satishchandra, Prediction of Credit Defaults by Customers Using Learning Outcomes, August 2016, (Peng Wang, Yichen Qin)
Most financial services have realized the importance of analyzing credit risk. Predicting the credit defaults with higher accuracy can save considerable amount of capital to financial services. Many machine learning algorithms can be leveraged to increase the accuracy of prediction. Popular and effective algorithms such as Logistic regression, Generalized Additive Models, Classification Tree, Support vector machines, Random Forest, Extreme Gradient Boosting, Neural networks and Lasso are apt for predicting the credit defaults. These algorithms have been compared using asymmetric misclassification rate and AUC for the out sample prediction. Data from the UCI Machine Learning Repository which was donated by I-Cheng Yeh from Chung Hua University, Taiwan has been used.
Rishabh Virmani, Kobe Bryant Shot Selection, August 2016, (Michael Magazine, Yichen Qin)
The report consists of insights about Kobe Bryant’s shot selection throughout his career. The data we have are all of his career shots and whether they went in or not which is the response variable. Along with that, we are trying to predict Kobe’s performance in the last two seasons of his career. We are predicting whether he actually sunk the shot or not. For this purpose we are employing the following three algorithms, Random Forest (Bootstrap Aggregating technique), Support Vector Machine (Non Probabilistic technique) and XGBoost(Boosting technique).
Alicia Liermann, The Analytics of Consumer Behavior: Customer Demographics, August 2016, (Jeffrey Shaffer, Uday Rao)
This project focuses on consumer buying behavior in retail grocery stores across the United States. The data was obtained through historic Dunnhumby data that was generated by shopping cards and recorded coupon codes, accompanied by transaction information. The project was approached from a business sales and marketing orientation as a means to target customers and increase sales.
Sarthak Saini, Predicting Caravan Insurance Policy Buyers, August 2016, (Peng Wang, Glenn Wegryn)
The project involves analyzing customer data for an insurance company. The aim is to predict whether a customer will buy caravan insurance based on demographic data and data on ownership of other insurance policies. The data consists of 86 variables and includes product usage data and socio-demographic data derived from zip codes. There are 5822 observations in the training data set and 4000 observations in the testing data set. The project aims to predict if a customer is interested in purchasing a caravan insurance policy. The models used for the project classifies them as potential buyers or no buyers. Predictive models were built to describe the customer behavior and predict potential buyer. Given that this is a classification problem Lasso Logistic regression, Classification Tree, Random Forest, Support Vector Machine (SVM) after dimension reduction by Principal Component Analysis (PCA), Linear discriminant analysis (LDA) (after PCA) and Quadratic discriminate analysis (QDA) (after PCA) were used to predict the potential customers . Dimension reduction was employed to reduce the number of predictor variable as there are many predictor variables. Best results were obtained using LDA and SVM with the misclassification rate as low as 7% for the testing data. Dimension reduction significantly improved the performance of the models. PCA was used for reducing dimensions and the first twenty components were used to build the model.
Abhishek Chaurasiya, Tracking Web Traffic Data Using Adobe Analytics, August 2016, (Dan Klco, Dungang Liu)
A website is the major source of information and interaction between the consumer and producer in any kind of organization or environment. It can be accessed by hundreds to millions of users, which generates huge volumes of data. This data contains important information about customer profile, demographic information, technology used, user patterns, consumer trends etc. Tracking this data and reporting it in the format desired is therefore a huge and important task. This project uses Adobe Analytics, along with Dynamic Tag Manager (DTM) to track and effectively report this data. The reports are then analyzed keeping in mind their business value. The analysis concludes that the author ‘Ryan McCollough’ garners maximum views, around 90% of the total, through his posts. It’s also concluded that Twitter is the most preferred Social media channel, it drives around 80% of the traffic which follows the blog.
Rutuja Gangane, Customer Targeting for Paper Towels – Trial Campaign, August 2016, (Sajjit Thampy, Yichen Qin)
Customer Targeting has been a marketing challenge for many years. The idea behind customer targeting is to optimize targeting so that one targets the right kind of customer at the right time, with the right kind of product in order to maximize sales, save business resources and maximize profit. Quotient Technology Inc.’s Website Coupons.com delivers personalized digital offers in accordance to user’s purchasing behavior data. A customer is displayed with many different combinations of coupons based on their buying patterns/segments created using data-driven techniques. This project is an ad-hoc predictive analysis to determine the target customers for a Paper Towel producing CPG brand, (say YZ) targeting its customers with personalized coupon offers for various retailers. The main idea behind the campaign is to generate trial. It is easy to determine user’s behavior based on previous trial campaigns, but as this is a first campaign of its sort for the brand, we will make use of multiple machine learning techniques, heuristics and business knowledge to make the best predictions about which customers are likely to try the product.
This project will make use of excessive SQL queries (Hadoop Impala) and R software to perform data analyses, market basket analysis, logistic regression, random forest, SVM models and similar Machine Learning techniques to find which customers are more likely to buy YZ Brand’s Paper Towels and should be targeted for this trial campaign.
Hardik Vyas, Analysis of Kobe Bryant Shot Selection, August 2016, (Michael Magazine, Peng Wang)
The key objective of this project is to explore the data pertaining to all of the 30,697 shots taken by Kobe Bryant during his entire NBA career. We also look to develop various models to predict which of these shots would make the basket had the outcome been unknown. The problem is based on a competition now closed on Kaggle. The competition was introduced post Kobe Bryant’s retirement from professional Basketball on April 12, 2016. Kobe played out his entire 20 year NBA career with the Los Angeles Lakers. He had an illustrious career to say the least, holds numerous records and is regarded as one of the most celebrated players to ever grace the game.
Nikita Mokhariwale, Reporting Analyst Internship at BlackbookHR, Cincinnati, August 2016, (Marc Aiello, Peng Wang)
The importance of data interpretability is often overlooked during Executive reporting. The customer experience can be increased manifold if Executive reports are made user-friendly and in such a manner that the executives are encouraged to see patterns and trends in data, and to even question the data. I transformed the traditional reports which BlackbookHR used to create for all its clients in the Talent Analytics space. The traditional reports comprised of numbers and tables which were tedious to read and provided little insight apart from just the results of surveys taken by the employees of the client. I introduced innovative visualizations and charts in the reports and minimized the use of numbers in depicting the data. The visualizations helped the Executives to view their organization in one snapshot without having to perform any mental calculations, as there were no numbers involved. This received very positive feedback from clients because such charts helped them find patterns also in areas where they weren’t expecting them. For example, one of the clients was able to identify a possible negative correlation between size of teams and levels of Employee Engagement. My work was primarily based on Excel and Tableau. I later created Excel and Tableau templates which could be used for all future reporting purposes. I did scalability tests so that the reporting templates could be used for larger clients and stay robust when varied data is introduced in them.
Joshua Roche, Market Analysis Framework for Mobile Technology Startups, August 2016, (Amit Raturi, Michael Magazine)
The current technological revolution has created a veritable modern day “gold rush” due to an ever-growing market and much lower barriers to entry than in traditional industries. Many startups do not pursue an analytical study of the market in which they seek to enter before development begins. This potentially leads to a tremendous undertaking that is in effect useless, due to a lack of implementing a market analysis before work begins. This paper seeks to establish an initial analytical framework to begin testing market potential assumptions before work begins so that entities with a limited amount of resources including, a lack of analytical prowess and information asymmetry, to make more informed decisions.
Shashank Pawar, Hybrid Movie Recommender System Using Probabilistic Inference over a Bayesian Network, August 2016, (Peng Wang, Edward Winkofsky)
Recommender systems are used widely, in order to help users accessing the Internet, by suggesting the products or services, they would be interested in based on their historical behavior, as well as the behavior of other users similar to them. Two different types of approaches are usually adopted while developing a recommender system: Content based and Collaborative filtering. This project studies the application of a hybrid approach, combining the content based and collaborative filtering techniques, in developing a recommender system for movies. The data set used is the MovieLens 100K data set, consisting of 100,000 ratings by 943 users of 1682 movies, where a movie is described using one or more of 19 features or genres. The objective is to predict how a given user would rate a movie, which has not yet been rated by him. A Bayesian network, is used to represent the interaction and dependencies among the movies, users and movie features, which in turn are represented as nodes in the graph. In order to find the users similar to the given user, for the collaborative filtering part, first, the ratings by the two users on common items, are considered, and the Pearson Correlation Coefficient between the two sets of ratings, is used as the measure of similarity and second, considering the same sets of ratings by the two users, the count of instances where the two users have both rated a movie lowly or highly, is used as a measure of similarity.
Raunak Bose, Machine Learning - Comparison Matrix, August 2016, (Uday Rao, Michael Magazine)
With the availability of several options, the decision of selecting machine learning tools for machine learning algorithms has become cumbersome. Each algorithm brings its own pros and cons to the machine learning community and many have similar uses. The emergence of phenomenon of collection of huge data is already here and current tools for machine learning need real-time processing abilities to meet the requirements of its users. Through this paper, I wish to provide researchers the ability to utilize machine learning with Python. In order to evaluate tools, one should have a thorough understanding of what to look for. This paper will take into account the platform of Python to evaluate machine learning algorithms on confusion and hardware matrix. We will look at libraries such as Python SCIKIT and study their usage in performing processing on data meant for supervised learning algorithms.
Ryan Stadtmiller, Predicting Season Football Ticket Renewals for the University of Cincinnati Using Logistic Regression and Classification Trees, July 2016, (Michael Magazine, Brandon Sosna)
Season ticket holders (STH) are important for both collegiate and professional sports teams. It allows fans to take ownership in the team and also provides a significant amount of overall revenue for the team’s ownership. For these reasons, maintaining a high renewal rate of STH’s is important to the teams on and off the field performance. I will focus on analyzing STH renewals for the University of Cincinnati’s Football team. I will use statistics and data mining techniques to predict whether a STH is likely to renew their seats based on many predictor variables such as quantity of tickets, section, percentage of tickets used throughout the year, and percentage of games attended among many others. If a customer is not likely to renew their tickets, the athletic department can take preemptive measures to retain the customer.
Nicholas Imholte, Optimizing a baseball lineup: Getting the most bang for your buck, July 2016, (Michael Magazine, Yichen Qin)
Given a fixed payroll, and focusing purely on the offensive side of the ball, how should a baseball team assign its funds to give itself the highest average number of runs possible? In this essay, I will attempt to answer this question using regression, clustering, optimization, and simulation. First, I will use regression to model baseball scores, with the goal being to determine how each event in a baseball game impacts how many runs a team scores. Second, I will use clustering to determine what kinds of hitters there are, and how much each type of hitter costs. Third, I will use optimization to determine the optimal arrangement of hitter clusters for a variety of payrolls. Finally, I will complement this analysis with a simulation, and see how the results from the two approaches compare.
Nidhi Shah, Revenue Optimization through Merchant-Centric Pricing, July 2016, (Jay Shan, Madan Dharmana)
A payment processor, that processes credit and debit card transactions, wanted to come up with a strategy to maximize the revenue they make from merchant transactions, by re-pricing the processing rates of their merchants periodically. The biggest challenge with increasing a merchant’s rate, as is with any customer of a business, is that, there is a very fine line between driving the customer away due to their price sensitivity and being able to determine an optimum price point so as to get the most revenue out of them and retain them as a customer.
To address this challenge, we implemented a dynamic, merchant-centric pricing strategy where each merchant is treated individually - based on their profile - while determining the pricing action to be taken. In order to achieve this, we designed an automated solution in SAS that came up with a unique pricing recommendation for each merchant based on certain decision rules. The strategy to maximize revenue was implemented by increasing processing rates up to the merchant’s segment (industry and volume tier) benchmark along with certain other constraints. This automated solution allowed re-pricing to be done more frequently (monthly) which resulted in an annual incremental revenue of ~$500,000 for the payment processor.
Kristofer R. Still, Forecasting Commercial Loan Charge-Offs Using Shumway’s Hazard Model for Predicting Bankruptcy, July 2016, (Yan Yu, Jeffrey Shaffer)
In the course of lending money, a certain percentage of a bank’s outstanding loans will be deemed uncollectible and charged-off. Because charge-offs can lead to significant losses commercial banks try to minimize these losses by closely monitoring borrowers for signs of default or worse. Commercial banks maintain detailed financial records for their customers which include numerous accounting ratios. This analysis seeks to leverage this accounting data to predict corporate charge-offs using a sample of firms from January 1, 2000 through the present. A simple hazard model is used and compared to older discriminant analysis methods based on out-of-sample classification accuracy.
Sahithi Reddy Pottim, Building a Probability of Default Model for Personal Loans, July 2016, (Dungang Liu, Yichen Qin)
Consumer lending industry is growing rapidly with a wide spread of loan types and lending personal loans over internet is gaining huge importance. The main goal of the project is to determine which customers should be offered a loan in order to maximize the profit of a small finance company which issues loans to customers over internet. The data set has information on the past loan performance and contains about 26,194 loans with 70 variables. The variables can be categorized as those on application data, credit data, loan information and loan performance. The main crux of the project is the selection of variables using weight of evidence and information value concepts which are measures of predictive power of the response variable. It has been noticed that weight of evidence is high for those variables where the percentage of the good and bad loans change significantly as the bins change. Variables with information value (predictive power) between 0.26 and 0.02 which can be classified as strong, average and weak predictors are considered for building logistic regression model and it resulted in an AUC of 0.67. However Information value did not take into account correlation or multicollinearity among the variables. Further check on correlation and multicollinearity using variance inflation factor (VIF) resulted in the reduction of variables. Step-wise logistic regression model is built on the selected variables using information values and it resulted in the reduction of variables and an AUC of 0.69 and a reduction in misclassification rate of good and bad risk loans. The results proved that information value is one of the best variable selection procedures and step-wise logistic regression model suited best in the prediction of probability of default of loans on the dataset.
Joseph Chris Adrian Regis, Human Activity Recognition using Machine Learning, July 2016, (Yichen Qin, Dungang Liu)
The Weight Lifting dataset is investigated in terms of "how (well)" an activity is performed. This can have real life applications in the sports and healthcare space. In this particular capstone, machine learning algorithms are applied with the intension of checking the feasibility of its application in terms of accuracy. This data is collected from the use of wearable accelerometers consisting of 39,242 observations with 159 variables. Features were calculated on the Euler angles (roll, pitch and yaw), as well as the raw accelerometer, gyroscope and magnetometer readings from the wearable devices. We have chosen to go with algorithms in the order of increasing complexity in order to probe accuracy w.r.t. the algorithm used. Decision Trees, Random Forests, Stochastic Gradient Boosting and Adaptive Boosting were applied. We saw that there is not much difference between the latter 3 (less than 0.25% apart in terms of % accuracy), but they were much better than decision trees, as expected. But as we have to choose between the three, we choose adaptive boosting as the final algorithm. We get an accuracy of 99.95% with the algorithm (adaptive boosting) on the scoring dataset and this is the expected accuracy in a general application using the same setup.
Jigisha Mohanty, Analyzing the Relationship between Customers for the Commercial Business of the Bank to Identify the Nature of Dependency and to Predict the Direction of Risk in Cases of Possible Adverse Effects, July 2016, (Kristofer Still, Michael Magazine)
The Commercial banking business deals with many customers that buy various products from the bank. There are scenarios where a company and its parent company are both customers of the bank. Further, a bigger company can guarantee the loan requested by another company. Each loan or credit service established carries a certain amount of risk for the bank. Each relationship is rated based on such factors of risk. The direct risk to the bank is established by the direct exposure amount assigned to the company. When a different company owns or guarantees for a company, the latter’s direct exposure also shows up as indirect exposure for the former. This implies that if the smaller company defaults in paying back its loan, the company owning or guaranteeing for it is responsible for the entire loan taken by the smaller company.
The objective of this study is to create a network map to identify such connections. The network map will provide a visual description of the relationship between two customers and show the dependencies between customers. The second objective of the study will be to identify the direction of risk in terms of direct and indirect exposure for primary, secondary companies and so on. This will help the bank establish a line of action and to quantify the exposure amount attributed to each customer. The direction of risk will open up the analysis of the effect of an adverse effect.
Minaz Josan, Sentiment Analysis for the Verbatim Response Provided by Clients for Satisfaction Survey for Fifth-Third Bank, July 2016, (Kristofer Still, Yan Yu)
The Financial-Services industry is still struggling with high churn rates as customers have numerous options where they can bank. This leads to the need for understanding the hidden customer sentiments. The industry has realized the need for strengthening the relationship with their customers. One measure taken is to monitor the performance of the representatives and the satisfaction of the clientele with the institution as well as the representative. An overall satisfaction score is provided to every representative based on the survey completed by the clients on the performance of the bank and the representative. This survey also includes the verbatim responses. In this project, an attempt will be made to identify the sentiment behind these verbatim responses and the correlation to the overall satisfaction score. The responses will be analyzed in three-category scale of positive, negative or neutral using the supervised learning model of SVM (support vector machines) and Logistic regression algorithm.
Adam Sullivan, Predicting the Rookie Season of 2016 NFL Wide Receivers, July 2016, (Yichen Qin, Mike Magazine)
The NFL has never been more popular than it is today, part of why the sport has become so popular is the expansion and exponential growth of fantasy football. According to American Express nearly 75 million people will play fantasy football and spend nearly $5 billion to play in the course of the 2015 season. The leagues people play in range from daily fantasy football, where different players can be selected each week, to dynasty fantasy football, where players can be kept for their whole career. This analysis will be focused through the lens of dynasty fantasy football, which is seeing its own explosion of participants. In dynasty fantasy football the wide receiver is king with 16 of the top 20 ranked players being wide receivers. The purpose of this analysis is to give insight into which 2016 rookie wide receivers are in the best position to have success in their rookie season and would validate being selected early in dynasty football drafts.
Eulji Lim, Cincinnati Crime Classification, July 2016, (Yan Yu, Dungang Liu)
Every citizen expects prompt service from police, and the police department wants to draw satisfaction from citizens with resource management and other tools. This study aims to build “Cincinnati crime category prediction models” in order to find an insight of the crime data through appropriate data visualization. The Cincinnati Police Crime Incident dataset is provided by the City of Cincinnati Open Data Portal. It contains time and location of crime in the six districts of Cincinnati from 1991 to present and has been continuously updated daily. Specifically, there are over one hundred eighty thousand incidents from Jan 2011 to May 2016, which is the sub-dataset chosen for the analysis. The crime classification idea and the model evaluation method are inspired by one of the Kaggle competitions: “San Francisco Crime Classification”. In the data exploration, it is found that month and season affect the number of crimes rather than the types of crime. Logistic Regression models are built using R with different time and geographical attributes. The hour, year and neighborhood factors are found to be more effective than other factors such as latitude and longitude, in order to build the model with the lowest log-loss (2.133). In addition, Random Forest and Tree models are built in SAS Enterprise Miner and the random forest model with hour and neighborhood factors shows the best performance with the lowest misclassification rate (0.67).
Joshua Horn, Analysis and Identification of Training Impulses on Long-Distance Running Performance, July 2016, (David Rogers, Brian Alessandro)
Long-distance running is one of the most popular participatory sports in the United States; in 2015 there were 17.1 million road race finishers and over 500,000 marathon finishers, each collecting a trove of untapped data. The subject of this analysis has been a semi-competitive runner since 2000 and began collecting personal running data in 2004, with an increase in detail in 2007 while competing collegiately and again with the inclusion of GPS data in 2014. Using these data and background knowledge of training theory and exercise physiology, a variety of new variables were defined for exploration in their ability to explain changes in athlete fitness, defined by VDOT, a pseudo form of VÖ2max, maximal oxygen consumption rate. The primary objective was to identify the primary drivers of VDOT to inform future training decisions. Based on a combination of heuristic, ensemble, and complete search methods across linear, additive, and tree regressions, the variable 48-week measures of training impulse, measured in intensity points, was identified as the primary driver of changes in VDOT. From these results, future training for the athlete should focus on maintaining long-term consistency, with the 48-week training impulse between 3,500 and 4,500 points, a zone that produces VDOT outcomes in the 66th to 86th percentiles without inducing the substantial physiological (muscular degradation due to insufficient recovery) and psychological (mental strain accompanying 10 to 16 hours required for weekly training) stress associated with higher training loads.
Linlu Sun, Analysis and Forecast of Istanbul Stock Data, July 2016, (Yichen Qin, Peng Wang)
The Istanbul Stock Exchange data set is collected from imkb.gov.tr and finance.yahoo.com. Data is organized with regard to working days of the Istanbul Stock Exchange. The objective of this exercise is to forecast the response variable ISE. First, we will use the mean forecast ISE, if it does not work we need to build a linear model. When using mean forecast ISE fails, we will build the best model using a linear regression on an 80% sample of the actual data. The initial approach involves performing exploratory data analysis to understand the variables and designing the best model with the most appropriate variables using linear regression. Based on the best model, we will forecast each predictor variables, then use the best model formula to forecast the next 10 days value of the ISE.
Subhashish Sarkar, Sentiment Analysis of Windows 10 – Through Tweets, July 2016, (Dungang Liu, Peng Wang)
With the advent of the mobile operating systems (OS), Microsoft revamped its value offering and launched the Windows 10 OS in July 2015 that works across devices (Laptops, Desktops, Tablets and Mobile phones). To gain market share and attract existing users to install the new OS, Microsoft offered a free upgrade that is expected to end on July 29th, 2016. However, Microsoft has not been able to generate the targeted traction for its new OS amongst its user base. The purpose of this project is to explore the sentiments of the user base and thereby explore the reasons why Windows 10 is not getting the traction targeted by Microsoft. Sentiment analysis is helpful for brands to determine the wider public perception about a product on social media. Results from the analysis can be used as a direct feedback that can result in altering product strategy or pruning and adding features to the product. In this case, lexicon-based sentiment analysis of tweets on Windows 10 revealed that only 24% of the users had a positive opinion. The analysis using ordinal regression further highlighted some specific issues that contributed to the negative opinions. E.g., the negative emotions were due to bugs, crashes, installation errors and the aggressive promotion adopted by Microsoft. The positive opinions about Windows 10 were centered on the host of features available in the OS. The report also goes on to identify frequently used contextual words that can be added to the lexicon to improve the parsing of emotions.
Zhiyao Zhang, Methodology on Term Frequency to Define Relationship between Public Media Articles and British Premier League Game Results, July 2016, (Yichen Qin, Michael Magazine)
This project is intended to define whether the relationship between the public media and the game results of Premier League games exists. Premier League is a soccer league filled with rumors, sources, news, and critics. Players in the league are suffering from the pressure and anxiety of the critics, which may potentially affect their game performance; however, we do not know whether the relationship actually exists or not, and even if does, we do not know how the public media affect the games. In this research, I will investigate these two questions with Term Frequency.
Emily Meyer, Demonstration of Interactive Data Visualization Capability for Enhancement of Air Force Science and Technology Management, July 2016, (Yan Yu, Jeff Haines)
Data must often pass through certain people and channels before it becomes information and reaches someone who makes a decision. In order to make the data-to-decision-maker pipeline somewhat more expedient, an effort is being undertaken to setup Tableau and Tableau Server within an organization. This effort is a long-term project whose current initial stage is focusing on both setting up Tableau Server and also demonstrating the capabilities of blending data and use of interactive dashboard visualizations for personnel within the company to create early adopters within the organization. The following report is an intern’s contribution toward the demonstration of the capabilities of Tableau through creating PowerPoint presentations, Tableau Story Points, Tableau Dashboards, identifying principles for structuring data, cleaning up datasets, and refining already created dashboards.
Zhaoyan Li, Identifying Outliers for the TAT Analysis, July 2016, (Michael Magazine, Ron Moore)
The goal of our company is to provide the best healthcare services (Cleaning, Equipment Delivery, etc.) to our client – Cincinnati Children’s Hospital Medical Center, or CCHMC, and of course, to patients who visit the Hospital. Sure enough, data analysis plays the role of improving the quality and efficiency of our services. My data analysis work can be put into four categories: Turnaround Time Analysis, Supervisor Inspection Analysis, Patient Survey Analysis, and Full-time Employee Analysis. Since October, 2016, profits our company receive from CCHMC are all based on metrics. For example, the hospital requires that 90% of all cleaning requests should be completed within 60 minutes. If we can hit this goal, we will receive 100% money. If hit 90% within 65 minutes, we get 75% of profit, and so on. The goal from the perspective of a data analyst is to generate graphs that tell how we perform in the past as well as to figure out a way of meeting the objective metrics set by CCHMC.
Shivanand Yashasvi Meka, Predicting Customer Response to Bank Marketing Campaigns, July 2016, (Peng Wang, Dungang Liu)
Banks often market their credit card, term deposits, etc. by cold calling their customers. Each call has a cost associated with it, and calling all their customer base is not prudent for a bank, as only a small percentage of those customers actually convert. Therefore, in order to reduce costs and improve efficiency, it is important to have a good prediction model for separating those customers who are likely to respond positively to the marketing campaign. The dataset used for this project contains information of 41,188 customers who have been approached for subscribing to a term deposit offered by a Portuguese bank. The objective of this project is to develop a model to accurately predict whether a contacted customer will subscribe to the term deposit or not.
Six different modeling techniques have been used in this project. These models use 19 variables that span a customer’s demographic information, credit and previous campaign information available with the bank, and macro-economic variables. Four of these 6 models have a very similar performance, and they improve the net profit generated from the marketing campaign by 50%.
Subhasree Chatterjee, Movie Recommendation System Using Collaborative Filtering, July 2016, (Yan Yu, Peng Wang)
Movie Recommendation system is the way to recommend users movies based on their ratings on movies they have already watched. Collaborative Filtering is used in this project to achieve the goal. This report is a comparative study of the different methods of collaborative filtering used in the industry and to find the best method based on the data in hand. This project also predicts the top 5 recommended movies per user based on their historical ratings from the Movielens database.
Sanjita Jain, Incident Rate Analysis, July 2016, (Dungang Liu, Michael Magazine)
XYZ, a major wireless network operator in the United States, offers its subscribers with a couple of handset protection programs (insurance programs), which covers all mobile devices including kit accessories (wall charger, battery, SIM card) in a total loss claim such as loss/theft or physical/liquid damage for as long as the feature is paid for. Understanding the different subscriber’s features, which affect their claim propensity would help in better budgeting for the future. Analysis of the different features of subscriber base (approximately 10 million per month for a period of 29 months) like credit class, Regions, Devices, etc. and analysis of the different features of the fulfilled claims per month base (approximately 115 thousand per month for a period of 29 months) like day of week, tenure, etc. was done to understand the relationship between incident rate and subscriber features. After the analysis it is clear that the P credit class in spite of making the maximum number of claims has the lowest incident rate. March and August have the maximum incident rate. The regions have no significant effect on the claim filing propensity. The incident rate reduces as the tenure increases. These findings will be helpful in improving forecast accuracy and recommending improvements to reverse logistics activities.
Sai Shashanka Suryadevara, Image Classification: Classifying images containing Dogs and Cats, July 2016, (Yan Yu, Peng Wang)
In many websites it is a common practice to check for HIP (Human Interactive Proof) or CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) when users try to register for their Web services. This verification is for many purposes, such as to reduce email and blog spam and prevent brute-force attacks on web site passwords. The users are provided with a challenge that is supposed to be easy for people to solve, but difficult for computers. The common practice is that user encounters a challenge to identify the letters or numbers in an image which are mostly distorted. Solving this CAPTCHA is not very easy and sometimes the user gets frustrated in this process. There is an another practice to verify for human interaction proof where users are shown some pictures containing both dogs and cats, then the users are required to identify the images that has Cats (or Dogs). As per the studies, users can accomplish this task quickly and accurately, many even think it's fun. But for computers, this task is not so easy because of many similarities that exists between these animals. In this project, supervised learning models are implemented to classify the images. The images are processed to obtain the pixel information by standardizing and rescaling. Using the pixel information, the features are extracted and the models are trained using the methods such as k-nearest neighbors, neural networks and support vector machines.
Santosh Kumar Molgu, SMS Spam Classification using Machine Learning Approach, July 2016, (Dungang Liu, Peng Wang)
Spam messages are identical messages sent to numerous recipients by email or text messages for reasons like mass marketing, gain clicks on website, scam users and steal data etc. Many carriers have started working on SMS spam by allowing subscribers to report spam and taking action after appropriate investigation. In some places, they have imposed limits on text length, number of messages per hour and day to crack down spam. But SMS spam has steadily grown over the past decade. A lot of research and work has been done on email spam and implemented by many mail service providers. SMS spam is relatively new and differs from email in terms of nature of communication and the availability of features. This paper aims at applying data mining techniques to build multiple SMS spam classifiers and compare their performances. The results show that Naive Bayes has high precision and low blocked ham rate and using term frequency with inverse document frequency increases spam caught rate.
Rujuta Kulkarni, Income Capacity Prediction, July 2016, (Peng Wang, Dungang Liu)
This analysis aims at segmenting customers into income groups of above and below 50K. This type of analysis can be used for purpose of target marketing. The data was collected from 1994 census database by selecting records between ages of 16 and 100 and applying few more filters. An initial exploratory analysis was performed on the data and data was modified as required for simplification. Different data modelling techniques were then fitted on the data and were tested for their in-sample and out-sample performance on basis of cost and AUC. New techniques like QDA, SVM and ensemble of classifiers were also tried out.
Geran Zhao, Call Center Problem Category Prediction and Call Volume Forecasting, July 2016, (Yichen Qin, Yan Yu)
This paper is about the call center of United Way 211-Greater Cincinnati. There are two objectives, and the first one is using logistic regression to predict the call category of basic needs vs the rest based on the information of callers. It finds out that Hamilton, Kenton, Campbell, Montgomery, Clark, Warren, Individual, Self Referred, Referral, Information Only, Black/African American, Family, and Agency have relationship with basic needs. The second one is to build the ARIMA model to predict the call volume. The second objective is using ARIMA model to predict the call volume of the call center in order to find out the trend of call volume. It shows that the call volume is predicted to be decreasing.
Shikha Shukla, Grocery Analytics: Analysis of Consumer Buying Behavior, July 2016, (Christopher Leary, Dungang Liu)
The client is an American grocery retailer in Cincinnati, Ohio. The focus of this research work is to help the Marketing departing of the client make data-driven decisions and identify prospects for sales growth in the coming year. This study will help the client identify and target their customer base more effectively through market basket analysis. The data provided by the client comprised of household level demographic data, campaign data and item level point of sales data for the past two years. Market basket analysis helped us identify the most frequently purchased product and a number of other products that are always purchased with this product. It was also observed that a lot of products co-occurred in all the transactions as their confidence was 1. This information should help the client in segmenting customers based on their buying behavior so that these segments of customers can be micro-targeted more effectively through specialized marketing campaigns. Through statistical techniques like t-test we proved that the impact of marketing campaigns on sales is very significant and marketing campaigns lead to substantial increase in sales, however the number of campaigns active in a given time period or the type of campaign does not significantly affect the sales. Based on this information, the client can plan their marketing campaigns more cost-effectively.
Radostin Tanov, NFL Player Grading with Predictive Modeling, July, 2016, (Michael Magazine, Ed Winkofsky)
The goal of this analysis is to use predictive modeling software to determine the most important variables of player grading for three types of reporting: contribution, durability, and performance. The aim is to help understand which variables are important and can be used to potentially change the weights of the previously developed model or how the calculation of the grades can be restructured, as well as finding any applicable models that can be used to alternatively calculate the grade.
Ninad Shreekant Dike, Forecasting Bike Sharing Demand, July 2016, (Peng Wang, Michael Magazine)
The objective of this analysis is to predict the number of bikes rented per hour in the Capital Bikeshare system in Washington D.C. The Bike Sharing Dataset is taken from the UCI Machine Learning Repository and comprises 14 predictor attributes and 17379 instances. An initial exploratory analysis was performed to gain an understanding of the data and the variables. Some inconsistent or inaccurate data, including 3 variables, were removed or modified to ensure the cleanest possible data. Two additional predictor variables for bikes rented in the past two hours were added. The data was randomly sampled (stratified) into 80% training and 20% testing sets, 5 separate times as a substitute for cross validation. The programming language R was used for the analysis throughout. Microsoft Excel was used for some graphs. Data modeling was initially done using linear regression. Then, noting that the assumptions of linear regression were violated, a log transformation and non-parametric methods (Generalized Additive Model, Regression Tree and Random Forest) were employed. Finally, a Time Series model was fitted on the data. The models obtained were tested for their performance based on their R-squared value (training data), and the Mean Absolute Error and Mean Squared Error (testing data). The value reported for each of these statistics was the mean of the 5 iterations of training and testing sets. It was observed that the Time Series model produced the best forecast with an accuracy of within 18 units (for a 5 hour forecast) of the actual values on average.
Abhilash Mittapalli, Framework for Measuring Quality of User Experience on Cerkl and Analyzing Factors of High Impact, July 2016, (Dungang Liu, Tarek Kamil)
Measuring quality of user experience is of paramount importance to web platforms like Cerkl. This is mainly due to the fact that survival of the platform and revenue potential are dependent on the user experience levels. However, the major challenge involved is building a score for evaluating user experience. Scoring models currently available are hugely specific to the web platforms, hence a novice model is required to be built from scratch in Cerkl’s context. Scope of this project is confined to building a set of metrics for example open rate, click rate, email bounce rate etc. which are aggregated to a score and to analyze key factors driving this score. Formulae have been developed for evaluating user experience score at different levels along with interpretation of the scores for good, bad, worse etc. Based on the data analyzed from 200 organizations and 15 variables, decision tree model with misclassification rate of 0.3 and auc of 0.74 on 20% hold out data set beats logistic regression, which led to misclassification rate of 0.38 and auc of 0.56. ‘Number of topics’ stood out as the most important factor in driving user experience score with a correlation of 0.41 at 95% confidence level. Audience size is a negative contributing factor, whereas number of emails delivered on weekday is a positive factor. These results can help us improve certain processes like encouraging clients to publish content from more number of topics or proactively pushing for a weekday delivery of email as default option.
Amol Dilip Jadhav, Machine Learning Approaches for Classification and Prediction of Time Series Activity Data Obtained from Smartphone Sensors, July 2016, (Peng Wang, Dungang Liu)
Smartphone devices have become increasingly popular as the technology has made devices cheaper, more energy efficient and multifunctional. The availability of data from the sensors embedded within the device can provide accurate and timely information on user’s activities and behavior, a primary area of research in pervasive computing. Numerous applications can be visualized from such activity analysis for instance, in patient management, rehabilitation, personnel security, and preemptive scenarios. Human activity recognition (HAR) has become an active area of research since last decade but there are still key aspects such as development of new techniques to improve the accuracy under more realistic conditions that still need to be addressed. The ultimate goal of such an endeavor is to understand the way people interact with mobile devices, make recognition inherent and providing personalized as well as collective information. The goal of this project is to recognize patterns from these raw data obtained from user’s wearing a smartphone (Samsung Galaxy S II) on the waist, extract useful information by classifying the signals and predict the measured activities. A simple feature extraction technique was employed to process the raw data, and then various machine learning algorithms were applied for multi-class classification. The results surpass prediction performance from previous published work, it was possible to achieve a prediction of 97.49%, higher than that reported in the literature.
Mayank Gehani, Customer Churn Prediction in Telecom Industry, July 2016, (Dungang Liu, Peng Wang)
The project is all about customers which are churning in a telecom industry. The company requires huge amount to acquire a new customer and it is very important to reduce the customer attrition. There is a big impact on the business because of churning as acquiring new customers corresponds to making huge investments. The project will present recommendations as a solution to reduce churn. The goal of the project is to predict if the customer will churn in the future on the basis of the data set describing the customer phone usage. The aim is to identify parameters which can help to reduce the churn. The project includes building and identifying the best predictive model to meet the goal. The task also includes finding some other important information which can help us to provide better recommendations. On analyzing data we came out with different classification models to reduce the churning but Support Vector Machine classification model came out to be best with an accuracy of 92%. The states of New Jersey, California and Texas observed maximum churning while states of Hawaii, Arkansas and Arizona had the lowest churn rate. On the basis of our model different rates/min for calls were recommended during the day time.
Nitin Nigam, Understanding Online Customer Behaviour Using Adobe Analytics, July 2016, (Dan Klco, Dungang Liu)
All major businesses today strive to have a strong online presence since it brings much more ease to them in connecting with their consumers, understanding their browsing patterns and drawing actionable insights from this data to intelligently expand their market penetration. Surprisingly a large number of firms struggle to do so due to various factors such as lack of understanding of consumer buying needs and interests, lack of technical expertise to understand customer’s digital interaction, inexperience in correctly identifying the Key Performance Indicators etc. This project aims to solve these problems by tracking the online browsing patterns of the customer and generating reports based on this data using Adobe Analytics, Adobe Tag Manager and JavaScript tools. Metrics such as Author name, Page Tags, Page Category, Post Dates were tracked for 16 different pages on Perficient Inc.’s blog site. It was found that readers prefer some authors and certain topics among others. They also show most readership on certain days. This kind of analysis is extremely useful for the companies which can tailor and time its content based on the analyzed data to generate maximum readership on its site thereby generating more revenue.
Pratap Krishnan, Prediction of Used Car Prices, July 2016, (Peng Wang, Michael Magazine)
The dataset consists of several hundred 2005 used GM cars. The aim is to build a predictive model which can predict prices of used cars based on some important factors like Mileage, Make, Model, Engine Size, Interior Style, and Cruise Control. A Multivariate Regression Model is developed using these set of predictors, with Price as the Dependent Variable. While Model Accuracy is important, it is also important to have good Model Interpretability so as to see which factors affect the Price of used cars the most. Other Models tried included Regression Trees, Random Forest, and LASSO Regularization.
Rajiv Nandan Sagi, Prediction of Phishing Websites, July 2016, (Dungang Liu, Peng Wang)
Phishing is a security attack that involves obtaining sensitive and private data by presenting oneself as a trustworthy entity. Phishers exploit users’ trust on the appearance of a site by using webpages that are visually similar to an authentic site. There are not many articles that talk about the methods or features through which one can identify these phishing websites. To overcome the problem and support the internet users in identifying these malpractices, Prof. Mohammad Rami from University of Huddersfield and Prof. Fadi Thabtah from Canadian University of Dubai came up with some important features, collected information from a few websites and also published the dataset on UCI Machine Learning Repository website. This paper aims at building prediction models using various machine learning techniques on the publicly available dataset to recognize the phishing websites. All the models are compared against each other to identify the best selection criterion. We study the importance of each of the features, which were identified in the dataset, in predicting the phishing websites. From the analysis, we notice that the model built using the Random forest technique gives the best results. The prediction accuracy is 95.07% and from that we can conclude that the data collection was accurate and the variables designed to identify the phishing websites are relevant.
Grant Lents, Non Linear Modeling and Clustering for Rate Pricing, July 2016, (Yan Yu, Edward Winkofsky)
The insurance industry tends to be very stagnant and resistant to change and new ideas. Statistical analysis is not new to the industry as actuaries have always been a part of the industry, however predictive modeling is not common place in insurance. In recent years the business world has started adopting analytics, and the insurance industry is finally starting to catch up. In this new world, old methods that have been in place for decades are starting to be reevaluated and improved. When looking deeper into rate charging it is clear that basic methods are just not sufficient given the statistical methods that are available to analysts. With predictive modeling and clustering analysis the industry can do away with old methods that rely on executives’ intuition and make rate charging based off of numbers. While this process has started, it can always be improved.
Nikita Bali, LOGIC and CloudCoder, July 2016, (Peng Wang, Michael Magazine)
There is a substantial rise in the number of people who are engaging in learning activities either through a learning management system or through in-class learning technologies. This is leading to high collection of user data. Analytics tools can be used to make online education adaptive and personalized based on a student’s past performance trends. Using clustering analysis, the different problems were categorized into three sets by using factors as proxies for difficulty and complexity. The three sets group the problems based on the difficulty level: Easy, Medium and Hard. The goal of this exercise is to implement a recommendation system that automatically assigns problems to users based on their performance trends, so that they can ultimately improve their learning curve.
Ishan Singh, Campaign Analytics for Customer Retention-Brillio, July 2016, (Kunal Agrawal, Michael Fry)
The capstone project involves setting up campaigns and defining audience size, test goals, KPIs and measurement methodologies for the consumers of a big multinational client of Brillio. The objective is to increase the renewal rate of the subscription of a product by offering a certain segment of the customers promotional offers for renewing the subscription. This entails setting up A/B or multivariate tests to experiment and understand the consumer behavior and obtain directional learnings for the business. These tests would have a hypothesis to test promotional discounts on a set of treatment group asking the customers to renew the service before they expire. The project involves defining the sample size of the audience, creating relevant metrics to measure the outcome of the experiment and building frameworks to calculate the statistical significance of the results. A second mini project involves finding correlation between the usage pattern of a prepaid customer and the propensity of the customer to respond to a campaign by mining the service usage data and mapping it to the historical response rate of a certain type of customer profile. The project also includes creating various visualization scenarios with graphs and reports to generate business insights and layout steps in the campaign lifecycle.
Anvesh Kollu Reddi Gari, Using Unstructured Data to Predict Attrition, July 2016, (Eric Hickman, Dungang Liu)
Companies that serve customers or businesses at a large scale inevitably have customer service departments where valuable information is stored as free form text. Unlocking value from these data sources presents a huge opportunity for companies to stay responsive and address concerns proactively. In this report we look at one instance where call notes of customers is used as an indication of possible attrition. We build a binary outcome prediction model with comments being the sole predictor and attrition in a defined time period as the response variable. We used a Support Vector Machines (SVM) model with a linear kernel which is proven to be best suited for text mining. In this case we show that the tuning penalty and weight parameters are important to arrive at the best model especially in a class imbalance problem like the present one. The results conclusively prove that comments alone can be a good predictor of attrition and can be used as a valuable predictor alongside other demographic variables.
Saketh Jellella, Forecasting Exchange Rates, July 2016, (Yichen Qin, Eric Rademacher)
In the modern world, investing in foreign lands is very common. For firms trading with other countries, the trends in exchange rate can be very important. If the firms can predict the exchange rate movement, it allows them to plan ahead. Exchange rates forecast can be drawn through the computation of a currency’s value with respect to other currency’s value over a period of time. There are a lot of theories/models that can be used for the prediction but all of them have limitations. In this project, Time-Series modelling approach was used to fit a model to the historical data and then a graph was plotted for the future(predicted) data to observe the trend. This would help the firms understand if it is a good time to invest and thus they can avoid losses. The capstone describes in detail about the data extraction, model building, model selection and the model diagnostics. The end result is an AR-ARCH model that was fit to the time-series data and the next one-month exchange rates trend was observed. The trend is decreasing i.e. the exchange rates are expected to come down over the next one month.
Sagar Umesh, PPNR: PCL (HELOC) Balance Forecast, July 2016, (Yan Yu, Omkar Saha)
As part of the requirements for Huntington Bancshares Inc.’s (HBI’s) Annual Capital Plan (ACP) and its participation in the Federal Reserve’s (Fed’s) Comprehensive Capital Analysis and Review (CCAR), the Personal Credit Line (PCL) balance models were developed to provide forecasts of the balances on HBI’s book for different economic scenarios. The PCL portfolio consists of 1st and 2nd Lien Home Equity Lines of Credit (HELOC) products. The primary macroeconomic drivers for 1st Lien balance are All Transactions Home Price Index in Huntington footprint, and Rate Spread between Freddie Mac 30Yr and Prime Rate. The primary macroeconomic drivers for 2nd Lien balance are Prime Rate, and All Transactions Home Price Index in Huntington footprint.
Sagar Vinaykumar Tupkar, Predicting Credit Card Defaults, July 2016, (Yichen Qin, Peng Wang)
Credit Card defaults pose a major problem to all the major financial service providers today as they have to invest a lot of money in collection strategy, which again is uncertain. The analysts in the financial industry today have achieved great success in plotting a method to predict the default of credit card holder based on various factors. This study aims at using the previous 6 months’ data of the customer to predict whether the customer will go default in the next month by various statistical and data mining techniques and building different models for the same. The exploratory data analysis part is also important to check the distributions and patterns followed by the customers which eventually lead to default. Out of the four models built, Logistic Regression after doing Principal Component Analysis and Adaptive Boosting Classifier performed the best in predicting defaults with around 83% accuracy and minimizing the penalty to the company. This study gave a list of important variables that affects the model and should be considered for predicting defaults. Even though the accuracy of the predictions is good, further research and powerful techniques can potentially enhance the results and bring a revolution in the credit card industry.
Leon Corriea, An Analysis of European Soccer Finances and Their Impact on On-field Success, July 2016, (Michael Magazine, Edward Winkofsky)
Analytics is revolutionizing every industry it touches. From banking to manufacturing to healthcare, every industry has been made better, more successful with the use of analytics. The sports industry has been the latest to embrace analytics. The use of analytics in sports is often referred to as the “Moneyball revolution”, alluding to the famous book and movie of the same name, This report takes a closer look at the business of soccer and how analytics can be used to improve financial decisions that impact performance on-field. It identifies all the essential levers that are involved in the financial decision making process at the top European soccer clubs and, through the use of analytics, assigns importance to each one of them. By recognizing the most important factors, soccer clubs can prioritize their efforts in improving those areas that have the maximum impact on on-field success.
Haribabu Inuganti, Predicting Default of Credit Card Customers, July 2016, (Dungang Liu, Edward Winkofsky)
It is important for banks and credit card companies to know if a customer is going to default or not. It will help them to assess the cash flows and assess the total risk at hand. For a customer who has a credit card, there are different attributes like customer’s income range, education, marital status, history of past payment etc. which will impact this outcome. The current project is to build a predictive model which predicts probability of default of credit card customers using different attributes of that customer. The data in UCI machine learning repository is taken. The data records of 30,000 customers has 24 different attributes like Limit balance, sex, education, marital status, age, past repayment status etc. Initially, exploratory data analysis is performed to understand the distributions of different variables, to check for outliers and missing values. The data set is divided into training and testing datasets by random sampling. After exploratory data analysis, logistic regression, lasso, support vector machines and random forest models are built on training data. To evaluate the performance of the model its AUC on testing data is used as the criterion. Out of all models, the best model is logistic regression built with a stratified sample. This model has 0.74 AUC on out of sample data. This model can be used for predicting the probability of default for new customers.
Wenwen Yang, Omni-channel Fulfillment Path Costing, July 2016, (Yichen Qin, Erick Wikum)
An innovative, integrated and customer-oriented retail business model, Omni-channel retailing is flourishing with the advent of the online and digital channels. A key challenge for Omni-channel retailers is to fulfill each customer order in a cost-efficient and timely manner. There can be many possible fulfillment paths for an order and understanding the cost for each alternative is complex. A 2015 JDA study1 found that only 16% of retailers surveyed could fulfill Omni-channel demand profitably. Thus, a universal approach to estimate cost of alternative fulfillment paths for retailers is beneficial. The ultimate purpose of this project is to construct a practical approach allowing retailers to optimize order fulfilment. The focus was on four types of fulfilment paths and the net cost allocated by the Activity-based costing (ABC) method. My internship at Tata Consultancy Services (TCS) mainly focused on three components: first, define four common fulfilment paths and define a method to compute cost to serve for individual fulfilment paths; second, group similar shopping items into groups based on their attributes using a clustering algorithm; and third, predict the type of packing carton to be used for an order using a classification model.
Bhrigu Shree, Speed Dating Analysis, July 2016, (Peng Wang, Yichen Qin)
Understanding behavior of women and men when it comes to choosing partners is something that has baffled mankind since the beginning. Immense number of books have been written on the subject over ages. In today’s data driven world, it would make sense to take a similar approach to analyze dating preferences also. In my capstone project, I plan to analyze 'Speed Dating Experiment’ dataset compiled and released by Columbia Business School Professors Ray Fisman and Sheena Iyengar for their paper “Gender Differences in Mate Selection: Evidence from a Speed Dating Experiment”. Data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four minute "first date" with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again and also to rate their date on certain attributes. The capstone project consists of two parts: First Exploratory Data Analysis, in which we try to analyze data on various dimensions to find interesting patterns and second, building a predictive model to predict the probability of success of a date based on characteristics of participants.
Alok Agarwal, A Forecasting Model to Predict Monthly Spend Percentages and an Approach to Assess Analytic Tools, July 2016, (Michael Magazine, Tim Hopmann)
The objective of this report is two-fold. The primary objective is to forecast the monthly spend percentages of different internal projects managed by PPS team. The secondary objective is to design a framework which can be used to assess different analytical tools for the team. For the forecasting model, the percentages were calculated based on the amount reserved for them. The tested techniques were Multiple Regression, Regression Trees, and Additive Models. The final model was selected based on out of sample mean square error and assessed based on decision level prediction in the 2015 portfolio. For the assessment of the analytic tools, I gathered the analytical requirements of the team and identified 4 evaluation phases. This assessment was local to the team and was a part of the enterprise-wide data governance strategy.
Karthekeyan Anumanpalli Kuppuraj, An Analysis on Personal Loans Offered by Lending Club, July 2016, (Dungang Liu, Michael Magazine)
Lending club offers personal loans in the range of $1000 - $35000 to applicants from various categories. The interest rate is decided based on the applicant’s grade, income level, lien status and other information provided in the application. After the loan is issued, the repayment of the loan is tracked to maintain a loan status. Building a model using the available data to identify the interest rate for an applicant and also to predict the status of the loan, once the loan is provided, will minimize the risk involved in providing loans to the applicants with a lower repayment capability. A linear regression model and a regression tree are constructed and compared and the approach with better results is used to predict the interest rate. A logistic regression model and a classification tree are constructed to predict the loan status. The linear model provided better results than the regression tree to predict the interest rate for the applicants. The mean squared error is used to compare the two models. The logistic regression model had lesser misclassification rate than the classification tree. After cross validating with various samples, it has been concluded that the logistic regression model predicts the loan status more accurately than the classification tree. These two models will enable lending club to make more accurate decisions based on the available data and will reduce the risk of providing loan to an applicant with less repayment capability.
Udayan Maurya, Telecom Customer Churn Analysis, June 2016, (Peng Wang, Yan Yu)
The study analyzes customer churn data for Telecom service provider. Objective of study was to develop predictive model to identify the variables responsible for churn of customers and predict potential customers which may churn out of telecom services. Homogeneous clusters of similar states were identified and a predictive model for each one of them was developed. The in-sample and out of sample performance for all the models were analyzed and found satisfactory within industrial standards.
August Spenlau Powers, Exploring the Use of and Attitudes Towards Drugs and Alcohol in the Kenton County Public School System, June 2016, (Yan Yu, Edward Winkofsky)
The illicit use of substances, both legal and illegal, is a pervasive and potentially very dangerous problem among teenagers. As such, the identification of influential factors which lead to such substance usage is critical, and has been the subject of much research over the past few decades. Using several years of survey data provided by the Kenton County Alliance, the main factors which influenced the use of alcohol, cigarettes, marijuana, and inhalants were determined using several techniques – classification trees, chi-squared automatic interaction detection (CHAID), and logistic regression. Unsurprisingly, the two variables which were incredibly significant throughout every analysis of all four substances were the student’s perception and ease of access to said substance. Other important factors were the parent (or parents) which the student primarily lived with, especially for the use of inhalants. While many of these factors were fairly intuitive, these insights should allow the KCA to better develop and cater their community programs to ensure higher effectiveness.
Nitish Kumar Singh, Predict Online News Popularity, June 2016, (Dungang Liu, Edward Winkofsky)
With increased usage of the internet for information sharing, a number of news articles are published online and subsequently shared on social media. The popularity of an article can be determined by the number of views or shares it receives from people. It will really help online publishers if there can be an algorithm which could determine if a news article will be popular or not. This algorithm can provide a helping hand to editors by separating out bad articles and can also help in determining the positioning of articles on the website. In this project, 5 different learning algorithms like logistic regression, decision tree, and boosting tree are implemented to classify news articles as popular and non-popular based on different features. The best model is also checked to see if it can differentiate very popular (viral) articles from other articles.
Justin Blanford, Analyzing Consumer Rating Data on Beer Products to Build a Recommendation System Using Collaborative Filtering, April 2016, (Yan Yu, Michael Seitz)
The Information Age has made accessible a growing amount of data quantifying our world and human behavior. However, at times, it has been unclear how we can benefit from this data and gather insight. A recommendation system is one tool that can solve this problem. Recommendation systems have been used for many years on Amazon, Netflix, Pandora and other platforms to guide customers to more products or content they would enjoy. Using collaborative filtering and a consumer review dataset I was able to recommend products to a given user based on their preferences and interests. Specifically, the recommended products are beers from a consumer review aggregator named BeerAdvocate. Included in the dataset collected from January 1998 to November 2011 are 1.5+ million reviews, 30+ thousand users and 65+ thousand beer products.
Alex Michael Wolfe, An Evaluation of Tax-Loss Harvesting Trading Strategies, April 2016, (Yan Yu, David Kelton)
Investment managers and financial advisors actively seek investment strategies that generate positive excess returns over time. The majority of academic research indicates that trading strategies do not consistently generate excess returns over time. However, the strategy of tax-loss harvesting (TLH) has become popular and is claimed to generate excess returns; this strategy seeks to take advantage of the U.S. tax code as it pertains to the realization of capital losses and the difference between short-term and long-term marginal tax rates on capital gains.
While TLH has been used by traditional financial advisors for years, the advent of robo-advisors has amplified the importance of TLH. Using algorithmic trading, robo-advisors such as Betterment and Wealthfront claim to generate substantial positive, excess returns for investors through daily TLH. This paper tests a single version of TLH using backtesting and simulation based on the daily stock price returns of 147 individual stocks from 1985 through 2014. Backtesting models indicate that annual and daily TLH strategies generate positive excess returns, with the daily excess returns being the largest. The results of the annual simulation model shows a negative mean excess return from TLH, while the daily simulation model shows no statistically significant effect of TLH on mean returns likely due to a limited number of simulation replications, which is in turn due to heavy computational requirements.
Shivang Desai, Heart Disease Prediction, April 2016, (Dungang Liu, Edward Winkofsky)
In today’s world where Healthcare management is of great importance and governments, corporations and individuals have a lot at stake when it comes to public health. This research project uses a data driven approach to tackle the issue of heart disease in individuals. Analytical tools are used to perform a thorough data exploration that leads us to key insights that would be beneficial in the modeling process. A logistic regression model is built to calculate the risk of being affected by heart disease for each individual. A classification tree is built to classify individuals into two groups based on whether they have a heart disease or not. The project can help insurance companies decide what an individual’s insurance premium should be based on his personal characteristics like age, sex and cholesterol level. It can also help health organizations screen individuals for heart tests based on model results.
Yogesh Kauntia, Identifying Bad Car Purchases at Auctions, April 2016, (Dungang Liu, Edward Winkofsky)
Used car dealers often buy cars from auctions which sometimes do not allow a thorough inspection of the vehicle before the purchase. However, it provides a list of metrics (model, sub-model, odometer reading etc.) to help the dealers make a decision. The objective of this analysis is to build a model to help dealers decide whether an automobile on auction is a good purchase or not which essentially means whether the dealer can sell it further for a profit or not. Different data mining classification algorithms are tried and compared to identify the best model for such a problem.
Naga-Venkata-Bhargav Gannavarapu, Gentrification of Hamilton County, April 2016, (Peng Wang, Olivier Parent)
Gentrification can be defined as the process of renewal and rebuilding accompanying the influx of middle-class or affluent people into deteriorating areas that often displaces poorer residents. Hamilton county is the third most populous county of Ohio. University of Cincinnati is also located in Hamilton county. This analysis was performed to identify the census tracts in Hamilton County that might have gentrified during 2000 to 2010. Census Tracts are small, relatively permanent statistical subdivisions of a county or equivalent entity that are updated by local participants prior to each decennial census as part of the Census Bureau's Participant Statistical Areas Program.
Satwinderpal Makkad, Functional Data Analysis of Plant Closings, April 2016, (Amitabh Raturi, Peng Wang)
This study examines performance of firms that announced plant-closing during a period of 17 quarters; 8 quarters before and 8 quarters after the announcement using functional data analysis (FDA) methodology. An analysis of individual firms’ financial dynamics is performed using their return on assets (ROA), operational dynamics using a sales by inventory ratio (SalesByInv), and resource-deployment or capacity utilization dynamics using a sales by property plant and equipment ratio (SalesByPpe), by treating them as continuous functional data during these time periods. The aim is to answer questions such as: did the firms benefit from closing a plant in terms of ROA, SalesByInv, and SalesByPpe and were any firms negatively impacted due to closing of a plant or did plant closing have no impact on the firm’s performance? Hierarchical clustering analysis is used to segment the dataset to find classes of firms that demonstrate similar patterns in terms of their performance. An analysis of 3-cluster, 4-cluster, and 5-cluster segmentation indicates that that 3-cluster analysis provides adequate sample sizes as well as logical differences for our tests. Three sets of hypotheses are developed that assess the antecedent and ex post financial metrics of firms that made such announcements. We relate these hypotheses to our cluster analysis results.
Calvin Kang, Analyzing NBA Rookie Data with Predictive Analytics, April 2016, (Michael Magazine, Peng Wang)
Since the inception of the NBA, the professional basketball teams have relied on the yearly draft to bring in new talented players. More often than not, these players played 1 or more years at a college level before going professional. Those that did join the NBA without college were exceptional high school players or players who played professionally overseas. Team get a limited number of picks each year so it is a huge benefit to choose an exceptional player to play for them or to trade for one as an asset. It is never possible to completely gauge how a new player will adjust to NBA play. However, I believe looking at past performance in college or other leagues can give some insight into what kind of splash a player can make in the NBA. In this report, I will use statistics and data mining techniques to analyze and predict the performances of players with data of players from the past 10 years.
Apoorv Agarwal, Privacy Preserving Data Mining: Comparing Classification Techniques while Maintaining Data Anonymity, April 2016, (Dungang Liu, Peng Wang)
The objective of this study is to compare different methods for binary predictions and highlight the accuracy from each method. Using some real life data from the banking industry, classification techniques will be run and the results from each model will be compared. The analysis will be done while maintaining data privacy. The data will have de-identified variables and observations. That is, no information about the variables and their data is known. The decisions will have to be purely statistics based. Recently, such data mining techniques have seen growing significance. Finally, a conclusion will be made on the logic that can be used for selecting the final models based on the cost of type I & II errors and application of such methods for real life problems.
Abhas Bhargava, Wine Analysis, April 2016, (Dungang Liu, Edward Winkofsky)
The Wine Industry is on the rise, which can be attributed to both, social and casual drinking. The big players in the industry are getting bigger and the small wineries are getting swallowed up. As of April 2016, the global wine trade turned in a healthy performance as wine lovers led to increased consumption. The competition is immense and different players in the industry implement different strategies to keep up to the pace. Some believe in spending on acres of desirable vineyard land while some believe in spending on nothing more than brand names. However, what matters most is how wine-tasters perceive a particular wine. Wine tasting is the sensory examination and evaluation of wine.
Another notable feature is that all wines possess a set of physicochemical properties that may or may not affect its quality. The idea of this paper is to analyze wine ratings with respect to its physicochemical properties. The question that everybody wants to answer is whether wine ratings are related to its physicochemical properties. Using supervised learning techniques, we will identify a set of physicochemical properties that lead to high ratings for a particular wine. As a secondary research, we will also try and differentiate between red and white wines based on their physicochemical properties. The question we are trying to answer here is whether we can differentiate between the two types of wine without looking at their appearance.
Rahul Chaudhary, EPL Transfers Analysis, April 2016, (Michael Magazine, Peng Wang)
The English Premier League, soccer’s richest league, is used to multi million pound deals to bring in the best soccer talent around the world. The focus of this study is to predict the transfer values of incoming players to the English Premier League while identifying the most important components that go into formulating a player’s transfer amount. This study also elaborates into the process behind data collection and why particular variables were selected in the model. The study was based on Linear Regression and Regression Trees since interpretability of the model is of the most importance, hence complex models were not explored. It was vital to provide thorough tests to check for Normality, Homoscedasticity and independence of the Linear Model to ensure that predictive power is not affected.
Jeremy Santiago, Predicting Success of Future Rookie NFL Running Backs, April 2016, (Yan Yu, Martin Levy)
Sports as a business is big money. Each year, athlete’s contracts reach new highs. Superstars and rookies alike see larger contracts. However, the money put into these rookie contracts doesn’t always yield returns on the playing field. This is due to the unmeasurable uncertainty of how rookies will handle the transition of moving to the highest competitive level. In this capstone, I will specifically focus on analyzing and predicting the performance of NFL rookie running backs. For decades past, the analyses done on predicting rookie’s abilities in the NFL are very subjective; the evaluations rely on game-tape analysis, scouts’ intuitions of the players’ abilities, and often compare the athletes to whom they are competing against. This is limiting for analysis, as the majority of data gathered is subjective or circumstantial. This capstone focuses on using completely quantitative data to predict the performance level of future rookie running backs in the NFL.
Fadiran Oluwafemi, An Investigation on Factors affecting Internet Banking in Nigeria, April 2016, (Yan Yu, Edward Winkofsky)
This study investigates the factors that influence and discourage the adoption of electronic banking in Nigeria. Various studies show that electronic banking systems bring about cost reduction, time efficiency, ease of access, and improved customer relationship, to both the financial systems and customers. Conversely, research shows that a huge percentage of the Nigerian population still adopts the traditional methods of banking, thus internet banking facilities are largely underutilized.
This research uses Theory of Reasoned Action (TRA), Technology Acceptance Model (TAM) and Theory of Planned Behavior to categorize the factors that influence and discourage the adoption of internet banking. The factors are demographic factors, perceived risk factors and limitations. Results from this study suggest that mostly well-educated young adults with average income levels are the significant group of people in Nigeria who currently utilize internet banking systems. Findings also show that perceived risks involving financial, performance and social risks discourage use of internet banking due to the history of high crime rates and corruption that negatively affects how consumers perceive internet banking and its usefulness. Lastly, limitations in the form of poor infrastructures like inefficient internet power supply, poor internet services and so on were found to also discourage the use of internet banking.
Joseph Charles Frost, Optimizing Course Schedules Using Integer Programming: Minimizing Conflicts in the Assignment of Required Courses Across Majors, November 2015, (Michael Magazine, David Rogers)
Universities commonly strive to develop schedules containing enough course options for students of any major to advance in their programs each semester. Organizing these courses into available periods can be as important to the scheduling process as deciding which courses to offer, and often proves itself the more challenging task. Offering two courses that are both required for the same major during overlapping periods can often create conflicts that constrain students’ schedules, forcing them to delay enrolling in certain requirements until they are offered in future semesters or even years. Some universities spend hours generating schedules with nothing more than intuition. Using integer programming, I was able to reorganize the schedule of one of these universities, Cincinnati Christian University, significantly decreasing the number of required courses for each major offered in overlapping periods.
Debjit Nayak, Operations Improvement using Data Analytics in the Cincinnati Police Department, November 2015, (Amitabh Raturi, Michael Magazine)
I took the Data visualization class at University of Cincinnati when I was undergoing my Masters in Business Analytics. I was introduced to the concept of visualization using Tableau and how it can be useful in giving a brief insight about data. The internship at the City of Cincinnati gave me an opportunity to use Excel, Tableau, Arc maps and Visio for process improvement and data visualization. These tools became part of ongoing processes used by the City; in fact, the city thereafter specifically used dashboards that I created to discuss issues in their “stat meeting”. In section 2-4, I provide several examples of visualizations I enabled including those in the areas of police Overtime, police Homicide, police Shootings fatal and nonfatal and police response sheets. I conclude with observations about what I learned and some implementation hurdles.
Yunzheng He, ChoreMonster Data Mining with R, December 2015, (Michael Fry, Uday Rao)
Studies have shown that giving children household chores at an early age helps to build a lasting sense of mastery, responsibility and self-reliance. ChoreMonster provides an easy solution to replace chore charts and make chores more enjoyable for children and parents. In-app user information and reward data were collected to improve App functionality and to provide a better user experience. In this project, we use R analytical tool to perform data mining on ChoreMonster dataset. Data was converted and imported to R workspace and cleaned. Summary statistics are extracted for numeric variables. Linear and logistic regression models were established to reveal relationships between reward type and gender and sex of a child. Text mining was applied to a character variable in order to investigate forms of reward given to kids.
Thanish Alex Varkey, SPLUNK Dashboard Proof of Concept - Charles Schwab, December 2015, (Peng Wang, Edward Winkofsky)
Compliance for companies are very expensive but the cost of noncompliance is shutting down shop. It is this mentality in corporate America that has seen the likes of Enron being shutdown makes compliance a very important matter.
Schwab Compliance Technology (CST) had recently implemented SPLUNK for their network monitoring and log analysis. The project was to study the search processing language of SPLUNK to implement dashboards that will help CST not only troubleshoot their problems but also get to know their customers better. Pattern of usage, duration of usage and capacity planning were not known. Some of the benefits of this system if implemented were that Schwab would be:
- Able to identify their top customers
- Able to detect customers who sent incorrect files
- Able to understand what devices the customers use for logging in
- Able to monitor capacity management for the server
- Able to monitor performance of the application by studying response time for each screen
The desired system will not only help the team in monitoring but also the development team to understand the performance of the application and the infrastructure management team to understand the amount of server usage for capacity planning. The end deliverable was a dashboard of the use cases that were developed by CST. This report has the query that was used as the building blocks of the dashboard and the XML version of the dashboard itself. The project also included an understanding of the market segments of CST and uses the data to predict the capacity of server and customer usage.
David Rodriguez, Cincinnati Bell, August 2015, (Yan Yu, Michael Magazine)
Cincinnati Bell, like many telecommunication companies, faces the problem of customer churn. There are two types of churn, voluntary and involuntary. Voluntary churn is defined as when a customer chooses to opt out of their Cincinnati Bell service. Involuntary churn is defined as customers who fail to pay their bill for four consecutive months and the Cincinnati Bell terminates the customer’s service. This is an issue from a business perspective because of each new customer comes fixed costs. For Cincinnati Bell to see positive ROIs customers must continue with their service for a minimum of several months. Cincinnati Bell’s business plan is to build to areas that would statistically provide a low risk of customer involuntary churn. This paper explores different types of models to help with prediction of involuntary churn.
Vijendra Krishnamurthy, Breast Cancer Diagnosis by Machine Learning Algorithms, August 2015, (Dungang Liu, Mike Magazine)
This dataset (Wisconsin Breast Cancer Data) was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. This project uses different machine learning algorithms to diagnose cancer into benign or malignant type. Also the study compares the results from various algorithms and combines the predictions using ensemble methods to try to obtain a better predictive performance than that possible by individual methods. Classification machine learning algorithms used in this project include Decision trees, Support Vector Machines, Logistic Regression, Naïve Bayesian Classifier, and K Nearest Neighbours. Each algorithm is trained on 80% of dataset and then tested with the remaining 20% of the dataset. Finally ensemble methods are used to combine the results from various algorithms.
Shashi Kanth Kandula, Internship with Ethicon (Johnson & Johnson), August 2015, (Andrew Harrison, Yichen Qin)
Ethicon Endo-Surgery, Inc., is a subsidiary of Johnson and Johnson which focusses primarily on Surgical Instrument manufacturing. My association with Ethicon as an intern has been quite eventful with direct involvement in projects affecting the sales strategy. The internship provided me an opportunity to get an inside view of health care industry, and learn new tools such as IBM Congnos and Concepts of Data Warehousing. During my tenure at Ethicon I worked on multiple projects using tools such as SQL, R, and Advance Excel for the purpose of data mining and data analysis. I was also involved in a sales forecasting project using Time Series Analysis wherein we built a model to forecast sales for coming 12 months for a given product line using historical data mined out of data warehouse.
Kala Krishna Kama, Network Intrusion Detection System Using Supervised Learning Algorithms, August 2015, (Dungang Liu, Yichen Qin)
Intrusion detection is a term we come across fairly regularly these days. With the impulsion of World Wide Web and massive growth in computer networks, network security is becoming a key issue to tackle. This has resulted in an enormous research towards building Intrusion detection systems which are capable of monitoring network or system activities for malicious activities.
The aim of this Capstone project is to build a classifier capable of distinguishing between "bad" connections, called intrusions or attacks, and "good" normal connections. In this project, a comprehensive set of classifier algorithms will be evaluated on the KDD dataset. Since there are four different attack types different algorithms are likely to exhibit different performance for a given attack category. The aim is to verify the effectiveness of different classifiers algorithms and build the final model that does a best job in predicting the Intrusions. Supervised learning algorithms Multinomial Regression, Decision Trees, Naïve Bayes Classifier, K-NN Classifier and Random Forests are used to build the classifier models. Measures like classification percentage, misclassification rate and misclassification cost will be used to evaluate the models.
Xiaoyu Zhu, The Comprehensive Capital Analysis and Review, August 2015, (Peng Wang, Yichen Qin)
To promote a safe and stable banking and financial system, the Federal Reserve need to regulate and supervise financial institutions, including bank holding companies, savings and loan holding companies, state member banks, and systemically important nonbank financial institutions.
One of the supervisory program is The Comprehensive Capital Analysis and Review (CCAR), which evaluates a bank holding company’s capital, and its planned capital distribution to maintain their risk within an acceptable range. For banks holding companies with assets of 50 billion or above, this program ensures they have effective capital planning processes and sufficient capital to absorb losses during stressful conditions, such as the great recession in 2008. If the Federal Reserve objects a bank holding company’s capital plan, the company may not be able to make any capital distribution. In general, for modeling team, our job is to build various models to forecast bank’s risk under different economic scenarios.
Charu Kumar, Forecasting Crude Oil Prices, August 2015, (Dungang Liu, Brian Kluger)
Oil Prices can depend on many factors leading to a volatile market. This necessitates the need for forecasting. Monthly data from 1986 to 2015 is collected, and divided into training and testing data in order to validate our model. Thereafter, ARIMA modeling is used to derive the model. The data is first transformed to account for the non-stationarity in variance, and a step-wide iterative procedure is outlined for model diagnostics and model fitting. The forecast for 2015 is compared with the testing data, and the mean squared error (MSE) is reported. This model is applied to 2014, 2013 and 2012 and is compared to the testing data for each of those years.
Vasudev Daruvuri, Commercial General Liability Forms: Text Mining for Key Words and Scalable ML & Movie Recommendations over Apache Spark, August 2015, (David J. Curry, Jaime Windeler)
Commercial General Liability Forms (Text Mining for Key Words):
- Project 1: The objective of this project is to identify and extract the important keywords per each document from the large pool of Commercial Insurance (1500+ so far) related documents and associate the keywords with existing search engine for those documents for faster querying and more accuracy in document search.
- Project 2: This is a Proof of Concept (POC) project to perform movie rating analysis and predict personalized user ratings for the movies. The objective of this project is to implement Collaborative Filtering on huge data set of 500,000 ratings in distributed environment on Apache Spark.
Nikhil Shaganti, RiQ Precision Targeting- Coupons.com, August 2015, (Yan Yu, Peng Wang)
Retailers and brands alike are facing a lot of challenges today due to changes in the marketplace e.g. shopper expectations, channel proliferation, trip erosion etc. Coupons.com recognizes the power of context and has built Retailer iQ (RiQ) Platform that engage consumers with Client’s content in ways that are most relevant to them when they are most apt to receive it. The RiQ platform benefits all parties involved. It offers omnichannel engagement, meaning shoppers can engage however and whenever they want, whether it be through web, mobile or social. It utilizes shopping behavior data to personalize and target coupons and media to shoppers, delivering an experience that is most relevant to them. This report gives a brief overview of the two retailer specific targeting campaigns designed for the biggest yogurt brand and personal care products brand in its vertical. The main objective for these campaigns is to drive trial, capture percentage incremental new buyers, drive brand loyalty, migration up to premium brands and regain lost/ lapsed customers. With the use of Point-of-Sale (POS) data shared by the retailer, it is easy to analyze the shopping behavior of a consumer in turn helping in behavioral targeting. The targeting campaign is a huge win for both the CPGs and retailer in driving trial with 69% incremental and 220% larger basket rings respectively.
Ama Singh Pawar, A Study of Approaches To Enhance The Comprehensibility of Both Opaque & Transparent Data Mining Techniques, August 2015, (James Evans, Jaime Windeler)
Recent advances in computing technology in terms of speed, cost, as well as access to tremendous amounts of computing power and the ability to process huge amounts of data in reasonable time have spurred increased interest in data mining applications. In this Capstone project a census data sheet (Adult) is selected from the UCI’s Machine Learning Repository. With an objective to predict whether, for any selected individual, income exceeds $50K/yr or not. On conducting several data mining techniques for classification purpose, high-accuracy predictive models seem to be opaque (Black Box Techniques), i.e. Techniques which can generate valid highly accurate predictions, but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based. Moreover, even if transparent models have too many variables, then interpretation of those models becomes very difficult. But in many real world business scenarios this is unacceptable, because models need to be comprehensible. To obtain comprehensibility, accuracy is often sacrificed by using simpler but transparent models. This project will demonstrate a trade-off between accuracy vs. comprehensibility. Moving forward, I will discuss rules extraction, genetic ensemble, and some visualization techniques to extract accurate but comprehensible rules from opaque predictive models. I will cover concepts such as feature extraction (selection), dimensionality reduction, lasso, ridge regression to make transparent models with many parameters easy to comprehend.
Xinqi Lin, A Comparison of Regression Tree Techniques for Customer Retention Rate Prediction, August 2015, (David Rogers, Peng Wang)
American Modern Insurance Group is a widely recognized national leading small niche company in the specialty insurance business. As for all other business, retaining customers is very important to American Modern Insurance Group. They assigned a project to find out the driving force of their customer retention rate. In a previous group case study, we developed a best model we believed contains the critical predictors for the policy renewal variable, and also generated a pricing elasticity curve. This project is an extension of the previous work. The objective is to determine the best suited regression algorithm to predict the customer retention from a list of predictors that are believed to influence it. I explore the different regression modelling techniques such as classification tree, random forest, and bagging and boosting, that can be built for predicting and attempt to find the best fit model to predict the response variable. This paper can serve to be valuable for the study of the comparison of different regression modelling techniques in both R and SAS EM and to find the best one.
Hannah Cunningham, Incorporating Motor Vehicle Records into Trucking Pricing Model for Great American Insurance Group, August 2015, (David Rogers, Edward Winkofsky)
The objective of this project was to incorporate driving record data into the Great American Insurance Group’s current trucking physical damage pricing model. For the case study, we intended to use linear regression to determine which categories of driving violation, as specified by Great American’s underwriters, had the most impact on the loss ratio and use those findings to create a scoring system for drivers or policies. Linear regression did not result in statistically significant categories, so we could not use those results to calculate scores. The client then provided weights and scoring methods for us to test, and we found that an unweighted arithmetic average is the scoring method with the most statistically significant relationship to the loss ratio. For my extension of the case study, I used a generalized linear model with a log link function to determine the relationship between violation categories and the loss ratio. This technique showed that Minor and Major Accident violations are significantly related to the loss ratio. Using the coefficient estimates from the general linear model as weights, I calculated new scores and compared them to the previously calculated scores. Again, an unweighted arithmetic average was the most statistically significant.
Yang Yang, Determining the Primary Factor in Kids’ Chore Rewarding Strategy, August 2015, (Michael Fry, Edward Winkofsky)
ChoreMonster is an app developed to help parents create and track chores of their kids who are to be rewarded in many ways for their finished chores. The reward data is constructed along with child’s age, gender, parent’s gender, and other geographical and time records. The objective of this project is to test the hypothesis that age is the primary factor in predicting suggested rewards by analyzing a fragment of data provided by the developer in order to explore and build up our understanding of the mechanism of the chore rewarding system. The insights will also allow the developer to create new areas of segmentation for family-suggested rewards and enrich the user experience. We utilize grouping schemes and logistic regression to analyze the data, and also deploy Tableau for data processing and data visualization.
Ambrose Wong, Sensitivity of Client Retention Modeling with Respect to Changes in the Costs of Misclassification, August 2015, (David Rogers, Peng Wang)
This paper is an extension of the project produced for the American Modern Insurance Group as part of the case studies course in BANA 7095. The previous project focused on exploratory data analysis and modeling while this paper focused on adding a cost function to two different models: logistic regression and classification trees. The cost function mitigates the imbalanced nature of the dataset which stems from having data that contains approximately 90% of the customers staying with the firm and the remaining 10% of the customers leaving the firm. The cost function was used to adjust the cost of false positives and was also used to choose the optimal cut-off probability for the models used. Naturally, this led to changes in the confusion matrix and also the misclassification rate for both models.
Luv Sharma, Forecasting Commercial Mortgages under Various Macroeconomic Scenarios, August 2015, (David Rogers, Michael Magazine)
The objective of the case study is to forecast aggregate commercial mortgages of four kinds (Non-farm & Non-residential, Multifamily residential real estate, Construction & Development, and Farmland) for all FDIC-insured institutions for the next 12 quarters. Each mortgage will be forecasted under three macroeconomic scenarios: Baseline, Adverse, and Severely Adverse. These scenarios are dependent upon broad macroeconomic indicators that will be included in the model to reflect the effects of the macro economy on mortgages. In this case study is an analysis of risk with respect to mortgage aggregates in recessionary environments and a deeper look at how these variables behave and relate to each other. The forecasting will be carried out using dynamic regression models that combine the inherent time series nature of the variables involved along with the regression component to integrate the relationship between the mortgages and the macroeconomic indicators. Statistical software ‘R’ was used to analyze the time-series, build the model, and forecast. The datasets for the historical and the forecasted data was provided by US Bank and other publicly available sources.
Kathleen Teresa Towers, Endangered or Threatened? Manatee Mortalities 2000-2014, August 2015, (Roger Chiang, Yan Yu)
In 2007, the U.S. Fish and Wildlife Service recommended downlisting the Florida Manatee (Trichechus manatus latirostris) from endangered to threatened status. Since then, record levels of mortality due to severe winter weather and recurrent toxic red tide algal bloom indicate that proposal may have been premature. A shortage of resources has delayed execution of the U.S. Fish and Wildlife Service’s most recent scheduled review of the manatee species’ status, and therefore, available datasets detailing the latest mortality trends have not been sufficiently analyzed. In order to provide an accessible, intuitive interface for non-experts concerned with the manatee population’s status, this project compiled recent mortality data from two different datasets, modeled updated estimates for causes of manatee mortality, and created a series of interactive visualizations. It is integral that policymakers consider recent years’ trends in causes of manatee mortality before deciding whether it is time to reclassify the Florida manatee’s conservation status.
Jhinak Sen, Analytical Approach Used for Business Strategy by Retail Industry, August 2015, (Peng Wang, Michael Magazine)
With ever-increasing point of sale information at retail stores, it is important to utilize these data to understand customer behavior and factors that drive them to purchase items. To understand the customer behavior a coupon-redemption study was carried out by the company. The aim was to identify and categorize brands according to coupons redemption compared to total sales volume. Coupon redemption indicates consumer response to an advertisement. This in turns indicates effectiveness of advertisement and opportunity for brand to make innovative products. The study comprise of designing the experiment, gathering datasets, analyzing and developing new business insights. For the project a particular retail chain was considered, which have stores throughout the country. Sales and discount information were collected for some products. Performance score were calculated based on relative sales and discount ratios. Using these performance score products were categorized into good, neutral or bad performers. The outcome of the study helped companies to understand their brand placement and optimized their targeting strategy for the retail store.
Ankit Jha, Analyzing a Churn Propensity for an Insurance Data Set Using CHAID Models, August 2015, (Amitabh Raturi, Mike Magazine)
This paper glimpses through the industry insurance solutions that's being offered by IBM as part of their PCI (Predictive Customer Intelligence) stack. One of the product features provides insights for churn propensity which will form the key focus of my paper. However, as this product is first of its kind in the market, getting actual customer data isn't possible. So our team is leveraging a sample dataset created by consulting industry experts. Academic instincts directed us towards models such as logistic regression and classification trees. Note that I have followed IBM SPSS Insurance Customer Retention and Growth Blueprint as the definitive guideline for this modeling exercise. Finally a CHAID model was used for classification tree and misclassification rate as model performance metrics.
Sruthi Susan Thomas, Analyzing Yelp Reviews using Text Mining and Sentiment Analysis, August 2015, (Zhe Shan, Peng Wang)
The objective of this project is to analyze reviews on Yelp for restaurants using text mining and sentiment analysis using Python to understand how many reviews expressed positive or negative sentiments about the food and the service, and what they had to say about them. This will help a restaurant owner understand the areas they are doing well in, and the areas they need to improve upon with regards to their food and service.
Ravi Shankar Sumanth Sarma Peri, A Study on Prediction of Client Subscription to a Term Deposit, August 2015, (Yan Yu, Peng Wang)
As the number of marketing campaigns are increasing, it’s getting difficult to target the appropriate general public. Data mining techniques are being used by the firms to gain competitive advantage. Through these techniques firms are able to identify valuable customers and thereby increase the efficiency of their marketing campaigns. In this project, data on Bank marketing is obtained, which provide information about each client, such as age, marital status, and education level. The classification goal is to predict if the client will subscribe to a term deposit or not. Different models are built and then compared to assess the accuracy.
Rajesh Doma, Creating Business Intelligence Dashboards for Education Services Organization, August 2015, (Jeffrey Shaffer, Michael Magazine)
The objective of this report is to give details of the work performed as part of my internship at Education Services Group, Cincinnati, Ohio. Education Services Group (ESG) is small-sized company (< 50 employees) located in the eastern part of Cincinnati. It partners with technology companies to create best-in-class education and training businesses for those companies. It also provides a full range of professional services, support and tools specific to an education business thus helping clients to increase education revenue and lower operational costs. The educational products such as on-demand trainings, in-classes trainings, labs, material etc. are all hosted, managed and delivered using Learning Management Systems (LMS). LMS can be in-house or driven through third party software products. LMS contain training data of individuals or users who have purchased the education products. Sales and marketing of these products are managed using Customer Relationship Management (CRM) products such as Salesforce.com and Marketo respectively.
As a Business Intelligence Analyst, I have worked on Data Extraction, Analysis and Visualization of trainings’ and sales data mined from these systems. The primary focus of the work is to develop descriptive analytics - Reports and Dashboards that will give actionable insights for the senior management.
Sean Ashton, Cincinnati Bell Internship Capstone: Survival Analysis on Customer Tenure, August 2015, (David Kelton, Edward Winkofsky)
In the telecommunications industry, revenue is almost entirely generated through customer subscriptions. The customers pay a monthly bill and the company provides them with services. The more monthly bills that customers pay, the more valuable they are to the company, so the length of customer’s tenure is an essential factor in determining the value of that customer. This research paper focuses on the survival analysis that was performed to try and understand how long certain types of customers will stay with the company. Customers were classified into different groups based on their demographic information. It was found that there were significant differences in tenure length based on a customer’s demographic group.
Navin Kumar, An Investigation of the Factors Affecting Passenger Air Fare Using Historical Data, August 2015, (Jeffrey Camm, Martin Levy)
Air-fare and the cost of flying have always been as much a matter of discussion as they have been a matter of speculation. While we have dedicated advisory services which advise on when the best time to buy tickets is, and if the current price is high or low based on historical data not a lot has been spoken on the various factors affecting the cost of flying from one destination to another. In this project the author has tried to understand the components influencing the “average” price a customer pays for flying. Analyzing factors such as distance, passenger traffic, competition among airlines and market dominance of a particular airline an attempt has been made to arrive at statistically significant results using linear regression.
George Albert Hoersting, Predicting Impact on Trucking Physical Damage Policies Using Historical Driver Data for Personal Vehicles, August 2015, (David Rogers, Edward Winkofsky)
The goal of this capstone is to determine the usefulness of driver motor vehicle records to predict a loss ratio defined as the total dollar amount of losses due to a claim, divided by the summed dollar loss of monthly premiums on a trucking insurance policy. Three data sets were provided by the Great American Insurance Group. For the data set with driver motor vehicle records, each of the observations have several potential violation types aggregated to a unique company policy number and renewal year. The sum of the weighted count of these violations provides a driver’s score. New weights for score variable are developed and inspected to see if it is significant in predicting the loss ratio. The score variables were originally aggregated in three different ways suggested by Great American. New weights from parameter coefficients of regression models ran in the original case study will be derived and evaluated. Models from this capstone and the original case study will be compared to see if the new weights significantly affected the ability of people’s personal driving record to predict loss ratio.
Ankur Jain, Credit Scoring, August 2015, (Peng Wang, Amitabh Raturi)
Credit scoring can be defined as a technique that helps credit providers decide whether to grant credit to consumers or customers. The project is to identify the bad or risky customers by observing the trends in spends and payment. Logistic regression technique is used to identify the customer likely to default. The data is collected from Barclays data ware house with almost 139 variable for 12 months. The steps involved for regression are exploratory analysis, data preparation, missing value imputation, model building, diagnostic and validation. Sixteen out of 138 independent variables were found significant and were used to fit the multiple logistic regression model. Maximum likelihood parameter estimates from the model are significant at 5% level. Higher value of concordance, lower misclassification and higher AUC indicates the predicted model has good discriminatory power. The model clearly segregate event and nonevent. Leverage is used to identify the outliers. Higher value of AUC for testing dataset indicates, the model is not overfitting the data and thus model has good predictive power.
Xiaoka Xiang, Revenue Forecasting of Medical Devices, August 2015, (Uday Rao, Yichen Qin)
Ethicon is a top medical device company that has recently purchased a forecasting tool in an effort to improve the accuracy of their account-level forecasting. Account forecasting is a challenge for medical device companies especially because this industry is highly influenced by contractual changes, unlike other retail industries. Efforts are taken to improve the accuracy of the forecasting including interviewing the managers of the accounts and comparing the output of another forecasting model built by internal analysts. Another factor to be considered is the revenue erosion due to rebate and selling through distribution channels.
Ravishankar Rajasubramanian, Text Classification: Identifying if a Passage of Text Is Humorous or Not, August 2015, (Peng Wang, Jay Shan, Michael Magazine)
Sentiment analysis in particular has been very useful for retail companies to understand how their products are being perceived by the customers and for ordinary people to see how a particular topic is being received on microblogging sites etc. Even though a lot of research has already been done in this field, there seems to be a lot of scope for implementing better models with increased accuracy in the classification task. Identifying positive or negative emotion in a passage of text is a problem that has been extensively researched in the recent past. In this study, a variant of the problem is chosen wherein the identification of whether a passage of text is humorous or not is the main goal. The study is based on a competition conducted by “Yelp” as part of a yearly Data Challenge. This particular variant of sentiment analysis is very useful for Yelp because it would help them identify which of the newly written reviews are most likely to be humorous and display those at the top of the web page. Also from an academic standpoint, this problem is slightly more challenging to solve compared to the positive and negative sentiment problem because of the different flavors of humor such as sarcasm, irony, hyperbole etc. Solving this problem could be the first step in trying to identify those sub-classes within the humor category.
Pavan Teja Machavarapu, Industry Analytical Solutions – IBM, August 2015, (Amit Raturi, Peng Wang)
This project report is to demonstrate my contribution to the Analytics Division at IBM. Our team was responsible for developing software anlytical solutions for various industries such as Banking, Insurance, Wealth-Management etc. I primarily worked on Banking Solution in the Summer and have described my contributions to the solution in this report.
IBM Behavior Based Customer Insight for Banking solution gives the information and insight that is needed to provide proactive service to client's customers. The IBM Behavior Based Customer Insight for Banking solution works with IBM Predictive Customer Insight. The solution includes reporting and dashboard templates, sample predictive models, and application interfaces for integration with operational systems. It uses banking data related to transactions, accounts, customer information, and location to divide customers into segments based on their spending and saving habits and predicts the probably of various life events. By anticipating customer needs, the solution enables banks to deliver personalized, timely, and relevant offers. For example, It can send alerts and targeted offerings and provide insights that help banks to develop direct marketing campaigns. It also helps customers of the bank to manage their finances.
Sai Teja Rayala, Credit Scoring of Australian Data, August 2015, (Dungang Liu, Yichen Qin)
In this project, credit scoring analysis has been performed on Australian credit scoring data. Credit scoring is the set of decision models and their underlying techniques that aid lenders in the granting of consumer credit. In this project the data is partitioned into training and testing data sets using simple random sampling. In this project we build models using three data mining techniques namely Logistic Regression, Decision Trees and Linear Discriminant Analysis. Models are built using the training dataset. The models built are validated using the testing data set and evaluated on the basis of area under the ROC curve and misclassification rate.
Pushkar Shanker, Brand Actualization Study, August 2015, (Uday Rao, Edward Winkofsky)
The Brand Actualization study at FRCH | Design Worldwide was intended to evaluate ratings of various brands and build a model for brand assessment. The study comprised of designing the survey, gathering response data, analyzing the responses, developing new insights for Brand Strategy and building a Brand Actualization Score model. The Brand Actualization Model has been built using the four key Brand Power Utilities viz. ‘Recognize’, ‘Evaluate’, ‘Experience’ and ‘Communicate’. These utilities are comprised of various attributes. A survey was designed and rolled out. Respondents rated each brand on those attributes. To arrive at a Brand Utility Score and also get insights on the key factors (attributes), an analysis was performed on the attributes which constituted each Brand Power Utilities. Further, variance in each Brand Power Utilities was analyzed and a model was developed to determine the Brand Actualization Score.
Zhengrui Yang, Cincinnati Reds Baseball Game Tickets Sold Prediction Model, August 2015, (Yichen Qin, Michael Fry)
The Cincinnati Reds is a professional baseball team based in Cincinnati. The tickets sold for each game can be influenced by different factors, like the weather, the opponents, months, and so on. In this paper, the dataset containing Cincinnati Reds game records from 2014 is analyzed through different data mining techniques, such as Linear Model, Leave-one-out Cross Validation, Lasso Model. The goal of this study is to build models which will predict the amount of tickets sold with different factors. Based on the results, the Lasso model with λ equal to 403.429 are preferred according to the performance of the results.
Joel Andrew Schickel, A Probit Classification Model for Credit Scoring Using Bayesian Analysis with MCMC Gibbs Sampling, August, 2015, (Jeffrey Mills, Martin Levy)
Classification models are widely used tools in credit scoring. Furthermore, Bayesian approaches are growing in popularity in a variety of fields. The aim of this study is to show some of the advantages and disadvantages of Bayesian models in application to a particular data set used for classification in credit scoring. First, Bayesian methods are used to attempt to improve on William H. Greene’s credit scoring model which predicts cardholder status and risk of default. Second, Bayesian methods are used to improve on the author’s own predictive models. Although the benefits of a Bayesian approach are not clearly seen when it is compared to Greene’s models; nevertheless, Bayesian methods provide a modest advantage over the author’s frequentist model for predicting cardholder status. Finally, it is seen that a Bayesian model that is based on a very small subset of the credit scoring data set can be significantly improved through the use of prior information about model parameters. This result reinforces the claim that Bayesian analysis can be especially helpful when decisions are made on the basis of a small data set or when the decision maker already possesses some knowledge about the factors that predict the relevant response variable.
Ruiyi Sun, Simulation of the Collection Center Operation System and the CACM Score Model, August 2015, (Dungang Liu, David Kelton)
This essay describes the author’s contribution to the projects during an internship from January 2015 to April 2015. Two projects are introduced: the “CACM Score Model” project and the “Collection Center Operation System Simulation using Arena” project. The “CACM Score Model” is a model used to group similar levels of accounts into one of several score bands and predict whether a defaulted account in a typical score band will pay back its loan. “CACM” is the abbreviation for “Collection Analytics Contact Model”. The “CACM Score Model” project is a team project for building models to group defaulted accounts into three score bands and design experiments for the placement strategy of different score band accounts. The “Collection Center Operation System Simulation using Arena” project contains Arena models built using different collection-operation scenarios. Two main models are built using two scenarios: a random model and a non-random model. Briefly, the random model assigns account calls randomly to human vs. an automated computer system, and the non-random model uses attributes of accounts for this assignment. For the random model, the Process Analyzer and Output Analyzer were used to analyze results. For the non-random model, three sub-scenarios were built based on three different scenarios. An overall comparison of the results for the two models was conducted via the Output Analyzer. Implementation of these results was also conducted at the end.
Liberty Holt, Robot Viability and Optimization Study, July 2015, (Craig Froehle, Michael Magazine)
Surgeries are performed via many means throughout healthcare organizations. In this study, the possibility of moving a DaVinci Robot, used for robotically assisted minimally invasive surgery, from a primary location in a full service, acute-care hospital to a new minimally invasive surgery center is explored and analyzed for viability. Many factors influence the final decision of the Operations Committee, the purpose of this analysis is to determine viability and, if found to be viable, an optimal solution for cases given the specialties of General Surgery and Gynecology. An underlying assumption and expectation of opening a surgery center is that cases that are moved to a surgery center are of lower risk for inpatient admission and patient complications and will be performed more efficiently in a surgery center. This analysis uses case times and physician behaviors in the main operating room with the intention that if current case times and turnover can be moved and would fit in the surgery center then gaining efficiencies would be very attainable by the very nature of a surgery center versus a larger hospital.
Sravya Kasinadhuni, Email Fraud Detection- Spam and Ham Classification for Enron Email Dataset, July 2015, (Andrew Harrison, Edward Winkofsky)
Enron is a Texas-based energy trading giant and was America’s seventh largest company that declared bankruptcy. In this project, we look at the email data of Enron and classify these emails as spam or ham. In this approach, I first divide the data into training and testing data sets. Then the data is cleaned and the classifier is trained with a subset of the data. The classifier used is a Naïve Bayes classifier that is based on calculating the probabilities for each term in the email appearing in a particular class. Later emails in the testing data are classified using this classifier, and the accuracy of the classifier for the testing data is obtained. Based on the accuracy, it can be determined whether Naïve Bayes classification can be used to classify emails.
Anila Chebrolu, Zoo Visitor Prediction, July 2015, (Craig Froehle, Zhe Shan)
The Zoo serves over a million visitors each year. Understanding the visitor patterns would allow the zoo authorities to optimize the visitor experience by enhancing their services. It would allow them to make better operational decisions like hiring the appropriate number of support staff for various seasons or stocking up food adequately. Better planning will ultimately help in driving up the revenues for the Cincinnati Zoo. The goal is to identify the model that best predicts the number of zoo visitors. The approach uses neural networks and random forest methods to predict the count of the number of zoo visitors on a particular day using R and builds a time series model based on the daily zoo visitor data using SAS.
Charith Acharya, Forecasting Daily Attendance at the Cincinnati Zoo and Botanical Gardens, July 2015, (Craig Froehle, Yichen Qin)
The Cincinnati Zoo and Botanical Gardens needed a forecast of the number of people arriving at the Zoo on a daily, weekly level rolled up to a monthly and a yearly level. Methods such as the simple average or the simple moving average were not very effective. This paper uses the ARIMA time series forecasting procedure to arrive at weekly and subsequently daily forecasts for the Cincinnati Zoo and Botanical Gardens for 2015.
Elise Mariner, Analysis of American Modern Insurance Group Mobile Home Policies, July 2015, (David Rogers, David Kelton)
American Modern Insurance Group is a national leading small niche company in the specialty insurance business. Located in Amelia, Ohio, the company has close to 50 years of experience in residential and recreational policies. Here the main focus will be on residential policies, specifically on mobile homes. The original group-project objective for the Case-Studies course was to see if there is a predictive model to determine whether a customer will renew his or her policy, and to see if there is a correlation between the predictor variables and the binary response variable.
Top significant factors that affect retention were identified through logistic regression modeling, decision trees and random forest modeling. To extend the original class group project to this individual capstone project, model refinement through further regression analysis was performed to develop a better model than the model that was originally chosen.
Ahffan Mohamed Ali Kondeth, Cincinnati Zoo – Daily and Monthly Forecasting Number of Visitors, July 2015, (Craig Froehle, Liu Dungang)
The aim of this project is to forecast the daily and monthly attendance for a year. This forecast will help the zoo in staffing decisions. The project aims to use various analytic techniques to forecast the attendance like regression and time forecasting methods. The attendance dataset provided by Cincinnati zoo contains daily attendance details from year 1996 till 2014. Further features were added to the dataset such as the temperature, weather events, weekend details etc. to create a regression model. Other features that were added include weekend details – Saturday/Sunday, Special event flags – Christmas/Halloween etc.
Anvita Shashidhar, Data Warehouse and Dashboard Design for a Hospital Asset Management System, July 2015, (Andrew Harrison, Brett Harnett)
Increasingly analytics is finding application in the field of healthcare and medicine, not only in dealing with clinical data but also in the efficient running of the hospital. Hospitals are a vast source of data and are now beginning to rely on analytics to make effective use of that data in improving the overall efficiency. The purpose of this project is to build a data warehouse to create a master list of the hospital’s assets. In addition to building the data warehouse, the project also aims to create an interactive dashboard to present this data and publish the same on the hospital’s intranet, so as to help hospital staff track the hardware across different departments and locations.
This project is the first undertaking of its kind at the hospital and will prove to be invaluable in helping manage the hospital’s assets efficiently.
Rajat Garewal, Simulation of Orders & Forecast to Align with the Demand, July 2015, (Michael Fry, Michael Magazine)
A multinational consumer goods company’s supply chain is managed by SAP’s planning method known as Distribution Requirements Planning (DRP). DRP’s built-in parameters control how the raw forecast gets modified to account for recent demand and supply. The modified forecast is then used in demand planning to request replenishments from the plants. This modified forecast is known as DRP forecast. The objective of this simulation is to mimic SAP’s logic to use the order, shipment and forecast data to generate DRP forecast and determine parameter values that would minimize the forecast error. Producing more units than the orders would lead to additional costs to store the items in the distribution centers; on the other hand, producing fewer units would result in not being able to fulfill customer orders. Forecast accuracy depends upon the consistency of DRP forecast with the shipments. Simulation is a feasible way to experiment with different settings with minimal additional costs to the company, and without disrupting the current planning method.
Pratish Nair, Developing Spotfire Tools to Perform Descriptive Analytics of Manufacturing Data, July 2015, (Uday Rao, Michael Magazine)
As an Analyst in the Business Intelligence team at Interstates Control Systems, West Chester, OH, I have worked on data extraction, data cleaning and data visualization with prime focus on Descriptive analytics. Data from the production line of various P&G plants are aggregated and pushed onto a dashboard for reporting. We aim at processing data and getting it to a form that is presentable. In a duration of 10 weeks, I worked on multiple projects, gaining knowledge of manufacturing analytics and how processes are defined for different analysis relating to them. Our BI team is currently working on extracting data from P&G production lines located at different sites. Currently, the plants which have the systems in place are located in Winton Hills, OH; Mehoopany, PA; Euskirchen, Germany; Targowek, Poland. I worked on two experimental studies. The first one was a comparison of the latency in the start time of the tool when setup with different servers, a Historian server and an MS SQL. My second study was a proof of concept study of integrating R language to the Spotfire dash boarding environment. In addition, I explored opportunities where it could benefit the existing projects.
Nanditha Narayanan, Finding the Sweet Spot, June 2015, (Maria Palmieri, Yan Yu)
The School of Human Services in the College of Education Criminal Justice, and Human Services offers a Bachelor’s Degree in Athletic Training. Every year only a select number of students are accepted into the Athletic Training cohort. Being a niche program, attrition of the students is desired to be minimal. This study aims to use SAS and R to develop a heuristic prediction of the student performance and ability to graduate from the cohort as a determinant for offering admission into the cohort. High school GPA and ACT/SAT scores, student UC GPA at the time of application to the Athletic Training cohort, and demographic information are scrutinized to measure student success. This project will enable the admissions office to make an informed decision to offer admission to promising applicants that are most likely to succeed in the program.
Ramkumar Selvarathinam, Identification of Child Predators using Naïve Bayes Classifier, June 2015, (Yan Yu, Michael Magazine)
Objective of this document is to detail the steps followed and the results obtained in the project that aims at identification of Child Predators using text-mining methodology. A cyber predator is a person who uses the Internet to hunt for victims to make the most of in any way, including sexually, emotionally, psychologically or financially. Child Predators are individuals who immediately engage in sexually explicit conversation with children. Some offenders primarily collect and trade child-pornographic images, while others seek face-to-face meetings with children via on-line contacts. Child predators know how to manipulate kids, creating trust and friendship where none should exist. In the fight against online pedophiles and predators, a non-profit organization named Perverted-Justice has pioneered an innovative program to identify child predators by pretending to be a victim. There are more than 587 conversations that happened between child predators and Pseudo-victims. This information base can be used to train a model that will identify a person as child predator by analyzing the conversation. This problem is a classification problem that aims at identifying whether or not a person is a predator, based on his chat logs. So the primary objective of this project is to identify a cyber-predator based on his chat logs. Naïve Bayes Classifier algorithm is leveraged to solve this problem.
Praveen Kumar Selvaraj, Data Mining Study in Power Consumption & Renewable Energy Production, June 2015, (Yan Yu, Michael Magazine)
The supply and consumption of renewable energy resources is expected to increase significantly over the next couple of decades. According to US Energy Information Administration, the share of renewable energy could rise from 13% in 2011 to up to 31% in 2040. Most of the renewable energy is going to come from wind power and solar power sources. Understanding of the dynamics of energy consumption and renewable energy production is important for effective load balancing in the grid. As the energy cannot be stored, optimizing grid load is crucial for energy management. In this study, prediction models for wind power, solar power and power consumption are built using the weather data and applied on a scenario data to arrive at the power shortfall that needs to be met.
Lee Saeugling, Assortment Planning and Optimization Based on Localized Demand, June 2015, (Jeffrey Camm, Michael Fry)
We consider the problem of optimizing assortments for a single product category across a large national retail chain. The objective of this paper is to develop a methodology for picking from 1 to N (complete localization) assortments in order to maximize revenue. We first show how to estimate demand based on product attributes. SKUs are broken into a set of attribute levels and the fractional demand for these levels is estimated along with an overall demand for the attribute space. We use the estimated fractional demand and overall demand to create from 1 to N assortments and assign each store a single assortment in order to maximize revenue. We found a significant increase in revenue when going from one national assortment to complete localization. Further we found that most of the revenue increase of complete localization can be accomplished with far fewer than N assortments. The advantage to our approach is it only requires transactional data that all retailers have.
Chaitanya V. Jammalamadaka, Analysis of Twitter Data Using Different Text Mining Techniques, June 2015, (Jeffrey Camm, Peng Wang)
For organizations and companies, managing public perception is very important. Increasing usage of the internet means that it is important for companies to maintain an online presence. It is very important for these companies to monitor the public sentiment so that they are able to react to changing sentiments of potential and existing customers. Considering the instant communications that take place on Twitter, it can be a very useful tool to achieve this goal. The aim of this project is to identify and compare different methods of visualizing Twitter data related to the company Apple. These methods give useful insights which could help any business. The scope of the project is limited to exploring ways to get these insights.
Abhinav Abhinav, Forecasting the Hourly Arrivals in Cincinnati Zoo Based on Historical Data, April, 2015 (Craig Froehle, Yichen Qin)
Cincinnati Zoo is a popular place with high frequency of visitors all around the year. Maintaining the optimum level of various supplies and staff can get problematic if proper planning is not done. In absence of planning, Zoo management can face issues related to customer satisfaction and revenue management. Suitable way of planning can be – Forecasting the number of future hourly arrivals based on historical hourly arrivals data. Primary obstacle in this exercise is – multiple levels of seasonality present. There are three types of seasonality in the hourly data – daily, weekly, and yearly. To tackle this, combination of two Time Series models is selected as the final solution. First model is combination of individual “Fourier Time Series” models for each hour from 9 AM to 5 PM during a regular day. Second model is Seasonal ARIMA at monthly level. Controlling factor in first model is hours in a day. This model is more effective in capturing the yearly seasonality. Controlling factor in second model is months in a year. This model is more effective in capturing the daily seasonality. Day-related variables are also used as external regressors across both models to capture the weekly trend. Other external regressors that I used are – Early Entry period and Promotion indicators. Forecasted arrival figures are generated for May (2015) using combination of both of these models.
Swati Adhikarla, Store Clustering & Assortment Optimization, April. 2015 (Jeffrey Camm, Michael Fry)
Gauging a product’s demand and performance has always been a difficult task for retailers. To maintain a competitive mix and achieve targeted profits in today’s retail chains, clustering and localization of stores is important. This presents an opportunity to leverage the problem and gain a competitive edge. The baseline of this project is to develop a prototype solution for retailers, which allows them to provide a better product assortment to meet the unique preferences of their customers. Clustering is very crucial as retailers are more focused to consider customer-centric retailing. Hence, the first step of the project is to perform store clustering considering customer behavioral attributes. The second step of the project is to achieve localization of the stores within each cluster. The essence of the store localization is to find the right mix of product assortment to be carried.
Sally Amkoa, An Exploratory Analysis of Real Estate Data, April, 2015 (Shaun Bond, Yan Yu)
The vast amount of data available in real estate is still an underutilized resource in the industry. This study aims to explore a real estate dataset on the acquisition of properties by real estate investment trusts (REITs) in various geographic locations. The goal of this study is to utilize statistical software (R under R Studio) to manipulate the dataset into a format that will allow for the creation of statistical models that can provide actionable information.
Eric Anderson, Forecasting Company Financials for FootLocker, Inc., April, 2015 (Jeffrey Camm, Yichen Qin)
Forecasting company financials is very useful in predicting future Revenue and future Earnings per Share for various companies. The ability to understand company growth drivers can be very useful in making investment decisions. In order to predict future performance, it is important to examine the past observations for the dependent variable and all of the past observations for the explanatory variables. Of course, this assumes that company performance will follow a similar pattern in the future. Time series models cannot always predict an exact forecast because unexpected shocks can occur to the economy. However, in industries like consumer retail, there are statistics that seem to drive the performance of companies and can give us a close estimate of future performance. In this capstone paper, we use data visualization to show the impact of the economy on FootLocker company performance. Once we have used visualization to understand the drivers for company performance, we use an ARIMA model to account for seasonality. It takes into account past revenues for forecasting FootLocker Revenue.
Ginger Simone Castle, Using Simulation and Optimization to Inform Hiring Decisions, April, 2015 (W. David Kelton, Amit Raturi)
Historical project management data is used to create a simulation model that closely models the reality of the day-to-day goings on of a team of employees at a marketing firm as it relates to client and internal projects. Output from the simulation is used to compare scenarios to evaluate queuing rules and to run optimization scenarios that relate to future hiring of both permanent and contract employees. In the end this analysis makes recommendations regarding future hiring during a period of rapid sales growth, thereby answering some key questions the executives have regarding resources what will be required to successfully meet the current and upcoming growth challenges. The final hiring recommendation for the largest growth scenario is 7 new permanent employees (2 C’s, 4 D’s, and 1 who can do both WPR and D) and the utilization of 9 contractors (2 C’s, 4 D’s, and 3 PRO’s).
Damon Chengelis, A Study of Simulation in Baseball: How the Tomahawk Better Fits and Better Predicts MLB Win Records, April, 2015, (Jeffrey Camm, Yan Yu)
This study introduces the Tomahawk simulation method as a novel win estimator similar to established Pythagorean expectation. I theorized by simulating a season using a team's given record, the end result in wins will be a closer fit for a team's particular run distribution, more robust against outliers, and more readily adaptable outside of the MLB environment which Pythagorean expectation was developed. Furthermore, the simulation provides a more relevant number for understanding how an MLB team will be affected by roster changes. I tested Tomahawk simulation with R using MLB game data from 2010 to 2014. The first test used the first half of a season to predict the second half. The other test used one season to predict the next season. Tomahawk simulation was compared to Pythagorean win expectancy, Pythagenpat win expectancy, and naive interpolation. Naïve interpolation simply assumes a team will repeat the same number of wins. This provides a relevant baseline for a win estimator having insight beyond the box score. Since I am interested in robustness against outliers, I chose to measure fit using Mean Square Error. Tomahawk simulation outperformed Pythagorean and Pythagenpat expectations when comparing one season to the next. Using one half of the season to predict the next showed Tomahawk being on par with Pythagenpat. However, Tomahawk simulation provides a confidence interval and does not require finding an ideal exponent. This suggests Tomahawk simulation is an appropriate alternative to Pythagorean and Pythagenpat win expectancies.
Auroshis Das, Assortment Planning Strategy for a Retailer, April, 2015 (Jeffrey Camm, Michael Fry)
Today, in the fast moving consumer goods space, retailers are facing quite a lot of challenges in providing the best offerings in terms of variety and affordability of products. This is further fueled by the increase in the number of retailers and thereby the competition. Under such a scenario it becomes imperative to leverage the power of consumer data, to make smart decisions that would translate to customer loyalty and satisfaction. This project was undertaken to optimize the assortments that go into stores of a retailer, considering the need to meet the variety in customer demands, while minimizing the cost involved in customization of the assortment. The aim here was to find out the optimal number and product-mix of assortments which would lie somewhere between the ideal store level customization (one assortment for each store) and the naive single assortment (one assortment for all stores). The approach taken was to first identify similar behaving / performing stores and then to collectively roll out assortments optimized for such similar stores. The first part of the approach involved clustering while the latter involved optimization for the share of different kinds of products that go into the assortments. The results from this project could serve as a guideline for retailers in assortment planning strategy. This is because it captures the true demand of their customers from the historical sales data while also finding the right product-mix which would serve the demand with minimum cost involved.
Preetham Datla, Predicting the Winners of Men’s Singles Championship Australian Open 2015, April, 2015 (Jeffrey Camm, Dungang Liu)
The role of analytics has increased a lot in the sports industry in recent times. It helps different stakeholders to analyze the various factors associated with the outcomes of different matches. In this project, tennis data for all men’s singles matches of the major tournaments was analyzed to predict the winners of Australian Open 2015 matches. Classification techniques like Logistic Regression and Support Vector Machines were used and their performance was compared. It was found that the Support Vector Machines algorithm gave better prediction results. This study was initially performed for the Sports Analytics competition organized by UC INFORMS and the Center for Business Analytics at the University of Cincinnati.
Nitish Deshpande, Forecasting based impact analysis for internal events at the Cincinnati Zoo, April, 2015 (Craig Froehle, Jay Shan)
Cincinnati Zoo wanted to forecast the attendance of visitors at the zoo for next six months and quantify the impact of internal events on the attendance. In order to forecast we analyzed the seasonality in the trend of visitor count and found it to be doubly seasonal with weekly and yearly seasonality. To facilitate forecasting for such a trend we built an auto regressive moving average (ARMA) model with factors of seasonality, internal events, external events and bad weather. Data for the years 2013 and 2014 was used to build and test the model. The model gave several interesting results such as hosting an internal event has a positive impact of 59% on an average over the visitor count. It also indicated that on bad weather days the attendance drops by 25% on average. 'PNC Festival of Lights' was identified as the most popular and successful event at the zoo. Based on the findings and forecasts we recommended the zoo to introduce more events on the lines of the festival of lights and provide incentives like discounts or indoor events to prevent the impact of bad weather conditions.
Krishna Kiran Duvvuri, Predicting the subscription of term deposit product of Portuguese bank, April 2015 (Jeffrey Camm, Peng Wang)
Direct marketing is a technique many institutions employ to reach out to potential customers who in general would not know about the product being marketed otherwise. This technique is especially employed for specialized products that are not essential on day to day basis for general public such as insurance products, mutual funds, special bank schemes, etc. As these products have less takers in comparison to those for other mainstream products such as household products etc. targeting and marketing to specific group of people (potential clientele) becomes crucial for their successful sale.
In this project, we summarize models that predict whether a client of a Portuguese financial institution will subscribe to its term deposit product or not. These models will help the bank in identifying clients to whom calls can be made directly to successfully market the product. The project also helps in identifying key variables that influence the decision of subscription.
Brian Floyd, Stochastic Simulation of Perishable Inventory, April 2015, (Amitabh Raturi, David Kelton)
Antibodies are a critical and perishable inventory component for the operations of the Diagnostic Immunology Laboratory (DIL) and require time-intensive quality validations upon receipt. In this study, current ordering practices were evaluated to identify opportunities to reduce the workload surrounding antibody ordering and validations, for the DIL. Stochastic simulation, via Arena software, was used to estimate needed adjustments to order sizes that will reduce the frequency of quality validations and to assess the influence of expiration dates on waste levels over the course of a year. Simulations showed order sizes sufficient for one year of testing can lead to a 28% decrease in quality validations work and the expiration of antibodies does not create an appreciable level of waste. Given that the expiration of stock is not a substantial influence, in the future, static simulation in a spreadsheet environment can be used to estimate order sizes within the confines of a year.
Mrinmayi Gadre, Cincinnati Zoo: Predicting the number of visitors (zoo members and non-members) using forecasting methods, April 2015, (Craig Froehle, Yichen Qin)
This project aims at forecasting the number of visitors and the proportion of members that are likely to visit the Cincinnati Zoo each day in the year 2015. In order to perform this study, the historical data of visitors and members was used and then taking into consideration various relationships with other variables that affect the results, the forecasts were calculated. This was done using ARIMA modeling in R. The data showed a weekly seasonal component and therefore seasonal ARIMA model was used. The accuracy of the fitted model was tested using different accuracy tests and later this model was used for forecasting the results. Knowing ahead of time an estimate of the number of visitors will aid the zoo in taking decisions such as deciding the annual budget, the number of memberships to be offered, various offers or events to be organized, etc. It should help the zoo come up with strategies that will increase the number of visitors on days on which less visitors are expected and thus help them increase their revenue.
Chandrashekar Gopalakrishnan, Analysis of visitor patterns at the Cincinnati Zoo and the effect of weather on the visitor counts, April, 2015, (Craig Froehle, Yichen Qin)
The Cincinnati Zoo wants to use the visitors data, that it has collected over the past few years and come up with better estimates for number of visitors that are expected to arrive in the next few months. The zoo hopes to use these estimates to make its planning process more efficient.
In this project, I will attempt to understand the influence of weather related external factors on the number of visitors to the zoo for any day.
The focus of the analysis will be on how the actual amount of precipitation on any day, affects the visitor count and whether it has an effect on one or two days after as well. Also, I will be trying understand the effect of temperature on visitors count as well, by using the maximum and minimum temperature of any day as variables.
Sai Avinash Gundapaneni, Forecast use of a city bike-share system, April 2015, (Jeffrey Camm, Yichen Qin)
Bike Sharing systems provide a service that enables people to rent bicycles. The System is set up in such a way that the whole process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. These systems provide a convenient way to rent a bike from a location and return it to a different location based on the user’s need. The use of these types of bike sharing systems are on the rise. Currently there are about 500 bike-sharing programs around the world.
The goal of this project is to forecast the bike rental demand in the Capital Bike-share program in Washington, D.C. by combining the historical usage patterns with weather data. These systems generate attractive data related to duration of travel, departure location, arrival location, and time elapsed. Thus a Bike sharing system can be used to function as a sensor network to study mobility in a city.
In this project, I build a statistical model using methods like linear regression and CART methods so that we can forecast the demand in the best possible way. We aim to build the model by splitting the data into train and test samples. This method allows us to do the out-of-sample testing which is the best way to test the performance of a model fit.
Manu Hegde, CrossRoads Church Survey, April 2015, (Jeffrey Camm, Peng Wang)
In this project, we seek to understand the characteristics of church-goers who believe that it has been a spiritually enriching experience. By understanding these properties, the church can identify their most satisfied parishioners and also understand the nature of dissatisfied parishioners. By this segregation, they can choose different courses of action for their incoming population.
A Likert Scale Survey has been circulated and the results are to be analyzed.
Anthony Frank Igel, Analysis and Predicted Attendance of the Cincinnati Zoo: Special Event Influence, April 2015, (Craig Froehle, Yan Yu)
The Cincinnati Zoo opens its doors 365 days a year to the public, with nearly half of those days featuring special events. 33 different special events have been hosted over the past five years. Over the past five years, 33 different special events have been hosted. This investigation will document the approach, numerical analysis, and business implications of prediction models containing special events in unique combinations. An Autoregressive method of Time Series Forecasting and generalized linear Regression models were used to predict future attendance of Zoo patrons based off of three different regression models. The results prove that not grouping the special events as a single entity yields a more reliable model and thus those predicted results were given to the Zoo for additional analysis.
Adhokshaj Shrikant Katarni, Forecasting number of visitors to Cincinnati Zoo using ARIMA and TBATS, April 2015, (Craig Froehle, Yichen Qin)
In Zoo business it is very important to know in advance about the arrival patterns of the customers across various months. Forecasting has since long been one of the biggest factors for deciding the success or failure of an organization. It becomes basis for the planning various things like events, promotions, capacity and staffing at the zoo. Using forecasting techniques we can better understand the seasonalities and trends in the data.
The objective of the project was to come up with forecasts for six months ahead at a daily and weekly level and then find the various trends in forecasts to identify the potential outliers in the predicted data. In this project, forecasting was done using two techniques Seasonal ARIMA and TBATS. ARIMA was used for predicting the number of visitors arriving at a weekly level while using TBATS day level number of visitors were predicted.
Vikas Konanki, Credit Scoring of Australian Data Using Logistic Regression, April 2015, (Jeffrey Camm, Yichen Qin)
In this project we are performing credit scoring analysis on Australian credit scoring data.
Credit scoring can be done using various statistical techniques but here we are going to use logistic regression. The approach we follow here is to first divide the data into training data and test data. Then we now decide which independent variables are significant enough to predict the dependent variables. Once the variables are selected we build a model using these variables to predict the dependent variable. This model is then validated on the test data set and the results are checked for the performance of the model. The different results we check here are Misclassification rate, ROC curve, Area under ROC curve and KS statistic.
Regina Krahenbuhl, Predicting the Outcome of School Levies, April 2015, (David Brasington, Yichen Qin)
In the state of Ohio, a public K-12 school district goes to the public to ask for money to cover the remainder after state and federal dollars. The district asks for the money in the form of a tax levy on the ballot during a local election. With over 600 school districts in 88 counties, Ohio has a very diverse populace. In this study, regression analysis was implemented to predict the factors which affect the outcome of the levy – pass or fail. A significant explanation is given on how these taxes are assessed and what other studies have determined to be predictors. To perform the analysis, data from 2013-2014 was gathered. The independent variable (outcome) is dichotomous; therefore, logistic regression must be chosen with a probit link. In addition, the following statistical procedures were examined: bivariate probit, interaction terms, exclusion of classes of variables, and several robustness checks. The final model found these variables to have a statistically significant impact on the outcome of a levy: percent of population with a bachelor’s degree or higher, district state revenues per pupil, month the election was held, millage amount, type of levy, percent of minority students in the district, and district salaries as a percent of operating expenditures.
Karthik Reddy Mogulla, A Trend Analysis and Forecasting of Cincinnati Zoo Membership and Member Arrivals, April 2015, (Craig Froehle, Yichen Qin)
Cincinnati Zoo & Botanical Garden is the second oldest Zoo in the United States where over 1.5 million people visit the Zoo annually. As part of its services it offers Gold, Premium and Basic annual membership plans for Family, Single Parent and Individual members. The goal of this project is twofold. (1) To build interactive dashboards to capture the trends over time of member arrivals and Zoo membership so as to identify the underlying patterns in it by using QlikView Visualization software. (2) To forecast member arrivals on a daily level using autoregressive integrated moving average, ARIMA, time series model. Internal and external factors affecting the arrival patterns are identified and are used as external regressors in the ARIMA model. Impact of different factors such as temperature, precipitation, internal Zoo events, corporate & educational events, week day, month etc. on the member arrivals are analyzed.
Swaraj Mohapatra, Process mining, missing link between model-based process analysis and data-oriented analysis techniques, April 2015, (Yan Yu, Yichen Qin)
Process mining is used to extract process related information from event data. The goal of this report is to introduce process mining not only as a technique but also as a method. The method facilitates to discover automatically a process model by observing the events recorded by any enterprise system. The importance of process mining can be showcased by the astounding growth of event data. Process mining takes care of the limitations of traditional approaches to business process management and classical data mining techniques. The report explains the three different types of process mining namely: process discovery, conformance checking, and enhancement. The report will also explain how to implement process mining on event data.
Chaitra Nayini, Using Visual Analytics and Dynamic Regression Modeling to Forecast Trends and Optimize Station Capacity for a Bike Share Service, April 2015, (Yichen Qin, Jeffrey Camm)
Forecasting trends for data that exhibits time series behavior at multiple levels can be very complex. However, having these forecasts can help an organization’s planning and decision making. It also enhances customer service as the organization can now anticipate future trends and be prepared to meet customer’s expectations. This project aims at building a Dynamic Regression model which quantifies the influence of predictor variables on the dependent variable while taking time series behavior into consideration.
The dataset for this project contains information released by Capital Bikeshare, a short term bike rental service operating in Washington, D.C., Metro area. It has information on every trip taken since September 2010. A supplementary dataset which includes hourly weather information and trip volume for that hour is also used. This paper will discuss application of Negative Binomial Regression and ARIMA using three different models. A set of geographical and temporal visualizations are also built for exploratory data analysis and to enhance understanding of the statistical models. These methods can be extended to predict trip volume for each individual station.
Matthew Martin Norris, Predicting Remission in Bipolar Disorder: An Exploratory Study, April 2015, (Yan Yu. James Eliassen)
Predicting treatment outcomes in psychiatric illnesses remains understudied. Analytic approaches to existing studies can be effectively applied to this process. Less than 50% of first line treatments work in bipolar disorder; identifying which individuals will respond to a specific course of treatment is the ultimate goal. In this study we are aiming to predict who will respond to one of two medications. For predicting remission in bipolar disorder in drug naïve subjects several models were tested, and one model performed sufficiently well to report the results. The neural network model performed well on both the hypothesis driven model and for the data driven model. However, in the data driven model the accuracy was somewhat reduced and the other modeling techniques, except for classification trees, showed improvement.
Chaitanya Peri, Predicting “No Real Spiritual Growth”, April 2015, (Jeffrey Camm, Peng Wang)
Eleven friends started crossroads church community in 1995 to create an authentic community for people seeking truth about spiritual growth. In this project, model(s) that predict whether a person attending services at crossroads church experience any spiritual growth are built. These models will help Crossroads church recognize the key variables responsible for people not experiencing any spiritual growth. The original dataset consisted of 10,222 observations and 35 survey questions. All the data cleaning, data manipulation and statistical analysis were performed using the open-source statistical analysis software “R”.
Logistic regression is the approach used in this project considering the dichotomous nature of the dependent variable. Variable reduction for the first logistic model built was done using stepwise logistic regression based on AIC criteria. Out of 35 variables, 10 proved to be significant. R treats all the categorical variables as factors without any requirement of creating dummy variables. This makes the whole process of model building easier but in the hind sight the models includes all levels of a categorical variable without taking into consideration their significance. Hence, an alternative model was built using just the significant levels of the categorical variables found significant in the previous model. The second model turned out to be more robust and it is very simple and easy to replicate this model in other statistical software.
Apurv Singh, To Predict Whether a Woman Makes Use of Contraceptives or not Based on Her Demographics and Socio-economic Characteristics, April 2015, (Jeffrey Camm, Michael Fry)
World population is rising rapidly. This is a problem for citizens across the globe because the availability of resources per individual is decreasing by the hour as population keeps increasing. There is a need to create awareness among people from all sects about the use of contraceptives to tackle this problem. It has been observed that the use of contraceptives is much lower in developing countries as compared to the developed ones. In this project, data about married Indonesian women have been obtained, which contains some of their demographic and socio-economic information such as age, religion, number of children, media exposure, etc. Based on these characteristics, the goal is to predict whether or not a woman uses contraceptives. The approaches of Logistic Regression and Classification and Regression Tree (CART) learnt from the course - Data Mining I have been made use of to build several models, which have then been compared to see which model can give us the best prediction. Using this information, we can have a better understanding of factors that influence the use of contraceptives among such women and can thus help us in targeting certain sections of the society globally to educate and make them aware of the long term benefits of the use of contraceptives.
Dan Soltys, Confidence Sets for Variable Selection via Bag of Little Bootstraps, April 2015, (Yichen Qin, Yan Yu)
Using the sample mean as our estimator, we replicate the functionality of the Bag of Little Bootstraps. Then, utilizing the BLB’s improved efficiency in complex tasks, we then apply the algorithm to the highly complex task of automated variable selection through stepwise regression and Lasso regression. The goal of this analysis is to obtain a confidence interval for the true model such that the LCL is a subset of the true model which is, in turn, a subset of the UCL. That is, we want P(LCL ⊆ True Model ⊆ UCL) = 1 – α. To test our ability to generate this interval, we utilize Monte Carlo simulations for a variety of selection criteria and determine how close the proportion of true models captured by our interval is to 1 – α.
Richard Tanner, Capital Bike-Share Rebalancing Optimization, April 2015, (Jeffrey Camm, David Rogers)
Bicycle sharing systems are increasingly common fixtures in urbanized areas throughout the world. In large metropolitan areas, thousands of commuters use these systems every weekday to travel to and from work. As they grow larger and more complex, it becomes necessary to employ station rebalancers who each day transport bicycles in a large van from stations with too many bikes to stations with no bikes. Because rebalancing operations are expensive, it is advantageous to perform them in a cost minimizing manner. Using publically available data, I attempt to compute the cost-minimizing rebalancing plan for the bicycle sharing system that services the Washington DC area.
Bijo Thomas, Daily Customer Arrival Forecasting at Cincinnati Zoo Using ARIMA Errors from Historic Customer Arrival Regressed with Fourier Terms, Promotion and Extreme Weather, April 2015, (Craig Froehle, Yichen Qin)
Presently, Cincinnati Zoo expects customer arrival on a particular day totally based on historic customer arrival. In this article we discuss the methodology of forecasting the daily customer arrival at Cincinnati Zoo using the annual seasonal time-series data, while also accounting for promotion and weather conditions. It is very challenging to account for annual seasonality in an ARIMA model, since R runs out of memory for a seasonality having time period more than 200. To solve the problem of long seasonality, first we used Fourier transformation on the time-series to decompose it into Fourier terms that accounts for seasonality of the time-series. Then used the Fourier terms, mean Temperature and promotion dummy variable as regressors in a linear regression model. Used ARIMA on the error terms that we get from this LR model. The forecasted value from the ARIMA error with the linear regression equation gave us the final value for the daily customer arrival.
Aditya Utpat, Study of Campaign Oriented Enrollment Performance for Mailing Solutions, April 2015, (Uday Rao, Edward Winkofsky)
Mailing solution businesses that run promotion campaigns using mailing lists seek to improve the success of their promotions, where success is measured using the campaign’s enrollment rate. Typically mailing lists come with additional data, a structured analysis of which can shed insight on the suitability of prospective targets for a specific campaign. This project seeks to quantify and better understand the impact of additional data or factors on the success/enrollment rate of a campaign. The additional data entails parameters like Prizm Codes and Census Information that are widely accepted and used in the industry. Statistical models help uncover relationship between different factors, identify the important/significant factors and give a quantitative measure of their impact on the outcome. The results and findings from the analysis provide management with a better data driven approach and solution to make policies and decisions regarding the fate of the campaign/product.
Karthick Vaidyanathan, Predictive Analytics in Sports Events, April 2015, (Jeffrey Camm, Michael Magazine)
Machine Learning techniques are often used to predict outcomes in areas like Retail, Banking, Insurance and Defense. The realization of its benefits has also resulted in its use on various Sports events. Sports like Tennis, Baseball, and Football have started to tap the potential of Predictive Analytics that can convert the vast amount of player data into useful strategies to win.
In this project, I use “Logistic Regression” to predict the outcome of a very famous World Event- “Australian Open 2015”. I will use the past data of all of the 128 players participating in the tournament. The Tennis data consists of all the matches played since 2000 until December 2014. The data includes information on the winner, loser, rank, points of both the players in each set, venue, surface of the court and tournament details. “Logistic Regression” is a supervised learning technique where I provide an input data with past results and let the system predict the dichotomous result of future outcomes. I used the matches of 2014 for the system to learn and then predict the results of all 127 matches of the 2015 Australian Open.
Zachary A. Finke, Determining the Effect of Traveling Across Time Zones on Major League Baseball Teams in the 2013 Regular Season, August 5, 2014 (Michael Magazine, David Rogers)
Major League Baseball (MLB) is composed of thirty teams spread out across the United States, including one team in Toronto, Canada, so travel is a large aspect of professional baseball. The goal of this project was to analyze the 2013 regular season to determine the effect of traveling across time zones on a baseball team's success in games. The hypothesis is that (1) traveling to another time zone significantly correlates with a team's success rate, and (2) when a team is playing away from home in another time zone, their chance of winning the game decreases, specifically in the first game after travel. The number of days of rest between games, the number of time zones from home, the number of time zones from the previous game, being home or away, and traveling after a day or night game were tested as independent variables to correlate with the dependent variable of winning or losing a game. Linear regression models were used and compared to test for statistical significance of correlation. The statistical analysis showed that there was no statistical significance in these models and traveling across time zones does not affect a team's success in games.
Bryce A. Alurovic, Emergency-Department Overcrowding: A Patient-Flow Simulation Model, August 4, 2014 (David Kelton, Uday Rao)
In the United States, annual visits to emergency departments increased from 90.3 million in 1996 to 119.2 million in 2006 (Saghafian et al. 2012, p. 1080). Continuation of these trends has helped to identify emergency-department overcrowding as a very serious problem. This project models the path of a patient through two systems. One base-case scenario models the current system while an alternative scenario re-directs non-critical patients to a local urgent-care center. Length of stay for re-directed and urgent-care patients, along with emergency-department and urgent-care center utilization, are compared across models. Patients classified as non-critical, so in triage level 4 or 5 (re-directed) see a significant decrease in length of stay, while urgent-care patients see an increase in length of stay. While re-directing patients has a positive impact on emergency-department utilization, the magnitude is not as pronounced as expected. Patient re-direction helps to improve the emergency department and healthcare system, although additional re-direction policies may be preferable. All conclusions drawn from this research are justified by proper statistical analysis.
Mingye Su, Retail Store Sales Forecasting: A Time-Series Analysis, July 25, 2014 (Yan Yu, David Rogers)
In this project, common time-series forecasting techniques are implemented to determine the best forecasting models for retail-store sales prediction. ARIMA and exponential-smoothing methods are utilized and combined with regression analysis and seasonal-trend decomposition with Loess (STL). Forecasting models are iteratively built on the sales of stores at the department level. The study identified that ARIMA and exponential-smoothing models with STL decomposition generally perform well on the data across all departments, and that averaging forecasts of all models is superior to any single forecasting model that has been used in the study.
Sumit Makashir, Statistical Meta-Analysis of Differential Gene Co-Expression in Lupus, July 25, 2014 (Yan Yu, Yichen Qin)
In this study, we developed a statistical framework for meta-analysis of differential gene co-expression. We then applied this framework to systemic lupus erythematosus (SLE) disease. To perform meta-analysis of differential gene co-expression in SLE, we used data from five microarray gene expression studies. Several interesting results were observed from this study. Gene networks built from top differentially co-expressed gene pairs showed a consistent enrichment for already-established SLE associated genes. Analysis of the network consisting of the top 1500 differentially co-expressed gene pairs showed that ELF1, an established SLE associated gene, was differentially co-expressed with the most number of other genes. Several results from analysis of this network are consistent with the well-established facts related to SLE. The enrichments of gene modules for viral-defense response, bacterial response, and other immune-response-related terms are the key consistent findings. Many other results have very interesting biological implications.
Lingchong Mai, Supply-Chain Modeling and Analysis: Activity-Costs Comparison Between the Current Supply Chain and the "Ideal-State" Supply Chain, July 24, 2014 (Jeffrey Camm, Michael Fry)
This report examines the salon products supply-chain network of a local manufacturing company. The current supply-chain network is a multi-node supply chain with production sites, the company's mixing centers, customer distribution centers, and the retail stores. Products are delivered between sites by trucks. In order to reduce the number of nodes, activity costs, and lead-time durations, an "ideal-state" supply-chain model is developed. In this "ideal-state" supply chain, mixing centers and customer distribution centers are proposed to be eliminated. Production sites perform partial functions of mixing and distribution centers, and products are directly shipped to retail stores via third-party service vendors. Activity-costs analysis and sensitivity analysis are conducted on both the current supply-chain model and the "ideal-state" supply-chain model under different scenarios. The project is part of a supply-chain research project undertaken by the University of Cincinnati Simulation Center (http://www.min.uc.edu/ucsc) and the company. Due to the confidential agreements with the company, the company's name is not directly mentioned in this report, and all data mentioned in this report are disguised forms of the real data.
Chris Fant, American Athletic Conference Football Division Alignment: Minimizing the Travel Distance and Maintaining Division Balance, July 24, 2014 (David Rogers, Michael Fry)
Starting in 2015, the American Athletic Conference (AAC) will be adding additional universities and forming two Divisions and a conference championship game between Division winners for football. The AAC will consist of twelve teams with varying rankings and locations throughout the country. A mixed-integer linear-programming model was solved to provide the optimal Divisions to minimize the average travel distance while keeping a balance of ranking and travel between the Divisions. Hierarchical and K-means clustering analysis were also used to compare with the optimization results. Travel distance and opponent ranking were used to evaluate the realignment scenarios. The optimization analysis showed two teams were not placed in the appropriate Division. The proposed optimal alignment will save the AAC 13,299 miles round-trip every four years.
Ole Jacobsen, NFL Decision Making: Evaluating Game Situations with a Markov Chain, July 24, 2014 (Michael Fry, David Rogers)
This research analyzes and quantifies all of the scenarios or starting plays that take place in a National Football League game. A Markov chain is used to create an expected point value of 700 situations based on possession, down, distance, and field position using 150,000 plays from five seasons of National Football League play-by-play data. The variability and mean of each upcoming play can then be weighed and considered from this model. This expected value is then modeled to test for linearity, as well as to find coefficients to value each starting situation. Also discovered within the model were the most volatile situations on the football field, which are the situations were after one play is run it consistently has the greatest impact on the expected point value. From these Markov-chain models, it is determined that the home-field advantage on any given play is worth 0.52 points, a same-field-position first down turnover is worth 5.01 points, and the successful yardage required to make up for the loss of a down is 6.2112 yards.
Xuejiao Diao, Sentiment Analysis on Amazon Tablet Computer Reviews, July 22, 2014 (Roger Chiang, Yan Yu)
The project uses multinomial naive Bayes classification methods to classify Amazon.com tablet computer reviews to helpful and unhelpful reviews, and to rating 1 to rating 5. The top 50 features for helpful and unhelpful reviews were selected by ranking order Chi-square scores from highest to lowest. For rating 1 to 5, the top 20 features were selected for each class. The optimal numbers of features for classifying reviews were obtained by comparing accuracy, precision, and recall for 1000, 3000, 10000, 30000, and 100000 features. It was found that the more positive the rating is, the more likely a peer customer votes the review to be positive. And polarized reviews were found to attract a higher number of total helpful/unhelpful votes than were neutral reviews.
Paul Reuscher, The Affordable Care Act: A Meta-Analysis of the 18-34 Demographic Through Market Surveys, Price Elasticities, and Variable Dominance Analysis, July 21, 2014 (Jeffrey Camm, Jeffrey Mills [Department of Economics])
The purpose of this research is to better understand the varying issues concerning utilization of the program by the 18-34 demographic that are disparagingly not participating in the Affordable Care Act program as intended. It has been pointed out that this utilization of the 18-34 demographic is not just desired by the government, but it is necessary to keep the system operable and affordable in the long run. Through market survey research, price elasticity concerns, and variable dominance (hierarchical partitioning) analysis, I provide empirical information to further the insight into these issues for the healthcare and public-policy audience.
Ya Meng, An Analysis of Predictive Modeling and Optimization for Student Recruitment (Jeffery Camm, B.J. Zirger [Department of Management])
The Department of Admissions at the University of Cincinnati receives four to five thousand applications every year. After the evaluation of each application, the Department of Admissions sends out offers with or without financial aid to qualified applicants. With one or multiple admission offers, applicants make their own decisions to decline or accept the offer. This project provides a detailed analysis of the factors that influence applicants' decisions, based on the applicant's characteristics, from records from 2002 to 2012. The objective of this project is to find influential factors, especially financial aid, that applicants take into consideration when the decision is made. This paper introduces two predictive models: a logistic regression model and a classification-tree model, to unveil the association between offer acceptance and applicants' personal information, application and financial aid. The results suggest that the logistic regression model has better predictive performance than does the tree model, and application preference, age, test score, home distance, amount of financial aid, and high-school type are crucial factors for applicants' choices. As the final step of the project, optimization analysis of financial aid is conducted for applicants in 2012. Since financial aid is the only factor that the school can control in admission, the allocation of limited financial aid is necessary to attract ideal applicants. Based on the logistic regression model, different optimization models are discussed and a heuristic method of financial-aid allocation is developed.
Vignesh Rajendran, Recursive Clustering on Non-Profit Contributions: An Application of Hierarchical Clustering to Non-Profit CRM Data, July 11, 2014 (Michael Fry, Jeffrey Camm)
Customer segmentation is an important part in the strategic decision making in Customer Relationship Management. In this project, segmentation is applied to contributors of United Way of Greater Cincinnati (UWGC), a non-profit organization that receives contributions from around 300,000 employees of more than 2500 companies every year. By recursively using hierarchical clustering with Ward's linkage, this project proposes a novel method to cluster the contributing companies for UWGC. Using this method on metrics which closely depict their growth potential and customer behavior, the contributing companies are segmented into different clusters which would help UWGC better understand their contributors and better plan their contributor campaigns.
Aashin Singla, Modeling the Preference of Wine Quality Using Logistic-Regression Techniques, July 9, 2014 (Yichen Qin, Yan Yu)
The Vinho Verde wine is a unique product with a perfect blend of aroma and petillance that makes it one of the most delicious natural beverages. Quality is credited by many ways, which includes physiochemical properties, and sensory tests. The Viticulture Commission of the Vinho Verde Region rated the wine quality using the physiochemical properties. These physicochemical properties can be used to model wine quality. This analysis extends a group report done in a Data-Mining course project using decision trees, support vector machines, and neural-network methods. I have tried analyzing the same data set to predict wine taste preferences by using two logistic regression approaches. As the output variable (dependent) is an ordered (or categorical) set, we considered using ordinal and multinomial logistic regression. For ordinal logistic regression, the dependent variable has to be an ordered response variable and the independent variables may be categorical, interval, or continuous scale and not collinear. For solving the problem of multicollinearity, linear regression is applied to determine the best model. After applying this technique, we realized that sulphate, an anti-oxidant and smelling agent, and volatile acidity, responsible for the acidic taste, are significant factors for grading the wine. When some of the assumptions of the ordinal logistic model got violated, multinomial logistic regression was applied. Multinomial logistic regression is used when the dependent variable has a set of categories that cannot be termed as ordered. Using this technique, residual sugar and alcohol have statistically significant negative effects throughout the various quality levels.
Ashmita Bora, Analyzing Student Behavior to Understand Their Rate of Success in STEM Programs, June 2, 2014 (Yichen Qin, Michael Fry)
The objective of this project is to analyze the students of the University of Cincinnati and their success rate in STEM (Science, Technology, Engineering, and Mathematics) programs. We are interested in identifying the patterns of the enrollment of students in the STEM programs in UC and understand if factors like gender, race, etc. have an impact on the success of students in the program. Although there are many interesting questions that can be answered, we focus our study on three analyses: (1) Exploratory analysis of the covariates for successful and unsuccessful students in UC. The unsuccessful students are defined as students enrolled in a program for more than 6 years and didn't graduate. (2) The second analysis is to perform predictive modeling to identify statistically significant covariates that predict the probability to switch from a STEM program to a non-STEM program. (3) The third analysis is predictive modeling to identify variables that determine the amount of time a student needs to graduate in a STEM program. Modeling techniques like logistic regression and accelerated failure time (AFT) models are used to model the probability of switching from a STEM to a non-STEM program and time to graduate, respectively.
Subramanian Narayanaswamy, A Descriptive and Predictive Modeling Approach to Understand the Success Factors of STEM and Non-STEM Students at the University of Cincinnati, May 16, 2014 (Yichen Qin, Michael Fry)
The objective of this project is twofold: (1) to understand and identify the demographic and academic factors that differentiate the performance of STEM and non-STEM students at the University of Cincinnati (UC); (2) to perform an in-depth descriptive analysis and build preliminary predictive models to identify the predictors of successful students at UC. Successful students are defined as those who graduate within six years of enrollment. In addition to descriptive statistics analyzing student performance, predictive models are presented that use logistic regression to estimate student success based on a variety of potential predictor variables. This work uncovers interesting comparisons between STEM and non-STEM students based on demographics and student background, while also identifying important characteristics of successful students at UC.
Vishal Ugle, Developing a Score Card for Selecting Fund Managers, May 7, 2014 (David Rogers, George Polak [Wright State University])
Absolute Return Strategies (ARS) analyzes directional investment performance patterns in the global foreign-exchange market using a proprietary data set provided to them through a strategic relationship with Citibank FX. ARS has data for about 40 currency hedge fund managers across the globe. The objective of the research is to develop a model that identifies the managers who can give better returns in the next six-month period and develop a score card to rate the managers. ARS currently employs a model named the Alpha consistency model that rates managers on a five-point scale based on their returns in every six-month period since they have been active in the market. I will be calling the new approach developed here Alpha consistency 2.0, and it has performance parameters like returns, volatilities, and drawdowns for each manager since he or she has been active in the market. Since the data have a high degree of multicollinearity, I used exploratory factor analysis to generate factors for identifying the underlying correlation structure among these variables and give loadings for each variable. I generated five factors out of which three factors will be used to denote volatility and downside, returns, and drawdowns. After generating the factors I have calculated the factor scores for each manager and used these scores to rate each manager.
Kiran Krishnakumar, Sales Prediction Using Public Data: An Emerging-Markets Perspective, April 18, 2014 (Yichen Qin, Edward Winkofsky, co-chairs)
In emerging markets, the availability of marketing data is limited by poor quality and poor reliability. This project describes ways in which companies can leverage the unexploited pool of publicly-available data to deliver analytical insights in order to support marketing initiatives. The project studies the case of a Spanish manufacturer in the bathroom-spaces industry. They are interested in pursuing marketing interests in emerging markets in Asia. This project uses various regression modeling and analytical techniques to build a statistical model to help predict product sales. The dataset used is custom sourced, which combines their internal point-of-sale data with 50+ sourced public datasets that include financial indicators, demographic indicators, and risk factors. The project uses SAS to conduct the analysis and leverages concepts including multivariate regression, generalized linear models, and logistic regression. This project provides an overview of issues related to data quality in emerging markets. It shows how companies can leverage public data to develop analytical insights when constrained by availability of reliable data.
Yichen Liu, Group LASSO in Logistic Regression, April 16, 2014 (Yichen Qin, Yan Yu)
When building a regression model it is important to select the relevant variables from a large pool that contains both continuous and categorical variables. Group LASSO (Least Absolute Shrinkage and Selection Operator) is an advanced method for variable selection in regression modeling. After predefining groups of variables using a group index, it minimizes the corresponding negative log-likelihood function subject to the constraint that the sum of the absolute values of the regression coefficients are less than a tuning constant. Due to this constraint, the absolute value of all the regression coefficients will be shrunk and some of them will be reduced to zero. Thus, Group LASSO can improve the interpretation of the regression model via variable selection and stabilize the regression by shrinking the absolute values of the regression coefficients. The predefined group index will make sure that once a variable is shrunk to zero, all the other variables in the same group will also be shrunk to zero. Therefore, Group LASSO can select variables for models containing both continuous and categorical variables. This research project is a study of Group LASSO's performance in logistic regression, where it was introduced. In order to show its superiority, Group LASSO has been applied to both simulated data and to real data from the Worcester Heart Attack Study (WHAS). A ten-fold cross-validation was performed to find the tuning constant.
Abhimanyu Kumbara, Hospital Performance Rating: A Super-Efficiency Data-Envelopment Analysis Model, April 15, 2014 (Craig Froehle, Michael Fry)
The United States spends approximately 18% of its GDP on healthcare. Nearly $650 billion of the spending is due to inefficiencies in the healthcare market. Cost-containment proposals have focused primarily on payment reforms, with approaches such as pay for performance and bundled payments. The non-emergency nature of elective procedures provides great opportunity for reducing costs. Percutaneous cardiovascular surgery is a popular elective procedure, with more than half a million Americans undergoing the procedure every year. The average procedure charges ranges from around $27,000 up to nearly $100,000. This, along with the shift towards outcome-based care models, motivates hospitals to become more efficient and provide high-quality, cost-effective care. This project provides a measure of efficiency by performing a super-efficiency data-envelopment analysis (SEDEA) on hospitals from Ohio and Kentucky that perform percutaneous cardiovascular surgery with drug-eluting stents. The data used in this study contain the procedure charges and quality information for approximately 86 hospitals from Ohio and Kentucky. The data were obtained from healthdata.gov, a healthcare-related data repository. The SEDEA model, implemented in R, uses cost and quality measures for each hospital to calculate the hospital efficiency scores and ranks the hospitals accordingly. Hospitals can use the ratings to assess their current market standings. Other healthcare-market participants, such as insurers, could use the ratings as a comparison tool for making cost-effective decisions about elective procedures.
Joshua Phipps, Exploratory Data Analysis for United Way, April 11, 2014 (Michael Fry, Jeffrey Camm)
United Way, a non-profit organization that collects donations and provides opportunities to volunteers in order to help the community, has recently digitized the past seven years of their donations and volunteering. The open-ended question of "what do these data tell us?" was posed, and through exploratory data analysis we show what the data hold, what they don't hold, possible deficiencies, and produce some insights into the segmentation of their donations. Further exploratory data analysis was conducted in order to determine the effect of volunteering on donation amount. Grouping volunteers by their donation behavior allowed United Way better to evaluate the interaction between volunteering and donating. We are able to show that it is worthwhile to push for more volunteers, and we give recommendations for better analyses in the future to tailor their efforts.
Mark E. Nichols, Evaluation and Improvement of New-Patient Appointment Scheduling, April 11, 2014 (David Kelton, Jeffrey Camm)
New Patient Lag (NP Lag) is the primary metric used to determine the efficiency of initial patient scheduling at University of Cincinnati Medical Clinics. It is defined as the Julian date of a new patient's appointment minus the Julian date when the new patient scheduled the appointment. NP Lag is the first impression of the clinic to the public and it varies across the thirty-plus clinics from approximately 1 week to over 7 weeks. Open-Access is a system of prescribed changes in clinic scheduling and operating procedures that is designed to bring NP Lag down to near-same-day access or near an NP Lag of zero. In an effort to examine some of the Open-Access methods prior to implementation, the scheduling at a particular clinic was studied. The neurology clinic, with its average 36 work-day NP Lag, was modeled in Arena with current scheduling logic and rules. After testing and validation, the model was altered based on selected aspects of Open-Access to evaluate the effects on NP Lag: removal of new/existing patient restrictions on open appointment slots, reduction of standard appointment slots from 30 to 20 minutes, and reduction in appointment cancellations near the appointment date. The simulation showed removal of the new/existing patient requirement from open appointments resulted in a simulated reduction in NP Lag of approximately 36 work-days, to a NP Lag of less than one work-day. The 36 work-day improvement was based on a 99% confidence interval and was the best NP Lag improvement of the three trials.
Matthew Schmucki, Visualizing Distribution Optimization, March 11, 2014 (Jeffrey Camm, Michael Fry)
Manufacturers strive to operate as efficiently as possible. Moving product incurs handling costs and shipping costs. Ensuring that product from manufacturing plants is optimally shipped to distribution centers that supply customers is a potential cost savings. Currently, linear programmers are able to model these distribution networks in programs such as AMPL and CPLEX but the solution is shown in a crosstab table format. These tables make it difficult to explore the solution. If the linear programmers were able to visualize the solution, it would be easier to share the data and discuss it with non-programmers. Ideally, the visualization software would also solve the problem creating a one-stop-shop. This would eliminate the need to enter information to AMPL or CPLEX, export the numerical solution, and import it to an additional piece of software. This paper discusses a no-cost Visual Basic Application (VBA) program that can be executed in Excel, solve a distribution problem, and then display the solution visually via Google Chrome. This method will save time, will not require programming knowledge, and will create a visualization of the solution with no additional effort. This visualization will help users see numerical solutions and better understand what a solution is indicating.
Joel C. Weaver, Enhancing Classroom Instruction by Finding Optimal Student Groups, March 3, 2014 (David Rogers, Jeffrey Camm)
Classrooms today are structured around a philosophy of teaching and learning that deviates from the traditional direct-instruction model of the past, in which students sit in rows and listen to lectures for the duration of the class period. Conversely, students in classrooms today are spending the majority of their time working together in one of many different types of student grouping arrangements. The assignment of students to different groups is typically driven by data representing students' ability levels, with the exact method of data utilization varying depending on the type of grouping arrangement that is desired by the educator (i.e., ranked ability, mixed ability, etc.). Determining the grouping assignments based on those ability levels, which should be derived from multiple types of student data, can be a time-consuming task. Additionally, when combining that data analysis with the multitude of potential classroom grouping constraints (separating students, limiting groups by gender, special needs, etc.), the student grouping task can become exceptionally arduous. This research project provides a solution to the difficulties that educators face when trying to create optimal student groups for specific grouping arrangements. A working optimization model was developed to provide educators with a useful tool for addressing their changing needs. In order to maximize accessibility and minimize cost to educators, the model was created in Microsoft Excel with a free open-source add-in called OpenSolver. By implementing this model, educators can enhance the classroom learning experience more easily through accurate and timely determination of optimal student grouping assignments.
Wenjing Song, Methods for Solving Vehicle-Routing Problems in a Supply Chain, December 6, 2013 (Michael Fry, Jeffrey Camm)
The capacitated vehicle-routing problem (CVRP) is a vehicle-routing problem to determine the optimal set of routes to be performed by a group of vehicles to meet the demand of a given set of customers or suppliers including vehicle-capacity restrictions. The goal here is to determine the optimal or near-optimal routes to get all the necessary materials from the suppliers to the manufacturing plant at the lowest cost. Input data include the volume of materials that must be picked up from each supplier and the distance matrix providing distances between all possible pairs of suppliers. Each vehicle is assumed to have a capacity restriction of 52 units. Optimal and heuristic solution methods are explored in this project. Optimal models are solved using AMPL/CPLEX and a genetic algorithm is coded in C to represent a meta-heuristic solution method. We compare the solutions from the optimal and heuristic approaches; difficulties and challenges of each method are also discussed.
Pramod Badri, Identifying Cross-Sell Opportunities Using Association Rules, December 6, 2013 (Jeffrey Camm, David Rogers)
Retailers process huge amounts of data on a daily basis. Each transaction contains details about customer behavior and purchase patterns. The objective of this study is to analyze prior point-of-sales (POS) data and identify groups of products that have the affinity to be purchased together. In this paper, I detail the steps involved in developing associated product sets using association Rules. These rules can be used to perform market-basket analysis that can help retailers understand the purchase behavior of customers. With a given set of rules, the retailer would be better able to cross-sell, up-sell and improve the store design for higher sales.
Hang Cheng, An Analysis of the Cintas Corporation's Uniform Service, December 5, 2013 (David Rogers, Yichen Qin)
The Cintas Uniform Service provides services such as renting, designing, and manufacturing customized uniforms for employees in various companies. The objective of this paper is to identify influential elements for the Cintas Uniform Service, and to predict future performance and tendencies. The business data used in this study are from Cintas, and geographic and demographic data are from the US Government Census. We first explore the relationship between the Cintas uniform service and influential factors such as location, industry, and employment in a linear regression. Time-series analysis is used to predict the service usage based on the previous five years' data. This research provides a regression model that shows the influential factors for the Cintas uniform service and a statistical prediction for the future number of wearers, which guides future manufacture and market planning. By implementing the findings from this study, Cintas could optimize its uniform service management and form a comprehensive sales strategy in various regions and industries.
Ting Li, Estimation and Prediction on the Term Structure of the Bond Yield Curve, December 5, 2013 (Yan Yu, Hui Guo [Department of Finance])
Estimation and prediction of the term structure of the bond yield curve have been studied for decades. In this work, the Treasury bond yield data from 1985 to 2000 are studied. Both in-sample fitting and out-of-sample forecasting performance of the yield curve are evaluated. The Nelson-Siegel model is applied on the data for estimation. Then the time-varying regression coefficients are further studied based on time-series analysis with a Box-Jenkins model. We first fit the linear model with a fixed shape parameter and three parameters, which can be interpreted as level, slope, and curvature. Different scenarios are investigated further to search for the optimal shape parameter and improve overall performance. We consider the dynamic Nelson-Siegel model for improvement as well. Nonlinear regression is conducted where we treat the shape parameter as the fourth coefficient. Alternatively, a grid-search method is studied as a simplified dynamic model, where we grid-search on the shape parameter and find the optimal value within the range that achieves the minimum root mean squared error. Different models are compared and discussed based on in-sample fitting and out-of-sample forecasting at the end of the study. The results show that the model with fixed shape parameter fits better in the short-term and long-term yields, while the nonlinear method performs better in fitting the medium-term yields and forecasting at longer horizons. The statistical Software SAS is used for implementation.
Fei Xu, Multi-period Corporate Bankruptcy Forecasts with Discrete-Time Hazard Models, December 5, 2013 (Yan Yu, Hui Guo [Department of Finance])
Bankruptcy prediction is of great interest to regulators, practitioners, and researchers. In this study, I employ the discrete-time hazard model to predict corporate bankruptcy, using the manufacturing sector data covering the period of 1980-2008. The model has high in-sample and out-of-sample prediction accuracy, with all AUCs higher than 0.85. The study of distance to default reveals that it adds little prediction power to the model. Comparing to the model with variables selected by the least absolute shrinkage and selection operator and the Campbell, Hilscher, and Szilagyi (2008) model, models by stepwise selection have the same or better accuracy. In addition, stepwise selection gives robust models across different training/testing periods. Within the data scope of this study, including macroeconomic variables (three-month Treasury bill rate and trailing one-year S&P index) provides little improvement in prediction accuracy.
Mengxia Wang, Credit-Risk Assessment Based on Different Data-Mining Approaches, December 5, 2013 (David Rogers, Yan Yu)
Data mining is a computational process to discover patterns in large data sets. Credit scoring is one of the data-mining research areas, and is commonly used by banks and credit-card companies. A dataset that comes from a private-label credit-card operation of a major Brazilian retail chain was analyzed. The dataset contains 50,000 records of the application information from the credit card applicants. Six data-mining approaches, including the generalized linear model, classification and regression trees, the generalized additive model, linear discriminant analysis, neural networks, and support vector machines were examined to help identify unqualified applicants based on the given explanatory information. For each approach, at least one model was built using the R software. The performance of each model was evaluated by the area under the receiver operating characteristic curve.
Qi Sun, Application of Data-Mining Methods in Bank Marketing Campaigns, December 5, 2013 (Jeffrey Camm, Yichen Qin)
Direct marketing is widely used among retailers and financial companies due to the competitive market environment. The increasing cost of marketing campaigns, coupled with declining response rates, has encouraged marketers to search for more sophisticated techniques. In today's global marketplace, organizations can monetize their data through the use of data-mining methods to select those customers who are most likely to be responsive and suggest targeted creative messages. This project will present the application of data-mining approaches in direct marketing in the banking industry. The objective of this project is to identify the variables that can increase the predictive outcomes in terms of response/subscription rates. A decision model, chi-square automatic interaction detection (CHAID) is built to determine and interpret the variables. A logistic regression model is also built in this study to compare with the decision-tree model. By applying both methodologies to the direct-marketing campaign data of a Portuguese banking institution, which has 45,211 records and 16 fields, we concluded that the CHAID decision-tree model performs better than the logistic regression model in terms of predictive power and stability. From the results, we illustrate the strengths of both non-parametric and parametric methods.
Arathi Nair, Demand Forecasting for Low-Volume, High-Variability Industrial Safety Products under Seasonality and Trend, December 4, 2013 (Uday Rao, David Kelton)
This project studies demand-forecasting methods for different items that are sold by West Chester Holdings Inc. In this study, based on data from West Chester Holdings, we apply moving averages (which represents the current approach by the company), exponential smoothing, and Winter's Method to predict future demand. We also provide a brief overview of ARIMA-based forecasting in SAS. Using forecast-quality metrics such as bias, root mean squared error (RMSE), mean average deviation (MAD), and tracking signal, we identify the best forecasting methods and their parameter settings. These forecasts will form the basis for setting target inventory levels of items at West Chester Holdings, which will drive their procurement and sales to maintain acceptable inventory turns.
Yi Tan, Development of Growth Curves for Children with End-Stage Renal Disease, December 4, 2013 (Christina Kelton [Department of Finance], Yichen Qin, Teresa Cavanaugh [College of Pharmacy])
Growth retardation is one of the greatest problems in children with end-stage renal disease (ESRD) on dialysis. Growth failure results from multiple causes including poor nutritional status; comorbidities, such as anemia, bone and mineral disorders, and changes in hormonal responses; and the use of steroids in treatment. Although research has documented differences in growth rates between dialysis patients and healthy children, no large-scale effort has been devoted to the development of growth charts specifically for children with ESRD. The primary objectives of this study were to develop and validate height, weight, and body-mass-index growth curves for children with ESRD. Using data from the United States Renal Data System (USRDS), all patients aged 20 or younger without previous transplantation, and undergoing dialysis, were initially selected for study. They were stratified into age groups, with finer (6-month) categories for the younger children, and grosser (1-year) categories for the older children. Then, children with height, weight, or BMI greater (less) than the mean plus (minus) 3 standard deviations (outliers) were excluded. The standard lambda-mu-sigma (LMS) methodology for developing growth charts for healthy children was applied to height, weight, and BMI data from USRDS. Growth-curve percentile values (3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 97th percentiles) for each age group by gender were calculated. The performance of the LMS model was evaluated using three different criteria, and results were compared with those previously published for healthy children. Advantages and disadvantages of parametric versus nonparametric (spline) growth-curve estimation were explored.
Yuan Zhu, Multivariate Methods for Survival Analysis, December 4, 2013 (David Rogers, Jeffrey Camm)
Multivariate analysis refers to a set of statistical methods for simultaneously involving observations and analyses on each individual or object under investigation. Multivariate methods like principal component analysis (PCA) are widely used for variable reduction and elimination of high correlations. Survival analysis, also called time-to-event analysis, primarily was used in the biomedical-science area to observe the time to death, now also widely used on other analyses such as the working life of machines. Survival analysis is multivariate and always has significantly correlated variables, which will result in parameter estimates with inflated standard errors and decreased power in detecting the true effects when processing multiple regression analysis. In this research project, data transformation will be first processed by PCA on the correlation matrix of the original variables. Then the transformed data will be input into Cox survival models and will be compared according to different PCA methods. Results suggested that PCA with a combination of different correlation coefficients has the best performance, and it can reduce redundant variables without losing much accuracy since a large portion of the covariance from the raw data set has been retained.
Anni Zhou, Insurance Customer Retention Improvement Analysis, December 4, 2013 (David Rogers, Jeffrey Camm)
In saturated markets, customer retention has become more important and will bring a lot of benefits for companies. In such a situation, it is advantageous to study how to improve customer retention using statistical and business-analytical methods. The objective of this project is to use different statistical and analytical methods to identify how to design an insurance product for site-built dwelling data and improve customer retention. Suggestions for insurance companies regarding how to build such products for improving customer retention are presented. In this project, the principal component analysis method is utilized to analyze which variables are important. How to use the non-parametric survival analysis and parametric survival analysis with different distributions to create models for the survival object will be discussed. A comparison of results from using survival and event history analysis with different distributions will also be shown.
Jing Sang, A Statistical Analysis of the Workers' Compensation Claims System, December 3, 2013 (Uday Rao, Hui Guo [Department of Finance])
The workers' compensation claims system is used to provide injured workers with coverage of medical costs and income replacement. The frequency of paid workers' compensation claims advises employers and insurance providers where prevention activities are most needed, and which firms are likely to have good safety programs. From the perspectives of employers and insurance providers, understanding the type of people who potentially file a claim and the factors that lead to a high-cost claim are important. This study aims to identify variables that drive higher/lower costs in workers' compensation claims by analyzing past claims and payment records. This paper first describes the workers' compensation claims system, and then uses real-world data to explore the attributes and variables that may be important to a claim. It then determines the significant factors that influence the cost. Microsoft Excel, SQL Server Management Studio, and SAS are the main tools used in this study.
Yi Ying, Application of Data-Mining Methods in Credit Risk Analysis and Modeling, December 2, 2013 (David Rogers, Yichen Qin)
The financial crisis has led to an increasingly important role of risk management for financial services institutions providing products such as loans, mortgages, and finances. Specifically, risk departments utilize data-mining tools to monitor, analyze, and predict risks of various kinds in business. One of the key risks a financial institution has to deal with on a daily basis is the credit risk: whether a borrower will default or make the payments on the debt. The goal of the project is to study different data-mining models in predicting the potential of loan delinquency based on borrowers' demographic information and payment history, and to identify the most important factors (variables) for risk assessment. Various modeling techniques will be investigated to understand the credit risk: generalized linear models (McCullagh and Nelder, 1989), classification and regression tree models (Breiman et al., 1984), and chi-squared automatic interaction detector (Kass, 1980). A combination of R, SAS, and SPSS Modeler were used to conduct the analysis. Ultimately, the four models developed predicted the "Default"/"No default" correctly over 75% of the time in the training sample and over 60% in the testing sample. In terms of prediction precision, the general linear model outperformed the others. In terms of stability, the chi-squared automatic interaction detector model was the most robust model to use. The classification and regression tree model was the least stable.
Yucong Huang, An Analysis of the Twitter Sentiment System in the Financial-Services Industry, November 15, 2013 (Yan Yu, Yichen Qin)
Twitter has gained increasing worldwide popularity since its launch in 2006. In 2013, there have been 58 million tweets per day, with 135,000 new users’ signing up every day (http://www.statisticbrain.com/twitter-statistics/). The large volume of tweets can be of substantial business value and insight in making decisions. In this project, we identify the sentiment of tweets in the financial-services industry using a supervised machine-learning approach. After collecting 852 tweets that are related to this industry, a rule-based quality-control method is designed to decrease human error in Twitter sentiment ratings. The results using support vector machines (SVM) are promising: we achieve an accuracy of around 70% in a three-category scale (negative/neutral/positive) and 60% in a five-category scale (strongly negative/negative/neutral/positive/strongly positive). After examining the impact of different features, we conclude that unigram features contribute the most to the overall accuracy, followed by lexicon features, and then encoding features. Finally, we study different combinations of features applied to different sentiment scales.
Suxing Zeng, Cluster-Based Predictive Models in Online Education Management, November 4, 2013 (David Rogers, Yichen Qi)
As online education becomes a popular learning approach, the large amount of data generated by learning activities can be effectively utilized for evaluation and assessment purposes. Traditional analytical tools and techniques are being adopted by the online-education industry to improve services in critical areas such as student retention, grades, and graduation. To evaluate an online-learning environment for students at the University of Phoenix, cluster-analysis and regression-analysis techniques were implemented to develop a cluster-specific predictive model and a simple direct regression model for student service management. In the cluster-specific predictive model, finite mixture models are used to classify students based on their learning attributes. Then, based on the learners-segmentation framework, predictive models were developed to predict the target scores for given new-learner attributes. Numerical results show that the cluster-specific model performs better for model fitting. The advantages and limitations for each method are discussed and recommendations are provided for management to drive academic excellence.
Chiao-Ying Chang, Cintas Website Visitors Profile Analysis, October 11, 2013 (David Rogers, Jeffrey Camm)
This study is aimed at analyzing the profile of online mobile visitors, and recommending divisions with high mobile traffic for future web enhancements. The study also explores attributes of visitors requesting more information through the website. Google Analytics is the main tool used for the visitors' profile analysis. In the analysis period, the Fire Protection, Hospitality, and Uniforms divisions have witnessed an increase in the percentage of mobile visitors. There is a very high possibility of increasing mobile users in the future in these divisions. Over 50% of those visitors requesting more information online use the Internet Explorer browser with the Windows operating system. Certain non-branded keywords such as "shred" and "extinguish" are highly used in the web search, and Tuesday, Wednesday, and Thursday are the days with the most visits. This profile analysis helps Cintas better know their visitors' background, and also helps them to create a better website, easier to navigate, and more effective to use.
Partha Tripathy, Analysis and Implementation of Allocation of Papers to Conference Sessions using a K-Means Clustering Algorithm, August 7, 2013 (Jeffrey Camm, B.J. Zirger [Department of Management])
This paper looks at a problem for a professional association of management science that organizes an annual symposium and invites papers in multiple branches of management. Due to constraints of resources, the designs of sessions in this conference are restricted by size and duration. Also, the association promotes discussions among participants from different institutions. Hence, only one author from an institution can present a paper in a single session. The sessions are defined by a common topic, which is identified by keywords that are associated with each paper. The keywords also have a priority in allocating the papers into similar groups. Optimization algorithms have been designed to assign resources to tasks based on priority, availability, and business constraints. These algorithms are closely associated with classification problems where categories have to be identified among observations. Cluster analysis is one such automatic-classification algorithm that can be formulated as a multi-objective optimization problem. By using a combination of efficient data pre-processing, ideal modeling of input parameters, and iterations of clustering analysis, the study attempts to discover the best solution with the desired properties. This study uses a naturally occurring heuristic and a k-means clustering algorithm to identify natural clusters among papers in specific divisions. It involves analysis of the data and subsequent preprocessing to provide acceptable session sizes using the SAS software. The outputs of the heuristic are analyzed and the results are reported to the user.
Regina Akrong, Kentucky High-School Athletic Association Ninth Region Realignment: Minimizing the Traveling Miles between Schools, August 7, 2013 (David Rogers, Jeffrey Camm)
As with all costs for high-school education, travel costs for sports teams should be closely monitored. Here, the 20 high schools in the Ninth Region located in Northern Kentucky are examined and placed into four Districts to minimize the overall travel distances for the entire region. A mixed integer linear programming model was adapted and solved to provide the optimal Districts. It was found that indeed two schools were not currently placed in the most appropriate Districts. The savings achieved by redistricting will be considered with respect to reduction in total miles, fuel costs, student time spent, and safety.
Shilesh Karunakaran, Predictive Modeling for Student Recruitment, August 6, 2013 (Jeffrey Camm, B.J. Zirger [Department of Management])
The objective of this study is to analyze past student enrollment behavior and build a statistical model to predict the prospect of a future applicant's enrolling at the university. This paper details the steps involved in developing a predictive model by a data-driven analysis of past enrollment behavior of students, and fine-tuning the prediction accuracy by training the model with in-sample data and testing it on an out-of-sample data set. This model will help rank the incoming applicants according to their likelihood of enrolling at the university. This ranking, in conjunction with other qualitative factors, can be used by the admissions department to make better decisions on admission offers, financial aid, and many other decisions like specific program campaigning.
Junyi Li, Cintas Customer-Preference Analysis Using Data-Mining Methods, August 6, 2013 (David Rogers, Jeffrey Camm)
The Cintas Corporation provides highly specialized services to approximately 900,000 businesses of all types mainly throughout North America. In this project, we target only four services: mats, first aid and safety, hygiene, and document shredding. To understand the customers better and provide them better service, Cintas Corporation collects large amounts of data from customers. This study is aimed at analyzing the customer preference for these four products/services and predicting the number of customers lost. I will use Excel to do the basic customer analysis and R to do the prediction for customers lost. The two procedures to do the prediction are a logistic regression model and a classification tree model. The Akaike information criterion (AIC), Bayesian information criterion (BIC), and area under the ROC curve (AUC) are criteria used to determine the best model. The result shows that for all the data the classification-tree model is better than the logistic regression model.
Jerry Moody, A Simulation-Based Data Analysis of Production Lines at OMI, Inc., August 6, 2013 (David Rogers, David Kelton)
A local manufacturer is planning to expand its facility in 2014. A simulation study using Arena is employed to determine whether the company's two production lines used for the majority of its products will meet potential demand through 2022. Scenarios are examined to determine production amounts for various configurations they may wish to employ. Potential labor costs are also examined to give the managers further information when making expansion determinations. Results are presented in interactive-dashboard format using Tableau.
Adebukola Faforiji, Investigating Factors Associated with High-School Dropout Tendency via Logistic Regression and Classification Trees, August 5, 2013 (David Kelton, Edward Winkofsky, co-chairs)
An average of nearly 7,000 students become dropouts each day. This adds up annually to about 1.2 million students who will not graduate from high school with their peers as scheduled. Lacking a high-school diploma, these individuals will be far more likely than graduates to spend their lives periodically unemployed, on government assistance, or cycling in and out of the prison system (Alliance for Excellent Education, November 2011). Most high-school dropouts see the result of their decision to leave school very clearly in their earning potential. The average annual income for a high-school dropout in 2009 was $19,540, compared to $27,380 for a high-school graduate. The impact on the country's economy is less visible, but its cumulative effect is staggering (Alliance for Excellent Education, November 2011). Although it cannot establish causality, statistical analysis will help to reveal association of the factors under consideration that influence a student's decision to graduate or drop out of high school. This research was approached using two separate methodologies so as to compare results and also determine which one provides clearer results. The methods considered are logistic regression and classification trees. Results from the analyses reveal that factors such as English-speaking proficiency, self determination to succeed and finish high school, performance of grades of B and above in science and English, discipline and safety within the school environment, and race and perception (self and external) had statistically significant relationships.
Chandhrika Venkataraman, A Simulation Study of Manufacturing Lead Time: The Case of Tire-Curing Presses, August 5, 2013 (David Kelton, Uday Rao)
Non-assembly-line manufacturing systems are not easily streamlined using off-the-shelf solutions provided by standard operations-improvement methods such as just-in-time, KANBAN, lean manufacturing, etc. In this paper, we introduce a non-assembly-line manufacturing system that produces a custom-made finished product. Manufacturing lead time is extremely long, ranging from four to six months, and profit margins are razor-thin, about 8%. With material costs forming about 70% of the finished-product sale price, unpredictable manufacturing lead times eat away what is left of the profit margin because there is no visibility into final costs incurred in manufacture at the time of providing a quote to the customer. We study a simulation of the factory floor under different real-time scenarios to generate a range of finished-product lead times, both stage-wise and overall, as output measures of interest. The aim of the study is not so much to prescribe improved ways of working as it is to provide an understanding of how changes in raw-material rejections and in machine scheduling affect lead times. It is expected that with this information, the factory can arrive at better estimates of manufacturing costs and hence provide more realistic quotes to their customers. The study finds that even though individual sub-assemblies can have mismatches due to batching at the machine shop, final assembly times are lower than individual sub-assembly lead times, probably because the system corrects itself of mismatches before entering into final assembly, thus showing that orderly planning might be valuable ultimately, even if not immediately.
Vince Baldasare, Inventory Analysis of Restaurant Products, August 5, 2013 (David Kelton, David Rogers)
Restaurants have to control many factors in their daily operations. The processes they choose to use for managing their inventory can have major implications on their bottom lines. Having too much product in inventory can be costly due to space and waste. Not having enough product in inventory can decrease revenue and customer satisfaction. A method of inventory analysis is explored for a specific restaurant that uses fresh products. The logic for determining appropriate inventory levels is built into a custom tool that will allow the restaurant to make adjustments based on its business needs. Statistical process control charts are then created as a means for monitoring future results for the restaurant.
Ryan Prasser, A Regression Model Relating the Pass/Run Ratio to Score Differential and Elapsed Time in the NFL, August 3, 2013 (Michael Fry, Jeffrey Camm, Paul Bessire [PredictionMachine.com])
This research examines data from the 2012 NFL season to determine how a team's decision making in terms of calling running and passing plays changes as the game progresses. We generate several different multiple regression models relating a team's pass/run play-calling ratio on a particular drive to the predictor variables of elapsed time and score differential. While both of these variables and their interaction terms are statistically significant, the regression models explain only a small amount of the observed variance.
Jingfan Yu, An Application of Data Mining in House-Price Analysis, August 2, 2013 (Jeffrey Camm, Craig Froehle)
House-price prediction has always been an active field of study. Real-estate developers and consumers are interested in this problem. This project describes application of different statistical models to a house-price data set and tests which model has the most predictive power. Classification methods have been widely used in the real-estate industry. They can help real-estate developers better target their potential buyers and better plan for new construction. Consumers will also benefit by the classification model from wisely choosing the house location within their budgets. We present the application of data-mining approaches to house-price analysis in the real-estate industry. The objective of this project is to compare the performance of two predictive methodologies: multiple linear regression and regression trees. We also consider k-means clustering models. These four approaches were applied to Boston house-price data from the UC Irvine Repository of Machine Learning. The results suggest that the regression-tree model has the best predictive performance but the least model stability. The multiple linear regression model has the best stability and acceptable predictive power. Clustering is not recommended for our data.
Brian Arno, Predicting Graduation Success of Student Athletes, August 2, 2013 (Jeffrey Camm, David Kelton)
Athletics are an important part of a university -- they provide community and pride in the school, a source of revenue, and indirectly serve as a recruitment tool. The success of a university athletic department can directly be attributed to the success of the athletes themselves. The purpose of this study is to analyze data to identify what, if any, variables can be used to predict the success, in terms of graduation, of student athletes. This paper discusses the two methods employed in the study -- both exploratory data analysis utilizing visualization techniques, followed by logistic regression for development of an analytical model. The data analyzed constituted real-life information on student athletes from the University of Cincinnati.
Matthew Sonnycalb, Simulating Correlated Random Variates Using Reverse Principal Component Analysis, August 2, 2013 (David Kelton, David Rogers)
Failing to model correlated input variables appropriately is one of the most common inadequacies in dynamic simulation software and can lead to significant errors in simulation results. Principal component analysis is a multivariate technique that forms new uncorrelated random variates as linear combinations of the original correlated variates. This study evaluates whether independently sampling from those new variates and reversing the principal-component-analysis transformation can efficiently match the correlations, means, and variances of an original sample. A sampling algorithm is developed in R using bootstrapping and accept-reject criteria. The method is then evaluated using samples of correlated Weibull variables. The method performs well when the correlated variates have reasonably symmetric distributions, with no observable differences in correlations, means, and variances. The method becomes inefficient and introduces significant bias when the variates become highly positively skewed.
Meghan Moore, Workflow Simulation of the Emergency/Radiology Department Handoff at UC Medical Center, August 1, 2013 (Craig Froehle, David Kelton)
University Hospital's Radiology Department in Cincinnati is responsible for capturing body-tissue images to help diagnose a patient's ailment, which subsequently dictates the patient's course of treatment. It is of particular importance that patients coming in from the Emergency Department (ED) have an accurate and efficient visit, as many of these patients are in serious conditions and require treatment as soon as possible. The efficiency of this process hinges on the availability of radiology physicians, as the attending physician is present only during day shifts while the resident physicians are available around the clock. The goal of this study is to simulate the current workflow handoff between the ED and the Radiology Department. After development of a valid model, the value of adding an additional attending is explored, considering four different scheduled shifts for the extra physician. For system improvement, the simulation results suggest the implementation of another attending physician during an evening shift (3pm to 3am) or overnight shift (7pm to 7am). With a baseline of around seven hours in the initial system, these two scenarios reduce an ED patient's average time in the radiology department to less than two hours.
Shannon Downs, Optimal Bivariate Clustering of Binary Data Matrices, August 1, 2013 (David Rogers, George Polak [Wright State University])
Bivariate clustering can be applied to data in a matrix to optimize similarity or dissimilarity among elements by rows and by columns simultaneously. Areas of relevance include cellular manufacturing. In this project, a series of programs were coded in GAMS to perform bivariate clustering of a binary dataset of any dimension into any given number of clusters. Two models from the literature and two new models were explored, where each model makes use of a distance measure between the elements of the dataset. Seven methods of calculating the distance measure were used to evaluate the effectiveness of each model. The various distance measures and different objective-function equations made the objective-function values not directly comparable across the four models and they were evaluated using popular cellular-manufacturing clustering quality metrics such as the proportion of exceptional elements, machine utilization, and grouping efficacy. It was found that the best approach was from clustering by rows and by columns, with an addition of an interaction term that was a linear indicator of whether an element (a point in the dataset with a value of one) was in the same row and column. The performance of this best linear model was comparable to an equivalent nonlinear model, but the execution time of the linear version was magnitudes faster, making it a more desirable model.
Matthew Skantz, Optimization of Airline Fleet Assignment and Ticket Distribution, July 26, 2013 (Jeffrey Camm, David Rogers)
Among the decisions with the greatest implications for airlines' profitability are fleet assignment and the methods used for passenger ticket distribution. Improper fleet assignment can result in lost revenue and unit costs too high to allow profitability, while distribution costs for tickets distributed through third-party agencies, paid by the airline, can amount to a significant portion of the ticket's value and must therefore be thoroughly understood. Two mixed integer linear programming models, each of which incorporates integer relaxation to lessen computational requirements, are developed and tested using an airline's ticket-purchase records over two days with very different demand profiles in order to recommend changes in these areas. Multiple runs of the models using different segments of passenger data in combination with demand unconstraining estimates and known distribution agency rates are used to find the most profitable combination of fleet assignment and distribution outlet retention. Results show that, while most aircraft are assigned to the best routes given fleet constraints on the days under review, there are areas of significant opportunity to increase or decrease capacity on a handful of routes. Moreover, given current demand, the results suggest that it is not in the airline's interest to limit the number of distribution channels, though the relative strength of the distribution agencies is determined and one is targeted as an opportunity for possible future disengagement.
Benjamin Milroy, A Comparison of Agent-Based Modeling, Ordinary Least Squares Regression, and Linear-Programming Optimization for Forecasting Sales, July 24, 2013 (Jeffrey Camm, Edward Winksofsky)
Due to advances in business intelligence and more widely available data, accurate sales forecasting and understanding of media effectiveness continues to become more paramount in today's business. This wealth of data has led companies to new sophisticated modeling approaches. This study examines three such methodologies: agent-based modeling, ordinary least squares regression, and linear programming optimization. Using a data set from the consumer packaged goods industry, all three models are used to predict three years' of historical data. Then, the forecasting ability of each methodology will be tested over a holdout period of one additional year. Finally, using the findings from each approach, I hope to gain some understanding of media and trade promotion's effectiveness for the brand.
Jana Sudnick, Effective Automotive Issue Prioritization with Neural Network Pattern Recognition, May 30, 2013 (Uday Rao, David Rogers)
Automotive quality engineers process large amounts of "found issue" reports weekly and decide how to prioritize new issues. It is important that engineers do not miss potential urgent or high-customer-impact items because excellent customer service is expected. Engineers have many sources of customer data available to them and it is imperative that they utilize as many relevant sources as possible to properly rank issues. The purpose of this project is to develop a tool to utilize data from existing customer data sources to aid quality engineers in proper prioritization of issues to address. A neural network pattern recognition model was trained to help engineers effectively prioritize issues with quicker response times by emulating past issue ranking decisions made by a panel of highly knowledgeable subject experts. Effective issue prioritization could result in more issues investigated, quality improvements, improved early detection, and potentially a reduction in warranty claims. The project resulted in two neural network models to help engineers to identify and address new customer issues.
Zhen Guo, Decision Making in a Random-Yield Supply Chain, April 19, 2013 (Uday Rao, Michael Fry)
Supply uncertainty is widespread and has significant impact on business operations, so it is receiving increased attention by both industry and academia. This project studies a two-echelon single-supplier single-retailer random-yield supply chain. We determine how suppliers and retailers make operations decisions geared toward optimizing their profits when supply is uncertain. Equilibrium decisions include suppliers' wholesale prices and planned production quantities, retailers' order quantities, and retail prices. We study how supply uncertainty and the salvage value of leftover products affect these decisions. We show that the optimal production inflation rate, defined as suppliers' planned production quantities over retailers' order quantities, is dependent only on the wholesale price and is independent of the retailer's order quantity. We also find that, ceteris paribus, the optimal production inflation rate increases with the salvage value. Numerical examples for uniformly distributed supply uncertainty are provided to illustrate our findings.
Sha Fan, An Application of Markov Chain Model for Ohio's Unemployment Rate, April 19, 2013 (Uday Rao, David Rogers)
Stochastic processes have been applied in many fields, for example, marketing, gambling, inventory control, biology, and healthcare. The labor market is not an exception. The resource of labor is usually classified as employed, unemployed, and out of the labor force. This project involves an application of Markov-chain modeling to Ohio's unemployment rate. Data from 1990 to 2011 are public and collected by the Bureau of Labor Statistics and the U.S. Census Bureau. A Markov-chain model proposed by Rothman (2008) is then applied. There are four sub-models in this project: r = 1 (monthly) & 1st order; r = 3 (quarterly) & 1st order; r = 1 & 2nd order; r = 3 & 2nd order, where the 2nd-order Markov chain assumes that transition probabilities depend on the current state and the previous state. In the long run, we conclude that the unemployment rate is most likely to decrease rather than increase. Also, there exists a business-cycle fluctuation. When considering geographic influence, the unemployment rates among Ohio's 88 counties are statistically different. Job hunters' attributes like race and age have significant influence on the employment rate, while gender does not play a remarkable role.
Herbert Ahting, Regression Modeling of Multivariate Process Systems Data, April 17, 2013 (Martin Levy, Jeffrey Camm)
Modern industrial process-control systems archive vast quantities of data pertaining to flow, temperature, pressure, level and other parameters. Valuable information regarding process performance is contained in these data histories but they are seldom tapped to their fullest potential. Most data archives are contaminated by extreme values that are caused by measurement errors or are the result of transient or persistent disruptions to the process. For this study, a data set was obtained from an industrial process, consisting of energy-related process variables. These variables were analyzed to determine which ones have the greatest impact on overall steam consumption. Three regression approaches were evaluated: 1) Linear regression using stepwise variable selection, 2) autoregressive modeling, and 3) principal-components regression. Stepwise-variable-selection regression identified several key variables as energy drivers. The autoregressive model provided better results than stepwise regression alone, by eliminating the autocorrelation in the residuals. While principal-component regression showed promise by reducing multicollinearity, the model results were difficult to interpret because the original variables have been transformed. Principal-components analysis does, however, provide a useful set of tools for identifying extreme observations.
Saurabh Jain, Support Vector Regression vs. Neural Networks in Stock Pricing, April 17, 2013 (Uday Rao, Amitabh Raturi)
Asset pricing is one of the most researched areas in investment management. While CAPM provides the basic framework for understanding stock returns, it also brings the assumption that stock return is linearly dependent upon the market return. This relation fails to hold in various cases and many other improvements of this CAPM model seem to explain stock returns better. In our analysis, we explore various factors that can be included in the asset pricing model. We also assume a non-linear relationship of the independent variables with the dependent variable, stock return. While neural networks are popular and the most widely used technique for such cases, support vector regression is adopted for our analysis keeping in mind that this will avoid over fitting the data. The historical data for training and testing the model were obtained from vendors of stock data (Bloomberg) and other resources. With the help of this modified mathematical model we find the value of alpha, which is an indicator of superior performance of the stock and a good criterion for picking stocks that can be used to outperform market returns. We compare the results of neural networks with support vector regression and establish the superiority of one method over the other.
Aashish Reddy Takkala, Next-Purchase Propensity of a Customer, April 15, 2013 (David Rogers, Jeffrey Camm)
This study is aimed at finding the next purchase of a customer given his current purchase. The buying pattern is modeled as a Markov chain and transition-probability matrices are calculated for several product categories. A stable Markov equilibrium vector is arrived at by solving the system of equations using Matlab and by iterative matrix multiplication using SAS. Further, the mean first-passage times are calculated for each of the transitions. These matrices help the marketing team streamline their campaigns. Also, the new campaigns developed using this model make customers feel more connected because they are targeted more with the product categories they seek.
Adebola Abanikanda, Regression with Aggregated Crime Data - A Study Using Poisson and Negative Binomial Regression, April 15, 2013 (David Rogers, Yan Yu)
Several researchers have argued that parameter estimates from a disaggregated model may differ significantly from those from the aggregated model. This research is targeted at looking into this issue using two regression methods - the Poisson regression approach and the negative-binomial regression method. These models are applied to crime data at different hierarchical levels including the national, state, and county levels. Also, the crime index was disaggregated into violent crime and property crime, and regression models were built with these approaches to explore how different levels of aggregation affect the results.
Kevin Michael Roa, Scheduling NFL Games to Maximize Viewers, April 12, 2013 (Michael Magazine, Michael Fry, co-chairs)
The total number of television viewers for NFL games has declined over the past season. These games are still one of the most viewed shows in the United States, but have seen a slight decrease in popularity. These games are incredibly popular for advertisers, as people are likely to watch the games live, making it much more likely that they will view commercials. This makes it important for both the league and networks to keep the number of viewers high throughout the season. Currently there is no method of optimizing the schedule to maximize the number of viewers being employed. The games that are set to be played are pre-determined, but a looming question is in what weeks these games should be played, and which should be nationally televised during prime time. This paper presents a model that has been created with the intent of maximizing the viewer numbers of regular-season NFL games. The model can be used to control the distribution of excitement appropriately throughout the season.
Sarbani Mishra, Bayesian Forecasting of Utilization of Antidepressant Drugs in U.S. Medicaid, March 29, 2013 (Martin Levy, Jeffrey Mills [Department of Economics])
Mental illness has been one of the prevalent health disorders and the statistics are heading towards an upward trend at an alarming rate. Over recent years, the utilization of antidepressant drugs in the form of doctors' prescriptions and Medicaid reimbursements has been rising steadily. Medicaid antidepressant prescriptions grew over 40% from 1995 to 1998. In the present study, we have made an effort to forecast the utilization of antidepressant drugs using applied Bayesian methods. These methods can be of great aid in statistical models used for forecasting. Due to the unique characteristic of dynamic updating of information along with information accrued in the past, applied Bayesian methods provide greater accountability to the reliability of the forecasts obtained from the models than does the frequentist approach to forecasting. We have used the software BATS (Bayesian Analysis of Time Series) to determine the forecasts and to compute the MAD, MSE, and log-likelihood values. It was found that the forecasts emulate the actual values fairly well with the exception of a few drugs where we find a certain number of outlying points.
Peipei Yuan, Stratified Random Sampling Design for Capital Expenditure Survey, March 15, 2013 (Martin Levy, Yan Yu)
The aim of this project is to produce a set of estimates (forecasts) for the year 2013 that involve spending intentions of the 60,000+ plants that engage in the manufacture of machine tools. Stratified random sampling design and Neyman allocation are used to design the capital-expenditure survey. 10,000 plants with known plant size are solicited in this study. There are a total of 5 strata, of which 4 are statistical strata and 1 is non-statistical. SAS PROC SURVEYMEANS and SURVEYFREQ, and SPSS Complex Samples are used to estimate means, proportions, and totals, and to produce standard deviations and confidence intervals. Among 10,000 plants selected in this study, 711 completed the survey and sent it back to the Gardner's Publication. The response rate is 16.88%, 10.57%, 6.92%, 2.30%, and 5.85% for strata 1 to 5, respectively. The range of the planned capital expenditure is 9,975,000, with a minimum of 25,000 and a maximum of 10,000,000. The 95% confidence interval for the mean is 359,728.19 to 497,168.63. The 95% confidence interval for the sum is 1.75E+10 to 2.42E+10. The statistics for subpopulations, such as states and machinery categories, are also calculated using domain analysis.
Hsin-Yi Wang, Bayesian Decisive Prediction Using BLINEX Loss and BLINEX Parameter Choice, March 7, 2013 (Martin Levy, Jeffrey Camm)
Direct marketing is a popular and successful way for businesses to reach prospective customers. It starts with the compilation of a name list that contains the set of targets together with their associated attributes. A scoring model, such as logistic regression, is used to compute activation scores from the set of attributes and these are ranked in descending order of activation likelihood. Names of the best prospects must be selected via an intelligent cutoff. Selecting too many names results in more revenue but perhaps less profit, i.e., profits diminish when costs outweigh gains from activation among bad prospects. The decision problem is to predict an optimal mailing size or cutoff in a future mailing to maximize profit. From a decision-theoretic point of view, a realistic loss function should be asymmetric (failure to choose good prospects carries a higher penalty than including too many bad prospects). The BLINEX loss function is a parsimonious loss function with three parameters, a bounding parameter, a scaling parameter, and a asymmetry parameter. Ideally, the data should be the collection of optimal scores gleaned from past direct-mail campaigns. Since we have information from only one campaign, we illustrate using a bootstrapping-like strategy to generate "historical" trials. In addition to the loss function, the elements of the Bayesian decisive prediction setup include the likelihood function of the optimal activation scores, assumed to be normal-gamma with unknown mean and variance, and conjugate priors with four hyperparameters. These are obtained using the empirical Bayes technique called the ML-II method. We show that in terms of both frequency and profit domination, in a set of 200 simulations the BLINEX loss outperforms the naive squared-error loss approach handily. Fractional-factorial analysis and half-normal plotting applied to more simulated data show that of the six parameters, the BLINEX asymmetry and scale parameters, together with the prior mean, are the most influential factors leading to BLINEX domination of squared-error loss in terms of profit.
Sandeep J. Patkar, Examination of Capacity and Delay at Airports and in the U.S. National Airspace System, January 14, 2013 (David Rogers, Edward Winkofsky)
Space - time diagrams are employed in airport master planning to enable visualization of aircraft flow along the approach pathway by projection of velocity trajectories onto a 2-dimensional surface. We present the opening and closing case to examine capacity and delay at the Cincinnati airport, which may serve as a basis for advanced dynamic stochastic programming models for air traffic-flow management. The arrival and departure pathways are examined independently, and then in mixed operations, which combine simultaneous pathways to mimic reality. We find that airline schedules, which are out of the control of local airport board authorities, influence capacity and delay fluctuations faced at major airports nationwide. In response, local air-traffic control towers may elect to use mitigation strategies to minimize aircraft space and time separation with the addition of a space - time buffer for each aircraft pairing on approach. We also consider random shocks to the airspace, which occur under the rubric of irregular operations due to weather fluctuations. Irregular operations are presented as Markov-chain scenarios where we argue that, from a practical standpoint, the total event space can be decomposed to circumvent the complication of dependencies connected to previous event nodes in our sequence. We focus on observable, partially connected Markov chains that are normalized to become row stochastic. The transition matrices form a basis to consider more advanced questions for strategic planning.
Liang Xia, Application of Decision Trees in Credit-Score Analysis, December 7, 2012 (Martin Levy, Yan Yu)
Excessive abuse of credit cards has contributed to increasing credit risks, which has become a heavy burden for credit-card companies. In such a negative situation, it is important to build and use models to estimate the potential risk and to try to maximize profits from credit-card use. Classification methods have been widely used in the credit-banking industry. They can help lenders decide whether an applicant is a good candidate for a loan. This project will present the application of data-mining approaches to credit-scoring analysis in the financial industry. The objective of this project is to compare the performance of three predictive methodologies: chi-square automatic interaction detection (CHAID) decision trees, classification and Regression Trees (CART), and logistic regression models. The three approaches were applied to German credit data from the UC Irvine Repository of Machine Learning Database. The results suggest that the CART decision-tree model has the best predictive performance but the least model stability. The logistic regression model has the best model stability and acceptable predictive ability. The CHAID decision tree is more robust than CART in model building and interpretation. From the results, we illustrate the strengths for both non-parametric and parametric methods.
Xiaoning Guo, A Comparison of Data-Mining Methods in Direct Marketing, December 7,2012 (Martin Levy, Yan Yu)
Direct marketing is used to target a group of consumers who are most likely to respond to marketing campaigns. Companies typically send promotional materials to about 20% of their potential buyers from their lists. However, how to select the best customers is a question. The purpose of this study is to compare different data-mining methods to select the best customers to send the promotion catalog. The data-mining methods include generalized-linear, generalized-additive, classification-and-regression-tree, neural-network, and support-vector-machine. Based on their misclassification rates and areas under ROC curves, the logistic regression model is best for predicting consumers' response behavior.
Jun Sun, A Simulation Model for Evaluating the Performance of Fire Departments in Hamilton County, Ohio, December 6, 2012 (Uday Rao, Jeffrey Camm)
This research project focuses on (1) the evaluation of current Hamilton County fire-department performance including response time (more importantly dispatch-to-arrival time) and vehicle utilization, and (2) forecasting performance during different situations (with increasing incident frequency and reducing the number of vehicles). Based on the need of the local fire departments, this project will determine answers to "Whether the fire department has done a good job in 2010-2011" and "What will be the performance level if the fire departments' budget is cut." In order to achieve this goal, statistical- analysis methods will be used to provide an evaluation of current fire-department performance and to help build a simulation model by conducting input analysis or raw-data analysis. After that, simulation modeling will be used to forecast different scenarios based on varying inputs. Four input variables used in the simulation model involved dispatch-to-arrival time, arrival-to-closed time, frequency of incidents, and number of vehicles responding an emergency. All these variables come directly from the 2010 and 2011 records that the Hamilton County Communication Center provided. Results from this study indicated that the total performance of the Hamilton County fire departments based on the six-minutes criterion is good, and the frequency of incidents is growing year by year. Based on this increasing rate, the simulation results show that within five years the dispatch-to-arrival time will increase by 20 seconds (5.6% of the criterion), and in ten years the increase will be 62 seconds (17% of the criterion).
Ece Ceren Izgi, A Model for Financial Risk Analyses of Mass Customization, December 5, 2012 (Jeffrey Camm, Amitabh Raturi)
Mass customization is gaining popularity as a viable business model. The value proposition of the customized products is very different from a commodity product. Customized products are more valuable than non-customized products, but to be sustainable the manufacturing process must be robust with system efficiencies close to that of mass production. In this research project, Procter and Gamble's mass-customization efforts on a product are evaluated with a financial model via key financial performance metrics under three process-design scenarios: in-house semi-manual process, in-house automated process, and outsourcing semi-manual process. First, a deterministic model was designed; afterwards probabilistic components are introduced. The @RISK software is used to execute sensitivity and scenario analyses. The main goal of this project is the design of a decision-support system to illustrate and measure the risk, flexibility, and potential impact of the decisions involving mass customization.
Rebekah Wilson, Comparing Data-Mining Techniques to Build a Predictive Model to Understand Customer Risk, December 4, 2012 (Martin Levy, Yan Yu)
Businesses use data-mining techniques to evaluate and manage large amounts of data. Specifically, risk departments use data mining to develop rules and models to rate or score new and existing customers for numerous reasons. In this project, we look at multi-divisional, credit-card risk performance data and develop rules that target specific card-holders. The goal is find cardholders who have frozen accounts due to a returned payment and classify them as "good" or "bad" as defined by the company. The dataset contains 10,337 accounts, each with 370 fields such as risk score, history code, last payment amount, etc. The project uses CHAID and CART classification trees to create decision rules that most accurately predict what frozen accounts would be "good" enough to unfreeze 60 days after the return payment had been made. The good/bad flag is defined as a frozen account that, 6 months after having a returned payment, is either current or 30 days past due on a payment. The two decision trees are compared to determine which method allows for the most accurate and stable rules. Ultimately, both models correctly predicted the "good" cardholders over 60% of the time (67.71% for CART and 63.27% for CHAID). In terms of stability, CART outperforms CHAID due to the distribution of a key variable that the CHAID process used. However, CHAID did much better at separating the "good" and "bad" cardholders with a more consistent and higher KS statistic. It was decided to look more closely into the business criteria of each decision tree and determine which tree paired with a cutoff would allow for the most profit.
Torrie A. Wilson, The Effects of Cessation of Assessing Credit Card Late Fees, December 3, 2012 (Martin Levy, Edward Winkofsky)
ABC Financial (name changed to hide the identity of the actual credit-card company) currently charges accounts late fees when they reach 30, 60, 90, 120, 150, and 180 days past due. With current economic conditions and with ABC Financial's mission statement stating that they are customer-friendly, a recommendation has been made that they stop charging late fees after 90 days past due. If an account charges off, then the late fees associated with that account are written off along with the unpaid balance. This is assessed as a loss for the company since this money will not be collected. If an account does not charge off, the late fees associated with that account are assessed as a profit for the company since the money will be collected. The analytics department was asked to come up with preliminary results that would either push the decision for the company to implement a test or not. The findings show that for accounts with a delinquency status of 90 days past due, ABC Financial is more likely to acquire a loss associated with them rather than a profit based solely on late fees. Therefore, the recommendation to stop assessing late fees after 90 days past due could actually be more beneficial to the company than its current late-fee structure.
Valerie Lynn Schneider, A Simulation Study of Traffic-Intersection Signalization, December 3, 2012 (David Kelton, David Rogers)
The intersection of North Bend Road and Edger Drive is under constant scrutiny by drivers who get stuck in long queues while waiting on Edger to turn onto North Bend. The purpose of this study is to simulate the intersection and determine if a traffic light should be installed. An analysis of several key output performance-metric statistics, including average queue length and average time spent in the intersection per car, was completed. Two models were constructed; one to model the intersection as it is now, and a second to model the intersection as if a traffic light were installed. Several scenarios were examined to analyze the potential effects of increased traffic through the intersection. Results of the traffic study show that installing a traffic light in the intersection now will not be beneficial. However, if the amount of traffic through the intersection were to increase upwards of 20 percent, then a traffic light could possibly help decrease average queue lengths and the overall time spent in the intersection by all cars throughout the day.
Akshay Mahesh Jain, Customer Segmentation and Profiling, November 30, 2012 (David Rogers, Edward Winkofsky)
Companies invest a lot of resources in developing database systems that store voluminous information on their customers. Appropriate data-mining and multivariate techniques are used to leverage the data to identify customers who are valuable to the organization as this can help companies generate maximum return on their marketing dollars. In this project, a two-stage study of customer behavior, segmentation and profiling, was done on a customer database of a retail company. Segmentation is the process of dividing the database into distinct customer segments such that each customer belongs to one segment, and this process helps in identifying the most valuable customer segment. Profiling is the process of describing the demographic and socioeconomic profile of the segments. The main goals of the project are to identify at most ten customer segments such that each segment is at least 5% in size and to profile these customer segments. For this purpose, a sample of 300,000 records used and randomly split into two datasets: Training (60%) and validation (40%). The K-means algorithm was used on the training dataset to identify groups of customers based on recency, frequency, monetary, and duration variables. The segments generated were validated using the validation dataset, and the results suggested that the cluster solution obtained is not sample-specific and is representative of the population. After customer segmentation, profiling of the customer segments was done using demographic and socioeconomic variables. Based on the segments and profiles obtained, appropriate marketing strategies were devised.
Jie He, Evaluation of Response Time and Service Performance of the Fire Departments in Hamilton County, Ohio, November 30, 2012 (Uday Rao, Jeffrey Camm)
We provide a framework for evaluating existing resource and service performance provided by the fire departments of Hamilton County. The primary focus is on the efficiency of the current fire-station resources. Specifically, this study is designed to answer the question "whether current personnel and emergency equipment resources assigned to fire stations is able to meet the increasing demand of fire and medical-emergency response service." Geographic information system (GIS) network analysis will be used to calculate optimal routes and theoretical driving times between the responding fire stations and incident locations. In practice, the driving route is chosen by the driver based on experience or evaluation of route length, speed limit, and number of turns. The theoretical driving time is calculated based on the street-level conditions including factors such as street types, slopes, and speed limits, with a view to minimizing turns. The theoretical driving time will be compared with the actual response time (between "Dispatch Time" and "Arrival Time") in the 2010 and 2011 records from the Hamilton County Communication Center (911 Dispatch Center), to determine the emergency response efficiency. Results indicate that the majority of fire stations are able to provide timely emergency response to the neighborhood. The results also reveal possible resource and personnel shortages in the future.
John Lawrence Ewing, Advanced Forecasting Using ARIMA Modeling: Sales Forecasting for OMI Industries, November 15, 2012 (David Rogers, Martin Levy)
Sales are the lifeblood of a company and accurate sales forecasting helps management make key business decisions. This study has been made to forecast the sales for OMI Industries. The Box-Jenkins methodology of model identification, estimation, and validation is applied to generate autoregressive integrated moving average, ARIMA, models. An outline of the steps needed to use ARIMA time series models to forecast sales is presented. The results produced by the model indicate that ARIMA forecasting is efficient at generating short-term forecasts.
Shashi Sharma, Empirical Assessment of the Ohlson (1995) Equity Valuation Model using Dynamic Linear Modeling Methodology within a Bayesian Inference Framework, October 5, 2012 (Jeffrey Mills [Department of Economics], Martin Levy)
The purpose of estimating the fundamental (intrinsic) value of an asset is to take advantage of mispriced assets. The guiding principle of all savvy investors, "Buy Low and Sell High," basically means that if the market price of an asset is below its fundamental value then the investor may/should consider purchasing the asset, and if the market price is above its fundamental value then the investor may consider selling the asset. This paper provides an empirical assessment of Ohlson's equity valuation equation proposed in Ohlson (1995) to estimate a firm's fundamental value of equity using the statistical approach of dynamic linear modeling. The results of this assessment can help an investor build a profitable trading strategy, for example by investing in stocks that are found to have significant a difference between their market price and their fundamental value.
Dan Larsen, A Mixed-Integer Programming Approach to a Profitable Airline Route Network Design, July 30, 2012 (Jeffrey Camm, Uday Rao)
In 2007, Delta Airlines and Northwest Airlines announced merger plans. Airline executives want regulatory agencies and the general public to believe that a merger will have positive impacts for the consumer. While there are some back-office functionalities that become redundant (and therefore eliminated) in a merger, most cost savings will come from better realigning airframes to markets. This project formulates a mixed integer program and uses publicly available fare and passenger data from 2007 to determine what a profitable route network would look like at some point in the unknown future. As this analysis was conducted in 2008, we also have the ability to take a retrospective look and see how our analysis and assumptions played out five years later.
Benjamin Noah Grant, The Benefit of Reallocation Using Scenario-Based Robust Optimization and Conditional Value-at-Risk in a Long-Only Equities Portfolio, July 30, 2012 (Jeffrey Camm, David Rogers)
One of the most important aspects of investment in volatile assets is risk control. Many mathematical models have been developed to try to control risk in an investment portfolio, with one of the most widely used models being the value-at-risk (VaR) model. Conditional-value-at-risk (CVaR) was developed as a model to address expected tail loss to help mitigate catastrophic losses in portfolios. This project is an examination of five different reallocation time periods applied to a long-only equities portfolio, for which the assumption is made that you can only buy assets to include and no short positions are allowed. The goal of finding the benefit to risk control by actively reallocating using a CVaR optimization model is performed.
Yuanyuan Niu, A Study on Bond Yield Curve Forecasting, July 26, 2012 (Yan Yu, Uday Rao)
There is considerable research on in-sample fitting and out-of-sample forecasting performance of yield curves. This project first studies the model and methodology of forecasting the term structure of government bond yields (Diebold and Li 2006). The Nelson-Siegel factor model is used to fit the Treasury bond yield data from 1985 through 2000. Three time-varying regression coefficients are interpreted as level, slope, and curvature-factor loadings of the yield curve. Various scenarios are constructed to find the optimal shape parameter. In addition, unlike a constant shape parameter in the previous literature, we develop both linear and non-linear least-squares grid search algorithms to find an optimal time-varying shape parameter. Autoregressive models and recursive forecasting are also involved in predicting future yields. Finally, we compare the in-sample and out-of-sample model fits with different choices of shape parameters. The statistical software package SAS is used for implementation.
Alexander Muff, Bottleneck Analysis via Simulation of a Steel Barrel Manufacturing Mill, July 26, 2012 (David Kelton, Uday Rao)
The purpose of this study was to build a simulation model of a steel-barrel mill for analysis. This analysis would include identifying bottlenecks via cycle times and machine breakdowns. It would also be used to quantify the return on investment of process and capital improvements. Since this was the first time the company had used simulation modeling, it was also a proof-of-concept that simulation modeling was the correct approach to identify these problems. The project resulted in verifying the plant manager's intuition about bottlenecks and provided valuable data about scheduled capital improvements. The company has also rolled out simulation modeling to its other facilities across North America.
Andrew Dempsey, Using Markov Chains to Analyze a Football Drive, July 24, 2012 (David Rogers, Jeffrey Camm)
The 2007 Highlands team was facing a critical drive that threatened their season. The drive starting from the twenty-six yard line was the highlight of the season. Fans of past year's teams started questioning whether the 2006 team would have been able to pull off such a drive. The probability of scoring and turning the ball over is analyzed by the use of matrix calculations. Markov models and matrix manipulation allows for expected points and expected downs to be calculated based on a spreadsheet of the 2006 season data. The calculations are used to gain situational awareness. Situational awareness is used by coaches to gain insight into the degree of success of plays and for future play calling. The analysis of downs, yard line, and distance to the goal line allows for the drive situation to be calculated and analyzed.
Silky Abbott, Simulation of the Convergys Contact Center in Erlanger, KY, July 24, 2012 (David Kelton, Jaime Newell)
The Convergys (CVG) Contact Center currently operates at three locations, one onshore in Erlanger, KY and two offshore in the Philippines and in India. The contact center located in Erlanger, KY serves different types of processes with inbound as well as outbound service. Clients from different sectors such as retail banking, credit cards, cell-phone providers, satellite/TV providers, healthcare, and insurance outsource their processes to the CVG inbound and outbound contact center. With business development among their clients, more processes are getting outsourced to CVG by the clients and there has been a subsequent increase in the call volume to which the agents respond every day. The most immediate concern for the corporation is the presence of unsatisfied customers due to lack of operational facilities for them to speak with a customer-service representative (CSR). In this model, we look at one of the processes served by the Convergys contact center; however the process client name is not mentioned due to confidentiality concerns. Several factors such as call arrival rates, total time spent on the phone, total time spent in the queue, holds, and transfers are looked at in this report. Several scenarios such as increasing the number of CSRs and trunk lines are simulated to decrease the percent of customers who are thrown out of the contact center queue due to system overflow, while at the same time keeping a check on the total cost of the system. Using simulation modeling, these different options are explored to determine which option is best for the Convergys Contact Center.
Shangpeng Pan, An Analysis of Running Records Using Frontier Analysis, July 20, 2012 (Jeffrey Camm, Martin Levy)
The objective of this study is to develop a handicapping technique by analyzing world running records, using frontier analysis. A handicapping technique is useful in comparing performances of different ages and genders in long-distance running. Using the world records for 5K, 10K, and Marathon races, the frontier function can be calculated based on linear programming. The estimated frontier function is discussed for different ages and genders. The handicapping technique is developed based on the frontier function and applied to an example. Such a handicapping technique using frontier analysis can be extended to other individual sports with precise measures.
Madan Mohan Dharmana, Customer-Centric Pricing Analysis, June 1, 2012 (Amitabh Raturi, David Rogers)
Price is an important driver for profitability. Even though price has a higher impact on increasing profits than other levers of operations management, companies often do not focus on pricing appropriately in the fear of losing customers to competition. One thing is clear - the higher the value of the product perceived by the customer, the less price-sensitive the customer is. In an ideal world, the right value of the product, as perceived by each customer, should be evaluated and the product can be priced accordingly. Having a customized pricing policy based on the characteristics of each segment can potentially enhance sales and thus maximize profits by extracting the complete value created by the product for the different segments. In this project, a customer-centric pricing strategy is illustrated. The customers are classified into different segments and current pricing benchmarks are obtained for each segment. The potential for price increases to each customer is identified based on what the current price is as compared to the segment benchmarks. Probability of attrition is used to identify how sensitive the customer will be to a price increase. A logistic regression model is built to obtain the probability of attrition. Based on the upside potential for price increase and price sensitivity of the customer, a strategy for revenue increase is identified for each customer. We develop a flowchart of the methodology for customer-centric pricing, illustrate this methodology using several examples, as well as show the magnitude of differences in the overall profitability of a firm with customized pricing policies in different scenarios. Several avenues for future research are also identified.
Ou Liu, College Athletic Conference Realignment: Minimize the Traveling Miles, May 31, 2012 (Michael Magazine, Jeffrey Camm)
Unnecessary flying distance could cause major money loss for college athletic teams and make players feel extremely exhausted. Thus realigning the NCAA athletic team conferences and optimizing the flying miles is of importance. This project formulates a problem to minimize total travel distance NCAA athletic teams travel by finding a realignment based on a distance metric. The project selects 66 teams from six conferences: Big Ten, S.E.C., A.C.C., Big Twelve, Pacific Twelve, Big East and four independent college athletic teams (Notre Dame, BYU, Navy, and Army). The optimization model is based on seven conferences in total and ten teams in each conference.
David Teng, Simulation Analysis of the UC Bearcats Transportation System, May 31, 2012 (David Kelton, Jeffrey Camm)
The objective of the study is to simulate the University of Cincinnati Bearcats Transportation System (BTS). Many analyses can be applied to other transportation systems as tools of cost reduction and effectiveness improvement. This study contains three major parts. The first part describes the development of the model. Based on the available data, five assumptions were identified. We displayed the limitations of Arena when building a shuttle-bus system simulation model and provide resolution of those limitations. In the second part, the model goes through the validation process and the accuracy of the model is confirmed. Finally, we conduct experimentation in the third part of the study. Then the experimental outputs suggest recommendations such as reduction of total number of trips or seats, leading to potential cost cuts.
Richard Walker III, Asia-to-US Supply Network Simulation and Analysis, May 23, 2012 (David Kelton, Uday Rao)
Container freight is a key component of imported merchandise, which has shown robust growth since its introduction in the 1950s. Within this segment, the China-to-US trade flow is the largest by volume, inter-nation trade route. To manage supply chains that avail themselves of this tremendous trade flow, major uncertainties in lead-time and demand forecasts work against the need to provide high service reliability with minimum costs. A dynamic simulation model was developed using a trans-Pacific, intermodal supply chain from Guangzhou, China to the southwestern quadrant of Ohio. The model used a combination of primary shipper data and literature values for transit-time probability distributions and freight cost variables to investigate the impact of shipper reliability and mean delivery times on logistics costs and service levels. Using regression analysis of the simulation data, it was determined that for selected situations, the cost advantage and competitive service rates attainable by direct train delivery of trans-Pacific, intermodal container merchandise can make train delivery an economically preferred choice for supply chains delivering above 95% fill rates. Additionally, we found that faster, more reliable shipment methods can significantly increase inventory levels and inventory holding costs for supply chains that are operating between 73% and 99% fill rates.
Lianlin Chi, Nurse Schedule Optimization at a Children's Hospital Emergency Department: A Linear-Programming Approach, May 2, 2012 (Jeffrey Camm, Yan Yu)
A hospital provides patient treatment by specialized staff and equipment. It usually has an emergency department, which is a medical-treatment facility specializing in acute care of patients who arrive without prior appointment. Because of the variation and uncertainty in demand, emergency-department staffing is particularly challenging. In terms of planning for this demand, the hospital needs to produce duty schedules for its emergency-department nursing staff. The schedule has impact on budget, nursing functions, and health-care quality. Nurses at the Cincinnati Children's Hospital's emergency department are given a lot of flexibility to serve their patients best according to their own working preferences. In this study, a computerized nurse-scheduling model was developed to adapt to Cincinnati Children's Hospital's emergency department. We solved the problem of minimizing staffing shortages using a binary linear goal-programming approach with OpenSolver.
Sashi Kommineni, Nurse Scheduling at a Children's Hospital Emergency Department, January 6, 2012 (Jeffrey Camm, Uday Rao)
Pediatric nursing is a specialty encompassing the care of children, adolescents, and their families in a variety of settings. Handling patients and their families in an emergency situation is a challenging task. Nurses at the emergency department of a children's hospital in Cincinnati are given a lot of flexibility in drawing up their schedule in order to allow them to serve patients in the best possible way. This is because the scheduling quality directly influences the nursing quality and working morale. As it exists today, a full-time scheduler works with nurses and draws schedules for a period of six weeks while trying to accommodate individual preferences and change requests. The underlying goal is to minimize overtime and staffing shortages while utilizing existing resources. In this project, we attempt to automate and solve a scaled-down nurse-scheduling scenario to cover minimum staffing levels at the emergency department using a multi-objective linear-programming approach with open solver.
Yanyun (Lance) Wang, Non-parametric Density Estimation in VLSI - Statistical Static Timing Analysis Boosting, December 2, 2011 (Yan Yu, Uday Rao)
In Very Large Scale Integrated (VLSI) circuit design, it's important to investigate the longest path delay from inputs to outputs, which is also termed Static Timing Analysis (STA). As the VLSI industry steps into the deep sub-micron era, process variation becomes more and more important, especially in STA. Statistical Static Timing Analysis (SSTA) thus helps to deal with the variation in critical path delays. This project studies how to employ non-parametric density estimation methods in SSTA to help the hardware-design industry further understand their device-manufacturing variability. The simulation results using kernel density estimation show significant reduction of the necessary MC simulation iterations in SSTA, which may potentially shed light on future design improvements. This project takes in the extracted parameters from a manufacturing chip-testing process and inputs them into a test bench of 12 serially connected inverters. Outputs are analyzed and summarized by PERL and ports to SAS and R. Non-parametric (NP) Kernel Density Estimation (KDE) is implemented, investigated, and compared with parametric modeling methods such as Gaussian distribution fitting. The result indicates that NP KDE demonstrates strong predictive ability with a high level of accuracy and lower cost in time and memory usage than does brute-force Monte Carlo simulation. The NP KDE method outperforms all the other statistical curve-fitting methods.
Hongbing Chen, Forecasting Loan Loss Rates Using Multivariate Time-Series Models, November 28, 2011 (David Rogers, Hui Guo)
The ability to forecast credit loss accurately is of vital importance to every financial institution for both decision support and regulatory compliance. This study proposes a Vector AutoRegressive and Moving-Average processes with eXogenous regressors (VARMAX) model to overcome constraints and limitations imposed by commonly used roll-rate models. The VARMAX model allows for multivariate forecasting and takes advantage of information contained in the time-series of forecasting variables. In particular, the VARMAX technique allows for joint forecast of the loan loss rate (the forecasted variable) and the delinquent rates (one of the forecasting variables), which substantially enhances forecasting performance, especially when the forecast window is lengthened. Using historical performance data of various auto loan portfolios at a regional commercial bank, the paper demonstrates that the proposed VARMAX model consistently outperforms roll-rate models across various loan portfolios.
Lakshmi Palaparambil Dinesh, Robust Optimization for Resource Allocation in the Energy Sector, November 22, 2011 (Jeffrey Camm, Uday Rao)
The highly volatile nature of energy prices makes it important for the utility companies to have a plan in place to buy, sell, and store energy in high-capacity batteries. The hourly Location Marginal Prices (LMP) change as a function of power consumption and the companies need to manage the batteries accordingly so that objectives of profit maximization and optimal power allocation are met. One of the ways to do this is to use scenario-based optimization using the best-worst case profits model, which is tested in this project. The best-worst case profit points to maximizing the profit that is lower than each of the individual scenario profits. This is an extremely conservative approach. In addition to the above-mentioned approach, other buy, sell, and store plans based on the mean prices and their confidence intervals can be used. The objective of this project is to show empirically why the robust optimization model performs better than the other models in the face of uncertainty. The results show that the robust optimization model performs the best in terms of stability.
Hexi Gu, Regularization Methods for Ill-Posed Inverse Problems: Empirical Research on Realistic Macroeconomic Data, November 17, 2011 (Uday Rao, Yan Yu)
Benefiting from research on many practical problems in the natural-sciences and engineering-technology areas, the inverse problem has received much attention since the 1960s. Widespread application of the inverse problem in medicine, mathematical physics, meteorology, and economics has attracted much research. This project briefly introduces the theory of the inverse problem and the regularization method used to solve ill-posed inverse problems. Several well-known regularization methods, such as the Tikhonov regularization method, the Landweber regularization method, and the conjugate gradient method, are discussed and analyzed. Regression models for parameter estimation based on these methods are developed and applied through a case study using China's real economic data from 1990 to 2008. In order to test the effects of the regularization regression model, the ordinary least squares error (OLS) and the Eviews methods are also applied and compared with the regularization methods, and the results show that regularization is better than OLS when dealing with ill-posed inverse problems. It is also suggested by the case study that, in order to have sustainable and healthy growth in China's economy, it is important that the government take measures to promote domestic consumption (final consumption), which forms a crucial part of GDP.
David A. Pasquel, Operational Cost-Curve Analysis for Supply-Chain Systems, November 2, 2011 (David Rogers, Amitabh Raturi)
Inventory managers of multi-level supply-chain environments experience continuous pressure, especially from website competitors, to minimize total inventory costs. A general, non-linear mathematical formulation with the objective of minimizing total system costs (the summation of backorder penalty costs and inventory holding costs) while constraining the proportion of backorder costs to total system costs is initially considered. This analysis is then expanded to compare this constraint to a backorder rate constraint. Finally, the model is modified to consider a mixture of back orders and lost sales in the retail inventory system. Cost curves are developed for these various scenarios to provide graphical support of the effects and tradeoffs inventory system decisions can have on total costs.
Ben Cofie, A Data-Mining Approach to Understanding Cincinnati Zoo Customer Behavior, October 18, 2011 (Yan Yu, Jeffrey Camm)
The Cincinnati Zoo is one of the top zoos in the nation. They serve approximately 1,000,000 visitors every year and are looking forward to serving even more visitors in the future. To understand the needs of their customers and provide better services, the zoo collects and manages large amounts of data on their customers through surveys and membership applications. In this project an attempt is made to study and identify groups, structures, or patterns that exist in the Cincinnati Zoo data using cluster analysis and association-rule mining. Clustering is used to identify useful groups that exist in the data sets. Both the K-means and Ward's clustering methods are implemented in R using the kmeans () and hclust (data, method="ward") functions, respectively. Association-rule mining (ARM) is used to identify food/retail purchasing patterns of members by studying the correspondence/associations between items purchased together. Association-rule mining is implemented in R using the Apriori algorithm. Results showed that certain food/retail items are almost always purchased together even though the Zoo sells them separately, and also Zoo members with fewer children are more likely to drop membership than those with more children.
Andy Craig Starling, Modeling a Small Wireless Network for the Telecom Industry, September 16, 2011 (David Kelton, Jeffrey Camm)
When deploying a wireless network in the telecom industry, it is important to develop a proper sales strategy that will maximize revenue while filling the network to capacity with sales to both residential and business customers. The wireless network described within has three towers from which calls or internet connections originate and then are relayed to a building in downtown Cincinnati where the calls are routed for termination, or internet connections hop onto the internet backbone. This thesis will study, via computer simulation, all the numerous variables that are a part of a wireless network and then conclusions will be drawn regarding the best sales strategy from which to begin.
Lily Elizabeth John, A Maximal-Set-Covering Model to Determine the Allocation of Police Vehicles in Response to 911 Calls, August 24, 2011 (Jeffrey Camm, Michael Magazine)
Police departments have in the past made a significant effort in ensuring immediate response to 911 calls. One of the many factors that directly influence response is the travel distance to the location where response is required. The closer a patrol car is to the location when a request is made, the lesser the time required to respond. However, due to resource limitations in terms of the number of patrol cars and officers on duty at any given time, it becomes important that patrol cars be strategically located to meet the demand. This project attempts to solve this problem by using a strategic maximal set covering location model to allocate patrol cars optimally. An application using real call data from the City of Cincinnati is presented.
Kenneth Darrell, An Investigation of Classification, August 24, 2011 (Jeffrey Camm, Raj Bhatnagar)
Classifying a response variable based on predictor variables is now a common task. Methods of classifying data and the models they produce can vary wildly. Classification methods can have different predictive capabilities, stemming from model assumptions and underlying theory. A comparison of disparate classification methods will be evaluated on their predictive capabilities as well as the steps required in construction of the model. A collection of data sets with varying numbers and types of predictor variables will be used to train and test various classification methods. The data sets under consideration will all have dichotomous response variables. The following methods will be evaluated: logistic regression, generalized additive models, decision trees, naive Bayes classification, linear discriminant analysis, and neural networks. These methods will be gauged to see whether one model will rise to the top and always outperform other methods or if each type of model is applicable to a certain range of problems. The evaluation methods will be based on common binary evaluation parameters. These parameters consist of accuracy, precision, recall, specificity, F-measure, receiver operator characteristic, and AUC.
Vijay Bharadwaj Chakilam, Bayesian Decisive Prediction Approach to Optimal Mailing Size Using BLINEX Loss, August 22, 2011 (Martin Levy, Jeffrey Camm)
Direct-mail marketing is a form of advertising that reaches its audience directly and is among the most rapidly growing forms of major marketing campaigns. A direct-mail marketing campaign starts by obtaining a name list through in-house data warehouses or external providers. A logistic scoring model is built to create response or activation scores from the characteristic attributes that describe the name list. The activation scores are then listed in descending order to rank and select names for direct mail solicitation purposes.More selected names result in more revenues but not necessarily more profits. The problem of interest is to predict an optimal mailing size to use for mailing future marketing catalogs that maximizes the profits. A collection of trials of direct-mail solicitation campaigns is made available by using a bootstrapping-like strategy to generate a set of historical trials. The likelihood function of the optimal activation scores is assumed to follow a normal distribution with unknown mean and unknown variance. Bayesian decisive prediction is then applied by using conjugate priors and a BLINEX loss function to predict the optimal activation score for a future trial.
Mark Richard Boone, Minimization of Deadhead Travel and Vehicle Allocation in a Transportation Network via Integer Programming, August 19, 2011 (Jeffrey Camm, Michael Magazine)
This paper examines an everyday issue in logistics: minimizing cost as well as unused resources. In order to find a solution for a specific case of 25 loads that needed to be transported from one city to another, data were acquired from a local firm and cities within Texas (or close proximity) were selected. After calculating deadhead distances between all possible city combinations, an integer-programming model was crafted that would take the data and minimize the deadhead travel distance in the system, given the number of trucks available for use and the maximum mileage allowed per truck. The findings, using AMPL and CPLEX as a solver, showed that the fewer resources employed, the longer it took for a solution to be found. Ultimately, for the 25 city-pairs selected, with 700 miles per truck set as a constraint, the fewest trucks that could transport all loads was 15, with a total deadhead distance of 1,373 miles (the trucks in the system were loaded nearly 82% of the time they were on the road).
Aswinraj Govindaraj, Nurse Scheduling Using a Column-Generation Heuristic, August 10, 2011 (Jeffrey Camm, Michael Magazine)
This paper describes a mathematical approach to solving a nurse-scheduling problem (NSP) arising at a hospital in Cincinnati. The hospital management finds difficulty in manually deriving a nurse roster for a six-week period while trying to place an adequate number of nurses in the emergency-care unit of the hospital. The aim of this project is to provide proof of concept that binary integer programming can be used effectively to address the NSP. This model employs a two-stage approach where multiple schedules are generated for all nurses in phase I based on the organizational and personal constraints, and the best-fit schedule for populating the roster is selected in phase II so as to effectively satisfy the demand. The study also evaluates the effectiveness of schedules thus generated to help the hospital management judiciously decide on the number of full-time and part-time nurses to be employed at the emergency-care unit.
Andrew Nguyen, Optimizing Clinic Resource Scheduling Using Mixed-Integer and Scenario-Based Stochastic Linear Programming, August 5, 2011 (Craig Froehle, Jeffrey Camm)
Efficient clinic operations are vital to ensuring that patients receive care in a timely manner. This becomes paramount when care is urgent, patients are abundant, and resources are limited. Clinic operations are more efficient when patients wait less, staff members are less idle, and total clinic duration is shorter. The goal of this paper is to develop a valid deterministic mixed-integer linear-programming (MILP) model from which a valid stochastic model can be derived, and to explore how such models can potentially be utilized in an actual clinical setting. An approach to minimizing patient waiting, staff idle time, and total duration at a clinic is to develop a MILP model that optimally schedules tasks of the clinic's staff members. This can be accomplished with a deterministic model. However, processing times of patients for each type of staff member vary in an actual clinical setting, so a stochastic model may be more appropriate. Also, a valid scheduling model is less valuable to efficient clinic operations if the model cannot be readily implemented for routine use by staff members. This paper describes a deterministic MILP model that simultaneously minimizes patient waiting, staff idle time, and total operating time. Then, from the deterministic model, a scenario-based stochastic model that assumes varying processing times is developed. Finally, prototype software solutions emphasizing clinic staff usability are discussed.
Logan Anne Kant, Robust Optimization Based Decision-Making Methodology for Improved Management of High-Capacity Battery Storage, August 4, 2011 (Jeffrey Camm, David Rogers)
Utility companies have the option of buying, selling, or storing power in high-capacity batteries to maximize profits in the face of fluctuating energy prices. The problem companies confront is optimally managing the batteries when hourly LMPs (Locational Marginal Prices) vary as a reflection of power consumption changes and power availability over the course of a day. The goal of this project is to identify and compare scenario-based robust-optimization planning models that may be used to achieve the desired outcome of maximizing profits from battery management. The following optimization planning models are considered and empirically tested: simple simulation/optimization, best worst-case, value at risk (VaR), conditional value at risk (CVaR), minimum expected downside risk, and maximum expected profit. The project aims to give decision makers a toolbox, a decision-making methodology containing robust-optimization models that improve battery management in an uncertain market. This toolbox facilitates and informs the process but it does not aim to solve the battery-management problem by replacing the educated judgment of the decision maker.
Wei Zhang, A Study of Count Data by Poisson Regression and Negative Binomial Regression, July 28, 2011 (Martin Levy, Jeffrey Camm)
Count data are one of the most common data types and many statistical models have been developed for their analysis. In this work two regression methods are investigated for applications in count-data analysis. They are Poisson regression and negative binomial regression. Following a brief introduction of the binomial and Poisson distributions, the regression models were developed based on these two distributions. The models were then applied in a case study, in which an insurance company was trying to model the number of emergency visits due to ischemic heart disease of the 778 subscribers. The problem was approached by both of the two regression methods and the performance of each method was evaluated. It turned out that the negative-binomial regression model outperformed the Poisson regression model in this case. Linear regression was also attempted but failed for the data.
Lili Wang, Donation Prediction Using Logistic Regression, June 27, 2011 (Martin Levy, Jeffrey Camm)
Increasing the accuracy of prediction of potential responders can save a charitable organization a lot of money. By soliciting only the most likely donors, the organization would spend less money on solicitation efforts and spend more money for charitable concerns. This project aims to predict who would be interested in donation and explain why those people would make a donation. The dataset contains individuals' information from a national veterans' organization, including demographic information and past behavior information that relates to donation. The response variable of interest is binary, indicating whether the recipient will respond or not. This paper will discuss the application of the logistic regression model and compare three models based on different variable-selection methods. The methods can also be applied in different companies to market their products or services.
Kristen Bell, Effects of Bus Arrivals on Emergency-Department Patients, June 1, 2011 (David Kelton, Craig Froehle)
University Hospital's Emergency Department (ED) treats nearly 100,000 patients annually. Patients arrive by air care (helicopter), ambulance, personal vehicle, and bus. The bus schedule limits arrival times, but may result in multiple patients arriving at one time. Using data from the hospital's record-keeping system, a simulation model was developed to examine the effects of transportation mode on wait times in the ED. Model modifications allowed for analysis of sensitivity to bus schedules and delays, determination of peak-load handling capabilities, arrival-time-of-day impacts, and comparison of scheduled arrival to a smoothed, continuous arrival function. Modeled results aim to help ED planners account for and adapt to changes in arrival-mode patterns.
Mahadevan Sambasivam, Factors Affecting Inpatient Reimbursements Using the Medicare Hospital Cost Reports -- A Case Study, June 1, 2011 (James Evans, Uday Rao)
Total Medicare payments provide the largest single source of a hospital's revenues. The CMS (Centers for Medicare and Medicaid Services) has a system, called as the Inpatient Prospective Payment System, which pre-determines how much a hospital should be paid for a particular service based on the Diagnosis Related Group (DRG) codes the patient qualifies depending upon the condition and the diagnosis of the patient. These standardized payments are calibrated every year depending upon the wage index, cost of the procedure, inflation, cost of technology, and other such criteria. This project is a case study in identifying which factors really affect the reimbursement rates of the Medical procedures based on the Medicare cost reports data submitted to CMS for 2006-2009. The statistical analysis is done by using two techniques, principal component analysis and multiple linear regression, in interpretation of the factors that affect the reimbursement rates the most and how they affect. Based upon the results of the analysis it was found that the revenue generated from total inpatient services was negatively correlated to the net inpatient income but was positively correlated to the overall net income of the hospitals.
Avinash Parthasarathy, Campaign-Coupon Analysis Using Integer Programming, May 26, 2011 (Jeffrey Camm, David Rogers)
In an effort to strike a balance between retaining a set of loyal customers and attracting new customers, a retailer is considering reshuffling and reducing the current number of coupons under each campaign. The main goal of this project is to explore optimization techniques, driven by binary integer programming, to analyze the campaign-coupon structure of a grocery store. The model helps the retailer understand the coupon-redemption behavior of his customers and eventually reduces the number of campaigns and coupons, and results in maximizing the number of households redeeming a set of coupons. The importance and usefulness of optimization techniques when applied directly to the data will be illustrated. It also elucidates the process of preparing the right kind of data required to apply these techniques from a collection of datasets or tables. The project uses SAS -- PROC SQL & PROC TRANSPOSE -- to extract and prepare the data required to feed the optimization process.
David Burgstrom, Foreclosures in Cincinnati: An Analysis of Associated Factors, May 25, 2011 (Yan Yu, Martin Levy)
The wave of recent home foreclosures across the nation was a hallmark of the financial crisis, lowering property values and in some cases leading to neighborhood blight. This project looks at a record of over 50,000 home sales in the city of Cincinnati from the past eleven years in order to identify predictors of increased odds of foreclosure. Part of this project is the creation of a dataset using ArcGIS software to geocode addresses from the Hamilton County Auditor and identify their respective neighborhoods, which enables demographic information from census data to be joined as possible predictors. The analysis is performed in the R computing environment using a generalized linear mixed model. The year of sale and the neighborhood are entered as random effects and all other predictors are evaluated as fixed effects. AIC, BIC, AUC, and mean residual deviance are criteria used to determine the optimal collection of predictor variables. The results show that while some predictors are similar to the findings from other foreclosure studies, other variables show that Cincinnati's experience with the foreclosure crisis was unique.
Lei Xia, A Study of Panel Data Analysis, May 23, 2011 (Martin Levy, Yan Yu)
Panel data refer to multi-dimensional data which contain observations on multiple phenomena observed over time periods for the same objects. Results from panel data analysis are more informative and estimation based on panel data can be more efficient compared to time-series data only or cross-sectional data only. The analysis of panel data has been widely applied in the social- and behavioral-science fields. In first part of this project, a thorough review of "Analysis of Panel Data" by Cheng Hsiao (2003), "Econometric Analysis of Panel Data" by Badi H. Baltagi (2008), and published papers written on this subject are presented to give an overall introduction on panel data analysis and its methodology. There are two major types of panel data models discussed in the second part of this project: the simple regression model with variable intercept, and the dynamic model with variable intercept. For each of these two types, fixed-effects models and random-effects models, as well as the corresponding methodology are discussed. A small case study about the cost of six U.S. airlines, conducted by researchers at Indiana University, is revisited in the end to demonstrate the implementation of panel data analysis in SAS using PROC PANEL procedure.
Brian Sacash, Data Envelopment Analysis in the Application of Bank Acquisitions, May 23, 2011 (Jeffrey Camm, David Rogers)
Data Envelopment Analysis (DEA) is an application of linear programming that helps determine and measure the efficiency of a particular type of system with multiple operating units that have behaviors based on the same principles. Quantifiable parameters are determined to be inputs and outputs, which creates a data set that allows comparison of efficiency across similar units. To determine an efficiency rating, these defined inputs and outputs for each decision-making unit (DMU) are parameters in a linear optimization model, which can be solved with off-the-shelf optimization tools. In this work, we use DEA to determine the efficiency of bank branches. The motivation was that the bank in the study took on a new set of branches through an acquisition and wished to determine the relative efficiency of the merged set of branches.
Chaojiang Wu, Partially Linear Modeling for Conditional Quantiles, May 23, 2011 (Yan Yu, Martin Levy)
We consider the estimation problem of conditional quantiles when high-dimensional covariates are involved. To overcome the "curse of dimensionality yet retain model flexibility, we propose two partially linear models for conditional quantiles: partially linear single-index models (QPLSIM) and partially linear additive models (QPLAM). The unknown univariate functions are estimated by penalized splines. An approximate iteratively reweighted least square algorithm is developed. To facilitate model comparisons, we develop effective model degrees of freedom for penalized spline conditional quantiles. Two smoothing-parameter selection criteria, Generalized Approximate Cross-validation (GACV) and Schwartz-type Information Criterion (SIC), are studied. Some asymptotic properties are established. Finite- sample properties are studied by simulation studies. A real-data application demonstrates the success of the proposed approach. Both simulations and real applications show encouraging results of the proposed estimators.
Ying Wang, The Use of Stated-Preference Techniques to Model Mode Choices for Container Shipping in China, May 13, 2011 (Uday Rao, Yan Yu)
This project presents a case study on the possibility of shifting containers off the road and onto intermodal coastal shipping services in China by analyzing the main determinants of mode choice.& The data were collected through a mix of revealed and stated preference questionnaire surveys, and then analyzed using the logit model; the case study has been carried out on routes from Wenzhou to Ningbo. The results show that, in the decision-making process of choosing a mode for container distribution, Coastal Shipping Cost, Coastal Shipping Time Reliability, Slot Availability for High Cube Container, Road Cost, and Road Time Reliability are significant determinants.
Neelima Kodumuri, Ratings and Rankings - A Comparative Study Based on Application of the Bradley-Terry Method to Real-World Survey Data, March 11, 2011 (Norman Bruvold, Martin Levy)
Understanding individual choice behavior is of utmost importance to organizations in order to compete in today's marketplace with fickle customer preferences. This is most evident in the crowded fast-food industry where the restaurants are competing with each other every breakfast, lunch and dinner to capture their customers and keep them coming back for more. In this study, we look at the choices and preferences of individuals using two separate rating and ranking scales across ten fast-food restaurants on twelve dimensions such as cleanliness, quality of food, etc. The individual choice is estimated from primary research data based on responses of about 5000 people to two market surveys – one conducted in August 2009 and another in November 2009. The respondents were randomly asked either to rate or rank restaurants based on their past experience. The experiment was set up as an incomplete block design. To achieve the objective of comparing how the restaurants perform against each other on each of the dimensions based on ratings and rankings, we employed the extended Bradley-Terry method of a paired comparison approach with an underlying logistic regression model that accommodates ties.
Ketan Kollipara, A Study of Scenario-Based Portfolio Optimization using Conditional Value-at-Risk, March 9, 2011 (Jeffrey Camm, Kipp Martin [Professor, Booth School of Business, University of Chicago])
Risk management is an essential part of portfolio management. After the financial crisis of the past three years, there is a need for more stringent measures to control exposure to market risk. Value at Risk (VAR) has been used extensively in the financial world as a measure for quantifying risk. But VAR has been criticized widely for the financial debacle of the past few years. Conditional value-at-risk (CVar)can help overcome some of the serious limitations of VAR. One of the ways to construct a portfolio is to use historical stock prices and take a scenario-based optimization approach to minimize or contain risk. Including CVar in the objective or the constraint of a portfolio-optimization problem is one such approach. The goals of this project are to understand CVar and observe the behavior of CVar with changes in the target threshold for the portfolio. Finally, I show how the portfolio that was built using the scenario-based CVar optimization performed during the two years of 2008-2009.
Zibo Wang, Building Predictive Models on Cleveland Clinic Foundation Data on the Diagnosis of Heart Disease by Data-Mining Techniques, March 7, 2011 (Martin Levy, Yan Yu)
Mining clinical data sets is challenging to a data miner. The main objective of this report has been to develop and then propose data-mining techniques useful in diagnosing the presence of heart disease. Data-mining techniques such as the generalized linear model (GLM) have been widely used for quantitative analysis of clinical trial data. In this report, we examine heart-disease data provided by the Medical Center of Long Beach and the Cleveland Clinic Foundation. In order to extract various features, we compare the model performance built by logistic regression, which is a special case of GLM, where the response variable is the presence of heart disease. Classification and regression trees (CART), an alternative methodology, is also applied to help fit the model. We select models using AUC (area under ROC curves) and the misclassification rate. As a result, in order to test the effect of random sampling error of each model we apply the model on 90% training data and 10% testing data, and then conduct a 10-fold cross-validation method.
Zhufeng Zhao, Application of Quantitative Analysis to Solving Some Real-World Business Problems, February 3, 2011 (Martin Levy, Yan Yu)
The thesis focuses on the application of quantitative analysis to solving some real-world business problems. It is composed of two projects. The first project uses some linear regression models to estimate the property tax imposed on some houses by a local government and to investigate whether the governmental property taxation is appropriate. After exploring some multiple linear regression models as well as some simple linear regression models, we have decided to develop our analysis further using simple linear regression because we desire simplicity and because we want to avoid regression coefficients with the wrong algebraic signs given by the multiple linear regression models. Residual analysis has been conducted to identify the outlying and influential observations for each simple linear regression model. On the basis of the estimated property tax given by the models, we have found that the governmental property taxation needs some correction. We also categorize the houses into three ranges according to their governmental valuation or sale prices. The comparison between the low-sale-price range and the other ranges in terms of the property tax underpaid/overpaid clearly indicates that the home owners of the low-sale-price houses are heavily taxed by the local government in an inappropriate manner. The second project uses SAS programming to manipulate the performance data of a call center that has operations in multiple sites and business areas, and to help analyze its improvement in terms of AHT (average handling time, a metric to measure the time a representative spends handling an inbound call). The AHT analysis has broken down the overall AHT improvement by each site and by each business area and thus identified the drivers of the AHT improvement at the different levels of the performance metrics.
Lili Wang, Determining Sample Size in Design of Experiments, December 1, 2010 (Martin Levy, Yan Yu)
This Research Project is a summary of the book HOW MANY SUBJECTS? -- Statistical Power Analysis in Research by Helena Chmura Kraemer and Sue Thiemann. Sample size refers to the number of subjects or participants planned to be included in an experiment or study. Sample size is not arbitrarily decided and it is usually determined by using a statistical power analysis. Generally, if we have a larger sample size, our decision will be more accurate and there will be less error in the parameter estimate. This report includes methods of calculating sample sizes for different statistical tests. The definition, calculation methods, and illustrative examples are provided for each test. SAS IML programs are listed for each example. The Master Tables for sample-size calculation is not provided in this Project because SAS programs can be easily used to calculate the sample size and power. This Project could be used as a reference to obtain the sample size and power for several different hypothesis tests.
Amin Khatami, A Study of Mixed-Effects Regression Models Using the SAS and R Software, November 29, 2010 (Yan Yu, David Kelton)
In a standard linear regression, we deal with models in which the residuals have a normal distribution, and are independent among the observations. On the other hand, in linear mixed models (LMMs), although residuals are still normally distributed, they are not assumed to be independent of each other. There are two major advantages in using LMMs for the study of dependent or correlated data. First, units on which the response is measured or observed are clustered into different groups. Second, on the same unit, repeated measures are performed over time or different spatial points on the same subject. In this project, we present a description of LMMs by building a model in the field of marketing, using a step-up model-building strategy with which we can illustrate the hierarchical structure of LMMs. We present a summary of published papers and textbooks written on this subject to discuss mathematical notation, covariance structures, and model-building and model-selection methods in LMMs. We revisit an application of LMMs in the field of dentistry, conducted by researchers at The University of Michigan, and perform the same analysis on the data using the SAS and R programs.
Carlos Alberto Isla-Hernández, Simulating a Retail Pharmacy: Modeling Considerations and Options, November 24, 2010 (David Kelton, Alex Lin [UC College of Pharmacy])
In a sector as highly competitive and with such complex operational aspects as chain drug stores, simulation modeling has become a valuable and powerful tool to make better management decisions. The literature about simulation of retail pharmacies is scarce, though. We have recently conducted an exhaustive research project for one important chain drug store in which discrete-event simulation was one of the key tools. The objective of this Research Project is to provide future researchers with knowledge and suggestions to build reliable and useful simulation models of a retail pharmacy. Many of the topics analyzed will also be useful to researchers modeling different healthcare settings like hospital facilities or emergency rooms. The first part of this study compiles a description of the main challenges we found to build an accurate simulation model of a retail pharmacy. Some important aspects will be analyzed, such as deciding what the entities should be, identifying the resources in the model, and the management of different levels of priorities throughout the model. The second part of the study identifies some useful output statistics and describes how to produce them in the simulation model. Finally, the third part of the study defines and analyzes three different ways in which staff behavior can affect the main performance statistics at the pharmacy: personal time, workload effects, and fatigue effects. We will describe how to include them in the simulation model.
Whitney Brooke Gaskins, Simulating Meal Pattern Behavior in the Visual Burrow System, November 22, 2010 (David Kelton, Jeffrey Johnson [UC Department of Biomedical Engineering])
America is home to some of the most obese people in the world. A staggering 33% of American adults are obese and, as a result, obesity-related deaths have climbed to more than 300,000 a year. This is second only to tobacco-related deaths. High-fat diets are a contributor to these conditions. Like humans, rodents also show a preference for high-fat diets. To examine the eating pattern and behavior of rats to help identify the reasons for obesity, a study was developed by the University of Cincinnati (Melhorn et al. 2010b). The time has now come to expand on this research with the use of simulation. Simulation has been used extensively in the manufacturing world; however, there are many untapped opportunities within the world of health care. This project will simulate and model the Visual Burrow System, a controlled habitat used to monitor the behavior of rats, to help with future experimentation. With the use of simulation, this project will examine the parameters such as feeding frequency and number of meals. From this, researchers will be able to examine the effects that physiological and environmental factors have on the test subjects without having to change a test subject's environment physically, thereby saving much time and money. Our Virtual Visual Burrow System (VVBS) model's results agree with those from the Visual Burrow System, demonstrating the validity of our simulation, and suggesting that simulation modeling can provide an important adjunct to traditional physical experimentation, and in so doing, provide a much faster and cheaper way to explore initially a wide variety of different experimental scenarios, and suggest which are the best candidates for more intensive follow-up physical experiments.
Michael D. Bernstein, A Transportation-Model Approach to Empty-Rail-Car-Scheduling Optimization, August 26, 2010 (Jeffrey Camm, Michael Fry)
The daily operational task of assigning empty freight cars to serve customer demand is a complex, detailed problem that affects customer service levels, transportation costs, and the operations of a railroad. As railroads grow larger and traffic levels increase, the problem becomes ever more difficult and ever more important. This paper presents an optimization model developed with the open-source COIN-OR OSI project. Matrix generation is done using VBA algorithms with a Microsoft Access interface. Data sources and outputs are designed to easily connect with a "real-world" platform. By calculating costs and feasible arcs outside of the optimization solution stage, a simple transportation problem is created with fast run times and implementable results. Car availability, transportation costs, off-schedule delivery penalties, and customer priorities are all taken in to consideration.
Valentina Pilipenko, Ph.D., Improving the Calling Genotyping Algorithm Using Support Vector Machines, August 20, 2010 (Martin Levy, Lisa Martin [UC College of Medicine])
Genome-wide single nucleotide polymorphism (SNP) chips provide researchers the opportunity to examine the effects of hundreds of thousands of SNPs in a single experiment. To process this information, software packages have been designed to convert laboratory results, fluorescent signal intensities, into genotype calls. While this process works well for most SNPs, a small portion equating to thousands of SNPs is problematic. This is a problem because these SNPs are removed from analysis and thus their disease-causing role cannot be explored. Therefore, the objective of this study was to determine whether statistical methods, namely support vector machines (SVMs) and regression trees, could be used to improve genotyping. To accomplish this objective, we used 664 individuals from the Cincinnati Children's Medical Center Genotyping Data Repository who had Affymetrix 6.0 genotyping data available. We then evaluated performance of SVM and regression-tree analysis with respect to reduction of missigness and correction of the batch effect. We found that neither method improved missigness. However, SVM consistently resolved the batch effect. As batch effects are a serious issue for genome-wide studies, the ability to resolve this issue could have substantial impact on future gene-discovery studies.
Vikram Kirikera, Quantitative Analysis of Highway Crack Treatment - A Case Study, August 20, 2010 (Martin Levy, Uday Rao)
This project is a case study of analytical approaches to assess the effectiveness of crack sealing on highway pavements for the Ohio Department of Transportation. The study determines the viability of crack sealing on different pavement conditions and quantifies the improvement in age resulting from the crack-seal process. The primary focus is on analyzing the effectiveness of crack sealing on two types of pavements (flexible and composite) and two types of surface layers (gravel and limestone) based on a Pavement Condition Rating (PCR) measure. In this case study, we conduct a trendline analysis of different pavement-performance measures to determine the types of pavements that are receptive to crack sealing and to find the optimum PCR range where crack sealing is effective. We find that the service life of pavements increased by 1.5 to 1.7 years and the maximum percentage improvement in PCR from crack sealing was in the Prior PCR range of 66% to 80% and was most effective in composite pavements.
Ying Yuan, A Comparison of the Naive Bayesian Classifier and Logistic Regression in Predictive Modeling, August 13, 2010 (Yan Yu, Uday Rao)
The naive Bayesian classifier (NBC) is based on Bayes' theorem and the attribute-conditional-independence assumption. Despite its simple structure and unrealistic assumptions, the NBC competes well with more sophisticated methods in terms of classification performance, and is remarkably successful in practice. This project studies the algorithm of the NBC and investigates how it can be applied successfully in practice to predictive modeling. We discuss and compare several different approaches and techniques in building an NBC. By fitting home-equity-loan data, we demonstrate how to develop the NBC in SAS and propose ways that can improve its performance. By comparing the model results from the NBC with those from a logistic regression, we conclude that the NBC performs as well as logistic regression in correctly classifying targeted objects, and its performance is slightly more robust than logistic regression. Finally, we discuss the limitations of the NBC in application.
Mingying Lu, Modeling the Probability of Default for Home-Equity Lines of Credit, August 11, 2010 (Martin Levy, Jeffrey Camm)
Credit-risk predictive modeling, as a tool to evaluate the level of risk associated with applicants or customers, has gained increasing popularity in financial firms such as banks, insurance companies, and credit-card companies. The new Basel Capital Accord provides several alternatives for banks to calculate economic capital. The method of advanced internal rating allows a bank to develop its own models to quantify the three components of expected loss: probability of default (PD), exposure at default, and loss given default. In this research, a model of PD for home-equity lines of credit is developed. The PD model estimates the likelihood that a loan will not be repaid and therefore fall into default, within 12 months. More than 50 predictors were considered in the model, including account-application data, performance data, credit-bureau data, and economic data. The response variable is binary (good vs. bad). A weight-of-evidence transformation was introduced for the numerical variables, and logistic regression was applied to formulate the model. The model was validated on both holdout data and out-of-time data. The results of Kolmogorov-Smirnov statistics and the receiver operation curve show the strong predictive power of the model, while the system-stability index and attribute profiling demonstrated the stability of the model.
Guo Jiang, A Test of Stock Portfolio Selection using Scenario-Based Mixed Integer Programming, July 14, 2010 (Jeffrey Camm, Martin Levy)
Generally, in the portfolio-selection problem, the decision maker considers simultaneously conflicting objectives such as rate of return, liquidity, and risk. The main goal of this project is to introduce a decision-support method for identifying a robust stock portfolio that will meet different requirements of investor types (risk neutral, risk preference, risk averse) accordingly, using the AMPL optimization software. Specifically, the method helps the decision maker narrow the choices by providing variance corresponding to the Markowitz model, downside risk, probability of being within k% of optimal, return and cost value that correspond to the best worst-case scenario, and value at risk. Then the decision maker can choose the appropriate model in specific cases. The project aims to develop a concept that can aid the fund manager to make a decision when parameters in the optimization model are stochastic. The final call in such cases is subjective and a "good" decision depends on the choice of decision maker.
Surya Ghimire, Forest-Cover Change: An Application of Logistic Regression and Multi-Layer Neural Networks to a Case Study from the Terai Region of Nepal, July 2, 2010 (Yan Yu, Uday Rao)
Forest-cover change has been taking place in the Terai Region of Nepal for the past two decades due to accelerated urbanization and population growth. This project investigates the forest-cover change during the period 1989-2005 in the region, with the combined use of remote-sensing satellite images, geographic information systems (GIS), and data-mining techniques. The results indicate that there is tremendous loss in forest cover: almost 10,438 hectares (10.32% of the region) turned into agricultural land and built-up area in last 16 years. This study finds that eight explanatory variables: distance to the settlement, topography with different slopes, land tenure with no established title, collective control of land, household size, livestock unit, community forestry, and presence of reforestation programs, are important for deforestation in the region. The study further demonstrates that the integration of GIS, remote sensing, and mathematical-modeling approaches are beneficial in analyzing and predicting forest-cover change in the region.
Timothy Murphy, Investigating Break-Point Analysis as a Predictor of Bankruptcy Risk, June 30, 2010 (Yan Yu, Martin Levy)
Previous studies in bankruptcy prediction have used models such as discriminant analysis and logistic regression to estimate a firm's risk of bankruptcy, using selected financial ratios from discrete moments in the past as predictor variables. These methods yield fairly effective predictions of bankruptcy risk, but ample misclassification still exists. One possible method of improving the misclassification rate might be to use the rate at which a firm's financial ratios are changing as an additional predictor in these models. This paper will summarize investigations into this hypothesis using Bayesian change-point (or "breakpoint") analysis. Future paths of study will also be discussed. This subject is of interest both in refining bankruptcy-prediction methods and in lending or retracting support from the efficient-market hypothesis.
Varun Mangla, Bankruptcy Prediction Models: A Comparison of North America and Japan, May 24, 2010 (Yan Yu, Uday Rao)
This study formulates and compares North America and Japan bankruptcy prediction models using logistic regression, linear discriminant analysis, and quadratic discriminant analysis. These models are used to find the similarities between North America and Japan with regard to bankruptcy prediction. Model comparisons are done on the basis of the cost of misclassification with a case study of two scenarios. Both scenarios differ in the cost of misclassification of a bankrupt company as a non-bankrupt company, where chosen cost values are guided by relevant previous results. These scenarios help illustrate the importance of the cost function in the prediction models. Further variable-selection techniques are used to identify important variables for bankruptcy prediction within the domains of the chosen countries. For this project data are obtained from the COMPUSTAT North America and Global databases. Ten financial variables are adopted to build models for bankruptcy predictions. This work may help investors untangle the intricacies behind corporate investments, and better to prepare investors to make judicious investment decisions when investing in North American and Japanese firms without the consideration of boundaries.
Deepsikha Saha, Comparison between Stepwise Logistic Regression on the Predictor Variables and Logistic Regression using the Factors from Factor Analysis, May 21, 2010 (Martin Levy, Norman Bruvold)
Identifying customers who are more likely to respond to a product offering is an important issue in direct marketing. In direct marketing, data mining has been used extensively to identify potential customers for a new product (target selection). Using historical purchase data, a predictive response model with data-mining techniques is developed to predict the probability that a customer is going to respond to a catalog mailing offer. The purpose of this research project is to identify the customers who are more likely to respond to the catalog mailing. To reach this purpose, a predictive response model using historical customer purchase data is built with data-mining techniques. The data-mining techniques used are (1) logistic regression modeling using stepwise logistic regression, and (2) factor analysis to generate factors for the predictor variables in the model and then use these factors for developing a logistic regression model. A comparison between the two logistic regression models is done.
Yuhui Qiu, Payback Study for Residential Air Conditioning Load Reduction Program, April 23, 2010 (Martin Levy, Yan Yu, Don Durack [Duke Energy])
Many utility companies implement direct load control programs to actively shift load from peak periods to non peak periods and add a level of grid security with the ability to reduce the distribution grid's load in case of equipment failures or excessive electrical usage. A large portion of energy is used for HVAC systems. One of the direct load control programs at Duke Energy is to cycle their residential customers' air conditioner during the peak electric load demand (normally on a hot day) for a certain period (2-6 hours) to reduce the electric load demand during these peak hours. Previous studies show that the A/C usage will often increase after control hours, compared to what would have occurred without the control event (frequently termed payback or snapback). This project used the A/C duty cycle data collected from randomly selected research group in 2007. In this study, a Tobit duty cycle model and a fixed-effects panel data model were developed to quantify the payback effects for the event days across two cycling strategies. The net energy benefits from residential air conditioning cycling program, including both the initial load reductions and the rebound amount were calculated. The results from two regression models were compared to investigate whether the cycling program reduces the overall daily kWh consumed by the customers or just simply shifts the usage to non-cycling hours after the cycling event interruptions are released.
Hasnaa Agouzoul, Green Driveway Survey: A Consumer Research Study Based on Discrete Choice Modeling, March 11, 2010 (David Curry, Yan Yu)
The objective of this project is to evaluate the willingness of homeowners to choose an environmentally friendly surface for paving their driveways at home. This is a full research study with survey development, administration, data collection and analysis. The new pavement alternative is a permeable surface that allows rain water and snow melt to seep through, thus reducing the amount of water -– called storm water runoff -– flowing into a city's sewer system. This is particularly important during heavy storms when sewers tend to overflow causing floods. Additionally, storm water runoff often carries pollutants and sediments that may end up in rivers and streams, thus polluting local water supplies. The purpose of this research is: (1) To evaluate the importance of selected pavement attributes in shaping a homeowner's decision to buy (or not buy) a new driveway surface. The chosen attributes are: impact on water quality, impact on the environment, installed cost, and possible financial incentives from the government. (2) To determine the relationship, if any, between homeowner demographics and choice of driveway surface. (3) To develop a model to predict the driveway surface choice as a function of homeowner demographics. Phase one of the project involved designing and administering a survey to homeowners in Ohio. Phase two analyzed collected data using a latent-class approach to discrete-choice modeling.
Yin Li, Using Data-Mining Techniques to Build Predictive Models and to Gain Understanding of Current Medical Health Insurance Status, March 5, 2010 (Martin Levy, Yan Yu)
General linear models (GLMs) have been widely used for the quantitative analysis of social-science data. In this report, we examine the choice of medical health insurance of adults based on the China Health and Nutrition Survey (CHNS). In many applications of data mining, prior to using predictive models, the most important variables must be selected. Using SAS/STAT, variable-selection methods are provided by the PROC LOGISTIC procedure. Among these are backward, forward, and stepwise selection. We review the stepwise method and compare it with a rank-of-predictors method that is based on the idea of bootstrapping. We compare the model performance built by binary logistic regression with the classification-tree methodology. Using ROC curves and area under curves (AUC), we identify the better fitting model. We also introduce an alternative method, namely a penalized likelihood approach, to deal with the challenge of complete separation. Finally, by applying the model on 10% testing data and 90% training data, we conduct 10-folder cross validation in order to test the effect of sampling error on each model.
Huiqing Li, Statistical Analysis of Knee Bracing Efficacy in Off-road Motorcycling Knee Injuries, March 5, 2010 (Martin Levy, Yan Yu)
The use of Prophylactic Knee Bracing (PKB) in off-road motorcycling is frequent, while the effectiveness of PKB in preventing injuries remains a controversial topic. An internet-based survey was conducted to pursue this issue further. The purpose of this paper is to explore and quantify the effectiveness of wearing a knee brace vs. not wearing a brace in preventing motorcycling knee injuries, by providing statistical analysis on the questionnaire data. Four logistic analyses are presented that statistically characterize the association of a number of risk factors with the odds ratio of a motorcycle driver suffering a knee injury. All four models in some way involve the factors AGE, BRACE ("Do you wear a knee brace?"), and a particular type or brand of knee brace, Air Townsend. Standard statistical diagnostic measures deem all models acceptable. Logistic regression for each of the four types of injuries (ACL, MCL, Meniscus, and Tibia Fracture) was conducted to explore the association of covariates and the likelihood of having a specific type of knee injury. Significant factors for each type of injury were identified. A motorist with 5-10 years riding experience has greater chance of incurring an ACL knee injury. In general, wearing a knee brace is helpful in preventing an MCL knee injury. Motorists in AGE3 (25-30) have a higher chance of a Meniscus knee injury. Motorists with riding experience less than 5 years have larger likelihood of Tibia Fracture injury. In addition, several brands of brace design were compared regarding their degree of protectiveness for a particular type of injury. Wearing Air_Townsend will increase the likelihood of getting an ACL injury; wearing Air_Townsend has higher chance of getting a Meniscus injury than wearing EVS and Asterisk; brace brand Asterisk has to some degree a negative effect on the Tibia Fracture injury.
Yi-Chin Huang, Optimal Vehicle Routing for a Pharmacy Prescription-Delivery Service, March 5, 2010 (Michael Fry, Alex Lin)
The objective of this project is to determine the optimal vehicle routing for prescription home delivery for a local chain of pharmacy stores, Clark's Pharmacy, to reduce its delivery costs. Currently, the pharmacy chain delivers its prescriptions from seven sites using seven vehicles. The preliminary analysis indicates that the prescription home delivery cost mainly depends on the number of vehicles needed for delivery and the aggregate distances traveled by the vehicles. Three scenarios are examined to determine the best delivery policy: (1) Decentralized Solution – representing the current delivery system by using seven vehicles, (2) Hybrid Solution – three separate service areas each served by one vehicle, and (3) Centralized Solution – one service area served by three vehicles. A cost analysis study is conducted to determine the number of vehicles and routing options that provide lowest costs. Historical delivery point locations and vehicle-related costs are collected from the company. A Travelling Salesperson Problem (TSP) model is solved to determine the shortest delivery tour. A max-flow model is used to identify violated tours. A combination of SAS and AMPL are employed to manipulate data and solve the models. The results indicate that the Hybrid Solution is the most effective strategy for Clark's Pharmacy.
Parama Nandi, Response Modeling in Direct Marketing, February 25, 2010 (Martin Levy, Jeffrey Camm)
In direct marketing, predictive modeling has been used extensively to identify potential customers for a new product. Identifying customers who are more likely to respond to a product offering is an important issue in direct marketing. Using historical purchase data, a predictive response model with data-mining techniques is developed to predict the probability that a customer is going to respond to a promotion or an offer. The purpose of this thesis is to build a model for identifying targets for a future mailing campaign. Logistic regression, which is a predictive modeling technique, is used to build a response model for targeting the right group of members. In this project the dataset used consisted of information from donors to the Paralyzed Veterans of America Fund in past fund-raising mailing campaigns. The data (10,000 donors) were obtained from the KDD cup competition database. First we build the predictive model using donors' historical donation data (behavioral variables), demographic and census data. The response-modeling procedure consists of several steps. In building a response model, one has to deal with some issues, such as determining the inputs to the model (feature selection) and missing-value problems. The project deals with all these issues and steps of modeling and goes on to the final model-building and model-evaluation phases. Response modeling has become a key factor to direct marketing. In general, there are two stages in response modeling. The first stage is to identify respondents from a customer database, while the second stage is to estimate purchase amounts of the respondents. This paper focuses on the first stage where a classification problem is solved. There are also a large number of predictors, which is common, since companies and other organizations are able to collect a large amount of information regarding customers. However, many of these predictors will contain little or no useful information, so the ability to exclude redundant variables from analysis is important. Many of the predictors have missing values. Some are continuous and some are categorical. Of the categorical predictors, some have a large number of levels with small exposure; that is, a small number of observations at that level. For the continuous variables, the distribution of the observations can have extreme values, or may take a small number of unique values. Further, there is potential for significant interaction between different predictors. Finally, the responses are often highly unbalanced; for instance only 5% of the observations were positive, and this low response rate is typical in any direct-marketing dataset. All these factors need to be considered in order to produce a satisfactory model. Since irrelevant or redundant features result in bad model performance, feature selection was performed in order to determine the inputs to the model. Feature selection was done in two steps using exploratory data analysis and stepwise selection.
Omkar Saha, Design and Develop Cincinnati Children's Scheduling System for use in the Optimization of Hospital Scheduling, October 19, 2009 (Kipp Martin, Michael Magazine, Craig Froehle)
Escalating health care costs continue to increase the demand on hospital administrators for greater efficiency, creating tighter constraints on doctors and human resources. Cincinnati Children's Hospital Medical Center (CCHMC) is seeking to address their existing scheduling inefficiencies, while also addressing the additional resource demands of a new satellite location. CCHMC currently uses a manual scheduling method based on legacy schedules and each specialty maintains its own schedule. As part of an on-going project with UC MS - Business Analytics faculty, several attempts have been made at optimizing the scheduling process across all locations and specialties. This project attempts to consolidate the entire request-making process for different specialties irrespective of locations, generate an optimum schedule satisfying all the requests, and report the approved schedule based on different criteria and a common format. This would help to achieve greater efficiencies of administrators, doctors, clinical and surgical spaces.
Edmund A. Berry, National Estimates of the Inpatient Burden of Pediatric Bipolar Disorder in an Inpatient Setting. An Analysis of the 2003 and 2006 Kids Inpatient Databases (KID) Data, September 25, 2009 (Martin Levy, Pamela Heaton)
Bipolar disorder (BPD) is a debilitating recurrent chronic mental illness, characterized by cycling states of depression, mania, hypomania, and mixed episodes. This disease, ingenerating tremendous societal and economic impact, is associated with a high degree of morbidity and mortality and is particularly costly and debilitating in pediatric patients. The objectives of this study were 1) to calculate national estimates of the annual burden of inpatient hospitalizations of children and adolescents with BPD, where burden is measured specifically in terms of charges, cost, and length of stay; 2) to describe and compare the burden across various demographic characteristics, hospital characteristics, and key comorbidities associated with BPD; and 3) to determine the independent effects of these demographic, hospital-type, and comorbidity factors on hospitalization costs. To accomplish these objectives, we examined data in both 2003 and 2006 from the Kid's Inpatient Databases (KID). National estimates of the means and standard error of the mean for cost, charges, and length of stay, for inpatient pediatric bipolar disorder (BPD) used the complex sample design of the 2003 and 2006 KID data, which contains weighting, stratification, and clustering variables. Two Ordinary Least Squares regression models, using 2003 and 2006 KID data, were used to determine key predictors of cost along demographic characteristics, hospital characteristics, and comorbidities. Finally the Chow test was used to determine whether the underlying regression models estimated in 2003 and 2006 were the same.
Deepankar Arora, A Decision Support Methodology for Distribution Networks in a Stochastic Environment using Mixed Integer Programming in Spreadsheets, September 11, 2009 (Jeffrey Camm, Kipp Martin)
In an effort to reduce the distribution costs from distribution centers to the customer location a company is considering opening a set of five distribution centers to cater all of its customer locations. The main problem that the company faces is demand uncertainty at the customer location, which can have an adverse effect on its transportation costs. The main goal of this project is to introduce a decision support methodology for identifying a robust distribution network that will lead to minimized transportation and handling costs under stochastic demands using VBA (Visual Basic for Applications) in Excel. Specifically the methodology helps the decision maker to narrow down his choices by giving him cost distributions corresponding to a candidate solution; an efficient frontier; cost value which corresponds to best worst case scenario, value at risk (VaR) and finally expected loss below a certain value. The project aims to develop a concept which can be utilized to aid the decision maker to make a decision when parameters in an optimization model are stochastic. The final call in such cases is subjective and a "good" decision depends on the choice of decision maker, but this methodology aims to give the decision maker tools to facilitate and inform the decision making process.
Bethany Harding, Safety Stock Level Analysis for Replenishment Planning using Actual-to-Forecast Demand Ratios, September 4, 2009 (Uday Rao, Amitabh Raturi)
Senco Brands, Inc., currently stocks approximately 10,000 items at one or all of their domestic distribution locations. The planning and replenishment for these items is performed using a basic MRP planning system. Forecasts are created for each item and prorated for each distribution center based on historic usage percentage to total corporate demand. Desired ending inventory levels are set using a safety-time factor of weeks. The current model requires a safety-time level defined for each item at each distribution center. Desired ending inventory is calculated in each weekly time period of the planning horizon by accumulating the demand forecasts over contiguous future weeks specified by the safety-time. Using the company's data, an Excel-based tool was developed to: 1. Recreate the MRP planning system's approach to setting planned order releases using the desired ending inventory approach and the input safety-time. 2. Apply an actual-to-forecast demand "A/F" ratio approach to determine the probability distribution of demand over the planning horizon, 3. Simulate various scenarios for future demand, 4. Use the simulated demand scenarios to determine the performance of a chosen safety-time (or desired ending inventory) using key performance indicators such as expected customer fill-rate, inventory investment, and working capital. 5. Calculate an optimized safety-time that achieves satisfactory performance (e.g. 95% fill rate), as determined by the company. Various applications of the Excel-based tool are illustrated.
Andreas Kuncoro, Empirical Study of Supply Chain Disruptions' Impact on the Financial and Inventory Performance of Manufacturing and Non-manufacturing Firms, August 28, 2009 (Amitabh Raturi, Uday Rao)
Supply chain disruptions are various unanticipated events in the supply chain caused by internal and external factors which cause a firm to significantly deviate from its original plans and consequently affect its performance This work assesses the relationship between supply chain disruptions and overall firm performance as measured by financial (return on asset and leverage) and operational (inventory turnover) metrics. We first chronicle 75 supply disruptions in 47 firms as reported in the business press over a three year period (2005-2007). We then categorize these disruptions on causal factors as internally versus externally caused, and across several origin sources. The performance metrics are then observed from Compustat quarterly one year before through one year after the disruption announcements. The impact of such disruptions is first analyzed by firm size, firm type (manufacturing versus non-manufacturing), reason and responsibility. In multivariate analysis of covariance tests, firm size showed a significant positive association with overall firm performance while disruption event announcement showed a significant negative association with overall performance. Consistent with previous studies, our findings indicate that supply chain disruptions negatively impact both financial and operational performance. Firm size significantly moderates this impact. One year after the event announcement, the firms are able to recover their performance.
James Andrew Kirtland III, Simulation Efficiency of the Finitized Logarithmic Power Series, August 27, 2009 (Martin Levy, David Kelton
It is often times appropriate or desired to limit a distribution's support. This can be due to the actual environment that an analyst is trying to model or to increase the efficiency of simulating random variates from a model. This can be done using traditional truncation. However, when truncation is used, undesired and often times unpredictable effects occur to the moments of the parent distribution. Finitization is a method of limiting a Power Series distribution's support while preserving its moments up to the order of finitization, n. The logarithmic power series distribution will be used to discuss properties of theoretical, truncated, and finitized distributions. Four algorithms designed to generate random variates from a theoretical logarithmic power series distribution are compared to an alias method designed to generate random variates from a finitized logarithmic power series distribution. The variates created from these four algorithms as well as the alias method will be tested against a theoretical logarithmic power series to check if the moments hold. Finally, a horserace is used to test whether the finitized logarithmic distribution using an alias method is more efficient at generating random variates than the four other algorithms based on an infinitely supported logarithmic distribution.
Shannon Peterson, Development of a Long Range Capacity and Purchasing Plan for a Manufacturing Environment, August 24, 2009 (Jeffrey Camm, Uday Rao)
Long range capacity planning is an essential part of business planning. This can be complicated by seasonality of products, varying material pricing plans and supplier capacities, criticality and substitutions of raw materials, and multiple production sites and bills of materials. This project develops a flexible tool that reveals an optimal, high-level long range production schedule and purchasing plan to satisfy customer demand and identify potential outages.
Ndanatsiwa Anne Chambati, Locating an Optimal Site for a New Natorp's Garden Center, August 21, 2009 (Michael Magazine, Uday Rao)
A well known aphorism states, "the most important attributes of stores are location, location and location". The area of research for optimal store location has grown rapidly in the last decade. Most of the research in this area has been undertaken by marketing researchers, urban geographers and economists with applied mathematicians recently entering the field. Applied mathematicians have become involved in the study of retail location theory through the development of algorithms and mathematical models applicable to location problems. At the mathematical level the problem is abstract and exact removed from the practical problems of the real estate developer or marketing expert. Natorp is a family owned business that has been around since 1916. They currently have two Garden Center locations, a nursery and landscaping services. They would like to open an additional Garden Center in the Ohio Kentucky and Indiana (OKI) region and need to know where the optimal location for it would be. First, we review the current literature on optimal store location then look at the most important factors for Natorp to consider in the expansion. Next, we evaluate each of the 8 counties in the OKI region using a multi-factor site location rating system and come up with potential sites for the new Garden Center. These potential sites will be evaluated based on population projections over the next 30 years, median household income, median home value, and proximity to competitors.
Ashutosh Mhasekar, Application of Statistical Procedures to Target Specific Segments for Upgrading Marginally Sub-par Members to Rewards-eligible Level in a Retail Loyalty Environment, August 19, 2009 (Michael Magazine, Uday Rao, Marc Schulkers)
The retail industry has become extremely competitive with loyalty programs constantly used to monitor customer behavior and engage customers for incremental sales / revenue. Retailer R runs a points-based loyalty program. Members can earn rewards certificates which are good towards future purchases. With the current economy and stiff competition, the Retailer is using targeted bonus offers to members that need additional points to earn a reward certificate. In this project we use various statistical tools to efficiently target members that need additional points to earn a reward certificate and to maximize certificate redemption which results in incremental sales to the company. Also, a test and control group approach is employed to monitor and measure the incremental behavior / performance of this “Bonused” group during the promotional period and post period as well. Using the targeted segmentation approach an increase in redemption rate was noted. There was significant increase in revenue during the promotional period, without impacting the post period sales.
Shaonan Tian, Data Sample Selection Issues for Bankruptcy Prediction, August 12, 2009 (Yan Yu, Martin Levy)
Bankruptcy prediction is of paramount interest to both academics and practitioners. This paper devotes special care to an important aspect of the bankruptcy prediction modeling: data sample selection issue. We first explore the effect of different data sample selection methods by comparing the out-of-sample predictive performances using a Monte Carlo simulation study under the logit regression model. The simulation study conducted suggests that if forecasting the probability of bankruptcy is of interest, complete data sampling technique provides more accurate results. However, if a binary bankruptcy decision or corporate rating is desired, choice based sampling technique may be still suitable. In particular, within the logit regression context, a simple remedy could be applied to justify the cut-off probability, such that choice based sampling technique and the complete data sampling technique display the same explanatory power in forecasting the bankruptcy classification. We also find that appropriate adjustment of the cut-off probability is complementary if taking into account different misclassifications. Finally, we contextualize the proposed recommendations by applying them to an updated bankruptcy database. We further investigate the effect of the different data selection methods on this corporate bankruptcy database with a non-linear classification method, Support Vector Machines (SVM), which has recently gained some popularity in the applications.
Xinhao Yao, Option Pricing: A Comparison Between Black-Scholes-Merton Model and Monte Carlo Simulation, August 7, 2009 (Martin Levy, Uday Rao)
An option, a kind of financial derivative, is a special contractual arrangement giving the owner the right to buy or sell an asset at a fixed price on a given date. In this project, we focus on comparison between two option pricing methods: Black-Scholes-Merton model and Monte Carlo simulation. The results from both methods can be considered equivalent and an equivalence test is applied to determine the number of iterations of Monte Carlo simulation. We also try some modifications of the Monte Carlo simulation to see how to improve the pricing method when rare events happen.
Wei Huai, Bankruptcy Prediction: A Comparison between Simple Hazard Model and Logistic Regression Model, July 27, 2009 (Yan Yu, Uday Rao)
As a serious issue for both firms and individuals, bankruptcy has recently drawn increased attention from society thereby making its prediction an important topic. In this research project, two popular bankruptcy forecasting models, Shumway (2001) Simple Hazard Model and Logistic Regression Model, are studied and compared. Three different measurements, Deciles Ranking, Area under ROC curve and Hosmer and Lemeshow goodness of fit test are implemented to evaluate and compare these bankruptcy forecasting results. The conclusion that simple hazard model is superior to logistic regression model in accuracy of bankruptcy forecasting is reconfirmed.
Mayur Bhat, Study of Uplift Modeling and Logistic Regression to increase ROI of Marketing Campaigns, June 5, 2009 (Uday Rao, Amitabh Raturi)
In this research project, we study a technique known as Uplift Modeling which uses control groups judiciously to measure the true lift in sales that a marketing campaign generates. In addition, Uplift Modeling proposes customer segmentation to achieve better campaign results by way of selective targeting. The results show how using test versus control groups helps in measuring true lift. We also demonstrate that selective targeting of customers using Uplift Modeling increases incremental revenue when compared to the existing alternative called Traditional Response Modeling. Logistic Regression, using categorical attitudinal data, is also used to further strengthen and complement the results seen from Uplift Modeling.
Venu Silvanose, Developing and Assessing a Multiple Logistic Regression Model on Mortgage Data to Determine the Association of Different Predictor Variables and Borrower Default, June 3, 2009 (Martin Levy, Norman Bruvold, Yan Yu)
The purpose of this paper is to develop and assess a logistic regression model to determine the association of different predictor variables and mortgage borrower default. In the current housing market, where none of the widely used models in the industry were able to predict with some certainty the high level of default by borrowers, models are still used albeit with a sense of extreme caution to identify good and bad credit risks.
Manish Kumar, Intelligent Allocation of Safety Stock in Multi-item Inventory System to Increase Order Service Level and Order Fill Rate, June 3, 2009 (Amitabh Raturi, Michael Magazine)
In this study, we propose a model to establish safety stock in a multi-item inventory system to increase order fill rate and order service level based on correlation between the demands of multiple products. A customer order to a multi-item inventory system consists of several different products in different quantities. The rate at which a manufacturer is able to fulfill the demand for all products to the customer's order in a specified time is termed as order fill rate (OFR). Whereas, the statistical picture of how successful the manufacturer is in fulfilling all the orders completely by the required date is termed as order service level (OSL). The OFR and OSL are very important indices in measuring the performance of the manufacturer and customer satisfaction. We evaluated the order fill rate and order service level performance of the inventory system in a model in which total customer order demand process is based on normally distributed but correlated demands. We show that if the safety stock level is adjusted in accordance with the level of correlation in product demand, both the order fill rate and order service level can be improved.
Larisa Vaysman, Quantifying the Impact of Draft Round on Draft Pick Quality Using Non-Parametric Median Comparison, June 2, 2009 (Michael Fry, Jeffrey Ohlmann, Geoff Smith)
At the beginning of each season, NFL teams take turns selecting rookies to add to their rosters in a days-long process known as the NFL Draft. The NFL Draft consists of seven rounds. Since each team wants to have the strongest possible roster, players who are thought to have the potential to be outstanding are chosen early, and less desirable players are generally chosen later in the process or not at all. We seek to quantify the “cost,” in terms of player quality, that is incurred when a team chooses to wait until a later round to draft a player at a particular position. We also examine a number of position-specific metrics to measure player quality. We use the Kruskal-Wallis test, a non-parametric comparison of medians, to determine which draft rounds are likely to offer picks of equivalent quality, and which draft rounds are likely to offer picks of significantly better or worse quality. Our analysis is meant to assist teams during the decision-making process of drafting players by quantifying the tradeoffs inherent in each potential decision.
Michael D. Platt, Distribution Network Model Using Mixed Integer Programming and a Combination of Distribution Centers and Cross-Dock Terminals, June 2, 2009 (Jeffrey Camm, Michael Fry)
In an effort to reduce manufacturing costs, a company is considering moving its manufacturing facilities from the United States to Mexico. Though the facility costs and labor costs will be much lower at the Mexico facility, they are concerned that the move could have an adverse effect on their transportation costs. The goal of this project is to determine the distribution network that will result in the lowest transportation and material handling cost while maintaining desired customer service levels. Specifically, the project will focus on incorporating cross-docking terminals in the solution in conjunction with fully stocked distribution centers. At a cross-docking terminal, product is moved directly from a receiving dock to a shipping dock, spending very little time in the facility. This process eliminates the need to hold these finished goods in inventory, thus reducing inventory costs and material handling costs.
Taylor W. Barker III, The Expected Box Score Method: An Objective Method for NFL Power Rankings, May 29, 2009 (Martin Levy, Michael Magazine, co-chairs)
One of the more interesting pages on ESPN.com during the NFL season is the NFL “Power Rankings” that they compile each week. This is basically a ranking of the relative strengths (during that week) of the NFL teams based on the votes of several panel members (ESPN.com NFL writers/bloggers). While the results take into account the subjective rankings of each of the panel members, it would be interesting to see if there is an "objective" method to develop weekly power rankings, based on current season statistics to date. An objective method for weekly power rankings is found through a process I have named the Expected Box Score (EBS) Method. The EBS Method determines expected box scores between two teams with a given venue based on current season data and then plugs them into a linear regression model based on 20 years of data to get a current estimated point differential between the two teams. This process is repeated for every team playing against every other team exactly twice (once at home and once away) and are used to determine how many of those games each team would be expected to win. The team with the most wins is ranked #1 and so on. Shortcomings of other methods are addressed and then considered in the development of the EBS method. Validation for this method is provided via comparisons with Las Vegas point spreads and NFL.com Power Rankings.
Lori Mueller, Norwood Fire-Department Simulation Models: Present and Future, May 28, 2009 (David Kelton, Jeffrey Camm)
The Norwood Fire Department (NFD) currently operates one fire station, serving approximately 22,000 people. In 2008, the NFD made approximately 4,400 runs, which averages to about 12 runs per day. With an increase in retail and business development in the city, there has been a subsequent increase in the number of emergencies the department responds to each year. If the development in the city continues over the next few years, the NFD will have to grow along with the city. The NFD has a few options for expansion. One option is to open a second fire station at a location currently owned by the city, which used to be the Norwood fire station before a new station was opened at its current location. Another option is to expand their current station, which is located near the geographical center of the city, so that they could increase the amount of equipment and firefighters. Using simulation modeling, these different options were explored to determine which option is best for Norwood, when the time comes for expansion.
Vinod Iyengar, Call Volume Forecasting and Call Center Staffing for a Financial Services Firm, March 13, 2009 (Uday Rao, Martin Levy)
In this project, we use statistics and data analytics to build scalable and robust models for call center forecasting and staffing. The core of the problem involves predicting call volumes with lead times of a few months, when conditions are dynamic and there is high variability with multiple types of calls. We use data from a US-based prepaid debit card vendor with two types of calls: application calls and customer service calls. We predict application calls using a model of historical effectiveness of marketing dollars and incorporate data on card activation history and customer attrition. We predict customer service calls from active cardholders using time series analysis and regression to capture trend, seasonality, and cyclicity. Call volume predictions are then input into a stochastic newsvendor model to set a staffing level that effectively trades off staffing costs with lost-sales penalty costs for unsatisfied calls. The impact of different staffing level choices on expected costs is explored by simulating call center volume. Performance improvement resulting from this work includes more accurate forecasts with increased service levels and agent occupancy.
Lei Yu, A Comparison of Portfolio Optimization Models, March 13, 2009 (Martin Levy, Uday Rao)
Applications of portfolio optimization models have developed rapidly. One issue is determining which model should be followed as a guide for investors to make an informed portfolio decision. In this paper, five optimization models: classical Markowitz model, MiniMax, Gini's Mean Difference, Mean Absolute Deviation, and Minimizing Conditional Value-at-Risk, are presented and compared. Solutions generated by different models applied to the same data sets provide insights for investors. The data sets employed include real world data and simulated data. MATLAB, VBA (Excel as host), and COIN-OR software were employed. Some observations about alternative selection, similarities, and discrepancies among these models are found and described.
Moumita Hanra, Assessing ultimate impact of Brand Communication on market share using Path Models and its comparison to Ridge regression, March 12, 2009 (Martin Levy, Uday Rao)
Path modeling, based on structural equation modeling, is a widely used technique in market research industries to analyze interrelationships between various measures and to measure which ones are really significant in driving sales. In this study, the objective is to find the best fitting path model to assess which attributes are really important to a consumer in terms of sales using respondent level survey data. Also, this model would predict the best media sources companies should focus on in advertising their brand for gaining maximum public awareness of that brand and how this awareness drives the way one thinks about the brand in different dimensions and its effect in turn in driving sales. The second half of this study is focused on comparing the results of Path model to ridge regression to assess which model yields better fit and gives results intuitively. Ridge regression reduces the multicollinearity among independent variables by modifying the X'X matrix used in Ordinary Least Squares regression using a ridge control parameter. The results indicate that the path model gives a much better fit than ridge regression especially when multicollinearity is not in its extremity.
Man Xu, Forecasting Default: A comparison between Merton Model and Logistic Model, March 11, 2009 (Yan Yu, Uday Rao)
Merton default model, which is based on Merton's (1974) bond pricing model, has been widely used both in academic research and industry to forecast bankruptcy. This work reexamines Merton default model as well as the relationship of default risk with equity returns and firm size effect using an updated database from 1986 to 2006 time frame obtained from CompuStat and CRSP. We concur with most of the findings in Vassalou and Xing (2003). We find that both default risk and size have impact on equity returns. The highest returns come with the smallest firms with the highest default risk. We then focus on the comparison between Merton model (financial model) and a logistic regression model (statistic model) for default forecasting. We compare Default likelihood indicator (DLI) from Merton model with estimated default probability from logistic model using rank correlation and deciles rankings based on out-of-sample prediction. We find that the function form of Merton model is very useful in determining default. The structure of Merton model captures important aspects of default probability. However, if bankruptcy forecasting is desired, our empirical results show that Logistic model seems to provide a better prediction. We also add distance to default (DD) from Merton model as a covariate in our best logistic model and we find out that it is not a significant predictor.
Luke Robert Chapman, A Current Review of Electronic Medical Records, March 11, 2009 (Michael Magazine, Craig Froehle)
In this project, we research the imminent installation of Electronic Medical Records (EMR) in all hospitals and clinics throughout the United States. This project was motivated by our interaction with the Cincinnati Department of Health (CDH) via a project that focused on persuading the Cincinnati council that EMR should be immediately invested in at all six of the CDH clinics. We review the advantages of EMR and also recognize the disadvantages, some of which were overlooked in the original project with CDH. The current growth of Electronic Medical Records in the US and what the future holds for EMR is reviewed. The main analysis will review the claim that EMR helps to reduce medical errors. The analysis will use multivariate techniques such as factor and cluster analysis.
Chetan Vispute, Improving a Debt-Collection Process by Simulation, March 9, 2009 (David Kelton, Norman Bruvold)
The Auto-Search Process is an automated business process flow that has been designed by Sallie Mae for its in-house collection agency; it works sequentially to procure good phone numbers of delinquent borrowers. The process involves outsourcing of data to private vendors wherein the failed data from one vendor are sent to the next vendor until we have tested against all. Also, the process is governed by time-related business rules that allow the data to be sent to the next vendor only after a certain period. Keeping the cheapest vendor first, the process aims at reducing the cost while increasing the procurement of good phone numbers. Before this process could go live, it was required by the analytical team to analyze the process by building a time-related model, and make recommendations. This thesis explores the building of this time-based model using dynamic discrete-event simulation with Arena, and then talks about the findings and recommendations developed while working on the project which helped the company improve its annual revenue position by over $440,000.
Cary Wise, Cincinnati Children's Hospital Block-Schedule Optimization, February 10, 2009 (Kipp Martin, Craig Froehle, Michael Magazine)
Cincinnati Children Hospital is implementing an automated process to schedule clinical and surgical patient visits. The goal is to create a program that allocates operating rooms to requests submitted by individual doctors for clinical time and surgical time. The schedule creation process takes place in two phases: the first phase schedules spaces for specialties (Ortho, Cardio, etc.); the second phase allocates doctors to the specialty schedule. The program that generates the specialty allocation is named the Space Request Feasibility Solver (SRFS). The inputs of the SRFS are a set of specialty requests and information about the operating rooms; the output is the schedule of specialty assignments. The problem is formulated as a mixed-integer linear program (MILP) that minimizes the number of unfulfilled spaces requested. A very large number of potential assignments may be generated depending on whether the request parameters are very specific or general. Indeed, the instance quickly becomes intractable for a realistic problem. We implement a branch-and-price column generation algorithm to overcome the problem of an intractable number of variables. The SRFS invokes a COIN-OR solver named “bcp” to perform the procedures of branching, solving the LP at each node and managing the search tree. The scope of this master's project is to implement a column generation scheme in the SRFS. Testing of the SRFS was performed by verifying the column selected had the minimum reduced cost, and verifying the results of the LP relaxation and IP against the solution of the exhaustive enumeration of all columns. The performance of the SRFS in terms of the number of columns and nodes created to arrive at a solution was also investigated.
Omkar Saha, Design and Develop Cincinnati Children's Scheduling System for use in the Optimization of Hospital Scheduling, October 19, 2009 (Kipp Martin, Michael Magazine, Craig Froehle)
Escalating health care costs continue to increase the demand on hospital administrators for greater efficiency, creating tighter constraints on doctors and human resources. Cincinnati Children's Hospital Medical Center (CCHMC) is seeking to address their existing scheduling inefficiencies, while also addressing the additional resource demands of a new satellite location. CCHMC currently uses a manual scheduling method based on legacy schedules and each specialty maintains its own schedule. As part of an on-going project with UC MS - Business Analytics faculty, several attempts have been made at optimizing the scheduling process across all locations and specialties. This project attempts to consolidate the entire request-making process for different specialties irrespective of locations, generate an optimum schedule satisfying all the requests, and report the approved schedule based on different criteria and a common format. This would help to achieve greater efficiencies of administrators, doctors, clinical and surgical spaces.
Edmund A. Berry, National Estimates of the Inpatient Burden of Pediatric Bipolar Disorder in an Inpatient Setting. An Analysis of the 2003 and 2006 Kids Inpatient Databases (KID) Data, September 25, 2009 (Martin Levy, Pamela Heaton)
Bipolar disorder (BPD) is a debilitating recurrent chronic mental illness, characterized by cycling states of depression, mania, hypomania, and mixed episodes. This disease, ingenerating tremendous societal and economic impact, is associated with a high degree of morbidity and mortality and is particularly costly and debilitating in pediatric patients. The objectives of this study were 1) to calculate national estimates of the annual burden of inpatient hospitalizations of children and adolescents with BPD, where burden is measured specifically in terms of charges, cost, and length of stay; 2) to describe and compare the burden across various demographic characteristics, hospital characteristics, and key comorbidities associated with BPD; and 3) to determine the independent effects of these demographic, hospital-type, and comorbidity factors on hospitalization costs. To accomplish these objectives, we examined data in both 2003 and 2006 from the Kid's Inpatient Databases (KID). National estimates of the means and standard error of the mean for cost, charges, and length of stay, for inpatient pediatric bipolar disorder (BPD) used the complex sample design of the 2003 and 2006 KID data, which contains weighting, stratification, and clustering variables. Two Ordinary Least Squares regression models, using 2003 and 2006 KID data, were used to determine key predictors of cost along demographic characteristics, hospital characteristics, and comorbidities. Finally the Chow test was used to determine whether the underlying regression models estimated in 2003 and 2006 were the same.
Deepankar Arora, A Decision Support Methodology for Distribution Networks in a Stochastic Environment using Mixed Integer Programming in Spreadsheets, September 11, 2009 (Jeffrey Camm, Kipp Martin)
In an effort to reduce the distribution costs from distribution centers to the customer location a company is considering opening a set of five distribution centers to cater all of its customer locations. The main problem that the company faces is demand uncertainty at the customer location, which can have an adverse effect on its transportation costs. The main goal of this project is to introduce a decision support methodology for identifying a robust distribution network that will lead to minimized transportation and handling costs under stochastic demands using VBA (Visual Basic for Applications) in Excel. Specifically the methodology helps the decision maker to narrow down his choices by giving him cost distributions corresponding to a candidate solution; an efficient frontier; cost value which corresponds to best worst case scenario, value at risk (VaR) and finally expected loss below a certain value. The project aims to develop a concept which can be utilized to aid the decision maker to make a decision when parameters in an optimization model are stochastic. The final call in such cases is subjective and a "good" decision depends on the choice of decision maker, but this methodology aims to give the decision maker tools to facilitate and inform the decision making process.
Bethany Harding, Safety Stock Level Analysis for Replenishment Planning using Actual-to-Forecast Demand Ratios, September 4, 2009 (Uday Rao, Amitabh Raturi)
Senco Brands, Inc., currently stocks approximately 10,000 items at one or all of their domestic distribution locations. The planning and replenishment for these items is performed using a basic MRP planning system. Forecasts are created for each item and prorated for each distribution center based on historic usage percentage to total corporate demand. Desired ending inventory levels are set using a safety-time factor of weeks. The current model requires a safety-time level defined for each item at each distribution center. Desired ending inventory is calculated in each weekly time period of the planning horizon by accumulating the demand forecasts over contiguous future weeks specified by the safety-time. Using the company's data, an Excel-based tool was developed to: 1. Recreate the MRP planning system's approach to setting planned order releases using the desired ending inventory approach and the input safety-time. 2. Apply an actual-to-forecast demand "A/F" ratio approach to determine the probability distribution of demand over the planning horizon, 3. Simulate various scenarios for future demand, 4. Use the simulated demand scenarios to determine the performance of a chosen safety-time (or desired ending inventory) using key performance indicators such as expected customer fill-rate, inventory investment, and working capital. 5. Calculate an optimized safety-time that achieves satisfactory performance (e.g. 95% fill rate), as determined by the company. Various applications of the Excel-based tool are illustrated.
Andreas Kuncoro, Empirical Study of Supply Chain Disruptions' Impact on the Financial and Inventory Performance of Manufacturing and Non-manufacturing Firms, August 28, 2009 (Amitabh Raturi, Uday Rao)
Supply chain disruptions are various unanticipated events in the supply chain caused by internal and external factors which cause a firm to significantly deviate from its original plans and consequently affect its performance This work assesses the relationship between supply chain disruptions and overall firm performance as measured by financial (return on asset and leverage) and operational (inventory turnover) metrics. We first chronicle 75 supply disruptions in 47 firms as reported in the business press over a three year period (2005-2007). We then categorize these disruptions on causal factors as internally versus externally caused, and across several origin sources. The performance metrics are then observed from Compustat quarterly one year before through one year after the disruption announcements. The impact of such disruptions is first analyzed by firm size, firm type (manufacturing versus non-manufacturing), reason and responsibility. In multivariate analysis of covariance tests, firm size showed a significant positive association with overall firm performance while disruption event announcement showed a significant negative association with overall performance. Consistent with previous studies, our findings indicate that supply chain disruptions negatively impact both financial and operational performance. Firm size significantly moderates this impact. One year after the event announcement, the firms are able to recover their performance.
James Andrew Kirtland III, Simulation Efficiency of the Finitized Logarithmic Power Series, August 27, 2009 (Martin Levy, David Kelton
It is often times appropriate or desired to limit a distribution's support. This can be due to the actual environment that an analyst is trying to model or to increase the efficiency of simulating random variates from a model. This can be done using traditional truncation. However, when truncation is used, undesired and often times unpredictable effects occur to the moments of the parent distribution. Finitization is a method of limiting a Power Series distribution's support while preserving its moments up to the order of finitization, n. The logarithmic power series distribution will be used to discuss properties of theoretical, truncated, and finitized distributions. Four algorithms designed to generate random variates from a theoretical logarithmic power series distribution are compared to an alias method designed to generate random variates from a finitized logarithmic power series distribution. The variates created from these four algorithms as well as the alias method will be tested against a theoretical logarithmic power series to check if the moments hold. Finally, a horserace is used to test whether the finitized logarithmic distribution using an alias method is more efficient at generating random variates than the four other algorithms based on an infinitely supported logarithmic distribution.
Shannon Peterson, Development of a Long Range Capacity and Purchasing Plan for a Manufacturing Environment, August 24, 2009 (Jeffrey Camm, Uday Rao)
Long range capacity planning is an essential part of business planning. This can be complicated by seasonality of products, varying material pricing plans and supplier capacities, criticality and substitutions of raw materials, and multiple production sites and bills of materials. This project develops a flexible tool that reveals an optimal, high-level long range production schedule and purchasing plan to satisfy customer demand and identify potential outages.
Ndanatsiwa Anne Chambati, Locating an Optimal Site for a New Natorp's Garden Center, August 21, 2009 (Michael Magazine, Uday Rao)
A well known aphorism states, "the most important attributes of stores are location, location and location". The area of research for optimal store location has grown rapidly in the last decade. Most of the research in this area has been undertaken by marketing researchers, urban geographers and economists with applied mathematicians recently entering the field. Applied mathematicians have become involved in the study of retail location theory through the development of algorithms and mathematical models applicable to location problems. At the mathematical level the problem is abstract and exact removed from the practical problems of the real estate developer or marketing expert. Natorp is a family owned business that has been around since 1916. They currently have two Garden Center locations, a nursery and landscaping services. They would like to open an additional Garden Center in the Ohio Kentucky and Indiana (OKI) region and need to know where the optimal location for it would be. First, we review the current literature on optimal store location then look at the most important factors for Natorp to consider in the expansion. Next, we evaluate each of the 8 counties in the OKI region using a multi-factor site location rating system and come up with potential sites for the new Garden Center. These potential sites will be evaluated based on population projections over the next 30 years, median household income, median home value, and proximity to competitors.
Ashutosh Mhasekar, Application of Statistical Procedures to Target Specific Segments for Upgrading Marginally Sub-par Members to Rewards-eligible Level in a Retail Loyalty Environment, August 19, 2009 (Michael Magazine, Uday Rao, Marc Schulkers)
The retail industry has become extremely competitive with loyalty programs constantly used to monitor customer behavior and engage customers for incremental sales / revenue. Retailer R runs a points-based loyalty program. Members can earn rewards certificates which are good towards future purchases. With the current economy and stiff competition, the Retailer is using targeted bonus offers to members that need additional points to earn a reward certificate. In this project we use various statistical tools to efficiently target members that need additional points to earn a reward certificate and to maximize certificate redemption which results in incremental sales to the company. Also, a test and control group approach is employed to monitor and measure the incremental behavior / performance of this “Bonused” group during the promotional period and post period as well. Using the targeted segmentation approach an increase in redemption rate was noted. There was significant increase in revenue during the promotional period, without impacting the post period sales.
Shaonan Tian, Data Sample Selection Issues for Bankruptcy Prediction, August 12, 2009 (Yan Yu, Martin Levy)
Bankruptcy prediction is of paramount interest to both academics and practitioners. This paper devotes special care to an important aspect of the bankruptcy prediction modeling: data sample selection issue. We first explore the effect of different data sample selection methods by comparing the out-of-sample predictive performances using a Monte Carlo simulation study under the logit regression model. The simulation study conducted suggests that if forecasting the probability of bankruptcy is of interest, complete data sampling technique provides more accurate results. However, if a binary bankruptcy decision or corporate rating is desired, choice based sampling technique may be still suitable. In particular, within the logit regression context, a simple remedy could be applied to justify the cut-off probability, such that choice based sampling technique and the complete data sampling technique display the same explanatory power in forecasting the bankruptcy classification. We also find that appropriate adjustment of the cut-off probability is complementary if taking into account different misclassifications. Finally, we contextualize the proposed recommendations by applying them to an updated bankruptcy database. We further investigate the effect of the different data selection methods on this corporate bankruptcy database with a non-linear classification method, Support Vector Machines (SVM), which has recently gained some popularity in the applications.
Xinhao Yao, Option Pricing: A Comparison Between Black-Scholes-Merton Model and Monte Carlo Simulation, August 7, 2009 (Martin Levy, Uday Rao)
An option, a kind of financial derivative, is a special contractual arrangement giving the owner the right to buy or sell an asset at a fixed price on a given date. In this project, we focus on comparison between two option pricing methods: Black-Scholes-Merton model and Monte Carlo simulation. The results from both methods can be considered equivalent and an equivalence test is applied to determine the number of iterations of Monte Carlo simulation. We also try some modifications of the Monte Carlo simulation to see how to improve the pricing method when rare events happen.
Wei Huai, Bankruptcy Prediction: A Comparison between Simple Hazard Model and Logistic Regression Model, July 27, 2009 (Yan Yu, Uday Rao)
As a serious issue for both firms and individuals, bankruptcy has recently drawn increased attention from society thereby making its prediction an important topic. In this research project, two popular bankruptcy forecasting models, Shumway (2001) Simple Hazard Model and Logistic Regression Model, are studied and compared. Three different measurements, Deciles Ranking, Area under ROC curve and Hosmer and Lemeshow goodness of fit test are implemented to evaluate and compare these bankruptcy forecasting results. The conclusion that simple hazard model is superior to logistic regression model in accuracy of bankruptcy forecasting is reconfirmed.
Mayur Bhat, Study of Uplift Modeling and Logistic Regression to increase ROI of Marketing Campaigns, June 5, 2009 (Uday Rao, Amitabh Raturi)
In this research project, we study a technique known as Uplift Modeling which uses control groups judiciously to measure the true lift in sales that a marketing campaign generates. In addition, Uplift Modeling proposes customer segmentation to achieve better campaign results by way of selective targeting. The results show how using test versus control groups helps in measuring true lift. We also demonstrate that selective targeting of customers using Uplift Modeling increases incremental revenue when compared to the existing alternative called Traditional Response Modeling. Logistic Regression, using categorical attitudinal data, is also used to further strengthen and complement the results seen from Uplift Modeling.
Venu Silvanose, Developing and Assessing a Multiple Logistic Regression Model on Mortgage Data to Determine the Association of Different Predictor Variables and Borrower Default, June 3, 2009 (Martin Levy, Norman Bruvold, Yan Yu)
The purpose of this paper is to develop and assess a logistic regression model to determine the association of different predictor variables and mortgage borrower default. In the current housing market, where none of the widely used models in the industry were able to predict with some certainty the high level of default by borrowers, models are still used albeit with a sense of extreme caution to identify good and bad credit risks.
Manish Kumar, Intelligent Allocation of Safety Stock in Multi-item Inventory System to Increase Order Service Level and Order Fill Rate, June 3, 2009 (Amitabh Raturi, Michael Magazine)
In this study, we propose a model to establish safety stock in a multi-item inventory system to increase order fill rate and order service level based on correlation between the demands of multiple products. A customer order to a multi-item inventory system consists of several different products in different quantities. The rate at which a manufacturer is able to fulfill the demand for all products to the customer's order in a specified time is termed as order fill rate (OFR). Whereas, the statistical picture of how successful the manufacturer is in fulfilling all the orders completely by the required date is termed as order service level (OSL). The OFR and OSL are very important indices in measuring the performance of the manufacturer and customer satisfaction. We evaluated the order fill rate and order service level performance of the inventory system in a model in which total customer order demand process is based on normally distributed but correlated demands. We show that if the safety stock level is adjusted in accordance with the level of correlation in product demand, both the order fill rate and order service level can be improved.
Larisa Vaysman, Quantifying the Impact of Draft Round on Draft Pick Quality Using Non-Parametric Median Comparison, June 2, 2009 (Michael Fry, Jeffrey Ohlmann, Geoff Smith)
At the beginning of each season, NFL teams take turns selecting rookies to add to their rosters in a days-long process known as the NFL Draft. The NFL Draft consists of seven rounds. Since each team wants to have the strongest possible roster, players who are thought to have the potential to be outstanding are chosen early, and less desirable players are generally chosen later in the process or not at all. We seek to quantify the “cost,” in terms of player quality, that is incurred when a team chooses to wait until a later round to draft a player at a particular position. We also examine a number of position-specific metrics to measure player quality. We use the Kruskal-Wallis test, a non-parametric comparison of medians, to determine which draft rounds are likely to offer picks of equivalent quality, and which draft rounds are likely to offer picks of significantly better or worse quality. Our analysis is meant to assist teams during the decision-making process of drafting players by quantifying the tradeoffs inherent in each potential decision.
Michael D. Platt, Distribution Network Model Using Mixed Integer Programming and a Combination of Distribution Centers and Cross-Dock Terminals, June 2, 2009 (Jeffrey Camm, Michael Fry)
In an effort to reduce manufacturing costs, a company is considering moving its manufacturing facilities from the United States to Mexico. Though the facility costs and labor costs will be much lower at the Mexico facility, they are concerned that the move could have an adverse effect on their transportation costs. The goal of this project is to determine the distribution network that will result in the lowest transportation and material handling cost while maintaining desired customer service levels. Specifically, the project will focus on incorporating cross-docking terminals in the solution in conjunction with fully stocked distribution centers. At a cross-docking terminal, product is moved directly from a receiving dock to a shipping dock, spending very little time in the facility. This process eliminates the need to hold these finished goods in inventory, thus reducing inventory costs and material handling costs.
Taylor W. Barker III, The Expected Box Score Method: An Objective Method for NFL Power Rankings, May 29, 2009 (Martin Levy, Michael Magazine, co-chairs)
One of the more interesting pages on ESPN.com during the NFL season is the NFL “Power Rankings” that they compile each week. This is basically a ranking of the relative strengths (during that week) of the NFL teams based on the votes of several panel members (ESPN.com NFL writers/bloggers). While the results take into account the subjective rankings of each of the panel members, it would be interesting to see if there is an "objective" method to develop weekly power rankings, based on current season statistics to date. An objective method for weekly power rankings is found through a process I have named the Expected Box Score (EBS) Method. The EBS Method determines expected box scores between two teams with a given venue based on current season data and then plugs them into a linear regression model based on 20 years of data to get a current estimated point differential between the two teams. This process is repeated for every team playing against every other team exactly twice (once at home and once away) and are used to determine how many of those games each team would be expected to win. The team with the most wins is ranked #1 and so on. Shortcomings of other methods are addressed and then considered in the development of the EBS method. Validation for this method is provided via comparisons with Las Vegas point spreads and NFL.com Power Rankings.
Lori Mueller, Norwood Fire-Department Simulation Models: Present and Future, May 28, 2009 (David Kelton, Jeffrey Camm)
The Norwood Fire Department (NFD) currently operates one fire station, serving approximately 22,000 people. In 2008, the NFD made approximately 4,400 runs, which averages to about 12 runs per day. With an increase in retail and business development in the city, there has been a subsequent increase in the number of emergencies the department responds to each year. If the development in the city continues over the next few years, the NFD will have to grow along with the city. The NFD has a few options for expansion. One option is to open a second fire station at a location currently owned by the city, which used to be the Norwood fire station before a new station was opened at its current location. Another option is to expand their current station, which is located near the geographical center of the city, so that they could increase the amount of equipment and firefighters. Using simulation modeling, these different options were explored to determine which option is best for Norwood, when the time comes for expansion.
Vinod Iyengar, Call Volume Forecasting and Call Center Staffing for a Financial Services Firm, March 13, 2009 (Uday Rao, Martin Levy)
In this project, we use statistics and data analytics to build scalable and robust models for call center forecasting and staffing. The core of the problem involves predicting call volumes with lead times of a few months, when conditions are dynamic and there is high variability with multiple types of calls. We use data from a US-based prepaid debit card vendor with two types of calls: application calls and customer service calls. We predict application calls using a model of historical effectiveness of marketing dollars and incorporate data on card activation history and customer attrition. We predict customer service calls from active cardholders using time series analysis and regression to capture trend, seasonality, and cyclicity. Call volume predictions are then input into a stochastic newsvendor model to set a staffing level that effectively trades off staffing costs with lost-sales penalty costs for unsatisfied calls. The impact of different staffing level choices on expected costs is explored by simulating call center volume. Performance improvement resulting from this work includes more accurate forecasts with increased service levels and agent occupancy.
Lei Yu, A Comparison of Portfolio Optimization Models, March 13, 2009 (Martin Levy, Uday Rao)
Applications of portfolio optimization models have developed rapidly. One issue is determining which model should be followed as a guide for investors to make an informed portfolio decision. In this paper, five optimization models: classical Markowitz model, MiniMax, Gini's Mean Difference, Mean Absolute Deviation, and Minimizing Conditional Value-at-Risk, are presented and compared. Solutions generated by different models applied to the same data sets provide insights for investors. The data sets employed include real world data and simulated data. MATLAB, VBA (Excel as host), and COIN-OR software were employed. Some observations about alternative selection, similarities, and discrepancies among these models are found and described.
Moumita Hanra, Assessing ultimate impact of Brand Communication on market share using Path Models and its comparison to Ridge regression, March 12, 2009 (Martin Levy, Uday Rao)
Path modeling, based on structural equation modeling, is a widely used technique in market research industries to analyze interrelationships between various measures and to measure which ones are really significant in driving sales. In this study, the objective is to find the best fitting path model to assess which attributes are really important to a consumer in terms of sales using respondent level survey data. Also, this model would predict the best media sources companies should focus on in advertising their brand for gaining maximum public awareness of that brand and how this awareness drives the way one thinks about the brand in different dimensions and its effect in turn in driving sales. The second half of this study is focused on comparing the results of Path model to ridge regression to assess which model yields better fit and gives results intuitively. Ridge regression reduces the multicollinearity among independent variables by modifying the X'X matrix used in Ordinary Least Squares regression using a ridge control parameter. The results indicate that the path model gives a much better fit than ridge regression especially when multicollinearity is not in its extremity.
Man Xu, Forecasting Default: A comparison between Merton Model and Logistic Model, March 11, 2009 (Yan Yu, Uday Rao)
Merton default model, which is based on Merton's (1974) bond pricing model, has been widely used both in academic research and industry to forecast bankruptcy. This work reexamines Merton default model as well as the relationship of default risk with equity returns and firm size effect using an updated database from 1986 to 2006 time frame obtained from CompuStat and CRSP. We concur with most of the findings in Vassalou and Xing (2003). We find that both default risk and size have impact on equity returns. The highest returns come with the smallest firms with the highest default risk. We then focus on the comparison between Merton model (financial model) and a logistic regression model (statistic model) for default forecasting. We compare Default likelihood indicator (DLI) from Merton model with estimated default probability from logistic model using rank correlation and deciles rankings based on out-of-sample prediction. We find that the function form of Merton model is very useful in determining default. The structure of Merton model captures important aspects of default probability. However, if bankruptcy forecasting is desired, our empirical results show that Logistic model seems to provide a better prediction. We also add distance to default (DD) from Merton model as a covariate in our best logistic model and we find out that it is not a significant predictor.
Luke Robert Chapman, A Current Review of Electronic Medical Records, March 11, 2009 (Michael Magazine, Craig Froehle)
In this project, we research the imminent installation of Electronic Medical Records (EMR) in all hospitals and clinics throughout the United States. This project was motivated by our interaction with the Cincinnati Department of Health (CDH) via a project that focused on persuading the Cincinnati council that EMR should be immediately invested in at all six of the CDH clinics. We review the advantages of EMR and also recognize the disadvantages, some of which were overlooked in the original project with CDH. The current growth of Electronic Medical Records in the US and what the future holds for EMR is reviewed. The main analysis will review the claim that EMR helps to reduce medical errors. The analysis will use multivariate techniques such as factor and cluster analysis.
Chetan Vispute, Improving a Debt-Collection Process by Simulation, March 9, 2009 (David Kelton, Norman Bruvold)
The Auto-Search Process is an automated business process flow that has been designed by Sallie Mae for its in-house collection agency; it works sequentially to procure good phone numbers of delinquent borrowers. The process involves outsourcing of data to private vendors wherein the failed data from one vendor are sent to the next vendor until we have tested against all. Also, the process is governed by time-related business rules that allow the data to be sent to the next vendor only after a certain period. Keeping the cheapest vendor first, the process aims at reducing the cost while increasing the procurement of good phone numbers. Before this process could go live, it was required by the analytical team to analyze the process by building a time-related model, and make recommendations. This thesis explores the building of this time-based model using dynamic discrete-event simulation with Arena, and then talks about the findings and recommendations developed while working on the project which helped the company improve its annual revenue position by over $440,000.
Cary Wise, Cincinnati Children's Hospital Block-Schedule Optimization, February 10, 2009 (Kipp Martin, Craig Froehle, Michael Magazine)
Cincinnati Children Hospital is implementing an automated process to schedule clinical and surgical patient visits. The goal is to create a program that allocates operating rooms to requests submitted by individual doctors for clinical time and surgical time. The schedule creation process takes place in two phases: the first phase schedules spaces for specialties (Ortho, Cardio, etc.); the second phase allocates doctors to the specialty schedule. The program that generates the specialty allocation is named the Space Request Feasibility Solver (SRFS). The inputs of the SRFS are a set of specialty requests and information about the operating rooms; the output is the schedule of specialty assignments. The problem is formulated as a mixed-integer linear program (MILP) that minimizes the number of unfulfilled spaces requested. A very large number of potential assignments may be generated depending on whether the request parameters are very specific or general. Indeed, the instance quickly becomes intractable for a realistic problem. We implement a branch-and-price column generation algorithm to overcome the problem of an intractable number of variables. The SRFS invokes a COIN-OR solver named “bcp” to perform the procedures of branching, solving the LP at each node and managing the search tree. The scope of this master's project is to implement a column generation scheme in the SRFS. Testing of the SRFS was performed by verifying the column selected had the minimum reduced cost, and verifying the results of the LP relaxation and IP against the solution of the exhaustive enumeration of all columns. The performance of the SRFS in terms of the number of columns and nodes created to arrive at a solution was also investigated.