Carl H. Lindner College of BusinessCarl H. Lindner College of BusinessUniversity of Cincinnati

Carl H. Lindner College of Business
MS Business Analytics

MS Business Analytics Capstone Projects

2016

Uma Lalitha Chockalingam, Customer Churn Propensity Modelling, August 2016, (Dungang Li, Edward Winkofsky)

Churn is a measure of subscription termination by customers. Churn incurs a loss to the company when investments are made on customers with high propensity to churn. Churn propensity models can help improve the customer retention rate and hence increase revenue.  This paper focuses on the churn problem faced by companies and predicting customer churn by building churn propensity models. Data for this project is taken from the IBM Watson Analytics Sample Datasets, which contain around 7043 instances of telecommunication customers’ churn data. In this paper churn propensity models are built using techniques like logistic regression, support vector machines, neural networks, random forests, and decision trees. By comparing the various model performances it is observed that for out-of-sample prediction, neural networks, logistic regression and random forests perform better. While neural networks and random forests are black-box algorithms, logistic regression gives good insight of predictor variables that are effective in modelling churn. In-sample prediction measures of random forests show the ideal misclassification rate indicating over fitting to training data. Hence logistic regression is recommended owing to good out-of-sample prediction performance, along with insights on predictor variables that are significant to model.

Mohan Sun, Customer Analytics for Financial Lending Industry, November 2016, (Zhe Shan, Peng Wang)

This research involves discovering customers’ experience, attributes and performance to help the company make better decisions in increasing profit in all aspects of service, origination and collection. This has been done by examining different datasets such as customers’ attributes and performance data, using different tools such as SAS and GIS and performing various analysis. Upon examination of these datasets, it becomes clear that timely answering customers’ phone calls, targeting customers and locating stores in the area with high demand index as well as tracking stores and customers’ performance in time will help the company understand its operation and then make more profit.

 

Wenwen Yang, P&G Stock Price Forecasting using the ARIMA Models in R and SAS, December 2016, (Yichen Qin, Dungang Liu)

Time series analysis is commonly used in economic forecasting as well as analyzing climate data over large periods of time. It helps identify patterns in correlated data, understand and model the data as well as predict short-term trends from previous patterns. The aim of this paper was to present a concise demonstration of one of the most common time series forecasting models, ARIMA models in both R and SAS. The daily stock prices of Procter & Gamble from January 1, 2013 to September 30, 2016 with 693 points were used as an example. The autocorrelation function/partial autocorrelation function plots were used to examine the adequacy of the model as well as Akaike Information Criterion (AIC). The daily stock prices from October 1, 2016 to November 4, 2016 with 25 points were used to test the model’s performance by calculating the accuracy of the forecasts. First, the time series modeling was conducted in R, and then it was validated using SAS. The final model was identified as a moving average model with a first order difference. The AIC was 1494 and the average accuracy was 97%, which suggested that for the short-term prediction using the ARIMA model could do a good job. In addition, the log transformation was performed which was preferred in real economic prediction analysis. In this case, the same modeling results were obtained. To conclude, this paper demonstrated a comprehensive time series analysis in R and SAS, which could be a useful documentation for beginners.

Scott Woodham, Time Series Analysis Using Seasonal ARIMAX Methods, December 2016, (Yan Yu, Martin Levy)

The goal of this analysis is to develop a model that forecasts sales using time series methodology. First the ARIMA and SARIMA models are developed in their polynomial forms. Second the process of developing a model from start to finished is performed addressing such issues as stationarity of data, interpreting the ACF and PACF plots to infer the model parameters and then estimate the parameters. After the forecasts are made for the multiplicative seasonal model, the model is adjusted to include an exogenous variable, SARIMAX, to enhance performance. The current heuristic used to predict sales is the value of sales from a week prior or a lag 6 value, the final model selected has both AR and MA seasonal and non-seasonal components as well as a binary indicator variable, which is sometimes referred to as intervention analysis, though that term is not used here as it usually implies a large sudden shift from which the system recovers and in this analysis the data are more sinusoidal as the sales shift from weekdays to weekends.

Jin Sun, Internship at West Chester Protective Gear, December 2016, (Yichen Qin, Yan Yu)

West Chester Protective Gear founded in 1978, is a known leader in the marketplace for providing high performance protective gear for industrial, retail and welding customers. From gloves to rainwear to disposable clothing, WCPG offers a wide range of quality products including core, seasonal and promotional products and is one of the largest glove importers in the United States. This capstone is composed of five projects, most of which are interactive reports made with Microsoft Power BI, a cloud-based business analytics tool. The Order Picking and Fill Rate reports greatly increases the work efficiency of the Warehouse Department, and the reports for Purchasing Department provide people another view of the sales data, which will help the company make a better inventory plan. The last project is to analyze the relationship between Average Sales Price and Sales Units. A linear regression model is built to explain how the change of the price will affect the sales units. Model diagnostics are conducted. Model performance in terms of hold-out sample prediction is evaluated. Throughout this internship, I have practiced and made the best use of my knowledge from MSBA program to real world applications.

Ally Taye, Predicting Hospital Readmissions of Diabetes Patients, December 2016, (Yichen Qin, Yan Yu)

Diabetes is an increasingly common disease among the U.S. population.  According to the CDC, the number of people diagnosed with diabetes increased fourfold from 1980 to 2014.  In addition, if not well controlled, diabetes can lead to serious complications such as cardiovascular disease, kidney disease, peripheral artery disease, and many others that can result in hospitalization or even death.  In light of the seriousness of this condition, it is worth looking into the causes of hospitalization of diabetes patients and what factors influence whether they stay healthy enough to avoid future hospitalizations.  This paper looks at a de-identified dataset with information about diabetes patients admitted to hospitals across the U.S. over an extended period of time, and analyzes multiple variables from their hospital records to see if there was a statistically significant relationship between any combination of these factors and whether or not they were readmitted to the hospital any time during the observation period after their initial visit.

Hamed Namavari, Disney Princess: Strong and Happy or Weak and Sad, A Sentiment Analysis of Seven Disney Princess Films, December 2016, (Michael Fry, Jeffrey Shaffer)

In a world that is predominantly run by men, it has been suggested by several researchers that entertainment content is affected more by male influence than female influence (see Friedman et al.). But, what if the general male dominance coming from the study context is eliminated in the research process? In this capstone case, the significance of main female characters in a select list of Disney Princess title movies are explored by only comparing their scripts in those movies to that of the other main character, which is always not female, in each title. The research completed here supports the idea of that Disney princess characters are the most positive and most spoken characters in their movies.

Pramit Singh, Sentiment Analysis of First Presidential Debate of 2016, December 2016, (Amitabh Raturi, Aman Tsegai)

Datazar is a platform which uses open data to generate meaningful insights, hence the sentiment analysis was performed as a part of a scalable plan which would allow analysts to reuse the analysis to calculate sentimental scores based on Twitter feeds.  It was performed after the First Presidential debate to capture the mood of people on social media, the tweets were classified as positive, negative and neutral and a sentiment score was calculated for each of the presidential candidates.

Also, logistic regression and Random forest techniques have been used to predict the negative sentiments.  While this was implemented for the presidential debate, the functions used are reusable and hence can be used to get the score for any other brand. This was done to ensure the process is scalable and reusable.

Huangyu Ju, Regression Analysis for Exploring Contributing Factors Leading to Decrease of Cincinnati Opera Attendance, December 2016, (Dungang Liu, Tong Yu)

In recent years, there has been a growing concern about the diminishing audience for opera nationwide. Cincinnati Opera is currently facing a flat dropping down in total audience attendance in the past decade. The audience attendance of Cincinnati Opera includes subscribers and single ticket buyers. While the number of both subscribers and single ticket buyers is decreasing year by year, the number of single ticket buyer is not decreasing as rapidly as it is of subscribers. The goal of this project is to explore and identify the possible variables that may influence especially the number of subscribers. In this project, the regression analysis method is adopted for exploring the contributing factors that impacting the audience attendance. The analysis has identified four categories of variables: variables related to the origin of the opera piece, variables regarding the show time, variables related to the popularity, and the theatre capacity. To boost the audience attendance, it is recommended that the opera pieces with European background and good reputation should be included in each season’s performance. More performances should be scheduled during weekends so that they can attract more audience.

Darryl Dcosta, Analysis of Industry Performance for Credit Card Issuing Banks, August 2016, (Dungang Liu, Ryan Flynn)

Argus Information and Advisory Services, LLC, is a financial services company that utilizes the credit card level transaction data collected from different banks and credit bureaus, to offer various analytical services to credit card issuers. Argus possesses transaction, risk, behavioral and bureau sourced data that covers around 85-95% of all the banks in the US and Canada. The dataset contains transaction level data provided by nearly 30 banks, across 24 months, with an approximate size of 3+ million records. This study looks at how Argus can offer an early bird analysis of the variance in performance of the industry, while abiding legal regulations that prevent the company from revealing more than a certain level of data, which poses a threat of price fixing by the client bank. Data is pulled from tables containing different dimensions of data in the SQL server database and aggregated to produce client level reports. The analysis showed that the projections made were fairly consistent with the observed industry trend, after the transaction level bank data was loaded, validated, normalized and queried from the database. It is a good indication on the accuracy of the projections. Client banks use the flash report to tailor their revenue model and customer acquisitions strategy. The spike in Total New Accounts in the industry for March 2016 was not captured by the projection made, which would need to be revisited from a business point of view.

Shivaram Prakash, Predicting online Purchases Navistone®, August 2016, (Efrain Torres, Dungang Liu)

Ecommerce – the newly emerged platform for online retail sales has seen a burgeoning increase in its usage since inception. Although dominated by the giants like Amazon and Ebay, almost all businesses have their own online store or website which contribute to a sizable chunk of the total revenue. Navistone® collects the visitor browsing behavior data and analyses patterns to predict prospective buyers for its clients. The objective of this exercise is to analyze the browsing behavior data of online visitors, in order to predict the success of purchase for each visitor. In order to achieve this goal, visitor browsing data is collected from various client websites, checked for erroneous entries, cleaned and analyzed.   Binary response models are then generated on a reduced, choice-based dataset (for enabling better prediction). The first model, a classification tree model, is generated to enable the management to understand the importance of different features of the dataset while the second, logistic regression model, is generated to better predict the response as compared to the classification tree.  The logistic regression model produces better prediction in both the training and testing datasets and the classification tree provides evidence that the number of carts opened is the most statistically significant variable, prompting the management to focus the marketing efforts on visitors who put items in the cart and then abandon them later on.

Lavneet Sidhu, Predicting YELP Business Rating, August 2016, (Yan Yu, Glenn Wegryn)

Sentiment analysis or opinion mining is the computational study of people’s opinions, sentiments, attitudes, and emotions expressed in written language. It is one of the most active research areas in natural language processing and text mining in recent years. Its popularity is due to its wide range of applications because opinions are central to almost all human activities and are key influencers of our behaviors. Whenever we need to make a decision, we want to hear others’ opinions. The focus of this study is to quantify people’s opinion on a numerical scale of 1 to 5. Various predictive models were explored and their performance were evaluated to determine the best model. Attempts were made to extract the semantic space from all the reviews using latent semantic indexing (LSI). LSI finds ‘topics’ in reviews, which are words having similar meanings or words occurring in a similar context. Similar reviews were clustered into different categories using semantic space.

Ashok Maganti, Internship with Argus Information and Advisory Services, August 2016, (Harsha Narain, Michael Magazine)

Argus Information and Advisory Services is a leading benchmark, scoring solution, analytics provider for the financial Institutions. Argus helps its clients maximize the value of data and analytics to allocate and align resources to strategic objectives, manage and mitigate risk (default, fraud, funding, and compliance), and optimize financial objectives.  One of the core competencies of Argus is being able to link the different accounts of a customer across financial institutions and have a complete view of the customer’s wallet. The deposits, transfers and the spending of the customer can be linked and the complete profile and spending behavior can be studied. Wallet Analysis Team is responsible for the Linkage and validation of the data.  As a part of Data and Applications Vertical and Wallet Analysis Team, my primary objective was to study the concepts of Record Linkage, Identity Resolution and to develop an algorithm to identify the unique customers from different data Sources and to populate into a single normalized flat database using deterministic Record Linkage process for the UK market.  The Records have the credit card account and customer details from different banks. These records are to be linked and integrated so as to identify the same customer across banks and remove the duplication. Apart from identifying the accounts of customers across banks, the changes in the customer details have to be captured and to be maintained in the integrated flat database with the help of slowly changing dimension of type-2.

Lian Duan, Fair Lending Analysis, August 2016, (Julius Heim, Dungang Liu)

The Consumer Financial Protection Bureau (CFPB) requires lenders to comply with fair lending laws, which prohibit unfair and discriminatory practices when providing customer loans. Applicants’ demographic information is usually prohibited for collection but it is needed to perform fair lending analysis. The objective of this project is to show that the race distribution is similar across the three bins of a predictor from our scorecard. In this project, the customers’ race categories were predicted using last names and residential locations according to the Bayesian Improved Surname Geocoding (BISG) proxy method published by CFPB. Modifications in our analysis including using customers’ Core Based Statistical Area (CBSA) information instead of home address, and using R software instead of STATA for data preparation and analysis. The predictor was evaluated based on race distribution in each bin, and our results suggest that the race distributions across the three bins of this predictor are similar.

Juvin Thomas George, Automation of Customer-Centric Retail Banking Dashboards, August 2016, (Andrew Harrison, David Bolocan)

Retail Banking is a competitive arena focused on customer-centric service. Customers interacting with banks through multiple channels have created an explosion of data, banks use to generate insights into their behaviors. Understanding customer data is crucial to developing better products and services. Performing analytics on transactional data and utilizing benchmarking studies requires creation of standard dashboards on a regular basis. Automating data input processes and updating dashboards are critical to on time services. This capstone project was completed at Argus Information & Advisory Services, part of Verisk Analytics, located at White Plains, NY.

Sudarshan K Satishchandra, Prediction of Credit Defaults by Customers Using Learning Outcomes, August 2016, (Peng Wang, Yichen Qin)

Most financial services have realized the importance of analyzing credit risk. Predicting the credit defaults with higher accuracy can save considerable amount of capital to financial services. Many machine learning algorithms can be leveraged to increase the accuracy of prediction. Popular and effective algorithms such as Logistic regression, Generalized Additive Models, Classification Tree, Support vector machines, Random Forest, Extreme Gradient Boosting, Neural networks and Lasso are apt for predicting the credit defaults. These algorithms have been compared using asymmetric misclassification rate and AUC for the out sample prediction. Data from the UCI Machine Learning Repository which was donated by I-Cheng Yeh from Chung Hua University, Taiwan has been used.

Rishabh Virmani, Kobe Bryant Shot Selection, August 2016, (Michael Magazine, Yichen Qin)

The report consists of insights about Kobe Bryant’s shot selection throughout his career. The data we have are all of his career shots and whether they went in or not which is the response variable. Along with that, we are trying to predict Kobe’s performance in the last two seasons of his career. We are predicting whether he actually sunk the shot or not. For this purpose we are employing the following three algorithms, Random Forest (Bootstrap Aggregating technique), Support Vector Machine (Non Probabilistic technique) and XGBoost(Boosting technique).

Alicia Liermann, The Analytics of Consumer Behavior:  Customer Demographics, August 2016, (Jeffrey Shaffer, Uday Rao)

This project focuses on consumer buying behavior in retail grocery stores across the United States.  The data was obtained through historic Dunnhumby data that was generated by shopping cards and recorded coupon codes, accompanied by transaction information. The project was approached from a business sales and marketing orientation as a means to target customers and increase sales. 

Sarthak Saini, Predicting Caravan Insurance Policy Buyers, August 2016, (Peng Wang, Glenn Wegryn)

The project involves analyzing customer data for an insurance company. The aim is to predict whether a customer will buy caravan insurance based on demographic data and data on ownership of other insurance policies. The data consists of 86 variables and includes product usage data and socio-demographic data derived from zip codes. There are 5822 observations in the training data set and 4000 observations in the testing data set.  The project aims to predict if a customer is interested in purchasing a caravan insurance policy. The models used for the project classifies them as potential buyers or no buyers.  Predictive models were built to describe the customer behavior and predict potential buyer. Given that this is a classification problem Lasso Logistic regression, Classification Tree, Random Forest, Support Vector Machine (SVM) after dimension reduction by Principal Component Analysis (PCA), Linear discriminant analysis (LDA) (after PCA) and Quadratic discriminate analysis (QDA) (after PCA) were used to predict the potential customers . Dimension reduction was employed to reduce the number of predictor variable as there are many predictor variables.  Best results were obtained using LDA and SVM with the misclassification rate as low as 7% for the testing data. Dimension reduction significantly improved the performance of the models. PCA was used for reducing dimensions and the first twenty components were used to build the model.

Abhishek Chaurasiya, Tracking Web Traffic Data Using Adobe Analytics, August 2016, (Dan Klco, Dungang Liu)

A website is the major source of information and interaction between the consumer and producer in any kind of organization or environment. It can be accessed by hundreds to millions of users, which generates huge volumes of data. This data contains important information about customer profile, demographic information, technology used, user patterns, consumer trends etc. Tracking this data and reporting it in the format desired is therefore a huge and important task. This project uses Adobe Analytics, along with Dynamic Tag Manager (DTM) to track and effectively report this data. The reports are then analyzed keeping in mind their business value. The analysis concludes that the author ‘Ryan McCollough’ garners maximum views, around 90% of the total, through his posts. It’s also concluded that Twitter is the most preferred Social media channel, it drives around 80% of the traffic which follows the blog.

 Rutuja Gangane, Customer Targeting for Paper Towels – Trial Campaign, August 2016, (Sajjit Thampy, Yichen Qin)

Customer Targeting has been a marketing challenge for many years. The idea behind customer targeting is to optimize targeting so that one targets the right kind of customer at the right time, with the right kind of product in order to maximize sales, save business resources and maximize profit. Quotient Technology Inc.’s Website Coupons.com delivers personalized digital offers in accordance to user’s purchasing behavior data.  A customer is displayed with many different combinations of coupons based on their buying patterns/segments created using data-driven techniques. This project is an ad-hoc predictive analysis to determine the target customers for a Paper Towel producing CPG brand, (say YZ) targeting its customers with personalized coupon offers for various retailers. The main idea behind the campaign is to generate trial. It is easy to determine user’s behavior based on previous trial campaigns, but as this is a first campaign of its sort for the brand, we will make use of multiple machine learning techniques, heuristics and business knowledge to make the best predictions about which customers are likely to try the product.

This project will make use of excessive SQL queries (Hadoop Impala) and R software to perform data analyses, market basket analysis, logistic regression, random forest, SVM models and similar Machine Learning techniques to find which customers are more likely to buy YZ Brand’s Paper Towels and should be targeted for this trial campaign.

Hardik Vyas, Analysis  of Kobe Bryant Shot Selection, August 2016, (Michael Magazine, Peng Wang)

The key objective of this project is to explore the data pertaining to all of the 30,697 shots taken by Kobe Bryant during his entire NBA career. We also look to develop various models to predict which of these shots would make the basket had the outcome been unknown.   The problem is based on a competition now closed on Kaggle. The competition was introduced post Kobe Bryant’s retirement from professional Basketball on April 12, 2016. Kobe played out his entire 20 year NBA career with the Los Angeles Lakers. He had an illustrious career to say the least, holds numerous records and is regarded as one of the most celebrated players to ever grace the game.

Nikita Mokhariwale, Reporting Analyst Internship at BlackbookHR, Cincinnati, August 2016, (Marc Aiello, Peng Wang)

The importance of data interpretability is often overlooked during Executive reporting. The customer experience can be increased manifold if Executive reports are made user-friendly and in such a manner that the executives are encouraged to see patterns and trends in data, and to even question the data. I transformed the traditional reports which BlackbookHR used to create for all its clients in the Talent Analytics space. The traditional reports comprised of numbers and tables which were tedious to read and provided little insight apart from just the results of surveys taken by the employees of the client. I introduced innovative visualizations and charts in the reports and minimized the use of numbers in depicting the data. The visualizations helped the Executives to view their organization in one snapshot without having to perform any mental calculations, as there were no numbers involved. This received very positive feedback from clients because such charts helped them find patterns also in areas where they weren’t expecting them. For example, one of the clients was able to identify a possible negative correlation between size of teams and levels of Employee Engagement. My work was primarily based on Excel and Tableau. I later created Excel and Tableau templates which could be used for all future reporting purposes. I did scalability tests so that the reporting templates could be used for larger clients and stay robust when varied data is introduced in them.

Joshua Roche, Market Analysis Framework for Mobile Technology Startups, August 2016, (Amit Raturi, Michael Magazine)

The current technological revolution has created a veritable modern day “gold rush” due to an ever-growing market and much lower barriers to entry than in traditional industries. Many startups do not pursue an analytical study of the market in which they seek to enter before development begins. This potentially leads to a tremendous undertaking that is in effect useless, due to a lack of implementing a market analysis before work begins. This paper seeks to establish an initial analytical framework to begin testing market potential assumptions before work begins so that entities with a limited amount of resources including, a lack of analytical prowess and information asymmetry, to make more informed decisions.  

Shashank Pawar, Hybrid Movie Recommender System Using Probabilistic Inference over a Bayesian Network, August 2016, (Peng Wang, Edward Winkofsky)

Recommender systems are used widely, in order to help users accessing the Internet, by suggesting the products or services, they would be interested in based on their historical behavior, as well as the behavior of other users similar to them. Two different types of approaches are usually adopted while developing a recommender system: Content based and Collaborative filtering.  This project studies the application of a hybrid approach, combining the content based and collaborative filtering techniques, in developing a recommender system for movies. The data set used is the MovieLens 100K data set, consisting of 100,000 ratings by 943 users of 1682 movies, where a movie is described using one or more of 19 features or genres. The objective is to predict how a given user would rate a movie, which has not yet been rated by him. A Bayesian network, is used to represent the interaction and dependencies among the movies, users and movie features, which in turn are represented as nodes in the graph. In order to find the users similar to the given user, for the collaborative filtering part, first, the ratings by the two users on common items, are considered, and the Pearson Correlation Coefficient between the two sets of ratings, is used as the measure of similarity and second, considering the same sets of ratings by the two users, the count of instances where the two users have both rated a movie lowly or highly, is used as a measure of similarity.

Raunak Bose, Machine Learning - Comparison Matrix, August 2016, (Uday Rao, Michael Magazine)

With the availability of several options, the decision of selecting machine learning tools for machine learning algorithms has become cumbersome. Each algorithm brings its own pros and cons to the machine learning community and many have similar uses. The emergence of phenomenon of collection of huge data is already here and current tools for machine learning need real-time processing abilities to meet the requirements of its users. Through this paper, I wish to provide researchers the ability to utilize machine learning with Python. In order to evaluate tools, one should have a thorough understanding of what to look for. This paper will take into account the platform of Python to evaluate machine learning algorithms on confusion and hardware matrix. We will look at libraries such as Python SCIKIT and study their usage in performing processing on data meant for supervised learning algorithms.

Ryan Stadtmiller, Predicting Season Football Ticket Renewals for the University of Cincinnati Using Logistic Regression and Classification Trees, July 2016, (Michael Magazine, Brandon Sosna)

Season ticket holders (STH) are important for both collegiate and professional sports teams. It allows fans to take ownership in the team and also provides a significant amount of overall revenue for the team’s ownership. For these reasons, maintaining a high renewal rate of STH’s is important to the teams on and off the field performance. I will focus on analyzing STH renewals for the University of Cincinnati’s Football team. I will use statistics and data mining techniques to predict whether a STH is likely to renew their seats based on many predictor variables such as quantity of tickets, section, percentage of tickets used throughout the year, and percentage of games attended among many others. If a customer is not likely to renew their tickets, the athletic department can take preemptive measures to retain the customer.

Nicholas Imholte, Optimizing a baseball lineup: Getting the most bang for your buck, July 2016, (Michael Magazine, Yichen Qin)

Given a fixed payroll, and focusing purely on the offensive side of the ball, how should a baseball team assign its funds to give itself the highest average number of runs possible? In this essay, I will attempt to answer this question using regression, clustering, optimization, and simulation. First, I will use regression to model baseball scores, with the goal being to determine how each event in a baseball game impacts how many runs a team scores. Second, I will use clustering to determine what kinds of hitters there are, and how much each type of hitter costs. Third, I will use optimization to determine the optimal arrangement of hitter clusters for a variety of payrolls. Finally, I will complement this analysis with a simulation, and see how the results from the two approaches compare.

Nidhi Shah, Revenue Optimization through Merchant-Centric Pricing, July 2016, (Jay Shan, Madan Dharmana)

A payment processor, that processes credit and debit card transactions, wanted to come up with a strategy to maximize the revenue they make from merchant transactions, by re-pricing the processing rates of their merchants periodically. The biggest challenge with increasing a merchant’s rate, as is with any customer of a business, is that, there is a very fine line between driving the customer away due to their price sensitivity and being able to determine an optimum price point so as to get the most revenue out of them and retain them as a customer.

To address this challenge, we implemented a dynamic, merchant-centric pricing strategy where each merchant is treated individually - based on their profile - while determining the pricing action to be taken. In order to achieve this, we designed an automated solution in SAS that came up with a unique pricing recommendation for each merchant based on certain decision rules. The strategy to maximize revenue was implemented by increasing processing rates up to the merchant’s segment (industry and volume tier) benchmark along with certain other constraints. This automated solution allowed re-pricing to be done more frequently (monthly) which resulted in an annual incremental revenue of ~$500,000 for the payment processor.

Kristofer R. Still, Forecasting Commercial Loan Charge-Offs Using Shumway’s Hazard Model for Predicting Bankruptcy, July 2016, (Yan Yu, Jeffrey Shaffer)

In the course of lending money, a certain percentage of a bank’s outstanding loans will be deemed uncollectible and charged-off.  Because charge-offs can lead to significant losses commercial banks try to minimize these losses by closely monitoring borrowers for signs of default or worse.  Commercial banks maintain detailed financial records for their customers which include numerous accounting ratios.  This analysis seeks to leverage this accounting data to predict corporate charge-offs using a sample of firms from January 1, 2000 through the present.  A simple hazard model is used and compared to older discriminant analysis methods based on out-of-sample classification accuracy.

Sahithi Reddy Pottim, Building a Probability of Default Model for Personal Loans, July 2016, (Dungang Liu, Yichen Qin)

Consumer lending industry is growing rapidly with a wide spread of loan types and lending personal loans over internet is gaining huge importance. The main goal of the project is to determine which customers should be offered a loan in order to maximize the profit of a small finance company which issues loans to customers over internet. The data set has information on the past loan performance and contains about 26,194 loans with 70 variables. The variables can be categorized as those on application data, credit data, loan information and loan performance.  The main crux of the project is the selection of variables using weight of evidence and information value concepts which are measures of predictive power of the response variable. It has been noticed that weight of evidence is high for those variables where the percentage of the good and bad loans change significantly as the bins change. Variables with information value (predictive power) between 0.26 and 0.02 which can be classified as strong, average and weak predictors are considered for building logistic regression model and it resulted in an AUC of 0.67. However Information value did not take into account correlation or multicollinearity among the variables. Further check on correlation and multicollinearity using variance inflation factor (VIF) resulted in the reduction of variables. Step-wise logistic regression model is built on the selected variables using information values and it resulted in the reduction of variables and an AUC of 0.69 and a reduction in misclassification rate of good and bad risk loans. The results proved that information value is one of the best variable selection procedures and step-wise logistic regression model suited best in the prediction of probability of default of loans on the dataset.

Joseph Chris Adrian Regis, Human Activity Recognition using Machine Learning, July 2016, (Yichen Qin, Dungang Liu)

The Weight Lifting dataset is investigated in terms of "how (well)" an activity is performed. This can have real life applications in the sports and healthcare space. In this particular capstone, machine learning algorithms are applied with the intension of checking the feasibility of its application in terms of accuracy.  This data is collected from the use of wearable accelerometers consisting of 39,242 observations with 159 variables. Features were calculated on the Euler angles (roll, pitch and yaw), as well as the raw accelerometer, gyroscope and magnetometer readings from the wearable devices. We have chosen to go with algorithms in the order of increasing complexity in order to probe accuracy w.r.t. the algorithm used.  Decision Trees, Random Forests, Stochastic Gradient Boosting and Adaptive Boosting were applied. We saw that there is not much difference between the latter 3 (less than 0.25% apart in terms of % accuracy), but they were much better than decision trees, as expected.  But as we have to choose between the three, we choose adaptive boosting as the final algorithm. We get an accuracy of 99.95% with the algorithm (adaptive boosting) on the scoring dataset and this is the expected accuracy in a general application using the same setup.

Jigisha Mohanty, Analyzing the Relationship between Customers for the Commercial Business of the Bank to Identify the Nature of Dependency and to Predict the Direction of Risk in Cases of Possible Adverse Effects, July 2016, (Kristofer Still, Michael Magazine)

The Commercial banking business deals with many customers that buy various products from the bank. There are scenarios where a company and its parent company are both customers of the bank. Further, a bigger company can guarantee the loan requested by another company. Each loan or credit service established carries a certain amount of risk for the bank. Each relationship is rated based on such factors of risk. The direct risk to the bank is established by the direct exposure amount assigned to the company. When a different company owns or guarantees for a company, the latter’s direct exposure also shows up as indirect exposure for the former. This implies that if the smaller company defaults in paying back its loan, the company owning or guaranteeing for it is responsible for the entire loan taken by the smaller company.

The objective of this study is to create a network map to identify such connections. The network map will provide a visual description of the relationship between two customers and show the dependencies between customers. The second objective of the study will be to identify the direction of risk in terms of direct and indirect exposure for primary, secondary companies and so on. This will help the bank establish a line of action and to quantify the exposure amount attributed to each customer. The direction of risk will open up the analysis of the effect of an adverse effect.

Minaz Josan, Sentiment Analysis for the Verbatim Response Provided by Clients for Satisfaction Survey for Fifth-Third Bank, July 2016, (Kristofer Still, Yan Yu)

The Financial-Services industry is still struggling with high churn rates as customers have numerous options where they can bank. This leads to the need for understanding the hidden customer sentiments. The industry has realized the need for strengthening the relationship with their customers. One measure taken is to monitor the performance of the representatives and the satisfaction of the clientele with the institution as well as the representative. An overall satisfaction score is provided to every representative based on the survey completed by the clients on the performance of the bank and the representative. This survey also includes the verbatim responses. In this project, an attempt will be made to identify the sentiment behind these verbatim responses and the correlation to the overall satisfaction score. The responses will be analyzed in three-category scale of positive, negative or neutral using the supervised learning model of SVM (support vector machines) and Logistic regression algorithm.

 Adam Sullivan, Predicting the Rookie Season of 2016 NFL Wide Receivers, July 2016, (Yichen Qin, Mike Magazine)

The NFL has never been more popular than it is today, part of why the sport has become so popular is the expansion and exponential growth of fantasy football. According to American Express nearly 75 million people will play fantasy football and spend nearly $5 billion to play in the course of the 2015 season. The leagues people play in range from daily fantasy football, where different players can be selected each week, to dynasty fantasy football, where players can be kept for their whole career. This analysis will be focused through the lens of dynasty fantasy football, which is seeing its own explosion of participants. In dynasty fantasy football the wide receiver is king with 16 of the top 20 ranked players being wide receivers. The purpose of this analysis is to give insight into which 2016 rookie wide receivers are in the best position to have success in their rookie season and would validate being selected early in dynasty football drafts.

Eulji Lim, Cincinnati Crime Classification, July 2016, (Yan Yu, Dungang Liu)

Every citizen expects prompt service from police, and the police department wants to draw satisfaction from citizens with resource management and other tools. This study aims to build “Cincinnati crime category prediction models” in order to find an insight of the crime data through appropriate data visualization. The Cincinnati Police Crime Incident dataset is provided by the City of Cincinnati Open Data Portal. It contains time and location of crime in the six districts of Cincinnati from 1991 to present and has been continuously updated daily. Specifically, there are over one hundred eighty thousand incidents from Jan 2011 to May 2016, which is the sub-dataset chosen for the analysis. The crime classification idea and the model evaluation method are inspired by one of the Kaggle competitions: “San Francisco Crime Classification”. In the data exploration, it is found that month and season affect the number of crimes rather than the types of crime. Logistic Regression models are built using R with different time and geographical attributes. The hour, year and neighborhood factors are found to be more effective than other factors such as latitude and longitude, in order to build the model with the lowest log-loss (2.133). In addition, Random Forest and Tree models are built in SAS Enterprise Miner and the random forest model with hour and neighborhood factors shows the best performance with the lowest misclassification rate (0.67).

Joshua Horn, Analysis and Identification of Training Impulses on Long-Distance Running Performance, July 2016, (David Rogers, Brian Alessandro)

Long-distance running is one of the most popular participatory sports in the United States; in 2015 there were 17.1 million road race finishers and over 500,000 marathon finishers, each collecting a trove of untapped data. The subject of this analysis has been a semi-competitive runner since 2000 and began collecting personal running data in 2004, with an increase in detail in 2007 while competing collegiately and again with the inclusion of GPS data in 2014. Using these data and background knowledge of training theory and exercise physiology, a variety of new variables were defined for exploration in their ability to explain changes in athlete fitness, defined by VDOT, a pseudo form of VÖ2max, maximal oxygen consumption rate. The primary objective was to identify the primary drivers of VDOT to inform future training decisions. Based on a combination of heuristic, ensemble, and complete search methods across linear, additive, and tree regressions, the variable 48-week measures of training impulse, measured in intensity points, was identified as the primary driver of changes in VDOT. From these results, future training for the athlete should focus on maintaining long-term consistency, with the 48-week training impulse between 3,500 and 4,500 points, a zone that produces VDOT outcomes in the 66th to 86th percentiles without inducing the substantial physiological (muscular degradation due to insufficient recovery) and psychological (mental strain accompanying 10 to 16 hours required for weekly training) stress associated with higher training loads.

Linlu Sun, Analysis and Forecast of Istanbul Stock Data, July 2016, (Yichen Qin, Peng Wang)

The Istanbul Stock Exchange data set is collected from imkb.gov.tr and finance.yahoo.com. Data is organized with regard to working days of the Istanbul Stock Exchange. The objective of this exercise is to forecast the response variable ISE. First, we will use the mean forecast ISE, if it does not work we need to build a linear model.  When using mean forecast ISE fails, we will build the best model using a linear regression on an 80% sample of the actual data. The initial approach involves performing exploratory data analysis to understand the variables and designing the best model with the most appropriate variables using linear regression. Based on the best model, we will forecast each predictor variables, then use the best model formula to forecast the next 10 days value of the ISE.

Subhashish Sarkar, Sentiment Analysis of Windows 10 – Through Tweets, July 2016, (Dungang Liu, Peng Wang)

With the advent of the mobile operating systems (OS), Microsoft revamped its value offering and launched the Windows 10 OS in July 2015 that works across devices (Laptops, Desktops, Tablets and Mobile phones). To gain market share and attract existing users to install the new OS, Microsoft offered a free upgrade that is expected to end on July 29th, 2016. However, Microsoft has not been able to generate the targeted traction for its new OS amongst its user base.  The purpose of this project is to explore the sentiments of the user base and thereby explore the reasons why Windows 10 is not getting the traction targeted by Microsoft. Sentiment analysis is helpful for brands to determine the wider public perception about a product on social media. Results from the analysis can be used as a direct feedback that can result in altering product strategy or pruning and adding features to the product. In this case, lexicon-based sentiment analysis of tweets on Windows 10 revealed that only 24% of the users had a positive opinion. The analysis using ordinal regression further highlighted some specific issues that contributed to the negative opinions. E.g., the negative emotions were due to bugs, crashes, installation errors and the aggressive promotion adopted by Microsoft. The positive opinions about Windows 10 were centered on the host of features available in the OS. The report also goes on to identify frequently used contextual words that can be added to the lexicon to improve the parsing of emotions.

Zhiyao Zhang, Methodology on Term Frequency to Define Relationship between Public Media Articles and British Premier League Game Results, July 2016, (Yichen Qin, Michael Magazine)

This project is intended to define whether the relationship between the public media and the game results of Premier League games exists. Premier League is a soccer league filled with rumors, sources, news, and critics. Players in the league are suffering from the pressure and anxiety of the critics, which may potentially affect their game performance; however, we do not know whether the relationship actually exists or not, and even if does, we do not know how the public media affect the games. In this research, I will investigate these two questions with Term Frequency.

Emily Meyer, Demonstration of Interactive Data Visualization Capability for Enhancement of Air Force Science and Technology Management, July 2016, (Yan Yu, Jeff Haines)

Data must often pass through certain people and channels before it becomes information and reaches someone who makes a decision. In order to make the data-to-decision-maker pipeline somewhat more expedient, an effort is being undertaken to setup Tableau and Tableau Server within the Air Force Research Laboratory Materials and Manufacturing Directorate (AFRL/RX). This effort is a long-term project whose current initial stage is focusing on both setting up Tableau Server and also demonstrating the capabilities of blending data and use of interactive dashboard visualizations for personnel within AFRL to create early adopters within the organization. The following report is an intern’s contribution toward the demonstration of the capabilities of Tableau through creating PowerPoint presentations, Tableau Story Points, Tableau Dashboards, identifying principles for structuring data, cleaning up datasets, and refining already created dashboards.

Zhaoyan Li, Identifying Outliers for the TAT Analysis, July 2016, (Michael Magazine, Ron Moore)

The goal of our company is to provide the best healthcare services (Cleaning, Equipment Delivery, etc.) to our client – Cincinnati Children’s Hospital Medical Center, or CCHMC, and of course, to patients who visit the Hospital. Sure enough, data analysis plays the role of improving the quality and efficiency of our services.  My data analysis work can be put into four categories: Turnaround Time Analysis, Supervisor Inspection Analysis, Patient Survey Analysis, and Full-time Employee Analysis. Since October, 2016, profits our company receive from CCHMC are all based on metrics. For example, the hospital requires that 90% of all cleaning requests should be completed within 60 minutes. If we can hit this goal, we will receive 100% money. If hit 90% within 65 minutes, we get 75% of profit, and so on. The goal from the perspective of a data analyst is to generate graphs that tell how we perform in the past as well as to figure out a way of meeting the objective metrics set by CCHMC.

Shivanand Yashasvi Meka, Predicting Customer Response to Bank Marketing Campaigns, July 2016, (Peng Wang, Dungang Liu)

Banks often market their credit card, term deposits, etc. by cold calling their customers. Each call has a cost associated with it, and calling all their customer base is not prudent for a bank, as only a small percentage of those customers actually convert. Therefore, in order to reduce costs and improve efficiency, it is important to have a good prediction model for separating those customers who are likely to respond positively to the marketing campaign. The dataset used for this project contains information of 41,188 customers who have been approached for subscribing to a term deposit offered by a Portuguese bank. The objective of this project is to develop a model to accurately predict whether a contacted customer will subscribe to the term deposit or not.

Six different modeling techniques have been used in this project. These models use 19 variables that span a customer’s demographic information, credit and previous campaign information available with the bank, and macro-economic variables. Four of these 6 models have a very similar performance, and they improve the net profit generated from the marketing campaign by 50%.

Subhasree Chatterjee, Movie Recommendation System Using Collaborative Filtering, July 2016, (Yan Yu, Peng Wang)

Movie Recommendation system is the way to recommend users movies based on their ratings on movies they have already watched. Collaborative Filtering is used in this project to achieve the goal. This report is a comparative study of the different methods of collaborative filtering used in the industry and to find the best method based on the data in hand. This project also predicts the top 5 recommended movies per user based on their historical ratings from the Movielens database.

Sanjita Jain, Incident Rate Analysis, July 2016, (Dungang Liu, Michael Magazine)

XYZ, a major wireless network operator in the United States, offers its subscribers with a couple of handset protection programs (insurance programs), which covers all mobile devices including kit accessories (wall charger, battery, SIM card) in a total loss claim such as loss/theft or physical/liquid damage for as long as the feature is paid for. Understanding the different subscriber’s features, which affect their claim propensity would help in better budgeting for the future. Analysis of the different features of subscriber base (approximately 10 million per month for a period of 29 months) like credit class, Regions, Devices, etc. and analysis of the different features of the fulfilled claims per month base (approximately 115 thousand per month for a period of 29 months) like day of week, tenure, etc. was done to understand the relationship between incident rate and subscriber features. After the analysis it is clear that the P credit class in spite of making the maximum number of claims has the lowest incident rate. March and August have the maximum incident rate. The regions have no significant effect on the claim filing propensity. The incident rate reduces as the tenure increases. These findings will be helpful in improving forecast accuracy and recommending improvements to reverse logistics activities.

Sai Shashanka Suryadevara, Image Classification: Classifying images containing Dogs and Cats, July 2016, (Yan Yu, Peng Wang)

In many websites it is a common practice to check for HIP (Human Interactive Proof) or CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) when users try to register for their Web services. This verification is for many purposes, such as to reduce email and blog spam and prevent brute-force attacks on web site passwords. The users are provided with a challenge that is supposed to be easy for people to solve, but difficult for computers. The common practice is that user encounters a challenge to identify the letters or numbers in an image which are mostly distorted. Solving this CAPTCHA is not very easy and sometimes the user gets frustrated in this process. There is an another practice to verify for human interaction proof where users are shown some pictures containing both dogs and cats, then the users are required to identify the images that has Cats (or Dogs). As per the studies, users can accomplish this task quickly and accurately, many even think it's fun. But for computers, this task is not so easy because of many similarities that exists between these animals. In this project, supervised learning models are implemented to classify the images. The images are processed to obtain the pixel information by standardizing and rescaling. Using the pixel information, the features are extracted and the models are trained using the methods such as k-nearest neighbors, neural networks and support vector machines.

Santosh Kumar Molgu, SMS Spam Classification using Machine Learning Approach, July 2016, (Dungang Liu, Peng Wang)

Spam messages are identical messages sent to numerous recipients by email or text messages for reasons like mass marketing, gain clicks on website, scam users and steal data etc. Many carriers have started working on SMS spam by allowing subscribers to report spam and taking action after appropriate investigation. In some places, they have imposed limits on text length, number of messages per hour and day to crack down spam. But SMS spam has steadily grown over the past decade. A lot of research and work has been done on email spam and implemented by many mail service providers. SMS spam is relatively new and differs from email in terms of nature of communication and the availability of features. This paper aims at applying data mining techniques to build multiple SMS spam classifiers and compare their performances. The results show that Naive Bayes has high precision and low blocked ham rate and using term frequency with inverse document frequency increases spam caught rate.

Rujuta Kulkarni, Income Capacity Prediction, July 2016, (Peng Wang, Dungang Liu)

This analysis aims at segmenting customers into income groups of above and below 50K. This type of analysis can be used for purpose of target marketing. The data was collected from 1994 census database by selecting records between ages of 16 and 100 and applying few more filters. An initial exploratory analysis was performed on the data and data was modified as required for simplification. Different data modelling techniques were then fitted on the data and were tested for their in-sample and out-sample performance on basis of cost and AUC. New techniques like QDA, SVM and ensemble of classifiers were also tried out.

Geran Zhao, Call Center Problem Category Prediction and Call Volume Forecasting, July 2016, (Yichen Qin, Yan Yu)

This paper is about the call center of United Way 211-Greater Cincinnati.  There are two objectives, and the first one is using logistic regression to predict the call category of basic needs vs the rest based on the information of callers. It finds out that Hamilton, Kenton, Campbell, Montgomery, Clark, Warren, Individual, Self Referred, Referral, Information Only, Black/African American, Family, and Agency have relationship with basic needs. The second one is to build the ARIMA model to predict the call volume. The second objective is using ARIMA model to predict the call volume of the call center in order to find out the trend of call volume. It shows that the call volume is predicted to be decreasing.

Shikha Shukla, Grocery Analytics: Analysis of Consumer Buying Behavior, July 2016, (Christopher Leary, Dungang Liu)

The client is an American grocery retailer in Cincinnati, Ohio. The focus of this research work is to help the Marketing departing of the client make data-driven decisions and identify prospects for sales growth in the coming year. This study will help the client identify and target their customer base more effectively through market basket analysis. The data provided by the client comprised of household level demographic data, campaign data and item level point of sales data for the past two years.  Market basket analysis helped us identify the most frequently purchased product and a number of other products that are always purchased with this product. It was also observed that a lot of products co-occurred in all the transactions as their confidence was 1. This information should help the client in segmenting customers based on their buying behavior so that these segments of customers can be micro-targeted more effectively through specialized marketing campaigns. Through statistical techniques like t-test we proved that the impact of marketing campaigns on sales is very significant and marketing campaigns lead to substantial increase in sales, however the number of campaigns active in a given time period or the type of campaign does not significantly affect the sales. Based on this information, the client can plan their marketing campaigns more cost-effectively.

Radostin Tanov, NFL Player Grading with Predictive Modeling, July, 2016, (Michael Magazine, Ed Winkofsky)

The goal of this analysis is to use predictive modeling software to determine the most important variables of player grading for three types of reporting: contribution, durability, and performance. The aim is to help understand which variables are important and can be used to potentially change the weights of the previously developed model or how the calculation of the grades can be restructured, as well as finding any applicable models that can be used to alternatively calculate the grade.

Ninad Shreekant Dike, Forecasting Bike Sharing Demand, July 2016, (Peng Wang, Michael Magazine)

The objective of this analysis is to predict the number of bikes rented per hour in the Capital Bikeshare system in Washington D.C. The Bike Sharing Dataset is taken from the UCI Machine Learning Repository and comprises 14 predictor attributes and 17379 instances. An initial exploratory analysis was performed to gain an understanding of the data and the variables. Some inconsistent or inaccurate data, including 3 variables, were removed or modified to ensure the cleanest possible data. Two additional predictor variables for bikes rented in the past two hours were added. The data was randomly sampled (stratified) into 80% training and 20% testing sets, 5 separate times as a substitute for cross validation. The programming language R was used for the analysis throughout. Microsoft Excel was used for some graphs.  Data modeling was initially done using linear regression. Then, noting that the assumptions of linear regression were violated, a log transformation and non-parametric methods (Generalized Additive Model, Regression Tree and Random Forest) were employed. Finally, a Time Series model was fitted on the data. The models obtained were tested for their performance based on their R-squared value (training data), and the Mean Absolute Error and Mean Squared Error (testing data). The value reported for each of these statistics was the mean of the 5 iterations of training and testing sets.  It was observed that the Time Series model produced the best forecast with an accuracy of within 18 units (for a 5 hour forecast) of the actual values on average.

Abhilash Mittapalli, Framework for Measuring Quality of User Experience on Cerkl and Analyzing Factors of High Impact, July 2016, (Dungang Liu, Tarek Kamil)

Measuring quality of user experience is of paramount importance to web platforms like Cerkl. This is mainly due to the fact that survival of the platform and revenue potential are dependent on the user experience levels. However, the major challenge involved is building a score for evaluating user experience. Scoring models currently available are hugely specific to the web platforms, hence a novice model is required to be built from scratch in Cerkl’s context. Scope of this project is confined to building a set of metrics for example open rate, click rate, email bounce rate etc. which are aggregated to a score and to analyze key factors driving this score. Formulae have been developed for evaluating user experience score at different levels along with interpretation of the scores for good, bad, worse etc. Based on the data analyzed from 200 organizations and 15 variables, decision tree model with misclassification rate of 0.3 and auc of 0.74 on 20% hold out data set beats logistic regression, which led to misclassification rate of 0.38 and auc of 0.56. ‘Number of topics’ stood out as the most important factor in driving user experience score with a correlation of 0.41 at 95% confidence level. Audience size is a negative contributing factor, whereas number of emails delivered on weekday is a positive factor. These results can help us improve certain processes like encouraging clients to publish content from more number of topics or proactively pushing for a weekday delivery of email as default option.

Amol Dilip Jadhav, Machine Learning Approaches for Classification and Prediction of Time Series Activity Data Obtained from Smartphone Sensors, July 2016, (Peng Wang, Dungang Liu)

Smartphone devices have become increasingly popular as the technology has made devices cheaper, more energy efficient and multifunctional. The availability of data from the sensors embedded within the device can provide accurate and timely information on user’s activities and behavior, a primary area of research in pervasive computing. Numerous applications can be visualized from such activity analysis for instance, in patient management, rehabilitation, personnel security, and preemptive scenarios. Human activity recognition (HAR) has become an active area of research since last decade but there are still key aspects such as development of new techniques to improve the accuracy under more realistic conditions that still need to be addressed. The ultimate goal of such an endeavor is to understand the way people interact with mobile devices, make recognition inherent and providing personalized as well as collective information.  The goal of this project is to recognize patterns from these raw data obtained from user’s wearing a smartphone (Samsung Galaxy S II) on the waist, extract useful information by classifying the signals and predict the measured activities. A simple feature extraction technique was employed to process the raw data, and then various machine learning algorithms were applied for multi-class classification. The results surpass prediction performance from previous published work, it was possible to achieve a prediction of 97.49%, higher than that reported in the literature.

Mayank Gehani, Customer Churn Prediction in Telecom Industry, July 2016, (Dungang Liu, Peng Wang)

The project is all about customers which are churning in a telecom industry. The company requires huge amount to acquire a new customer and it is very important to reduce the customer attrition. There is a big impact on the business because of churning as acquiring new customers corresponds to making huge investments. The project will present recommendations as a solution to reduce churn. The goal of the project is to predict if the customer will churn in the future on the basis of the data set describing the customer phone usage. The aim is to identify parameters which can help to reduce the churn. The project includes building and identifying the best predictive model to meet the goal. The task also includes finding some other important information which can help us to provide better recommendations. On analyzing data we came out with different classification models to reduce the churning but Support Vector Machine classification model came out to be best with an accuracy of 92%. The states of New Jersey, California and Texas observed maximum churning while states of Hawaii, Arkansas and Arizona had the lowest churn rate. On the basis of our model different rates/min for calls were recommended during the day time.

Nitin Nigam, Understanding Online Customer Behaviour Using Adobe Analytics, July 2016, (Dan Klco, Dungang Liu)

All major businesses today strive to have a strong online presence since it brings much more ease to them in connecting with their consumers, understanding their browsing patterns and drawing actionable insights from this data to intelligently expand their market penetration. Surprisingly a large number of firms struggle to do so due to various factors such as lack of understanding of consumer buying needs and interests, lack of technical expertise to understand customer’s digital interaction, inexperience in correctly identifying the Key Performance Indicators etc. This project aims to solve these problems by tracking the online browsing patterns of the customer and generating reports based on this data using Adobe Analytics, Adobe Tag Manager and JavaScript tools. Metrics such as Author name, Page Tags, Page Category, Post Dates were tracked for 16 different pages on Perficient Inc.’s blog site. It was found that readers prefer some authors and certain topics among others. They also show most readership on certain days. This kind of analysis is extremely useful for the companies which can tailor and time its content based on the analyzed data to generate maximum readership on its site thereby generating more revenue.

Pratap Krishnan, Prediction of Used Car Prices, July 2016, (Peng Wang, Michael Magazine)

The dataset consists of several hundred 2005 used GM cars. The aim is to build a predictive model which can predict prices of used cars based on some important factors like Mileage, Make, Model, Engine Size, Interior Style, and Cruise Control. A Multivariate Regression Model is developed using these set of predictors, with Price as the Dependent Variable. While Model Accuracy is important, it is also important to have good Model Interpretability so as to see which factors affect the Price of used cars the most. Other Models tried included Regression Trees, Random Forest, and LASSO Regularization.

Rajiv Nandan Sagi, Prediction of Phishing Websites, July 2016, (Dungang Liu, Peng Wang)

Phishing is a security attack that involves obtaining sensitive and private data by presenting oneself as a trustworthy entity. Phishers exploit users’ trust on the appearance of a site by using webpages that are visually similar to an authentic site. There are not many articles that talk about the methods or features through which one can identify these phishing websites. To overcome the problem and support the internet users in identifying these malpractices, Prof. Mohammad Rami from University of Huddersfield and Prof. Fadi Thabtah from Canadian University of Dubai came up with some important features, collected information from a few websites and also published the dataset on UCI Machine Learning Repository website. This paper aims at building prediction models using various machine learning techniques on the publicly available dataset to recognize the phishing websites. All the models are compared against each other to identify the best selection criterion. We study the importance of each of the features, which were identified in the dataset, in predicting the phishing websites. From the analysis, we notice that the model built using the Random forest technique gives the best results. The prediction accuracy is 95.07% and from that we can conclude that the data collection was accurate and the variables designed to identify the phishing websites are relevant.

Grant Lents, Non Linear Modeling and Clustering for Rate Pricing, July 2016, (Yan Yu, Edward Winkofsky)

The insurance industry tends to be very stagnant and resistant to change and new ideas. Statistical analysis is not new to the industry as actuaries have always been a part of the industry, however predictive modeling is not common place in insurance. In recent years the business world has started adopting analytics, and the insurance industry is finally starting to catch up.  In this new world, old methods that have been in place for decades are starting to be reevaluated and improved. When looking deeper into rate charging it is clear that basic methods are just not sufficient given the statistical methods that are available to analysts. With predictive modeling and clustering analysis the industry can do away with old methods that rely on executives’ intuition and make rate charging based off of numbers. While this process has started, it can always be improved.

Nikita Bali, LOGIC and CloudCoder, July 2016, (Peng Wang, Michael Magazine)

There is a substantial rise in the number of people who are engaging in learning activities either through a learning management system or through in-class learning technologies. This is leading to high collection of user data. Analytics tools can be used to make online education adaptive and personalized based on a student’s past performance trends. Using clustering analysis, the different problems were categorized into three sets by using factors as proxies for difficulty and complexity. The three sets group the problems based on the difficulty level: Easy, Medium and Hard. The goal of this exercise is to implement a recommendation system that automatically assigns problems to users based on their performance trends, so that they can ultimately improve their learning curve.

Ishan Singh, Campaign Analytics for Customer Retention-Brillio, July 2016, (Kunal Agrawal, Michael Fry)

The capstone project involves setting up campaigns and defining audience size, test goals, KPIs and measurement methodologies for the consumers of a big multinational client of Brillio.  The objective is to increase the renewal rate of the subscription of a product by offering a certain segment of the customers promotional offers for renewing the subscription.  This entails setting up A/B or multivariate tests to experiment and understand the consumer behavior and obtain directional learnings for the business. These tests would have a hypothesis to test promotional discounts on a set of treatment group asking the customers to renew the service before they expire.  The project involves defining the sample size of the audience, creating relevant metrics to measure the outcome of the experiment and building frameworks to calculate the statistical significance of the results.  A second mini project involves finding correlation between the usage pattern of a prepaid customer and the propensity of the customer to respond to a campaign by mining the service usage data and mapping it to the historical response rate of a certain type of customer profile.  The project also includes creating various visualization scenarios with graphs and reports to generate business insights and layout steps in the campaign lifecycle.

Anvesh Kollu Reddi Gari, Using Unstructured Data to Predict Attrition, July 2016, (Eric Hickman, Dungang Liu)

Companies that serve customers or businesses at a large scale inevitably have customer service departments where valuable information is stored as free form text. Unlocking value from these data sources presents a huge opportunity for companies to stay responsive and address concerns proactively. In this report we look at one instance where call notes of customers is used as an indication of possible attrition. We build a binary outcome prediction model with comments being the sole predictor and attrition in a defined time period as the response variable. We used a Support Vector Machines (SVM) model with a linear kernel which is proven to be best suited for text mining. In this case we show that the tuning penalty and weight parameters are important to arrive at the best model especially in a class imbalance problem like the present one. The results conclusively prove that comments alone can be a good predictor of attrition and can be used as a valuable predictor alongside other demographic variables.

Saketh Jellella, Forecasting Exchange Rates, July 2016, (Yichen Qin, Eric Rademacher)

In the modern world, investing in foreign lands is very common. For firms trading with other countries, the trends in exchange rate can be very important. If the firms can predict the exchange rate movement, it allows them to plan ahead.  Exchange rates forecast can be drawn through the computation of a currency’s value with respect to other currency’s value over a period of time. There are a lot of theories/models that can be used for the prediction but all of them have limitations.  In this project, Time-Series modelling approach was used to fit a model to the historical data and then a graph was plotted for the future(predicted) data to observe the trend. This would help the firms understand if it is a good time to invest and thus they can avoid losses.  The capstone describes in detail about the data extraction, model building, model selection and the model diagnostics. The end result is an AR-ARCH model that was fit to the time-series data and the next one-month exchange rates trend was observed. The trend is decreasing i.e. the exchange rates are expected to come down over the next one month.

Sagar Umesh, PPNR: PCL (HELOC) Balance Forecast, July 2016, (Yan Yu, Omkar Saha)

As part of the requirements for Huntington Bancshares Inc.’s (HBI’s) Annual Capital Plan (ACP) and its participation in the Federal Reserve’s (Fed’s) Comprehensive Capital Analysis and Review (CCAR), the Personal Credit Line (PCL) balance models were developed to provide forecasts of the balances on HBI’s book for different economic scenarios. The PCL portfolio consists of 1st and 2nd Lien Home Equity Lines of Credit (HELOC) products. The primary macroeconomic drivers for 1st Lien balance are All Transactions Home Price Index in Huntington footprint, and Rate Spread between Freddie Mac 30Yr and Prime Rate. The primary macroeconomic drivers for 2nd Lien balance are Prime Rate, and All Transactions Home Price Index in Huntington footprint.

Sagar Vinaykumar Tupkar, Predicting Credit Card Defaults, July 2016, (Yichen Qin, Peng Wang)

Credit Card defaults pose a major problem to all the major financial service providers today as they have to invest a lot of money in collection strategy, which again is uncertain. The analysts in the financial industry today have achieved great success in plotting a method to predict the default of credit card holder based on various factors. This study aims at using the previous 6 months’ data of the customer to predict whether the customer will go default in the next month by various statistical and data mining techniques and building different models for the same. The exploratory data analysis part is also important to check the distributions and patterns followed by the customers which eventually lead to default. Out of the four models built, Logistic Regression after doing Principal Component Analysis and Adaptive Boosting Classifier performed the best in predicting defaults with around 83% accuracy and minimizing the penalty to the company. This study gave a list of important variables that affects the model and should be considered for predicting defaults. Even though the accuracy of the predictions is good, further research and powerful techniques can potentially enhance the results and bring a revolution in the credit card industry.

Leon Corriea, An Analysis of European Soccer Finances and Their Impact on On-field Success, July 2016, (Michael Magazine, Edward Winkofsky)

Analytics is revolutionizing every industry it touches. From banking to manufacturing to healthcare, every industry has been made better, more successful with the use of analytics. The sports industry has been the latest to embrace analytics. The use of analytics in sports is often referred to as the “Moneyball revolution”, alluding to the famous book and movie of the same name.  This report takes a closer look at the business of soccer and how analytics can be used to improve financial decisions that impact performance on-field. It identifies all the essential levers that are involved in the financial decision making process at the top European soccer clubs and, through the use of analytics, assigns importance to each one of them. By recognizing the most important factors, soccer clubs can prioritize their efforts in improving those areas that have the maximum impact on on-field success.

Haribabu Inuganti, Predicting Default of Credit Card Customers, July 2016, (Dungang Liu, Edward Winkofsky)

It is important for banks and credit card companies to know if a customer is going to default or not. It will help them to assess the cash flows and assess the total risk at hand. For a customer who has a credit card, there are different attributes like customer’s income range, education, marital status, history of past payment etc. which will impact this outcome. The current project is to build a predictive model which predicts probability of default of credit card customers using different attributes of that customer. The data in UCI machine learning repository is taken.  The data records of 30,000 customers has 24 different attributes like Limit balance, sex, education, marital status, age, past repayment status etc. Initially, exploratory data analysis is performed to understand the distributions of different variables, to check for outliers and missing values. The data set is divided into training and testing datasets by random sampling. After exploratory data analysis, logistic regression, lasso, support vector machines and random forest models are built on training data. To evaluate the performance of the model its AUC on testing data is used as the criterion. Out of all models, the best model is logistic regression built with a stratified sample. This model has 0.74 AUC on out of sample data. This model can be used for predicting the probability of default for new customers.

Wenwen Yang, Omni-channel Fulfillment Path Costing, July 2016, (Yichen Qin, Erick Wikum)

An innovative, integrated and customer-oriented retail business model, Omni-channel retailing is flourishing with the advent of the online and digital channels. A key challenge for Omni-channel retailers is to fulfill each customer order in a cost-efficient and timely manner. There can be many possible fulfillment paths for an order and understanding the cost for each alternative is complex. A 2015 JDA study1 found that only 16% of retailers surveyed could fulfill Omni-channel demand profitably. Thus, a universal approach to estimate cost of alternative fulfillment paths for retailers is beneficial. The ultimate purpose of this project is to construct a practical approach allowing retailers to optimize order fulfilment. The focus was on four types of fulfilment paths and the net cost allocated by the Activity-based costing (ABC) method. My internship at Tata Consultancy Services (TCS) mainly focused on three components: first, define four common fulfilment paths and define a method to compute cost to serve for individual fulfilment paths; second, group similar shopping items into groups based on their attributes using a clustering algorithm; and third, predict the type of packing carton to be used for an order using a classification model.  

Bhrigu Shree, Speed Dating Analysis, July 2016, (Peng Wang, Yichen Qin)

Understanding behavior of women and men when it comes to choosing partners is something that has baffled mankind since the beginning. Immense number of books have been written on the subject over ages. In today’s data driven world, it would make sense to take a similar approach to analyze dating preferences also.   In my capstone project, I plan to analyze 'Speed Dating Experiment’ dataset compiled and released by Columbia Business School Professors Ray Fisman and Sheena Iyengar for their paper “Gender Differences in Mate Selection: Evidence from a Speed Dating Experiment”. Data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four minute "first date" with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again and also to rate their date on certain attributes.  The capstone project consists of two parts: - First Exploratory Data Analysis, in which we try to analyze data on various dimensions to find interesting patterns and second, building a predictive model to predict the probability of success of a date based on characteristics of participants.

Alok Agarwal, A Forecasting Model to Predict Monthly Spend Percentages and an Approach to Assess Analytic Tools, July 2016, (Michael Magazine, Tim Hopmann)

The objective of this report is two-fold. The primary objective is to forecast the monthly spend percentages of different internal projects managed by PPS team. The secondary objective is to design a framework which can be used to assess different analytical tools for the team. For the forecasting model, the percentages were calculated based on the amount reserved for them. The tested techniques were Multiple Regression, Regression Trees, and Additive Models. The final model was selected based on out of sample mean square error and assessed based on decision level prediction in the 2015 portfolio.  For the assessment of the analytic tools, I gathered the analytical requirements of the team and identified 4 evaluation phases. This assessment was local to the team and was a part of the enterprise-wide data governance strategy.

Karthekeyan Anumanpalli Kuppuraj, An Analysis on Personal Loans Offered by Lending Club, July 2016, (Dungang Liu, Michael Magazine)

Lending club offers personal loans in the range of $1000 - $35000 to applicants from various categories. The interest rate is decided based on the applicant’s grade, income level, lien status and other information provided in the application. After the loan is issued, the repayment of the loan is tracked to maintain a loan status. Building a model using the available data to identify the interest rate for an applicant and also to predict the status of the loan, once the loan is provided, will minimize the risk involved in providing loans to the applicants with a lower repayment capability. A linear regression model and a regression tree are constructed and compared and the approach with better results is used to predict the interest rate. A logistic regression model and a classification tree are constructed to predict the loan status. The linear model provided better results than the regression tree to predict the interest rate for the applicants. The mean squared error is used to compare the two models. The logistic regression model had lesser misclassification rate than the classification tree. After cross validating with various samples, it has been concluded that the logistic regression model predicts the loan status more accurately than the classification tree. These two models will enable lending club to make more accurate decisions based on the available data and will reduce the risk of providing loan to an applicant with less repayment capability.

Udayan Maurya, Telecom Customer Churn Analysis, June 2016, (Peng Wang, Yan Yu)

The study analyzes customer churn data for Telecom service provider. Objective of study was to develop predictive model to identify the variables responsible for churn of customers and predict potential customers which may churn out of telecom services. Homogeneous clusters of similar states were identified and a predictive model for each one of them was developed. The in-sample and out of sample performance for all the models were analyzed and found satisfactory within industrial standards.

August Spenlau Powers, Exploring the Use of and Attitudes Towards Drugs and Alcohol in the Kenton County Public School System, June 2016, (Yan Yu, Edward Winkofsky)

The illicit use of substances, both legal and illegal, is a pervasive and potentially very dangerous problem among teenagers. As such, the identification of influential factors which lead to such substance usage is critical, and has been the subject of much research over the past few decades. Using several years of survey data provided by the Kenton County Alliance, the main factors which influenced the use of alcohol, cigarettes, marijuana, and inhalants were determined using several techniques – classification trees, chi-squared automatic interaction detection (CHAID), and logistic regression. Unsurprisingly, the two variables which were incredibly significant throughout every analysis of all four substances were the student’s perception and ease of access to said substance. Other important factors were the parent (or parents) which the student primarily lived with, especially for the use of inhalants. While many of these factors were fairly intuitive, these insights should allow the KCA to better develop and cater their community programs to ensure higher effectiveness.

Nitish Kumar Singh, Predict Online News Popularity, June 2016, (Dungang Liu, Edward Winkofsky)

With increased usage of the internet for information sharing, a number of news articles are published online and subsequently shared on social media. The popularity of an article can be determined by the number of views or shares it receives from people. It will really help online publishers if there can be an algorithm which could determine if a news article will be popular or not. This algorithm can provide a helping hand to editors by separating out bad articles and can also help in determining the positioning of articles on the website. In this project, 5 different learning algorithms like logistic regression, decision tree, and boosting tree are implemented to classify news articles as popular and non-popular based on different features. The best model is also checked to see if it can differentiate very popular (viral) articles from other articles.

Justin Blanford, Analyzing Consumer Rating Data on Beer Products to Build a Recommendation System Using Collaborative Filtering, April 2016, (Yan Yu, Michael Seitz)

The Information Age has made accessible a growing amount of data quantifying our world and human behavior.  However, at times, it has been unclear how we can benefit from this data and gather insight.  A recommendation system is one tool that can solve this problem.  Recommendation systems have been used for many years on Amazon, Netflix, Pandora and other platforms to guide customers to more products or content they would enjoy.  Using collaborative filtering and a consumer review dataset I was able to recommend products to a given user based on their preferences and interests.  Specifically, the recommended products are beers from a consumer review aggregator named BeerAdvocate.  Included in the dataset collected from January 1998 to November 2011 are 1.5+ million reviews, 30+ thousand users and 65+ thousand beer products.

Alex Michael Wolfe, An Evaluation of Tax-Loss Harvesting Trading Strategies, April 2016, (Yan Yu, David Kelton)

Investment managers and financial advisors actively seek investment strategies that generate positive excess returns over time. The majority of academic research indicates that trading strategies do not consistently generate excess returns over time. However, the strategy of tax-loss harvesting (TLH) has become popular and is claimed to generate excess returns; this strategy seeks to take advantage of the U.S. tax code as it pertains to the realization of capital losses and the difference between short-term and long-term marginal tax rates on capital gains.

While TLH has been used by traditional financial advisors for years, the advent of robo-advisors has amplified the importance of TLH. Using algorithmic trading, robo-advisors such as Betterment and Wealthfront claim to generate substantial positive, excess returns for investors through daily TLH. This paper tests a single version of TLH using backtesting and simulation based on the daily stock price returns of 147 individual stocks from 1985 through 2014. Backtesting models indicate that annual and daily TLH strategies generate positive excess returns, with the daily excess returns being the largest. The results of the annual simulation model shows a negative mean excess return from TLH, while the daily simulation model shows no statistically significant effect of TLH on mean returns likely due to a limited number of simulation replications, which is in turn due to heavy computational requirements.

Shivang Desai, Heart Disease Prediction, April 2016, (Dungang Liu, Edward Winkofsky)

In today’s world where Healthcare management is of great importance and governments, corporations and individuals have a lot at stake when it comes to public health. This research project uses a data driven approach to tackle the issue of heart disease in individuals. Analytical tools are used to perform a thorough data exploration that leads us to key insights that would be beneficial in the modeling process. A logistic regression model is built to calculate the risk of being affected by heart disease for each individual. A classification tree is built to classify individuals into two groups based on whether they have a heart disease or not. The project can help insurance companies decide what an individual’s insurance premium should be based on his personal characteristics like age, sex and cholesterol level. It can also help health organizations screen individuals for heart tests based on model results.

Yogesh Kauntia, Identifying Bad Car Purchases at Auctions, April 2016, (Dungang Liu, Edward Winkofsky)

Used car dealers often buy cars from auctions which sometimes do not allow a thorough inspection of the vehicle before the purchase. However, it provides a list of metrics (model, sub-model, odometer reading etc.) to help the dealers make a decision. The objective of this analysis is to build a model to help dealers decide whether an automobile on auction is a good purchase or not which essentially means whether the dealer can sell it further for a profit or not. Different data mining classification algorithms are tried and compared to identify the best model for such a problem.

Naga-Venkata-Bhargav Gannavarapu, Gentrification of Hamilton County, April 2016, (Peng Wang, Olivier Parent)

Gentrification can be defined as the process of renewal and rebuilding accompanying the influx of middle-class or affluent people into deteriorating areas that often displaces poorer residents. Hamilton county is the third most populous county of Ohio. University of Cincinnati is also located in Hamilton county. This analysis was performed to identify the census tracts in Hamilton County that might have gentrified during 2000 to 2010. Census Tracts are small, relatively permanent statistical subdivisions of a county or equivalent entity that are updated by local participants prior to each decennial census as part of the Census Bureau's Participant Statistical Areas Program.

Satwinderpal Makkad, Functional Data Analysis of Plant Closings, April 2016, (Amitabh Raturi, Peng Wang)

This study examines performance of firms that announced plant-closing during a period of 17 quarters; 8 quarters before and 8 quarters after the announcement using functional data analysis (FDA) methodology. An analysis of individual firms’ financial dynamics is performed using their return on assets (ROA), operational dynamics using a sales by inventory ratio (SalesByInv), and resource-deployment or capacity utilization dynamics using a sales by property plant and equipment ratio (SalesByPpe), by treating them as continuous functional data during these time periods. The aim is to answer questions such as: did the firms benefit from closing a plant in terms of ROA, SalesByInv, and SalesByPpe and were any firms negatively impacted due to closing of a plant or did plant closing have no impact on the firm’s performance? Hierarchical clustering analysis is used to segment the dataset to find classes of firms that demonstrate similar patterns in terms of their performance. An analysis of 3-cluster, 4-cluster, and 5-cluster segmentation indicates that that 3-cluster analysis provides adequate sample sizes as well as logical differences for our tests. Three sets of hypotheses are developed that assess the antecedent and ex post financial metrics of firms that made such announcements. We relate these hypotheses to our cluster analysis results.

Calvin Kang, Analyzing NBA Rookie Data with Predictive Analytics, April 2016, (Michael Magazine, Peng Wang)

Since the inception of the NBA, the professional basketball teams have relied on the yearly draft to bring in new talented players. More often than not, these players played 1 or more years at a college level before going professional. Those that did join the NBA without college were exceptional high school players or players who played professionally overseas. Team get a limited number of picks each year so it is a huge benefit to choose an exceptional player to play for them or to trade for one as an asset. It is never possible to completely gauge how a new player will adjust to NBA play. However, I believe looking at past performance in college or other leagues can give some insight into what kind of splash a player can make in the NBA. In this report, I will use statistics and data mining techniques to analyze and predict the performances of players with data of players from the past 10 years.

Apoorv Agarwal, Privacy Preserving Data Mining: Comparing Classification Techniques while Maintaining Data Anonymity, April 2016, (Dungang Liu, Peng Wang)

The objective of this study is to compare different methods for binary predictions and highlight the accuracy from each method. Using some real life data from the banking industry, classification techniques will be run and the results from each model will be compared. The analysis will be done while maintaining data privacy. The data will have de-identified variables and observations. That is, no information about the variables and their data is known. The decisions will have to be purely statistics based. Recently, such data mining techniques have seen growing significance. Finally, a conclusion will be made on the logic that can be used for selecting the final models based on the cost of type I & II errors and application of such methods for real life problems.

Abhas Bhargava, Wine Analysis, April 2016, (Dungang Liu, Edward Winkofsky)

The Wine Industry is on the rise, which can be attributed to both, social and casual drinking. The big players in the industry are getting bigger and the small wineries are getting swallowed up. As of April 2016, the global wine trade turned in a healthy performance as wine lovers led to increased consumption. The competition is immense and different players in the industry implement different strategies to keep up to the pace. Some believe in spending on acres of desirable vineyard land while some believe in spending on nothing more than brand names. However, what matters most is how wine-tasters perceive a particular wine. Wine tasting is the sensory examination and evaluation of wine.

Another notable feature is that all wines possess a set of physicochemical properties that may or may not affect its quality. The idea of this paper is to analyze wine ratings with respect to its physicochemical properties. The question that everybody wants to answer is whether wine ratings are related to its physicochemical properties. Using supervised learning techniques, we will identify a set of physicochemical properties that lead to high ratings for a particular wine. As a secondary research, we will also try and differentiate between red and white wines based on their physicochemical properties. The question we are trying to answer here is whether we can differentiate between the two types of wine without looking at their appearance.

Rahul Chaudhary, EPL Transfers Analysis, April 2016, (Michael Magazine, Peng Wang)

The English Premier League, soccer’s richest league, is used to multi million pound deals to bring in the best soccer talent around the world. The focus of this study is to predict the transfer values of incoming players to the English Premier League while identifying the most important components that go into formulating a player’s transfer amount. This study also elaborates into the process behind data collection and why particular variables were selected in the model. The study was based on Linear Regression and Regression Trees since interpretability of the model is of the most importance, hence complex models were not explored. It was vital to provide thorough tests to check for Normality, Homoscedasticity and independence of the Linear Model to ensure that predictive power is not affected.

Jeremy Santiago, Predicting Success of Future Rookie NFL Running Backs, April 2016, (Yan Yu, Martin Levy)

Sports as a business is big money. Each year, athlete’s contracts reach new highs. Superstars and rookies alike see larger contracts. However, the money put into these rookie contracts doesn’t always yield returns on the playing field. This is due to the unmeasurable uncertainty of how rookies will handle the transition of moving to the highest competitive level. In this capstone, I will specifically focus on analyzing and predicting the performance of NFL rookie running backs. For decades past, the analyses done on predicting rookie’s abilities in the NFL are very subjective; the evaluations rely on game-tape analysis, scouts’ intuitions of the players’ abilities, and often compare the athletes to whom they are competing against. This is limiting for analysis, as the majority of data gathered is subjective or circumstantial. This capstone focuses on using completely quantitative data to predict the performance level of future rookie running backs in the NFL.

Fadiran Oluwafemi, An Investigation on Factors affecting Internet Banking in Nigeria, April 2016, (Yan Yu, Edward Winkofsky)

This study investigates the factors that influence and discourage the adoption of electronic banking in Nigeria. Various studies show that electronic banking systems bring about cost reduction, time efficiency, ease of access, and improved customer relationship, to both the financial systems and customers. Conversely, research shows that a huge percentage of the Nigerian population still adopts the traditional methods of banking, thus internet banking facilities are largely underutilized.

This research uses Theory of Reasoned Action (TRA), Technology Acceptance Model (TAM) and Theory of Planned Behavior to categorize the factors that influence and discourage the adoption of internet banking. The factors are demographic factors, perceived risk factors and limitations. Results from this study suggest that mostly well-educated young adults with average income levels are the significant group of people in Nigeria who currently utilize internet banking systems. Findings also show that perceived risks involving financial, performance and social risks discourage use of internet banking due to the history of high crime rates and corruption that negatively affects how consumers perceive internet banking and its usefulness. Lastly, limitations in the form of poor infrastructures like inefficient internet power supply, poor internet services and so on were found to also discourage the use of internet banking.

2015

Joseph Charles Frost, Optimizing Course Schedules Using Integer Programming: Minimizing Conflicts in the Assignment of Required Courses Across Majors, November 2015, (Michael Magazine, David Rogers)

Universities commonly strive to develop schedules containing enough course options for students of any major to advance in their programs each semester.  Organizing these courses into available periods can be as important to the scheduling process as deciding which courses to offer, and often proves itself the more challenging task.  Offering two courses that are both required for the same major during overlapping periods can often create conflicts that constrain students’ schedules, forcing them to delay enrolling in certain requirements until they are offered in future semesters or even years.  Some universities spend hours generating schedules with nothing more than intuition.  Using integer programming, I was able to reorganize the schedule of one of these universities, Cincinnati Christian University, significantly decreasing the number of required courses for each major offered in overlapping periods.

Debjit Nayak, Operations Improvement using Data Analytics in the Cincinnati Police Department, November 2015, (Amitabh Raturi, Michael Magazine)

I took the Data visualization class at University of Cincinnati when I was undergoing my Masters in Business Analytics.  I was introduced to the concept of visualization using Tableau and how it can be useful in giving a brief insight about data. The internship at the City of Cincinnati gave me an opportunity to use Excel, Tableau, Arc maps and Visio for process improvement and data visualization.  These tools became part of ongoing processes used by the City; in fact, the city thereafter specifically used dashboards that I created to discuss issues in their “stat meeting”.  In section 2-4, I provide several examples of visualizations I enabled including those in the areas of police Overtime, police Homicide, police Shootings fatal and nonfatal and police response sheets.  I conclude with observations about what I learned and some implementation hurdles.

Yunzheng He, ChoreMonster Data Mining with R, December 2015, (Michael Fry, Uday Rao)

Studies have shown that giving children household chores at an early age helps to build a lasting sense of mastery, responsibility and self-reliance. ChoreMonster provides an easy solution to replace chore charts and make chores more enjoyable for children and parents. In-app user information and reward data were collected to improve App functionality and to provide a better user experience. In this project, we use R analytical tool to perform data mining on ChoreMonster dataset. Data was converted and imported to R workspace and cleaned. Summary statistics are extracted for numeric variables. Linear and logistic regression models were established to reveal relationships between reward type and gender and sex of a child. Text mining was applied to a character variable in order to investigate forms of reward given to kids. 

Thanish Alex Varkey, SPLUNK Dashboard Proof of Concept - Charles Schwab, December 2015, (Peng Wang, Edward Winkofsky)

Compliance for companies are very expensive but the cost of noncompliance is shutting down shop. It is this mentality in corporate America that has seen the likes of Enron being shutdown makes compliance a very important matter.

Schwab Compliance Technology (CST) had recently implemented SPLUNK for their network monitoring and log analysis. The project was to study the search processing language of SPLUNK to implement dashboards that will help CST not only troubleshoot their problems but also get to know their customers better. Pattern of usage, duration of usage and capacity planning were not known. Some of the benefits of this system if implemented were that Schwab would be

  • Able to identify their top customers
  • Able to detect customers who sent incorrect files
  • Able to understand what devices the customers use for logging in
  • Able to monitor capacity management for the server
  • Able to monitor performance of the application by studying response time for each screen

The desired system will not only help the team in monitoring but also the development team to understand the performance of the application and the infrastructure management team to understand the amount of server usage for capacity planning. The end deliverable was a dashboard of the use cases that were developed by CST. This report has the query that was used as the building blocks of the dashboard and the XML version of the dashboard itself. The project also included an understanding of the market segments of CST and uses the data to predict the capacity of server and customer usage.

David Rodriguez, Cincinnati Bell, August 2015, (Yan Yu, Michael Magazine)

Cincinnati Bell, like many telecommunication companies, faces the problem of customer churn.  There are two types of churn, voluntary and involuntary.  Voluntary churn is defined as when a customer chooses to opt out of their Cincinnati Bell service. Involuntary churn is defined as customers who fail to pay their bill for four consecutive months and the Cincinnati Bell terminates the customer’s service.  This is an issue from a business perspective because of each new customer comes fixed costs.  For Cincinnati Bell to see positive ROIs customers must continue with their service for a minimum of several months.  Cincinnati Bell’s business plan is to build to areas that would statistically provide a low risk of customer involuntary churn.  This paper explores different types of models to help with prediction of involuntary churn. 

Vijendra Krishnamurthy, Breast Cancer Diagnosis by Machine Learning Algorithms, August 2015, (Dungang Liu, Mike Magazine)

This dataset (Wisconsin Breast Cancer Data) was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. This project uses different machine learning algorithms to diagnose cancer into benign or malignant type. Also the study compares the results from various algorithms and combines the predictions using ensemble methods to try to obtain a better predictive performance than that possible by individual methods.  Classification machine learning algorithms used in this project include Decision trees, Support Vector Machines, Logistic Regression, Naïve Bayesian Classifier, and K Nearest Neighbours. Each algorithm is trained on 80% of dataset and then tested with the remaining 20% of the dataset. Finally ensemble methods are used to combine the results from various algorithms.

Shashi Kanth Kandula, Internship with Ethicon (Johnson & Johnson), August 2015, (Andrew Harrison, Yichen Qin)

Ethicon Endo-Surgery, Inc., is a subsidiary of Johnson and Johnson which focusses primarily on Surgical Instrument manufacturing. My association with Ethicon as an intern has been quite eventful with direct involvement in projects affecting the sales strategy. The internship provided me an opportunity to get an inside view of health care industry, and learn new tools such as IBM Congnos and Concepts of Data Warehousing. During my tenure at Ethicon I worked on multiple projects using tools such as SQL, R, and Advance Excel for the purpose of data mining and data analysis. I was also involved in a sales forecasting project using Time Series Analysis wherein we built a model to forecast sales for coming 12 months for a given product line using historical data mined out of data warehouse.

Kala Krishna Kama, Network Intrusion Detection System Using Supervised Learning Algorithms, August 2015, (Dungang Liu, Yichen Qin)

Intrusion detection is a term we come across fairly regularly these days. With the impulsion of World Wide Web and massive growth in computer networks, network security is becoming a key issue to tackle. This has resulted in an enormous research towards building Intrusion detection systems which are capable of monitoring network or system activities for malicious activities.

The aim of this Capstone project is to build a classifier capable of distinguishing between "bad" connections, called intrusions or attacks, and "good" normal connections. In this project, a comprehensive set of classifier algorithms will be evaluated on the KDD dataset. Since there are four different attack types different algorithms are likely to exhibit different performance for a given attack category. The aim is to verify the effectiveness of different classifiers algorithms and build the final model that does a best job in predicting the Intrusions. Supervised learning algorithms Multinomial Regression, Decision Trees, Naïve Bayes Classifier, K-NN Classifier and Random Forests are used to build the classifier models. Measures like classification percentage, misclassification rate and misclassification cost will be used to evaluate the models.

Xiaoyu Zhu, The Comprehensive Capital Analysis and Review, August 2015, (Peng Wang, Yichen Qin)

To promote a safe and stable banking and financial system, the Federal Reserve need to regulate and supervise financial institutions, including bank holding companies, savings and loan holding companies, state member banks, and systemically important nonbank financial institutions.

One of the supervisory program is The Comprehensive Capital Analysis and Review (CCAR), which evaluates a bank holding company’s capital, and its planned capital distribution to maintain their risk within an acceptable range. For banks holding companies with assets of 50 billion or above, this program ensures they have effective capital planning processes and sufficient capital to absorb losses during stressful conditions, such as the great recession in 2008. If the Federal Reserve objects a bank holding company’s capital plan, the company may not be able to make any capital distribution.  In general, for modeling team, our job is to build various models to forecast bank’s risk under different economic scenarios.

Charu Kumar, Forecasting Crude Oil Prices, August 2015, (Dungang Liu, Brian Kluger)

Oil Prices can depend on many factors leading to a volatile market. This necessitates the need for forecasting. Monthly data from 1986 to 2015 is collected, and divided into training and testing data in order to validate our model. Thereafter, ARIMA modeling is used to derive the model. The data is first transformed to account for the non-stationarity in variance, and a step-wide iterative procedure is outlined for model diagnostics and model fitting. The forecast for 2015 is compared with the testing data, and the mean squared error (MSE) is reported. This model is applied to 2014, 2013 and 2012 and is compared to the testing data for each of those years.

Vasudev Daruvuri, Commercial General Liability Forms: Text Mining for Key Words and Scalable ML & Movie Recommendations over Apache Spark, August 2015, (David J. Curry, Jaime Windeler)

Commercial General Liability Forms (Text Mining for Key Words) :

Project 1: The objective of this project is to identify and extract the important keywords per each document from the large pool of Commercial Insurance (1500+ so far) related documents and associate the keywords with existing search engine for those documents for faster querying and more accuracy in document search.

Project 2: This is a Proof of Concept (POC) project to perform movie rating analysis and predict personalized user ratings for the movies. The objective of this project is to implement

Collaborative Filtering on huge data set of 500,000 ratings in distributed environment on

Apache Spark.

Nikhil Shaganti, RiQ Precision Targeting- Coupons.com, August 2015, (Yan Yu, Peng Wang)

Retailers and brands alike are facing a lot of challenges today due to changes in the marketplace e.g. shopper expectations, channel proliferation, trip erosion etc. Coupons.com recognizes the power of context and has built Retailer iQ (RiQ) Platform that engage consumers with Client’s content in ways that are most relevant to them when they are most apt to receive it.  The RiQ platform benefits all parties involved.  It offers omnichannel engagement, meaning shoppers can engage however and whenever they want, whether it be through web, mobile or social. It utilizes shopping behavior data to personalize and target coupons and media to shoppers, delivering an experience that is most relevant to them.  This report gives a brief overview of the two retailer specific targeting campaigns designed for the biggest yogurt brand and personal care products brand in its vertical. The main objective for these campaigns is to drive trial, capture percentage incremental new buyers, drive brand loyalty, migration up to premium brands and regain lost/ lapsed customers. With the use of Point-of-Sale (POS) data shared by the retailer, it is easy to analyze the shopping behavior of a consumer in turn helping in behavioral targeting. The targeting campaign is a huge win for both the CPGs and retailer in driving trial with 69% incremental and 220% larger basket rings respectively.

 Ama Singh Pawar, A Study of Approaches To Enhance The Comprehensibility of Both Opaque & Transparent Data Mining Techniques, August 2015, (James Evans, Jaime Windeler)

Recent advances in computing technology in terms of speed, cost, as well as access to tremendous amounts of computing power and the ability to process huge amounts of data in reasonable time have spurred increased interest in data mining applications. In this Capstone project a census data sheet (Adult) is selected from the UCI’s Machine Learning Repository. With an objective to predict whether, for any selected individual, income exceeds $50K/yr or not. On conducting several data mining techniques for classification purpose, high-accuracy predictive models seem to be opaque (Black Box Techniques), i.e. Techniques which can generate valid highly accurate predictions, but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based. Moreover, even if transparent models have too many variables, then interpretation of those models becomes very difficult. But in many real world business scenarios this is unacceptable, because models need to be comprehensible. To obtain comprehensibility, accuracy is often sacrificed by using simpler but transparent models. This project will demonstrate a trade-off between accuracy vs. comprehensibility. Moving forward, I will discuss rules extraction, genetic ensemble, and some visualization techniques to extract accurate but comprehensible rules from opaque predictive models. I will cover concepts such as feature extraction (selection), dimensionality reduction, lasso, ridge regression to make transparent models with many parameters easy to comprehend.

Xinqi Lin, A Comparison of Regression Tree Techniques for Customer Retention Rate Prediction, August 2015, (David Rogers, Peng Wang)

American Modern Insurance Group is a widely recognized national leading small niche company in the specialty insurance business. As for all other business, retaining customers is very important to American Modern Insurance Group. They assigned a project to find out the driving force of their customer retention rate. In a previous group case study, we developed a best model we believed contains the critical predictors for the policy renewal variable, and also generated a pricing elasticity curve.  This project is an extension of the previous work. The objective is to determine the best suited regression algorithm to predict the customer retention from a list of predictors that are believed to influence it. I explore the different regression modelling techniques such as classification tree, random forest, and bagging and boosting, that can be built for predicting and attempt to find the best fit model to predict the response variable. This paper can serve to be valuable for the study of the comparison of different regression modelling techniques in both R and SAS EM and to find the best one.

Hannah Cunningham, Incorporating Motor Vehicle Records into Trucking Pricing Model for Great American Insurance Group, August 2015, (David Rogers, Edward Winkofsky)

The objective of this project was to incorporate driving record data into the Great American Insurance Group’s current trucking physical damage pricing model. For the case study, we intended to use linear regression to determine which categories of driving violation, as specified by Great American’s underwriters, had the most impact on the loss ratio and use those findings to create a scoring system for drivers or policies. Linear regression did not result in statistically significant categories, so we could not use those results to calculate scores. The client then provided weights and scoring methods for us to test, and we found that an unweighted arithmetic average is the scoring method with the most statistically significant relationship to the loss ratio. For my extension of the case study, I used a generalized linear model with a log link function to determine the relationship between violation categories and the loss ratio. This technique showed that Minor and Major Accident violations are significantly related to the loss ratio. Using the coefficient estimates from the general linear model as weights, I calculated new scores and compared them to the previously calculated scores. Again, an unweighted arithmetic average was the most statistically significant.

Yang Yang, Determining the Primary Factor in Kids’ Chore Rewarding Strategy, August 2015, (Michael Fry, Edward Winkofsky)

ChoreMonster is an app developed to help parents create and track chores of their kids who are to be rewarded in many ways for their finished chores. The reward data is constructed along with child’s age, gender, parent’s gender, and other geographical and time records. The objective of this project is to test the hypothesis that age is the primary factor in predicting suggested rewards by analyzing a fragment of data provided by the developer in order to explore and build up our understanding of the mechanism of the chore rewarding system. The insights will also allow the developer to create new areas of segmentation for family-suggested rewards and enrich the user experience. We utilize grouping schemes and logistic regression to analyze the data, and also deploy Tableau for data processing and data visualization.

Ambrose Wong, Sensitivity of Client Retention Modeling with Respect to Changes in the Costs of Misclassification, August 2015, (David Rogers, Peng Wang)

This paper is an extension of the project produced for the American Modern Insurance Group as part of the case studies course in BANA 7095. The previous project focused on exploratory data analysis and modeling while this paper focused on adding a cost function to two different models: logistic regression and classification trees. The cost function mitigates the imbalanced nature of the dataset which stems from having data that contains approximately 90% of the customers staying with the firm and the remaining 10% of the customers leaving the firm. The cost function was used to adjust the cost of false positives and was also used to choose the optimal cut-off probability for the models used. Naturally, this led to changes in the confusion matrix and also the misclassification rate for both models.

Luv Sharma, Forecasting Commercial Mortgages under Various Macroeconomic Scenarios, August 2015, (David Rogers, Michael Magazine)

The objective of the case study is to forecast aggregate commercial mortgages of four kinds (Non-farm & Non-residential, Multifamily residential real estate, Construction & Development, and Farmland) for all FDIC-insured institutions for the next 12 quarters. Each mortgage will be forecasted under three macroeconomic scenarios: Baseline, Adverse, and Severely Adverse. These scenarios are dependent upon broad macroeconomic indicators that will be included in the model to reflect the effects of the macro economy on mortgages. In this case study is an analysis of risk with respect to mortgage aggregates in recessionary environments and a deeper look at how these variables behave and relate to each other.  The forecasting will be carried out using dynamic regression models that combine the inherent time series nature of the variables involved along with the regression component to integrate the relationship between the mortgages and the macroeconomic indicators. Statistical software ‘R’ was used to analyze the time-series, build the model, and forecast. The datasets for the historical and the forecasted data was provided by US Bank and other publicly available sources.

Kathleen Teresa Towers, Endangered or Threatened?  Manatee Mortalities 2000-2014, August 2015, (Roger Chiang, Yan Yu)

In 2007, the U.S. Fish and Wildlife Service recommended downlisting the Florida Manatee (Trichechus manatus latirostris) from endangered to threatened status.  Since then, record levels of mortality due to severe winter weather and recurrent toxic red tide algal bloom indicate that proposal may have been premature. A shortage of resources has delayed execution of the U.S. Fish and Wildlife Service’s most recent scheduled review of the manatee species’ status, and therefore, available datasets detailing the latest mortality trends have not been sufficiently analyzed. In order to provide an accessible, intuitive interface for non-experts concerned with the manatee population’s status, this project compiled recent mortality data from two different datasets, modeled updated estimates for causes of manatee mortality, and created a series of interactive visualizations. It is integral that policymakers consider recent years’ trends in causes of manatee mortality before deciding whether it is time to reclassify the Florida manatee’s conservation status.

Jhinak Sen, Analytical Approach Used for Business Strategy by Retail Industry, August 2015, (Peng Wang, Michael Magazine)

With ever-increasing point of sale information at retail stores, it is important to utilize these data to understand customer behavior and factors that drive them to purchase items. To understand the customer behavior a coupon-redemption study was carried out by the company. The aim was to identify and categorize brands according to coupons redemption compared to total sales volume. Coupon redemption indicates consumer response to an advertisement. This in turns indicates effectiveness of advertisement and opportunity for brand to make innovative products. The study comprise of designing the experiment, gathering datasets, analyzing and developing new business insights. For the project a particular retail chain was considered, which have stores throughout the country. Sales and discount information were collected for some products. Performance score were calculated based on relative sales and discount ratios. Using these performance score products were categorized into good, neutral or bad performers.  The outcome of the study helped companies to understand their brand placement and optimized their targeting strategy for the retail store.

Ankit Jha, Analyzing a Churn Propensity for an Insurance Data Set Using CHAID Models, August 2015, (Amitabh Raturi, Mike Magazine)

This paper glimpses through the industry insurance solutions that's being offered by IBM as part of their PCI (Predictive Customer Intelligence) stack. One of the product features provides insights for churn propensity which will form the key focus of my paper. However, as this product is first of its kind in the market, getting actual customer data isn't possible. So our team is leveraging a sample dataset created by consulting industry experts. Academic instincts directed us towards models such as logistic regression and classification trees. Note that I have followed IBM SPSS Insurance Customer Retention and Growth Blueprint as the definitive guideline for this modeling exercise. Finally a CHAID model was used for classification tree and misclassification rate as model performance metrics.

Sruthi Susan Thomas, Analyzing Yelp Reviews using Text Mining and Sentiment Analysis, August 2015, (Zhe Shan, Peng Wang)

The objective of this project is to analyze reviews on Yelp for restaurants using text mining and sentiment analysis using Python to understand how many reviews expressed positive or negative sentiments about the food and the service, and what they had to say about them. This will help a restaurant owner understand the areas they are doing well in, and the areas they need to improve upon with regards to their food and service.

Ravi Shankar Sumanth Sarma Peri, A Study on Prediction of Client Subscription to a Term Deposit, August 2015, (Yan Yu, Peng Wang)

As the number of marketing campaigns are increasing, it’s getting difficult to target the appropriate general public. Data mining techniques are being used by the firms to gain competitive advantage. Through these techniques firms are able to identify valuable customers and thereby increase the efficiency of their marketing campaigns. In this project, data on Bank marketing is obtained, which provide information about each client, such as age, marital status, and education level. The classification goal is to predict if the client will subscribe to a term deposit or not. Different models are built and then compared to assess the accuracy.

Rajesh Doma, Creating Business Intelligence Dashboards for Education Services Organization, August 2015, (Jeffrey Shaffer, Michael Magazine)

The objective of this report is to give details of the work performed as part of my internship at Education Services Group, Cincinnati, Ohio.  Education Services Group (ESG) is small-sized company (< 50 employees) located in the eastern part of Cincinnati. It partners with technology companies to create best-in-class education and training businesses for those companies. It also provides a full range of professional services, support and tools specific to an education business thus helping clients to increase education revenue and lower operational costs.  The educational products such as on-demand trainings, in-classes trainings, labs, material etc. are all hosted, managed and delivered using Learning Management Systems (LMS). LMS can be in-house or driven through third party software products. LMS contain training data of individuals or users who have purchased the education products. Sales and marketing of these products are managed using Customer Relationship Management (CRM) products such as Salesforce.com and Marketo respectively.

As a Business Intelligence Analyst, I have worked on Data Extraction, Analysis and Visualization of trainings’ and sales data mined from these systems. The primary focus of the work is to develop descriptive analytics - Reports and Dashboards that will give actionable insights for the senior management.

Sean Ashton, Cincinnati Bell Internship Capstone: Survival Analysis on Customer Tenure, August 2015, (David Kelton, Edward Winkofsky)

In the telecommunications industry, revenue is almost entirely generated through customer subscriptions. The customers pay a monthly bill and the company provides them with services. The more monthly bills that customers pay, the more valuable they are to the company, so the length of customer’s tenure is an essential factor in determining the value of that customer. This research paper focuses on the survival analysis that was performed to try and understand how long certain types of customers will stay with the company. Customers were classified into different groups based on their demographic information. It was found that there were significant differences in tenure length based on a customer’s demographic group.

Navin Kumar, An Investigation of the Factors Affecting Passenger Air Fare Using Historical Data, August 2015, (Jeffrey Camm, Martin Levy)

Air-fare and the cost of flying have always been as much a matter of discussion as they have been a matter of speculation. While we have dedicated advisory services which advise on when the best time to buy tickets is, and if the current price is high or low based on historical data not a lot has been spoken on the various factors affecting the cost of flying from one destination to another. In this project the author has tried to understand the components influencing the “average” price a customer pays for flying. Analyzing factors such as distance, passenger traffic, competition among airlines and market dominance of a particular airline an attempt has been made to arrive at statistically significant results using linear regression.

George Albert Hoersting, Predicting Impact on Trucking Physical Damage Policies Using Historical Driver Data for Personal Vehicles, August 2015, (David Rogers, Edward Winkofsky)

The goal of this capstone is to determine the usefulness of driver motor vehicle records to predict a loss ratio defined as the total dollar amount of losses due to a claim, divided by the summed dollar loss of monthly premiums on a trucking insurance policy. Three data sets were provided by the Great American Insurance Group. For the data set with driver motor vehicle records, each of the observations have several potential violation types aggregated to a unique company policy number and renewal year. The sum of the weighted count of these violations provides a driver’s score. New weights for score variable are developed and inspected to see if it is significant in predicting the loss ratio. The score variables were originally aggregated in three different ways suggested by Great American. New weights from parameter coefficients of regression models ran in the original case study will be derived and evaluated. Models from this capstone and the original case study will be compared to see if the new weights significantly affected the ability of people’s personal driving record to predict loss ratio.

Ankur Jain, Credit Scoring, August 2015, (Peng Wang, Amitabh Raturi)

Credit scoring can be defined as a technique that helps credit providers decide whether to grant credit to consumers or customers. The project is to identify the bad or risky customers by observing the trends in spends and payment. Logistic regression technique is used to identify the customer likely to default. The data is collected from Barclays data ware house with almost 139 variable for 12 months. The steps involved for regression are exploratory analysis, data preparation, missing value imputation, model building, diagnostic and validation. Sixteen out of 138 independent variables were found significant and were used to fit the multiple logistic regression model. Maximum likelihood parameter estimates from the model are significant at 5% level. Higher value of concordance, lower misclassification and higher AUC indicates the predicted model has good discriminatory power. The model clearly segregate event and nonevent. Leverage is used to identify the outliers. Higher value of AUC for testing dataset indicates, the model is not overfitting the data and thus model has good predictive power.

Xiaoka Xiang, Revenue Forecasting of Medical Devices, August 2015, (Uday Rao, Yichen Qin)

Ethicon is a top medical device company that has recently purchased a forecasting tool in an effort to improve the accuracy of their account-level forecasting. Account forecasting is a challenge for medical device companies especially because this industry is highly influenced by contractual changes, unlike other retail industries. Efforts are taken to improve the accuracy of the forecasting including interviewing the managers of the accounts and comparing the output of another forecasting model built by internal analysts. Another factor to be considered is the revenue erosion due to rebate and selling through distribution channels.

Ravishankar Rajasubramanian, Text Classification: Identifying if a Passage of Text Is Humorous or Not, August 2015, (Peng Wang, Jay Shan, Michael Magazine)

Sentiment analysis in particular has been very useful for retail companies to understand how their products are being perceived by the customers and for ordinary people to see how a particular topic is being received on microblogging sites etc. Even though a lot of research has already been done in this field, there seems to be a lot of scope for implementing better models with increased accuracy in the classification task. Identifying positive or negative emotion in a passage of text is a problem that has been extensively researched in the recent past. In this study, a variant of the problem is chosen wherein the identification of whether a passage of text is humorous or not is the main goal.  The study is based on a competition conducted by “Yelp” as part of a yearly Data Challenge. This particular variant of sentiment analysis is very useful for Yelp because it would help them identify which of the newly written reviews are most likely to be humorous and display those at the top of the web page. Also from an academic standpoint, this problem is slightly more challenging to solve compared to the positive and negative sentiment problem because of the different flavors of humor such as sarcasm, irony, hyperbole etc. Solving this problem could be the first step in trying to identify those sub-classes within the humor category.

Pavan Teja Machavarapu, Industry Analytical Solutions – IBM, August 2015, (Amit Raturi, Peng Wang)

This project report is to demonstrate my contribution to the Analytics Division at IBM. Our team was responsible for developing software anlytical solutions for various industries such as

Banking, Insurance, Wealth-Management etc. I primarily worked on Banking Solution in the

Summer and have described my contributions to the solution in this report.

IBM Behavior Based Customer Insight for Banking solution gives the information and insight

that is need to provide proactive service to client's customers. The IBM Behavior Based

Customer Insight for Banking solution works with IBM Predictive Customer Insight.

The solution includes reporting and dashboard templates, sample predictive models, and application interfaces for integration with operational systems. It uses banking data related to transactions, accounts, customer information, and location to divide customers into segments based on their spending and saving habits and predicts the probably of various life events. By anticipating customer needs, the solution enables banks to deliver personalized, timely, and relevant offers. For example, It can send alerts and targeted offerings and provide insights that help banks to develop direct marketing campaigns. It also helps customers of the bank to manage their finances.

Sai Teja Rayala, Credit Scoring of Australian Data, August 2015, (Dungang Liu, Yichen Qin)

In this project, credit scoring analysis has been performed on Australian credit scoring data. Credit scoring is the set of decision models and their underlying techniques that aid lenders in the granting of consumer credit. In this project the data is partitioned into training and testing data sets using simple random sampling. In this project we build models using three data mining techniques namely Logistic Regression, Decision Trees and Linear Discriminant Analysis. Models are built using the training dataset. The models built are validated using the testing data set and evaluated on the basis of area under the ROC curve and misclassification rate.

Pushkar Shanker, Brand Actualization Study, August 2015, (Uday Rao, Edward Winkofsky)

The Brand Actualization study at FRCH | Design Worldwide was intended to evaluate ratings of various brands and build a model for brand assessment. The study comprised of designing the survey, gathering response data, analyzing the responses, developing new insights for Brand Strategy and building a Brand Actualization Score model.  The Brand Actualization Model has been built using the four key Brand Power Utilities viz. ‘Recognize’, ‘Evaluate’, ‘Experience’ and ‘Communicate’. These utilities are comprised of various attributes. A survey was designed and rolled out. Respondents rated each brand on those attributes. To arrive at a Brand Utility Score and also get insights on the key factors (attributes), an analysis was performed on the attributes which constituted each Brand Power Utilities. Further, variance in each Brand Power Utilities was analyzed and a model was developed to determine the Brand Actualization Score.

Zhengrui Yang, Cincinnati Reds Baseball Game Tickets Sold Prediction Model, August 2015, (Yichen Qin, Michael Fry)

The Cincinnati Reds is a professional baseball team based in Cincinnati. The tickets sold for each game can be influenced by different factors, like the weather, the opponents, months, and so on. In this paper, the dataset containing Cincinnati Reds game records from 2014 is analyzed through different data mining techniques, such as Linear Model, Leave-one-out Cross Validation, Lasso Model. The goal of this study is to build models which will predict the amount of tickets sold with different factors. Based on the results, the Lasso model with λ equal to 403.429 are preferred according to the performance of the results.

Joel Andrew Schickel, A Probit Classification Model for Credit Scoring Using Bayesian Analysis with MCMC Gibbs Sampling, August, 2015, (Jeffrey Mills, Martin Levy)

Classification models are widely used tools in credit scoring.  Furthermore, Bayesian approaches are growing in popularity in a variety of fields.  The aim of this study is to show some of the advantages and disadvantages of Bayesian models in application to a particular data set used for classification in credit scoring.  First, Bayesian methods are used to attempt to improve on William H. Greene’s credit scoring model which predicts cardholder status and risk of default.  Second, Bayesian methods are used to improve on the author’s own predictive models.  Although the benefits of a Bayesian approach are not clearly seen when it is compared to Greene’s models; nevertheless, Bayesian methods provide a modest advantage over the author’s frequentist model for predicting cardholder status.  Finally, it is seen that a Bayesian model that is based on a very small subset of the credit scoring data set can be significantly improved through the use of prior information about model parameters.  This result reinforces the claim that Bayesian analysis can be especially helpful when decisions are made on the basis of a small data set or when the decision maker already possesses some knowledge about the factors that predict the relevant response variable. 

Ruiyi Sun, Simulation of the Collection Center Operation System and the CACM Score Model, August 2015, (Dungang Liu, David Kelton)

This essay describes the author’s contribution to the projects during an internship from January 2015 to April 2015.  Two projects are introduced: the “CACM Score Model” project and the “Collection Center Operation System Simulation using Arena” project. The “CACM Score Model” is a model used to group similar levels of accounts into one of several score bands and predict whether a defaulted account in a typical score band will pay back its loan. “CACM” is the abbreviation for “Collection Analytics Contact Model”. The “CACM Score Model” project is a team project for building models to group defaulted accounts into three score bands and design experiments for the placement strategy of different score band accounts.  The “Collection Center Operation System Simulation using Arena” project contains Arena models built using different collection-operation scenarios. Two main models are built using two scenarios: a random model and a non-random model. Briefly, the random model assigns account calls randomly to human vs. an automated computer system, and the non-random model uses attributes of accounts for this assignment.  For the random model, the Process Analyzer and Output Analyzer were used to analyze results. For the non-random model, three sub-scenarios were built based on three different scenarios. An overall comparison of the results for the two models was conducted via the Output Analyzer. Implementation of these results was also conducted at the end.

Liberty Holt, Robot Viability and Optimization Study, July 2015, (Craig Froehle, Michael Magazine)

Surgeries are performed via many means throughout healthcare organizations.  In this study, the possibility of moving a DaVinci Robot, used for robotically assisted minimally invasive surgery, from a primary location in a full service, acute-care hospital to a new minimally invasive surgery center is explored and analyzed for viability.  Many factors influence the final decision of the Operations Committee, the purpose of this analysis is to determine viability and, if found to be viable, an optimal solution for cases given the specialties of General Surgery and Gynecology.  An underlying assumption and expectation of opening a surgery center is that cases that are moved to a surgery center are of lower risk for inpatient admission and patient complications and will be performed more efficiently in a surgery center.  This analysis uses case times and physician behaviors in the main operating room with the intention that if current case times and turnover can be moved and would fit in the surgery center then gaining efficiencies would be very attainable by the very nature of a surgery center versus a larger hospital.

Sravya Kasinadhuni, Email Fraud Detection- Spam and Ham Classification for Enron Email Dataset, July 2015, (Andrew Harrison, Edward Winkofsky)

Enron is a Texas-based energy trading giant and was America’s seventh largest company that declared bankruptcy. In this project, we look at the email data of Enron and classify these emails as spam or ham. In this approach, I first divide the data into training and testing data sets. Then the data is cleaned and the classifier is trained with a subset of the data. The classifier used is a Naïve Bayes classifier that is based on calculating the probabilities for each term in the email appearing in a particular class. Later emails in the testing data are classified using this classifier, and the accuracy of the classifier for the testing data is obtained. Based on the accuracy, it can be determined whether Naïve Bayes classification can be used to classify emails.

 Anila Chebrolu, Zoo Visitor Prediction, July 2015, (Craig Froehle, Zhe Shan)

The Zoo serves over a million visitors each year. Understanding the visitor patterns would allow the zoo authorities to optimize the visitor experience by enhancing their services. It would allow them to make better operational decisions like hiring the appropriate number of support staff for various seasons or stocking up food adequately. Better planning will ultimately help in driving up the revenues for the Cincinnati Zoo. The goal is to identify the model that best predicts the number of zoo visitors.  The approach uses neural networks and random forest methods to predict the count of the number of zoo visitors on a particular day using R and builds a time series model based on the daily zoo visitor data using SAS.

 Charith Acharya, Forecasting Daily Attendance at the Cincinnati Zoo and Botanical Gardens, July 2015, (Craig Froehle, Yichen Qin)

The Cincinnati Zoo and Botanical Gardens needed a forecast of the number of people arriving at the Zoo on a daily, weekly level rolled up to a monthly and a yearly level. Methods such as the simple average or the simple moving average were not very effective.  This paper uses the ARIMA time series forecasting procedure to arrive at weekly and subsequently daily forecasts for the Cincinnati Zoo and Botanical Gardens for 2015.

 Elise Mariner, Analysis of American Modern Insurance Group Mobile Home Policies, July 2015, (David Rogers, David Kelton)

American Modern Insurance Group is a national leading small niche company in the specialty insurance business.  Located in Amelia, Ohio, the company has close to 50 years of experience in residential and recreational policies. Here the main focus will be on residential policies, specifically on mobile homes.  The original group-project objective for the Case-Studies course was to see if there is a predictive model to determine whether a customer will renew his or her policy, and to see if there is a correlation between the predictor variables and the binary response variable. 

Top significant factors that affect retention were identified through logistic regression modeling, decision trees and random forest modeling. To extend the original class group project to this individual capstone project, model refinement through further regression analysis was performed to develop a better model than the model that was originally chosen.

 Ahffan Mohamed Ali Kondeth, Cincinnati Zoo – Daily and Monthly Forecasting Number of Visitors, July 2015, (Craig Froehle, Liu Dungang)

The aim of this project is to forecast the daily and monthly attendance for a year. This forecast will help the zoo in staffing decisions.  The project aims to use various analytic techniques to forecast the attendance like regression and time forecasting methods.  The attendance dataset provided by Cincinnati zoo contains daily attendance details from year 1996 till 2014.  Further features were added to the dataset such as the temperature, weather events, weekend details etc. to create a regression model.  Other features that were added include weekend details – Saturday/Sunday, Special event flags – Christmas/Halloween etc.

 Anvita Shashidhar, Data Warehouse and Dashboard Design for a Hospital Asset Management System, July 2015, (Andrew Harrison, Brett Harnett)

Increasingly analytics is finding application in the field of healthcare and medicine, not only in dealing with clinical data but also in the efficient running of the hospital. Hospitals are a vast source of data and are now beginning to rely on analytics to make effective use of that data in improving the overall efficiency.  The purpose of this project is to build a data warehouse to create a master list of the hospital’s assets. In addition to building the data warehouse, the project also aims to create an interactive dashboard to present this data and publish the same on the hospital’s intranet, so as to help hospital staff track the hardware across different departments and locations.

This project is the first undertaking of its kind at the hospital and will prove to be invaluable in helping manage the hospital’s assets efficiently.

 Rajat Garewal, Simulation of Orders & Forecast to Align with the Demand, July 2015, (Michael Fry, Michael Magazine)

A multinational consumer goods company’s supply chain is managed by SAP’s planning method known as Distribution Requirements Planning (DRP). DRP’s built-in parameters control how the raw forecast gets modified to account for recent demand and supply. The modified forecast is then used in demand planning to request replenishments from the plants.  This modified forecast is known as DRP forecast. The objective of this simulation is to mimic SAP’s logic to use the order, shipment and forecast data to generate DRP forecast and determine parameter values that would minimize the forecast error. Producing more units than the orders would lead to additional costs to store the items in the distribution centers; on the other hand, producing fewer units would result in not being able to fulfill customer orders. Forecast accuracy depends upon the consistency of DRP forecast with the shipments. Simulation is a feasible way to experiment with different settings with minimal additional costs to the company, and without disrupting the current planning method.

Pratish Nair, Developing Spotfire Tools to Perform Descriptive Analytics of Manufacturing Data, July 2015, (Uday Rao, Michael Magazine)

As an Analyst in the Business Intelligence team at Interstates Control Systems, West Chester, OH , I have worked on data extraction, data cleaning and data visualization with prime focus on Descriptive analytics. Data from the production line of various P&G plants are aggregated and pushed onto a dashboard for reporting. We aim at processing data and getting it to a form that is presentable. In a duration of 10 weeks, I worked on multiple projects, gaining knowledge of manufacturing analytics and how processes are defined for different analysis relating to them. Our BI team is currently working on extracting data from P&G production lines located at different sites. Currently, the plants which have the systems in place are located in Winton Hills, OH; Mehoopany, PA; Euskirchen, Germany; Targowek, Poland. I worked on two experimental studies. The first one was a comparison of the latency in the start time of the tool when setup with different servers, a Historian server and an MS SQL. My second study was a proof of concept study of integrating R language to the Spotfire dash boarding environment. In addition, I explored opportunities where it could benefit the existing projects.

Nanditha Narayanan, Finding the Sweet Spot, June 2015, (Maria Palmieri, Yan Yu)

The School of Human Services in the College of Education Criminal Justice, and Human Services offers a Bachelor’s Degree in Athletic Training. Every year only a select number of students are accepted into the Athletic Training cohort. Being a niche program, attrition of the students is desired to be minimal. This study aims to use SAS and R to develop a heuristic prediction of the student performance and ability to graduate from the cohort as a determinant for offering admission into the cohort. High school GPA and ACT/SAT scores, student UC GPA at the time of application to the Athletic Training cohort, and demographic information are scrutinized to measure student success. This project will enable the admissions office to make an informed decision to offer admission to promising applicants that are most likely to succeed in the program.

Ramkumar Selvarathinam, Identification of Child Predators using Naïve Bayes Classifier, June 2015, (Yan Yu, Michael Magazine)

Objective of this document is to detail the steps followed and the results obtained in the project that aims at identification of Child Predators using text-mining methodology.  A cyber predator is a person who uses the Internet to hunt for victims to make the most of in any way, including sexually, emotionally, psychologically or financially. Child Predators are individuals who immediately engage in sexually explicit conversation with children. Some offenders primarily collect and trade child-pornographic images, while others seek face-to-face meetings with children via on-line contacts. Child predators know how to manipulate kids, creating trust and friendship where none should exist.  In the fight against online pedophiles and predators, a non-profit organization named Perverted-Justice has pioneered an innovative program to identify child predators by pretending to be a victim. There are more than 587 conversations that happened between child predators and Pseudo-victims. This information base can be used to train a model that will identify a person as child predator by analyzing the conversation.  This problem is a classification problem that aims at identifying whether or not a person is a predator, based on his chat logs.  So the primary objective of this project is to identify a cyber-predator based on his chat logs. Naïve Bayes Classifier algorithm is leveraged to solve this problem.

 Praveen Kumar Selvaraj, Data Mining Study in Power Consumption & Renewable Energy Production, June 2015, (Yan Yu, Michael Magazine)

The supply and consumption of renewable energy resources is expected to increase significantly over the next couple of decades. According to US Energy Information Administration, the share of renewable energy could rise from 13% in 2011 to up to 31% in 2040. Most of the renewable energy is going to come from wind power and solar power sources. Understanding of the dynamics of energy consumption and renewable energy production is important for effective load balancing in the grid. As the energy cannot be stored, optimizing grid load is crucial for energy management. In this study, prediction models for wind power, solar power and power consumption are built using the weather data and applied on a scenario data to arrive at the power shortfall that needs to be met.

 Lee Saeugling, Assortment Planning and Optimization Based on Localized Demand, June 2015, (Jeffrey Camm, Michael Fry)

We consider the problem of optimizing assortments for a single product category across a large national retail chain. The objective of this paper is to develop a methodology for picking from 1 to N (complete localization) assortments in order to maximize revenue. We first show how to estimate demand based on product attributes. SKUs are broken into a set of attribute levels and the fractional demand for these levels is estimated along with an overall demand for the attribute space. We use the estimated fractional demand and overall demand to create from 1 to N assortments and assign each store a single assortment in order to maximize revenue. We found a significant increase in revenue when going from one national assortment to complete localization. Further we found that most of the revenue increase of complete localization can be accomplished with far fewer than N assortments. The advantage to our approach is it only requires transactional data that all retailers have.

Chaitanya V. Jammalamadaka, Analysis of Twitter Data Using Different Text Mining Techniques, June 2015, (Jeffrey Camm, Peng Wang)

For organizations and companies, managing public perception is very important. Increasing usage of the internet means that it is important for companies to maintain an online presence. It is very important for these companies to monitor the public sentiment so that they are able to react to changing sentiments of potential and existing customers.  Considering the instant communications that take place on Twitter, it can be a very useful tool to achieve this goal.  The aim of this project is to identify and compare different methods of visualizing Twitter data related to the company Apple. These methods give useful insights which could help any business. The scope of the project is limited to exploring ways to get these insights.

Abhinav Abhinav, Forecasting the Hourly Arrivals in Cincinnati Zoo Based on Historical Data, April, 2015 (Craig Froehle, Yichen Qin)

Cincinnati Zoo is a popular place with high frequency of visitors all around the year. Maintaining the optimum level of various supplies and staff can get problematic if proper planning is not done. In absence of planning, Zoo management can face issues related to customer satisfaction and revenue management. Suitable way of planning can be – Forecasting the number of future hourly arrivals based on historical hourly arrivals data. Primary obstacle in this exercise is – multiple levels of seasonality present. There are three types of seasonality in the hourly data – daily, weekly, and yearly. To tackle this, combination of two Time Series models is selected as the final solution. First model is combination of individual “Fourier Time Series” models for each hour from 9 AM to 5 PM during a regular day. Second model is Seasonal ARIMA at monthly level. Controlling factor in first model is hours in a day. This model is more effective in capturing the yearly seasonality. Controlling factor in second model is months in a year. This model is more effective in capturing the daily seasonality. Day-related variables are also used as external regressors across both models to capture the weekly trend. Other external regressors that I used are – Early Entry period and Promotion indicators. Forecasted arrival figures are generated for May (2015) using combination of both of these models.

Swati Adhikarla, Store Clustering & Assortment Optimization, April. 2015 (Jeffrey Camm, Michael Fry)

Gauging a product’s demand and performance has always been a difficult task for retailers. To maintain a competitive mix and achieve targeted profits in today’s retail chains, clustering and localization of stores is important. This presents an opportunity to leverage the problem and gain a competitive edge. The baseline of this project is to develop a prototype solution for retailers, which allows them to provide a better product assortment to meet the unique preferences of their customers. Clustering is very crucial as retailers are more focused to consider customer-centric retailing. Hence, the first step of the project is to perform store clustering considering customer behavioral attributes. The second step of the project is to achieve localization of the stores within each cluster. The essence of the store localization is to find the right mix of product assortment to be carried.

Sally Amkoa, An Exploratory Analysis of Real Estate Data, April, 2015 (Shaun Bond, Yan Yu)

The vast amount of data available in real estate is still an underutilized resource in the industry. This study aims to explore a real estate dataset on the acquisition of properties by real estate investment trusts (REITs) in various geographic locations. The goal of this study is to utilize statistical software (R under R Studio) to manipulate the dataset into a format that will allow for the creation of statistical models that can provide actionable information.

Eric Anderson, Forecasting Company Financials for FootLocker, Inc., April, 2015 (Jeffrey Camm, Yichen Qin)

Forecasting company financials is very useful in predicting future Revenue and future Earnings per Share for various companies. The ability to understand company growth drivers can be very useful in making investment decisions. In order to predict future performance, it is important to examine the past observations for the dependent variable and all of the past observations for the explanatory variables. Of course, this assumes that company performance will follow a similar pattern in the future.  Time series models cannot always predict an exact forecast because unexpected shocks can occur to the economy.  However, in industries like consumer retail, there are statistics that seem to drive the performance of companies and can give us a close estimate of future performance. In this capstone paper, we use data visualization to show the impact of the economy on FootLocker company performance. Once we have used visualization to understand the drivers for company performance, we use an ARIMA model to account for seasonality. It takes into account past revenues for forecasting FootLocker Revenue.

Ginger Simone Castle, Using Simulation and Optimization to Inform Hiring Decisions, April, 2015 (W. David Kelton, Amit Raturi)

Historical project management data is used to create a simulation model that closely models the reality of the day-to-day goings on of a team of employees at a marketing firm as it relates to client and internal projects.  Output from the simulation is used to compare scenarios to evaluate queuing rules and to run optimization scenarios that relate to future hiring of both permanent and contract employees.  In the end this analysis makes recommendations regarding future hiring during a period of rapid sales growth, thereby answering some key questions the executives have regarding resources what will be required to successfully meet the current and upcoming growth challenges. The final hiring recommendation for the largest growth scenario is 7 new permanent employees (2 C’s, 4 D’s, and 1 who can do both WPR and D) and the utilization of 9 contractors (2 C’s, 4 D’s, and 3 PRO’s).

Damon Chengelis, A Study of Simulation in Baseball: How the Tomahawk Better Fits and Better Predicts MLB Win Records, April, 2015, (Jeffrey Camm, Yan Yu)

This study introduces the Tomahawk simulation method as a novel win estimator similar to established Pythagorean expectation.  I theorized by simulating a season using a team's given record, the end result in wins will be a closer fit for a team's particular run distribution, more robust against outliers, and more readily adaptable outside of the MLB environment which Pythagorean expectation was developed.  Furthermore, the simulation provides a more relevant number for understanding how an MLB team will be affected by roster changes.  I tested Tomahawk simulation with R using MLB game data from 2010 to 2014.  The first test used the first half of a season to predict the second half.  The other test used one season to predict the next season.  Tomahawk simulation was compared to Pythagorean win expectancy, Pythagenpat win expectancy, and naive interpolation.  Naïve interpolation simply assumes a team will repeat the same number of wins. This provides a relevant baseline for a win estimator having insight beyond the box score.   Since I am interested in robustness against outliers, I chose to measure fit using Mean Square Error.  Tomahawk simulation outperformed Pythagorean and Pythagenpat expectations when comparing one season to the next.  Using one half of the season to predict the next showed Tomahawk being on par with Pythagenpat.  However, Tomahawk simulation provides a confidence interval and does not require finding an ideal exponent.  This suggests Tomahawk simulation is an appropriate alternative to Pythagorean and Pythagenpat win expectancies.

Auroshis Das, Assortment Planning Strategy for a Retailer, April, 2015 (Jeffrey Camm, Michael Fry)

Today, in the fast moving consumer goods space, retailers are facing quite a lot of challenges in providing the best offerings in terms of variety and affordability of products. This is further fueled by the increase in the number of retailers and thereby the competition. Under such a scenario it becomes imperative to leverage the power of consumer data, to make smart decisions that would translate to customer loyalty and satisfaction. This project was undertaken to optimize the assortments that go into stores of a retailer, considering the need to meet the variety in customer demands, while minimizing the cost involved in customization of the assortment. The aim here was to find out the optimal number and product-mix of assortments which would lie somewhere between the ideal store level customization (one assortment for each store) and the naive single assortment (one assortment for all stores). The approach taken was to first identify similar behaving / performing stores and then to collectively roll out assortments optimized for such similar stores. The first part of the approach involved clustering while the latter involved optimization for the share of different kinds of products that go into the assortments. The results from this project could serve as a guideline for retailers in assortment planning strategy. This is because it captures the true demand of their customers from the historical sales data while also finding the right product-mix which would serve the demand with minimum cost involved.

Preetham Datla, Predicting the Winners of Men’s Singles Championship Australian Open 2015, April, 2015 (Jeffrey Camm, Dungang Liu)

The role of analytics has increased a lot in the sports industry in recent times. It helps different stakeholders to analyze the various factors associated with the outcomes of different matches. In this project, tennis data for all men’s singles matches of the major tournaments was analyzed to predict the winners of Australian Open 2015 matches. Classification techniques like Logistic Regression and Support Vector Machines were used and their performance was compared. It was found that the Support Vector Machines algorithm gave better prediction results. This study was initially performed for the Sports Analytics competition organized by UC INFORMS and the Center for Business Analytics at the University of Cincinnati.

Nitish Deshpande, Forecasting based impact analysis for internal events at the Cincinnati Zoo, April, 2015 (Craig Froehle, Jay Shan)

Cincinnati Zoo wanted to forecast the attendance of visitors at the zoo for next six months and quantify the impact of internal events on the attendance. In order to forecast we analyzed the seasonality in the trend of visitor count and found it to be doubly seasonal with weekly and yearly seasonality. To facilitate forecasting for such a trend we built an auto regressive moving average (ARMA) model with factors of seasonality, internal events, external events and bad weather. Data for the years 2013 and 2014 was used to build and test the model. The model gave several interesting results such as hosting an internal event has a positive impact of 59% on an average over the visitor count. It also indicated that on bad weather days the attendance drops by 25% on average. 'PNC Festival of Lights' was identified as the most popular and successful event at the zoo. Based on the findings and forecasts we recommended the zoo to introduce more events on the lines of the festival of lights and provide incentives like discounts or indoor events to prevent the impact of bad weather conditions.

Krishna Kiran Duvvuri, Predicting the subscription of term deposit product of Portuguese bank, April 2015 (Jeffrey Camm, Peng Wang)

Direct marketing is a technique many institutions employ to reach out to potential customers who in general would not know about the product being marketed otherwise. This technique is especially employed for specialized products that are not essential on day to day basis for general public such as insurance products, mutual funds, special bank schemes, etc. As these products have less takers in comparison to those for other mainstream products such as household products etc. targeting and marketing to specific group of people (potential clientele) becomes crucial for their successful sale.

In this project, we summarize models that predict whether a client of a Portuguese financial institution will subscribe to its term deposit product or not. These models will help the bank in identifying clients to whom calls can be made directly to successfully market the product. The project also helps in identifying key variables that influence the decision of subscription.

Brian Floyd, Stochastic Simulation of Perishable Inventory, April 2015, (Amitabh Raturi, David Kelton)

Antibodies are a critical and perishable inventory component for the operations of the Diagnostic Immunology Laboratory (DIL) and require time-intensive quality validations upon receipt. In this study, current ordering practices were evaluated to identify opportunities to reduce the workload surrounding antibody ordering and validations, for the DIL. Stochastic simulation, via Arena software, was used to estimate needed adjustments to order sizes that will reduce the frequency of quality validations and to assess the influence of expiration dates on waste levels over the course of a year. Simulations showed order sizes sufficient for one year of testing can lead to a 28% decrease in quality validations work and the expiration of antibodies does not create an appreciable level of waste. Given that the expiration of stock is not a substantial influence, in the future, static simulation in a spreadsheet environment can be used to estimate order sizes within the confines of a year.

Mrinmayi Gadre, Cincinnati Zoo: Predicting the number of visitors (zoo members and non-members) using forecasting methods, April 2015, (Craig Froehle, Yichen Qin)

This project aims at forecasting the number of visitors and the proportion of members that are likely to visit the Cincinnati Zoo each day in the year 2015. In order to perform this study, the historical data of visitors and members was used and then taking into consideration various relationships with other variables that affect the results, the forecasts were calculated. This was done using ARIMA modeling in R. The data showed a weekly seasonal component and therefore seasonal ARIMA model was used. The accuracy of the fitted model was tested using different accuracy tests and later this model was used for forecasting the results. Knowing ahead of time an estimate of the number of visitors will aid the zoo in taking decisions such as deciding the annual budget, the number of memberships to be offered, various offers or events to be organized, etc. It should help the zoo come up with strategies that will increase the number of visitors on days on which less visitors are expected and thus help them increase their revenue.

Chandrashekar Gopalakrishnan, Analysis of visitor patterns at the Cincinnati Zoo and the effect of weather on the visitor counts, April, 2015, (Craig Froehle, Yichen Qin)

The Cincinnati Zoo wants to use the visitors data, that it has collected over the past few years and come up with better estimates for number of visitors that are expected to arrive in the next few months. The zoo hopes to use these estimates to make its planning process more efficient.

In this project, I will attempt to understand the influence of weather related external factors on the number of visitors to the zoo for any day.

The focus of the analysis will be on how the actual amount of precipitation on any day, affects the visitor count and whether it has an effect on one or two days after as well. Also, I will be trying understand the effect of temperature on visitors count as well, by using the maximum and minimum temperature of any day as variables.

Sai Avinash Gundapaneni, Forecast use of a city bike-share system, April 2015, (Jeffrey Camm, Yichen Qin)

Bike Sharing systems provide a service that enables people to rent bicycles. The System is set up in such a way that the whole process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. These systems provide a convenient way to rent a bike from a location and return it to a different location based on the user’s need. The use of these types of bike sharing systems are on the rise. Currently there are about 500 bike-sharing programs around the world.

The goal of this project is to forecast the bike rental demand in the Capital Bike-share program in Washington, D.C. by combining the historical usage patterns with weather data. These systems generate attractive data related to duration of travel, departure location, arrival location, and time elapsed. Thus a Bike sharing system can be used to function as a sensor network to study mobility in a city.

In this project, I build a statistical model using methods like linear regression and CART methods so that we can forecast the demand in the best possible way. We aim to build the model by splitting the data into train and test samples. This method allows us to do the out-of-sample testing which is the best way to test the performance of a model fit.

Manu Hegde, CrossRoads Church Survey, April 2015, (Jeffrey Camm, Peng Wang)

In this project, we seek to understand the characteristics of church-goers who believe that it has been a spiritually enriching experience. By understanding these properties, the church can identify their most satisfied parishioners and also understand the nature of dissatisfied parishioners. By this segregation, they can choose different courses of action for their incoming population.

A Likert Scale Survey has been circulated and the results are to be analyzed.

Anthony Frank Igel, Analysis and Predicted Attendance of the Cincinnati Zoo: Special Event Influence, April 2015, (Craig Froehle, Yan Yu)

The Cincinnati Zoo opens its doors 365 days a year to the public, with nearly half of those days featuring special events. 33 different special events have been hosted over the past five years. Over the past five years, 33 different special events have been hosted. This investigation will document the approach, numerical analysis, and business implications of prediction models containing special events in unique combinations. An Autoregressive method of Time Series Forecasting and generalized linear Regression models were used to predict future attendance of Zoo patrons based off of three different regression models. The results prove that not grouping the special events as a single entity yields a more reliable model and thus those predicted results were given to the Zoo for additional analysis.

Adhokshaj Shrikant Katarni, Forecasting number of visitors to Cincinnati Zoo using ARIMA and TBATS, April 2015, (Craig Froehle, Yichen Qin)

In Zoo business it is very important to know in advance about the arrival patterns of the customers across various months. Forecasting has since long been one of the biggest factors for deciding the success or failure of an organization. It becomes basis for the planning various things like events, promotions, capacity and staffing at the zoo. Using forecasting techniques we can better understand the seasonalities and trends in the data.

The objective of the project was to come up with forecasts for six months ahead at a daily and weekly level and then find the various trends in forecasts to identify the potential outliers in the predicted data. In this project, forecasting was done using two techniques Seasonal ARIMA and TBATS. ARIMA was used for predicting the number of visitors arriving at a weekly level while using TBATS day level number of visitors were predicted.

Vikas Konanki, Credit Scoring of Australian Data Using Logistic Regression, April 2015, (Jeffrey Camm, Yichen Qin)

In this project we are performing credit scoring analysis on Australian credit scoring data.

Credit scoring can be done using various statistical techniques but here we are going to use logistic regression. The approach we follow here is to first divide the data into training data and test data. Then we now decide which independent variables are significant enough to predict the dependent variables. Once the variables are selected we build a model using these variables to predict the dependent variable. This model is then validated on the test data set and the results are checked for the performance of the model. The different results we check here are Misclassification rate, ROC curve, Area under ROC curve and KS statistic.

Regina Krahenbuhl, Predicting the Outcome of School Levies, April 2015, (David Brasington, Yichen Qin)

In the state of Ohio, a public K-12 school district goes to the public to ask for money to cover the remainder after state and federal dollars. The district asks for the money in the form of a tax levy on the ballot during a local election. With over 600 school districts in 88 counties, Ohio has a very diverse populace. In this study, regression analysis was implemented to predict the factors which affect the outcome of the levy – pass or fail. A significant explanation is given on how these taxes are assessed and what other studies have determined to be predictors. To perform the analysis, data from 2013-2014 was gathered. The independent variable (outcome) is dichotomous; therefore, logistic regression must be chosen with a probit link. In addition, the following statistical procedures were examined: bivariate probit, interaction terms, exclusion of classes of variables, and several robustness checks. The final model found these variables to have a statistically significant impact on the outcome of a levy: percent of population with a bachelor’s degree or higher, district state revenues per pupil, month the election was held, millage amount, type of levy, percent of minority students in the district, and district salaries as a percent of operating expenditures.

Karthik Reddy Mogulla, A Trend Analysis and Forecasting of Cincinnati Zoo Membership and Member Arrivals, April 2015, (Craig Froehle, Yichen Qin)

Cincinnati Zoo & Botanical Garden is the second oldest Zoo in the United States where over 1.5 million people visit the Zoo annually. As part of its services it offers Gold, Premium and Basic annual membership plans for Family, Single Parent and Individual members. The goal of this project is twofold. (1) To build interactive dashboards to capture the trends over time of member arrivals and Zoo membership so as to identify the underlying patterns in it by using QlikView Visualization software. (2) To forecast member arrivals on a daily level using autoregressive integrated moving average, ARIMA, time series model. Internal and external factors affecting the arrival patterns are identified and are used as external regressors in the ARIMA model. Impact of different factors such as temperature, precipitation, internal Zoo events, corporate & educational events, week day, month etc. on the member arrivals are analyzed.

Swaraj Mohapatra, Process mining, missing link between model-based process analysis and data-oriented analysis techniques, April 2015, (Yan Yu, Yichen Qin)

Process mining is used to extract process related information from event data. The goal of this report is to introduce process mining not only as a technique but also as a method. The method facilitates to discover automatically a process model by observing the events recorded by any enterprise system. The importance of process mining can be showcased by the astounding growth of event data. Process mining takes care of the limitations of traditional approaches to business process management and classical data mining techniques. The report explains the three different types of process mining namely: process discovery, conformance checking, and enhancement. The report will also explain how to implement process mining on event data.

Chaitra Nayini, Using Visual Analytics and Dynamic Regression Modeling to Forecast Trends and Optimize Station Capacity for a Bike Share Service, April 2015, (Yichen Qin, Jeffrey Camm)

Forecasting trends for data that exhibits time series behavior at multiple levels can be very complex. However, having these forecasts can help an organization’s planning and decision making. It also enhances customer service as the organization can now anticipate future trends and be prepared to meet customer’s expectations. This project aims at building a Dynamic Regression model which quantifies the influence of predictor variables on the dependent variable while taking time series behavior into consideration.

The dataset for this project contains information released by Capital Bikeshare, a short term bike rental service operating in Washington, D.C., Metro area. It has information on every trip taken since September 2010. A supplementary dataset which includes hourly weather information and trip volume for that hour is also used. This paper will discuss application of Negative Binomial Regression and ARIMA using three different models. A set of geographical and temporal visualizations are also built for exploratory data analysis and to enhance understanding of the statistical models. These methods can be extended to predict trip volume for each individual station.

Matthew Martin Norris, Predicting Remission in Bipolar Disorder: An Exploratory Study, April 2015, (Yan Yu. James Eliassen)

Predicting treatment outcomes in psychiatric illnesses remains understudied. Analytic approaches to existing studies can be effectively applied to this process. Less than 50% of first line treatments work in bipolar disorder; identifying which individuals will respond to a specific course of treatment is the ultimate goal. In this study we are aiming to predict who will respond to one of two medications.  For predicting remission in bipolar disorder in drug naïve subjects several models were tested, and one model performed sufficiently well to report the results. The neural network model performed well on both the hypothesis driven model and for the data driven model. However, in the data driven model the accuracy was somewhat reduced and the other modeling techniques, except for classification trees, showed improvement.

Chaitanya Peri, Predicting “No Real Spiritual Growth”, April 2015, (Jeffrey Camm, Peng Wang)

Eleven friends started crossroads church community in 1995 to create an authentic community for people seeking truth about spiritual growth. In this project, model(s) that predict whether a person attending services at crossroads church experience any spiritual growth are built. These models will help Crossroads church recognize the key variables responsible for people not experiencing any spiritual growth.  The original dataset consisted of 10,222 observations and 35 survey questions.  All the data cleaning, data manipulation and statistical analysis were performed using the open-source statistical analysis software “R”.

Logistic regression is the approach used in this project considering the dichotomous nature of the dependent variable. Variable reduction for the first logistic model built was done using stepwise logistic regression based on AIC criteria. Out of 35 variables, 10 proved to be significant. R treats all the categorical variables as factors without any requirement of creating dummy variables. This makes the whole process of model building easier but in the hind sight the models includes all levels of a categorical variable without taking into consideration their significance. Hence, an alternative model was built using just the significant levels of the categorical variables found significant in the previous model. The second model turned out to be more robust and it is very simple and easy to replicate this model in other statistical software.

Apurv Singh, To Predict Whether a Woman Makes Use of Contraceptives or not Based on Her Demographics and Socio-economic Characteristics, April 2015, (Jeffrey Camm, Michael Fry)

World population is rising rapidly. This is a problem for citizens across the globe because the availability of resources per individual is decreasing by the hour as population keeps increasing. There is a need to create awareness among people from all sects about the use of contraceptives to tackle this problem. It has been observed that the use of contraceptives is much lower in developing countries as compared to the developed ones. In this project, data about married Indonesian women have been obtained, which contains some of their demographic and socio-economic information such as age, religion, number of children, media exposure, etc. Based on these characteristics, the goal is to predict whether or not a woman uses contraceptives. The approaches of Logistic Regression and Classification and Regression Tree (CART) learnt from the course - Data Mining I have been made use of to build several models, which have then been compared to see which model can give us the best prediction.  Using this information, we can have a better understanding of factors that influence the use of contraceptives among such women and can thus help us in targeting certain sections of the society globally to educate and make them aware of the long term benefits of the use of contraceptives.

Dan Soltys, Confidence Sets for Variable Selection via Bag of Little Bootstraps, April 2015, (Yichen Qin, Yan Yu)

Using the sample mean as our estimator, we replicate the functionality of the Bag of Little Bootstraps. Then, utilizing the BLB’s improved efficiency in complex tasks, we then apply the algorithm to the highly complex task of automated variable selection through stepwise regression and Lasso regression. The goal of this analysis is to obtain a confidence interval for the true model such that the LCL is a subset of the true model which is, in turn, a subset of the UCL. That is, we want P(LCL ⊆ True Model ⊆ UCL) = 1 – α. To test our ability to generate this interval, we utilize Monte Carlo simulations for a variety of selection criteria and determine how close the proportion of true models captured by our interval is to 1 – α.

Richard Tanner, Capital Bike-Share Rebalancing Optimization, April 2015, (Jeffrey Camm, David Rogers)

Bicycle sharing systems are increasingly common fixtures in urbanized areas throughout the world. In large metropolitan areas, thousands of commuters use these systems every weekday to travel to and from work. As they grow larger and more complex, it becomes necessary to employ station rebalancers who each day transport bicycles in a large van from stations with too many bikes to stations with no bikes. Because rebalancing operations are expensive, it is advantageous to perform them in a cost minimizing manner. Using publically available data, I attempt to compute the cost-minimizing rebalancing plan for the bicycle sharing system that services the Washington DC area.

Bijo Thomas, Daily Customer Arrival Forecasting at Cincinnati Zoo Using ARIMA Errors from Historic Customer Arrival Regressed with Fourier Terms, Promotion and Extreme Weather, April 2015, (Craig Froehle, Yichen Qin)

Presently, Cincinnati Zoo expects customer arrival on a particular day totally based on historic customer arrival. In this article we discuss the methodology of forecasting the daily customer arrival at Cincinnati Zoo using the annual seasonal time-series data, while also accounting for promotion and weather conditions. It is very challenging to account for annual seasonality in an ARIMA model, since R runs out of memory for a seasonality having time period more than 200. To solve the problem of long seasonality, first we used Fourier transformation on the time-series to decompose it into Fourier terms that accounts for seasonality of the time-series. Then used the Fourier terms, mean Temperature and promotion dummy variable as regressors in a linear regression model. Used ARIMA on the error terms that we get from this LR model. The forecasted value from the ARIMA error with the linear regression equation gave us the final value for the daily customer arrival.

Aditya Utpat, Study of Campaign Oriented Enrollment Performance for Mailing Solutions, April 2015, (Uday Rao, Edward Winkofsky)

Mailing solution businesses that run promotion campaigns using mailing lists seek to improve the success of their promotions, where success is measured using the campaign’s enrollment rate. Typically mailing lists come with additional data, a structured analysis of which can shed insight on the suitability of prospective targets for a specific campaign. This project seeks to quantify and better understand the impact of additional data or factors on the success/enrollment rate of a campaign. The additional data entails parameters like Prizm Codes and Census Information that are widely accepted and used in the industry. Statistical models help uncover relationship between different factors, identify the important/significant factors and give a quantitative measure of their impact on the outcome. The results and findings from the analysis provide management with a better data driven approach and solution to make policies and decisions regarding the fate of the campaign/product.

Karthick Vaidyanathan, Predictive Analytics in Sports Events, April 2015, (Jeffrey Camm, Michael Magazine)

Machine Learning techniques are often used to predict outcomes in areas like Retail, Banking, Insurance and Defense. The realization of its benefits has also resulted in its use on various Sports events. Sports like Tennis, Baseball, and Football have started to tap the potential of Predictive Analytics that can convert the vast amount of player data into useful strategies to win.

In this project, I use “Logistic Regression” to predict the outcome of a very famous World Event- “Australian Open 2015”.  I will use the past data of all of the 128 players participating in the tournament.  The Tennis data consists of all the matches played since 2000 until December 2014. The data includes information on the winner, loser, rank, points of both the players in each set, venue, surface of the court and tournament details. “Logistic Regression” is a supervised learning technique where I provide an input data with past results and let the system predict the dichotomous result of future outcomes. I used the matches of 2014 for the system to learn and then predict the results of all 127 matches of the 2015 Australian Open.

 

2014

Zachary A. Finke, Determining the Effect of Traveling Across Time Zones on Major League Baseball Teams in the 2013 Regular Season, August 5, 2014 (Michael Magazine, David Rogers)
Major League Baseball (MLB) is composed of thirty teams spread out across the United States, including one team in Toronto, Canada, so travel is a large aspect of professional baseball. The goal of this project was to analyze the 2013 regular season to determine the effect of traveling across time zones on a baseball team's success in games. The hypothesis is that (1) traveling to another time zone significantly correlates with a team's success rate, and (2) when a team is playing away from home in another time zone, their chance of winning the game decreases, specifically in the first game after travel. The number of days of rest between games, the number of time zones from home, the number of time zones from the previous game, being home or away, and traveling after a day or night game were tested as independent variables to correlate with the dependent variable of winning or losing a game. Linear regression models were used and compared to test for statistical significance of correlation. The statistical analysis showed that there was no statistical significance in these models and traveling across time zones does not affect a team's success in games.

Bryce A. Alurovic, Emergency-Department Overcrowding: A Patient-Flow Simulation Model, August 4, 2014 (David Kelton, Uday Rao)
In the United States, annual visits to emergency departments increased from 90.3 million in 1996 to 119.2 million in 2006 (Saghafian et al. 2012, p. 1080). Continuation of these trends has helped to identify emergency-department overcrowding as a very serious problem. This project models the path of a patient through two systems. One base-case scenario models the current system while an alternative scenario re-directs non-critical patients to a local urgent-care center. Length of stay for re-directed and urgent-care patients, along with emergency-department and urgent-care center utilization, are compared across models. Patients classified as non-critical, so in triage level 4 or 5 (re-directed) see a significant decrease in length of stay, while urgent-care patients see an increase in length of stay. While re-directing patients has a positive impact on emergency-department utilization, the magnitude is not as pronounced as expected. Patient re-direction helps to improve the emergency department and healthcare system, although additional re-direction policies may be preferable. All conclusions drawn from this research are justified by proper statistical analysis.

Mingye Su, Retail Store Sales Forecasting: A Time-Series Analysis, July 25, 2014 (Yan Yu, David Rogers)
In this project, common time-series forecasting techniques are implemented to determine the best forecasting models for retail-store sales prediction. ARIMA and exponential-smoothing methods are utilized and combined with regression analysis and seasonal-trend decomposition with Loess (STL). Forecasting models are iteratively built on the sales of stores at the department level. The study identified that ARIMA and exponential-smoothing models with STL decomposition generally perform well on the data across all departments, and that averaging forecasts of all models is superior to any single forecasting model that has been used in the study.

Sumit Makashir, Statistical Meta-Analysis of Differential Gene Co-Expression in Lupus, July 25, 2014 (Yan Yu, Yichen Qin)
In this study, we developed a statistical framework for meta-analysis of differential gene co-expression. We then applied this framework to systemic lupus erythematosus (SLE) disease. To perform meta-analysis of differential gene co-expression in SLE, we used data from five microarray gene expression studies. Several interesting results were observed from this study. Gene networks built from top differentially co-expressed gene pairs showed a consistent enrichment for already-established SLE associated genes. Analysis of the network consisting of the top 1500 differentially co-expressed gene pairs showed that ELF1, an established SLE associated gene, was differentially co-expressed with the most number of other genes. Several results from analysis of this network are consistent with the well-established facts related to SLE. The enrichments of gene modules for viral-defense response, bacterial response, and other immune-response-related terms are the key consistent findings. Many other results have very interesting biological implications.

Lingchong Mai, Supply-Chain Modeling and Analysis: Activity-Costs Comparison Between the Current Supply Chain and the "Ideal-State" Supply Chain, July 24, 2014 (Jeffrey Camm, Michael Fry)
This report examines the salon products supply-chain network of a local manufacturing company. The current supply-chain network is a multi-node supply chain with production sites, the company's mixing centers, customer distribution centers, and the retail stores. Products are delivered between sites by trucks. In order to reduce the number of nodes, activity costs, and lead-time durations, an "ideal-state" supply-chain model is developed. In this "ideal-state" supply chain, mixing centers and customer distribution centers are proposed to be eliminated. Production sites perform partial functions of mixing and distribution centers, and products are directly shipped to retail stores via third-party service vendors. Activity-costs analysis and sensitivity analysis are conducted on both the current supply-chain model and the "ideal-state" supply-chain model under different scenarios. The project is part of a supply-chain research project undertaken by the University of Cincinnati Simulation Center (http://www.min.uc.edu/ucsc) and the company. Due to the confidential agreements with the company, the company's name is not directly mentioned in this report, and all data mentioned in this report are disguised forms of the real data.

Chris Fant, American Athletic Conference Football Division Alignment: Minimizing the Travel Distance and Maintaining Division Balance, July 24, 2014 (David Rogers, Michael Fry)
Starting in 2015, the American Athletic Conference (AAC) will be adding additional universities and forming two Divisions and a conference championship game between Division winners for football. The AAC will consist of twelve teams with varying rankings and locations throughout the country. A mixed-integer linear-programming model was solved to provide the optimal Divisions to minimize the average travel distance while keeping a balance of ranking and travel between the Divisions. Hierarchical and K-means clustering analysis were also used to compare with the optimization results. Travel distance and opponent ranking were used to evaluate the realignment scenarios. The optimization analysis showed two teams were not placed in the appropriate Division. The proposed optimal alignment will save the AAC 13,299 miles round-trip every four years.

Ole Jacobsen, NFL Decision Making: Evaluating Game Situations with a Markov Chain, July 24, 2014 (Michael Fry, David Rogers)
This research analyzes and quantifies all of the scenarios or starting plays that take place in a National Football League game. A Markov chain is used to create an expected point value of 700 situations based on possession, down, distance, and field position using 150,000 plays from five seasons of National Football League play-by-play data. The variability and mean of each upcoming play can then be weighed and considered from this model. This expected value is then modeled to test for linearity, as well as to find coefficients to value each starting situation. Also discovered within the model were the most volatile situations on the football field, which are the situations were after one play is run it consistently has the greatest impact on the expected point value. From these Markov-chain models, it is determined that the home-field advantage on any given play is worth 0.52 points, a same-field-position first down turnover is worth 5.01 points, and the successful yardage required to make up for the loss of a down is 6.2112 yards.

Xuejiao Diao, Sentiment Analysis on Amazon Tablet Computer Reviews, July 22, 2014 (Roger Chiang, Yan Yu)
The project uses multinomial naive Bayes classification methods to classify Amazon.com tablet computer reviews to helpful and unhelpful reviews, and to rating 1 to rating 5. The top 50 features for helpful and unhelpful reviews were selected by ranking order Chi-square scores from highest to lowest. For rating 1 to 5, the top 20 features were selected for each class. The optimal numbers of features for classifying reviews were obtained by comparing accuracy, precision, and recall for 1000, 3000, 10000, 30000, and 100000 features. It was found that the more positive the rating is, the more likely a peer customer votes the review to be positive. And polarized reviews were found to attract a higher number of total helpful/unhelpful votes than were neutral reviews.

Paul Reuscher, The Affordable Care Act: A Meta-Analysis of the 18-34 Demographic Through Market Surveys, Price Elasticities, and Variable Dominance Analysis, July 21, 2014 (Jeffrey Camm, Jeffrey Mills [Department of Economics])
The purpose of this research is to better understand the varying issues concerning utilization of the program by the 18-34 demographic that are disparagingly not participating in the Affordable Care Act program as intended. It has been pointed out that this utilization of the 18-34 demographic is not just desired by the government, but it is necessary to keep the system operable and affordable in the long run. Through market survey research, price elasticity concerns, and variable dominance (hierarchical partitioning) analysis, I provide empirical information to further the insight into these issues for the healthcare and public-policy audience.

Ya Meng, An Analysis of Predictive Modeling and Optimization for Student Recruitment (Jeffery Camm B.J. Zirger [Department of Management])
The Department of Admissions at the University of Cincinnati receives four to five thousand applications every year. After the evaluation of each application, the Department of Admissions sends out offers with or without financial aid to qualified applicants. With one or multiple admission offers, applicants make their own decisions to decline or accept the offer. This project provides a detailed analysis of the factors that influence applicants' decisions, based on the applicant's characteristics, from records from 2002 to 2012. The objective of this project is to find influential factors, especially financial aid, that applicants take into consideration when the decision is made. This paper introduces two predictive models: a logistic regression model and a classification-tree model, to unveil the association between offer acceptance and applicants' personal information, application and financial aid. The results suggest that the logistic regression model has better predictive performance than does the tree model, and application preference, age, test score, home distance, amount of financial aid, and high-school type are crucial factors for applicants' choices. As the final step of the project, optimization analysis of financial aid is conducted for applicants in 2012. Since financial aid is the only factor that the school can control in admission, the allocation of limited financial aid is necessary to attract ideal applicants. Based on the logistic regression model, different optimization models are discussed and a heuristic method of financial-aid allocation is developed.

Vignesh Rajendran, Recursive Clustering on Non-Profit Contributions: An Application of Hierarchical Clustering to Non-Profit CRM Data, July 11, 2014 (Michael Fry, Jeffrey Camm)
Customer segmentation is an important part in the strategic decision making in Customer Relationship Management. In this project, segmentation is applied to contributors of United Way of Greater Cincinnati (UWGC), a non-profit organization that receives contributions from around 300,000 employees of more than 2500 companies every year. By recursively using hierarchical clustering with Ward's linkage, this project proposes a novel method to cluster the contributing companies for UWGC. Using this method on metrics which closely depict their growth potential and customer behavior, the contributing companies are segmented into different clusters which would help UWGC better understand their contributors and better plan their contributor campaigns.

Aashin Singla, Modeling the Preference of Wine Quality Using Logistic-Regression Techniques, July 9, 2014 (Yichen Qin, Yan Yu)
The Vinho Verde wine is a unique product with a perfect blend of aroma and petillance that makes it one of the most delicious natural beverages. Quality is credited by many ways, which includes physiochemical properties, and sensory tests. The Viticulture Commission of the Vinho Verde Region rated the wine quality using the physiochemical properties. These physicochemical properties can be used to model wine quality. This analysis extends a group report done in a Data-Mining course project using decision trees, support vector machines, and neural-network methods. I have tried analyzing the same data set to predict wine taste preferences by using two logistic regression approaches. As the output variable (dependent) is an ordered (or categorical) set, we considered using ordinal and multinomial logistic regression. For ordinal logistic regression, the dependent variable has to be an ordered response variable and the independent variables may be categorical, interval, or continuous scale and not collinear. For solving the problem of multicollinearity, linear regression is applied to determine the best model. After applying this technique, we realized that sulphate, an anti-oxidant and smelling agent, and volatile acidity, responsible for the acidic taste, are significant factors for grading the wine. When some of the assumptions of the ordinal logistic model got violated, multinomial logistic regression was applied. Multinomial logistic regression is used when the dependent variable has a set of categories that cannot be termed as ordered. Using this technique, residual sugar and alcohol have statistically significant negative effects throughout the various quality levels.

Ashmita Bora, Analyzing Student Behavior to Understand Their Rate of Success in STEM Programs, June 2, 2014 (Yichen Qin, Michael Fry)
The objective of this project is to analyze the students of the University of Cincinnati and their success rate in STEM (Science, Technology, Engineering, and Mathematics) programs. We are interested in identifying the patterns of the enrollment of students in the STEM programs in UC and understand if factors like gender, race, etc. have an impact on the success of students in the program. Although there are many interesting questions that can be answered, we focus our study on three analyses: (1) Exploratory analysis of the covariates for successful and unsuccessful students in UC. The unsuccessful students are defined as students enrolled in a program for more than 6 years and didn't graduate. (2) The second analysis is to perform predictive modeling to identify statistically significant covariates that predict the probability to switch from a STEM program to a non-STEM program. (3) The third analysis is predictive modeling to identify variables that determine the amount of time a student needs to graduate in a STEM program. Modeling techniques like logistic regression and accelerated failure time (AFT) models are used to model the probability of switching from a STEM to a non-STEM program and time to graduate, respectively.

Subramanian Narayanaswamy, A Descriptive and Predictive Modeling Approach to Understand the Success Factors of STEM and Non-STEM Students at the University of Cincinnati, May 16, 2014 (Yichen Qin, Michael Fry)
The objective of this project is twofold: (1) to understand and identify the demographic and academic factors that differentiate the performance of STEM and non-STEM students at the University of Cincinnati (UC); (2) to perform an in-depth descriptive analysis and build preliminary predictive models to identify the predictors of successful students at UC. Successful students are defined as those who graduate within six years of enrollment. In addition to descriptive statistics analyzing student performance, predictive models are presented that use logistic regression to estimate student success based on a variety of potential predictor variables. This work uncovers interesting comparisons between STEM and non-STEM students based on demographics and student background, while also identifying important characteristics of successful students at UC.

Vishal Ugle, Developing a Score Card for Selecting Fund Managers, May 7, 2014 (David Rogers, George Polak [Wright State University])
Absolute Return Strategies (ARS) analyzes directional investment performance patterns in the global foreign-exchange market using a proprietary data set provided to them through a strategic relationship with Citibank FX. ARS has data for about 40 currency hedge fund managers across the globe. The objective of the research is to develop a model that identifies the managers who can give better returns in the next six-month period and develop a score card to rate the managers. ARS currently employs a model named the Alpha consistency model that rates managers on a five-point scale based on their returns in every six-month period since they have been active in the market. I will be calling the new approach developed here Alpha consistency 2.0, and it has performance parameters like returns, volatilities, and drawdowns for each manager since he or she has been active in the market. Since the data have a high degree of multicollinearity, I used exploratory factor analysis to generate factors for identifying the underlying correlation structure among these variables and give loadings for each variable. I generated five factors out of which three factors will be used to denote volatility and downside, returns, and drawdowns. After generating the factors I have calculated the factor scores for each manager and used these scores to rate each manager.

Kiran Krishnakumar, Sales Prediction Using Public Data: An Emerging-Markets Perspective, April 18, 2014 (Yichen Qin, Edward Winkofsky, co-chairs)
In emerging markets, the availability of marketing data is limited by poor quality and poor reliability. This project describes ways in which companies can leverage the unexploited pool of publicly-available data to deliver analytical insights in order to support marketing initiatives. The project studies the case of a Spanish manufacturer in the bathroom-spaces industry. They are interested in pursuing marketing interests in emerging markets in Asia. This project uses various regression modeling and analytical techniques to build a statistical model to help predict product sales. The dataset used is custom sourced, which combines their internal point-of-sale data with 50+ sourced public datasets that include financial indicators, demographic indicators, and risk factors. The project uses SAS to conduct the analysis and leverages concepts including multivariate regression, generalized linear models, and logistic regression. This project provides an overview of issues related to data quality in emerging markets. It shows how companies can leverage public data to develop analytical insights when constrained by availability of reliable data.

Yichen Liu, Group LASSO in Logistic Regression, April 16, 2014 (Yichen Qin, Yan Yu)
When building a regression model it is important to select the relevant variables from a large pool that contains both continuous and categorical variables. Group LASSO (Least Absolute Shrinkage and Selection Operator) is an advanced method for variable selection in regression modeling. After predefining groups of variables using a group index, it minimizes the corresponding negative log-likelihood function subject to the constraint that the sum of the absolute values of the regression coefficients are less than a tuning constant. Due to this constraint, the absolute value of all the regression coefficients will be shrunk and some of them will be reduced to zero. Thus, Group LASSO can improve the interpretation of the regression model via variable selection and stabilize the regression by shrinking the absolute values of the regression coefficients. The predefined group index will make sure that once a variable is shrunk to zero, all the other variables in the same group will also be shrunk to zero. Therefore, Group LASSO can select variables for models containing both continuous and categorical variables. This research project is a study of Group LASSO's performance in logistic regression, where it was introduced. In order to show its superiority, Group LASSO has been applied to both simulated data and to real data from the Worcester Heart Attack Study (WHAS). A ten-fold cross-validation was performed to find the tuning constant.

Abhimanyu Kumbara, Hospital Performance Rating: A Super-Efficiency Data-Envelopment Analysis Model, April 15, 2014 (Craig Froehle, Michael Fry)
The United States spends approximately 18% of its GDP on healthcare. Nearly $650 billion of the spending is due to inefficiencies in the healthcare market. Cost-containment proposals have focused primarily on payment reforms, with approaches such as pay for performance and bundled payments. The non-emergency nature of elective procedures provides great opportunity for reducing costs. Percutaneous cardiovascular surgery is a popular elective procedure, with more than half a million Americans undergoing the procedure every year. The average procedure charges ranges from around $27,000 up to nearly $100,000. This, along with the shift towards outcome-based care models, motivates hospitals to become more efficient and provide high-quality, cost-effective care. This project provides a measure of efficiency by performing a super-efficiency data-envelopment analysis (SEDEA) on hospitals from Ohio and Kentucky that perform percutaneous cardiovascular surgery with drug-eluting stents. The data used in this study contain the procedure charges and quality information for approximately 86 hospitals from Ohio and Kentucky. The data were obtained from healthdata.gov, a healthcare-related data repository. The SEDEA model, implemented in R, uses cost and quality measures for each hospital to calculate the hospital efficiency scores and ranks the hospitals accordingly. Hospitals can use the ratings to assess their current market standings. Other healthcare-market participants, such as insurers, could use the ratings as a comparison tool for making cost-effective decisions about elective procedures.

Joshua Phipps, Exploratory Data Analysis for United Way, April 11, 2014 (Michael Fry, Jeffrey Camm)
United Way, a non-profit organization that collects donations and provides opportunities to volunteers in order to help the community, has recently digitized the past seven years of their donations and volunteering. The open-ended question of "what do these data tell us?" was posed, and through exploratory data analysis we show what the data hold, what they don't hold, possible deficiencies, and produce some insights into the segmentation of their donations. Further exploratory data analysis was conducted in order to determine the effect of volunteering on donation amount. Grouping volunteers by their donation behavior allowed United Way better to evaluate the interaction between volunteering and donating. We are able to show that it is worthwhile to push for more volunteers, and we give recommendations for better analyses in the future to tailor their efforts.

Mark E. Nichols, Evaluation and Improvement of New-Patient Appointment Scheduling, April 11, 2014 (David Kelton, Jeffrey Camm)
New Patient Lag (NP Lag) is the primary metric used to determine the efficiency of initial patient scheduling at University of Cincinnati Medical Clinics. It is defined as the Julian date of a new patient's appointment minus the Julian date when the new patient scheduled the appointment. NP Lag is the first impression of the clinic to the public and it varies across the thirty-plus clinics from approximately 1 week to over 7 weeks. Open-Access is a system of prescribed changes in clinic scheduling and operating procedures that is designed to bring NP Lag down to near-same-day access or near an NP Lag of zero. In an effort to examine some of the Open-Access methods prior to implementation, the scheduling at a particular clinic was studied. The neurology clinic, with its average 36 work-day NP Lag, was modeled in Arena with current scheduling logic and rules. After testing and validation, the model was altered based on selected aspects of Open-Access to evaluate the effects on NP Lag: removal of new/existing patient restrictions on open appointment slots, reduction of standard appointment slots from 30 to 20 minutes, and reduction in appointment cancellations near the appointment date. The simulation showed removal of the new/existing patient requirement from open appointments resulted in a simulated reduction in NP Lag of approximately 36 work-days, to a NP Lag of less than one work-day. The 36 work-day improvement was based on a 99% confidence interval and was the best NP Lag improvement of the three trials.

Matthew Schmucki, Visualizing Distribution Optimization, March 11, 2014 (Jeffrey Camm, Michael Fry)
Manufacturers strive to operate as efficiently as possible. Moving product incurs handling costs and shipping costs. Ensuring that product from manufacturing plants is optimally shipped to distribution centers that supply customers is a potential cost savings. Currently, linear programmers are able to model these distribution networks in programs such as AMPL and CPLEX but the solution is shown in a crosstab table format. These tables make it difficult to explore the solution. If the linear programmers were able to visualize the solution, it would be easier to share the data and discuss it with non-programmers. Ideally, the visualization software would also solve the problem creating a one-stop-shop. This would eliminate the need to enter information to AMPL or CPLEX, export the numerical solution, and import it to an additional piece of software. This paper discusses a no-cost Visual Basic Application (VBA) program that can be executed in Excel, solve a distribution problem, and then display the solution visually via Google Chrome. This method will save time, will not require programming knowledge, and will create a visualization of the solution with no additional effort. This visualization will help users see numerical solutions and better understand what a solution is indicating.

Joel C. Weaver, Enhancing Classroom Instruction by Finding Optimal Student Groups, March 3, 2014 (David Rogers, Jeffrey Camm)
Classrooms today are structured around a philosophy of teaching and learning that deviates from the traditional direct-instruction model of the past, in which students sit in rows and listen to lectures for the duration of the class period. Conversely, students in classrooms today are spending the majority of their time working together in one of many different types of student grouping arrangements. The assignment of students to different groups is typically driven by data representing students' ability levels, with the exact method of data utilization varying depending on the type of grouping arrangement that is desired by the educator (i.e., ranked ability, mixed ability, etc.). Determining the grouping assignments based on those ability levels, which should be derived from multiple types of student data, can be a time-consuming task. Additionally, when combining that data analysis with the multitude of potential classroom grouping constraints (separating students, limiting groups by gender, special needs, etc.), the student grouping task can become exceptionally arduous. This research project provides a solution to the difficulties that educators face when trying to create optimal student groups for specific grouping arrangements. A working optimization model was developed to provide educators with a useful tool for addressing their changing needs. In order to maximize accessibility and minimize cost to educators, the model was created in Microsoft Excel with a free open-source add-in called OpenSolver. By implementing this model, educators can enhance the classroom learning experience more easily through accurate and timely determination of optimal student grouping assignments.


2013

Wenjing Song, Methods for Solving Vehicle-Routing Problems in a Supply Chain, December 6, 2013 (Michael Fry, Jeffrey Camm)
The capacitated vehicle-routing problem (CVRP) is a vehicle-routing problem to determine the optimal set of routes to be performed by a group of vehicles to meet the demand of a given set of customers or suppliers including vehicle-capacity restrictions. The goal here is to determine the optimal or near-optimal routes to get all the necessary materials from the suppliers to the manufacturing plant at the lowest cost. Input data include the volume of materials that must be picked up from each supplier and the distance matrix providing distances between all possible pairs of suppliers. Each vehicle is assumed to have a capacity restriction of 52 units. Optimal and heuristic solution methods are explored in this project. Optimal models are solved using AMPL/CPLEX and a genetic algorithm is coded in C to represent a meta-heuristic solution method. We compare the solutions from the optimal and heuristic approaches; difficulties and challenges of each method are also discussed.

Pramod Badri, Identifying Cross-Sell Opportunities Using Association Rules, December 6, 2013 (Jeffrey Camm, David Rogers)
Retailers process huge amounts of data on a daily basis. Each transaction contains details about customer behavior and purchase patterns. The objective of this study is to analyze prior point-of-sales (POS) data and identify groups of products that have the affinity to be purchased together. In this paper, I detail the steps involved in developing associated product sets using association Rules. These rules can be used to perform market-basket analysis that can help retailers understand the purchase behavior of customers. With a given set of rules, the retailer would be better able to cross-sell, up-sell and improve the store design for higher sales.

Hang Cheng, An Analysis of the Cintas Corporation's Uniform Service, December 5, 2013 (David Rogers, Yichen Qin)
The Cintas Uniform Service provides services such as renting, designing, and manufacturing customized uniforms for employees in various companies. The objective of this paper is to identify influential elements for the Cintas Uniform Service, and to predict future performance and tendencies. The business data used in this study are from Cintas, and geographic and demographic data are from the US Government Census. We first explore the relationship between the Cintas uniform service and influential factors such as location, industry, and employment in a linear regression. Time-series analysis is used to predict the service usage based on the previous five years' data. This research provides a regression model that shows the influential factors for the Cintas uniform service and a statistical prediction for the future number of wearers, which guides future manufacture and market planning. By implementing the findings from this study, Cintas could optimize its uniform service management and form a comprehensive sales strategy in various regions and industries.

Ting Li, Estimation and Prediction on the Term Structure of the Bond Yield Curve, December 5, 2013 (Yan Yu, Hui Guo [Department of Finance])
Estimation and prediction of the term structure of the bond yield curve have been studied for decades. In this work, the Treasury bond yield data from 1985 to 2000 are studied. Both in-sample fitting and out-of-sample forecasting performance of the yield curve are evaluated. The Nelson-Siegel model is applied on the data for estimation. Then the time-varying regression coefficients are further studied based on time-series analysis with a Box-Jenkins model. We first fit the linear model with a fixed shape parameter and three parameters, which can be interpreted as level, slope, and curvature. Different scenarios are investigated further to search for the optimal shape parameter and improve overall performance. We consider the dynamic Nelson-Siegel model for improvement as well. Nonlinear regression is conducted where we treat the shape parameter as the fourth coefficient. Alternatively, a grid-search method is studied as a simplified dynamic model, where we grid-search on the shape parameter and find the optimal value within the range that achieves the minimum root mean squared error. Different models are compared and discussed based on in-sample fitting and out-of-sample forecasting at the end of the study. The results show that the model with fixed shape parameter fits better in the short-term and long-term yields, while the nonlinear method performs better in fitting the medium-term yields and forecasting at longer horizons. The statistical Software SAS is used for implementation.

Fei Xu, Multi-period Corporate Bankruptcy Forecasts with Discrete-Time Hazard Models, December 5, 2013 (Yan Yu, Hui Guo [Department of Finance])
Bankruptcy prediction is of great interest to regulators, practitioners, and researchers. In this study, I employ the discrete-time hazard model to predict corporate bankruptcy, using the manufacturing sector data covering the period of 1980-2008. The model has high in-sample and out-of-sample prediction accuracy, with all AUCs higher than 0.85. The study of distance to default reveals that it adds little prediction power to the model. Comparing to the model with variables selected by the least absolute shrinkage and selection operator and the Campbell, Hilscher, and Szilagyi (2008) model, models by stepwise selection have the same or better accuracy. In addition, stepwise selection gives robust models across different training/testing periods. Within the data scope of this study, including macroeconomic variables (three-month Treasury bill rate and trailing one-year S&P index) provides little improvement in prediction accuracy.

Mengxia Wang, Credit-Risk Assessment Based on Different Data-Mining Approaches, December 5, 2013 (David Rogers, Yan Yu)
Data mining is a computational process to discover patterns in large data sets. Credit scoring is one of the data-mining research areas, and is commonly used by banks and credit-card companies. A dataset that comes from a private-label credit-card operation of a major Brazilian retail chain was analyzed. The dataset contains 50,000 records of the application information from the credit card applicants. Six data-mining approaches, including the generalized linear model, classification and regression trees, the generalized additive model, linear discriminant analysis, neural networks, and support vector machines were examined to help identify unqualified applicants based on the given explanatory information. For each approach, at least one model was built using the R software. The performance of each model was evaluated by the area under the receiver operating characteristic curve.

Qi Sun, Application of Data-Mining Methods in Bank Marketing Campaigns, December 5, 2013 (Jeffrey Camm, Yichen Qin)
Direct marketing is widely used among retailers and financial companies due to the competitive market environment. The increasing cost of marketing campaigns, coupled with declining response rates, has encouraged marketers to search for more sophisticated techniques. In today's global marketplace, organizations can monetize their data through the use of data-mining methods to select those customers who are most likely to be responsive and suggest targeted creative messages. This project will present the application of data-mining approaches in direct marketing in the banking industry. The objective of this project is to identify the variables that can increase the predictive outcomes in terms of response/subscription rates. A decision model, chi-square automatic interaction detection (CHAID) is built to determine and interpret the variables. A logistic regression model is also built in this study to compare with the decision-tree model. By applying both methodologies to the direct-marketing campaign data of a Portuguese banking institution, which has 45,211 records and 16 fields, we concluded that the CHAID decision-tree model performs better than the logistic regression model in terms of predictive power and stability. From the results, we illustrate the strengths of both non-parametric and parametric methods.

Arathi Nair, Demand Forecasting for Low-Volume, High-Variability Industrial Safety Products under Seasonality and Trend, December 4, 2013 (Uday Rao, David Kelton)
This project studies demand-forecasting methods for different items that are sold by West Chester Holdings Inc. In this study, based on data from West Chester Holdings, we apply moving averages (which represents the current approach by the company), exponential smoothing, and Winter's Method to predict future demand. We also provide a brief overview of ARIMA-based forecasting in SAS. Using forecast-quality metrics such as bias, root mean squared error (RMSE), mean average deviation (MAD), and tracking signal, we identify the best forecasting methods and their parameter settings. These forecasts will form the basis for setting target inventory levels of items at West Chester Holdings, which will drive their procurement and sales to maintain acceptable inventory turns.

Yi Tan, Development of Growth Curves for Children with End-Stage Renal Disease, December 4, 2013 (Christina Kelton [Department of Finance], Yichen Qin, Teresa Cavanaugh [College of Pharmacy])
Growth retardation is one of the greatest problems in children with end-stage renal disease (ESRD) on dialysis. Growth failure results from multiple causes including poor nutritional status; comorbidities, such as anemia, bone and mineral disorders, and changes in hormonal responses; and the use of steroids in treatment. Although research has documented differences in growth rates between dialysis patients and healthy children, no large-scale effort has been devoted to the development of growth charts specifically for children with ESRD. The primary objectives of this study were to develop and validate height, weight, and body-mass-index growth curves for children with ESRD. Using data from the United States Renal Data System (USRDS), all patients aged 20 or younger without previous transplantation, and undergoing dialysis, were initially selected for study. They were stratified into age groups, with finer (6-month) categories for the younger children, and grosser (1-year) categories for the older children. Then, children with height, weight, or BMI greater (less) than the mean plus (minus) 3 standard deviations (outliers) were excluded. The standard lambda-mu-sigma (LMS) methodology for developing growth charts for healthy children was applied to height, weight, and BMI data from USRDS. Growth-curve percentile values (3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 97th percentiles) for each age group by gender were calculated. The performance of the LMS model was evaluated using three different criteria, and results were compared with those previously published for healthy children. Advantages and disadvantages of parametric versus nonparametric (spline) growth-curve estimation were explored.

Yuan Zhu, Multivariate Methods for Survival Analysis, December 4, 2013 (David Rogers, Jeffrey Camm)
Multivariate analysis refers to a set of statistical methods for simultaneously involving observations and analyses on each individual or object under investigation. Multivariate methods like principal component analysis (PCA) are widely used for variable reduction and elimination of high correlations. Survival analysis, also called time-to-event analysis, primarily was used in the biomedical-science area to observe the time to death, now also widely used on other analyses such as the working life of machines. Survival analysis is multivariate and always has significantly correlated variables, which will result in parameter estimates with inflated standard errors and decreased power in detecting the true effects when processing multiple regression analysis. In this research project, data transformation will be first processed by PCA on the correlation matrix of the original variables. Then the transformed data will be input into Cox survival models and will be compared according to different PCA methods. Results suggested that PCA with a combination of different correlation coefficients has the best performance, and it can reduce redundant variables without losing much accuracy since a large portion of the covariance from the raw data set has been retained.

Anni Zhou, Insurance Customer Retention Improvement Analysis, December 4, 2013 (David Rogers, Jeffrey Camm)
In saturated markets, customer retention has become more important and will bring a lot of benefits for companies. In such a situation, it is advantageous to study how to improve customer retention using statistical and business-analytical methods. The objective of this project is to use different statistical and analytical methods to identify how to design an insurance product for site-built dwelling data and improve customer retention. Suggestions for insurance companies regarding how to build such products for improving customer retention are presented. In this project, the principal component analysis method is utilized to analyze which variables are important. How to use the non-parametric survival analysis and parametric survival analysis with different distributions to create models for the survival object will be discussed. A comparison of results from using survival and event history analysis with different distributions will also be shown.

Jing Sang, A Statistical Analysis of the Workers' Compensation Claims System, December 3, 2013 (Uday Rao, Hui Guo [Department of Finance])
The workers' compensation claims system is used to provide injured workers with coverage of medical costs and income replacement. The frequency of paid workers' compensation claims advises employers and insurance providers where prevention activities are most needed, and which firms are likely to have good safety programs. From the perspectives of employers and insurance providers, understanding the type of people who potentially file a claim and the factors that lead to a high-cost claim are important. This study aims to identify variables that drive higher/lower costs in workers' compensation claims by analyzing past claims and payment records. This paper first describes the workers' compensation claims system, and then uses real-world data to explore the attributes and variables that may be important to a claim. It then determines the significant factors that influence the cost. Microsoft Excel, SQL Server Management Studio, and SAS are the main tools used in this study.

Yi Ying, Application of Data-Mining Methods in Credit Risk Analysis and Modeling, December 2, 2013 (David Rogers, Yichen Qin)
The financial crisis has led to an increasingly important role of risk management for financial services institutions providing products such as loans, mortgages, and finances. Specifically, risk departments utilize data-mining tools to monitor, analyze, and predict risks of various kinds in business. One of the key risks a financial institution has to deal with on a daily basis is the credit risk: whether a borrower will default or make the payments on the debt. The goal of the project is to study different data-mining models in predicting the potential of loan delinquency based on borrowers' demographic information and payment history, and to identify the most important factors (variables) for risk assessment. Various modeling techniques will be investigated to understand the credit risk: generalized linear models (McCullagh and Nelder, 1989), classification and regression tree models (Breiman et al., 1984), and chi-squared automatic interaction detector (Kass, 1980). A combination of R, SAS, and SPSS Modeler were used to conduct the analysis. Ultimately, the four models developed predicted the "Default"/"No default" correctly over 75% of the time in the training sample and over 60% in the testing sample. In terms of prediction precision, the general linear model outperformed the others. In terms of stability, the chi-squared automatic interaction detector model was the most robust model to use. The classification and regression tree model was the least stable.

Yucong Huang, An Analysis of the Twitter Sentiment System in the Financial-Services Industry, November 15, 2013 (Yan Yu, Yichen Qin)
Twitter has gained increasing worldwide popularity since its launch in 2006. In 2013, there have been 58 million tweets per day, with 135,000 new users’ signing up every day (http://www.statisticbrain.com/twitter-statistics/). The large volume of tweets can be of substantial business value and insight in making decisions. In this project, we identify the sentiment of tweets in the financial-services industry using a supervised machine-learning approach. After collecting 852 tweets that are related to this industry, a rule-based quality-control method is designed to decrease human error in Twitter sentiment ratings. The results using support vector machines (SVM) are promising: we achieve an accuracy of around 70% in a three-category scale (negative/neutral/positive) and 60% in a five-category scale (strongly negative/negative/neutral/positive/strongly positive). After examining the impact of different features, we conclude that unigram features contribute the most to the overall accuracy, followed by lexicon features, and then encoding features. Finally, we study different combinations of features applied to different sentiment scales.

Suxing Zeng, Cluster-Based Predictive Models in Online Education Management, November 4, 2013 (David Rogers, Yichen Qi)
As online education becomes a popular learning approach, the large amount of data generated by learning activities can be effectively utilized for evaluation and assessment purposes. Traditional analytical tools and techniques are being adopted by the online-education industry to improve services in critical areas such as student retention, grades, and graduation. To evaluate an online-learning environment for students at the University of Phoenix, cluster-analysis and regression-analysis techniques were implemented to develop a cluster-specific predictive model and a simple direct regression model for student service management. In the cluster-specific predictive model, finite mixture models are used to classify students based on their learning attributes. Then, based on the learners-segmentation framework, predictive models were developed to predict the target scores for given new-learner attributes. Numerical results show that the cluster-specific model performs better for model fitting. The advantages and limitations for each method are discussed and recommendations are provided for management to drive academic excellence.

Chiao-Ying Chang, Cintas Website Visitors Profile Analysis, October 11, 2013 (David Rogers, Jeffrey Camm)
This study is aimed at analyzing the profile of online mobile visitors, and recommending divisions with high mobile traffic for future web enhancements. The study also explores attributes of visitors requesting more information through the website. Google Analytics is the main tool used for the visitors' profile analysis. In the analysis period, the Fire Protection, Hospitality, and Uniforms divisions have witnessed an increase in the percentage of mobile visitors. There is a very high possibility of increasing mobile users in the future in these divisions. Over 50% of those visitors requesting more information online use the Internet Explorer browser with the Windows operating system. Certain non-branded keywords such as "shred" and "extinguish" are highly used in the web search, and Tuesday, Wednesday, and Thursday are the days with the most visits. This profile analysis helps Cintas better know their visitors' background, and also helps them to create a better website, easier to navigate, and more effective to use.

Partha Tripathy, Analysis and Implementation of Allocation of Papers to Conference Sessions using a K-Means Clustering Algorithm, August 7, 2013 (Jeffrey Camm, B.J. Zirger [Department of Management])
This paper looks at a problem for a professional association of management science that organizes an annual symposium and invites papers in multiple branches of management. Due to constraints of resources, the designs of sessions in this conference are restricted by size and duration. Also, the association promotes discussions among participants from different institutions. Hence, only one author from an institution can present a paper in a single session. The sessions are defined by a common topic, which is identified by keywords that are associated with each paper. The keywords also have a priority in allocating the papers into similar groups. Optimization algorithms have been designed to assign resources to tasks based on priority, availability, and business constraints. These algorithms are closely associated with classification problems where categories have to be identified among observations. Cluster analysis is one such automatic-classification algorithm that can be formulated as a multi-objective optimization problem. By using a combination of efficient data pre-processing, ideal modeling of input parameters, and iterations of clustering analysis, the study attempts to discover the best solution with the desired properties. This study uses a naturally occurring heuristic and a k-means clustering algorithm to identify natural clusters among papers in specific divisions. It involves analysis of the data and subsequent preprocessing to provide acceptable session sizes using the SAS software. The outputs of the heuristic are analyzed and the results are reported to the user.

Regina Akrong, Kentucky High-School Athletic Association Ninth Region Realignment: Minimizing the Traveling Miles between Schools, August 7, 2013 (David Rogers, Jeffrey Camm)
As with all costs for high-school education, travel costs for sports teams should be closely monitored. Here, the 20 high schools in the Ninth Region located in Northern Kentucky are examined and placed into four Districts to minimize the overall travel distances for the entire region. A mixed integer linear programming model was adapted and solved to provide the optimal Districts. It was found that indeed two schools were not currently placed in the most appropriate Districts. The savings achieved by redistricting will be considered with respect to reduction in total miles, fuel costs, student time spent, and safety.

Shilesh Karunakaran, Predictive Modeling for Student Recruitment, August 6, 2013 (Jeffrey Camm, B.J. Zirger [Department of Management])
The objective of this study is to analyze past student enrollment behavior and build a statistical model to predict the prospect of a future applicant's enrolling at the university. This paper details the steps involved in developing a predictive model by a data-driven analysis of past enrollment behavior of students, and fine-tuning the prediction accuracy by training the model with in-sample data and testing it on an out-of-sample data set. This model will help rank the incoming applicants according to their likelihood of enrolling at the university. This ranking, in conjunction with other qualitative factors, can be used by the admissions department to make better decisions on admission offers, financial aid, and many other decisions like specific program campaigning.

Junyi Li, Cintas Customer-Preference Analysis Using Data-Mining Methods, August 6, 2013 (David Rogers, Jeffrey Camm)
The Cintas Corporation provides highly specialized services to approximately 900,000 businesses of all types mainly throughout North America. In this project, we target only four services: mats, first aid and safety, hygiene, and document shredding. To understand the customers better and provide them better service, Cintas Corporation collects large amounts of data from customers. This study is aimed at analyzing the customer preference for these four products/services and predicting the number of customers lost. I will use Excel to do the basic customer analysis and R to do the prediction for customers lost. The two procedures to do the prediction are a logistic regression model and a classification tree model. The Akaike information criterion (AIC), Bayesian information criterion (BIC), and area under the ROC curve (AUC) are criteria used to determine the best model. The result shows that for all the data the classification-tree model is better than the logistic regression model.

Jerry Moody, A Simulation-Based Data Analysis of Production Lines at OMI, Inc., August 6, 2013 (David Rogers, David Kelton)
A local manufacturer is planning to expand its facility in 2014. A simulation study using Arena is employed to determine whether the company's two production lines used for the majority of its products will meet potential demand through 2022. Scenarios are examined to determine production amounts for various configurations they may wish to employ. Potential labor costs are also examined to give the managers further information when making expansion determinations. Results are presented in interactive-dashboard format using Tableau.

Adebukola Faforiji, Investigating Factors Associated with High-School Dropout Tendency via Logistic Regression and Classification Trees, August 5, 2013 (David Kelton, Edward Winkofsky, co-chairs)
An average of nearly 7,000 students become dropouts each day. This adds up annually to about 1.2 million students who will not graduate from high school with their peers as scheduled. Lacking a high-school diploma, these individuals will be far more likely than graduates to spend their lives periodically unemployed, on government assistance, or cycling in and out of the prison system (Alliance for Excellent Education, November 2011). Most high-school dropouts see the result of their decision to leave school very clearly in their earning potential. The average annual income for a high-school dropout in 2009 was $19,540, compared to $27,380 for a high-school graduate. The impact on the country's economy is less visible, but its cumulative effect is staggering (Alliance for Excellent Education, November 2011). Although it cannot establish causality, statistical analysis will help to reveal association of the factors under consideration that influence a student's decision to graduate or drop out of high school. This research was approached using two separate methodologies so as to compare results and also determine which one provides clearer results. The methods considered are logistic regression and classification trees. Results from the analyses reveal that factors such as English-speaking proficiency, self determination to succeed and finish high school, performance of grades of B and above in science and English, discipline and safety within the school environment, and race and perception (self and external) had statistically significant relationships.

Chandhrika Venkataraman, A Simulation Study of Manufacturing Lead Time: The Case of Tire-Curing Presses, August 5, 2013 (David Kelton, Uday Rao)
Non-assembly-line manufacturing systems are not easily streamlined using off-the-shelf solutions provided by standard operations-improvement methods such as just-in-time, KANBAN, lean manufacturing, etc. In this paper, we introduce a non-assembly-line manufacturing system that produces a custom-made finished product. Manufacturing lead time is extremely long, ranging from four to six months, and profit margins are razor-thin, about 8%. With material costs forming about 70% of the finished-product sale price, unpredictable manufacturing lead times eat away what is left of the profit margin because there is no visibility into final costs incurred in manufacture at the time of providing a quote to the customer. We study a simulation of the factory floor under different real-time scenarios to generate a range of finished-product lead times, both stage-wise and overall, as output measures of interest. The aim of the study is not so much to prescribe improved ways of working as it is to provide an understanding of how changes in raw-material rejections and in machine scheduling affect lead times. It is expected that with this information, the factory can arrive at better estimates of manufacturing costs and hence provide more realistic quotes to their customers. The study finds that even though individual sub-assemblies can have mismatches due to batching at the machine shop, final assembly times are lower than individual sub-assembly lead times, probably because the system corrects itself of mismatches before entering into final assembly, thus showing that orderly planning might be valuable ultimately, even if not immediately.

Vince Baldasare, Inventory Analysis of Restaurant Products, August 5, 2013 (David Kelton, David Rogers)
Restaurants have to control many factors in their daily operations. The processes they choose to use for managing their inventory can have major implications on their bottom lines. Having too much product in inventory can be costly due to space and waste. Not having enough product in inventory can decrease revenue and customer satisfaction. A method of inventory analysis is explored for a specific restaurant that uses fresh products. The logic for determining appropriate inventory levels is built into a custom tool that will allow the restaurant to make adjustments based on its business needs. Statistical process control charts are then created as a means for monitoring future results for the restaurant.

Ryan Prasser, A Regression Model Relating the Pass/Run Ratio to Score Differential and Elapsed Time in the NFL, August 3, 2013 (Michael Fry, Jeffrey Camm, Paul Bessire [PredictionMachine.com])
This research examines data from the 2012 NFL season to determine how a team's decision making in terms of calling running and passing plays changes as the game progresses. We generate several different multiple regression models relating a team's pass/run play-calling ratio on a particular drive to the predictor variables of elapsed time and score differential. While both of these variables and their interaction terms are statistically significant, the regression models explain only a small amount of the observed variance.

Jingfan Yu, An Application of Data Mining in House-Price Analysis, August 2, 2013 (Jeffrey Camm, Craig Froehle)
House-price prediction has always been an active field of study. Real-estate developers and consumers are interested in this problem. This project describes application of different statistical models to a house-price data set and tests which model has the most predictive power. Classification methods have been widely used in the real-estate industry. They can help real-estate developers better target their potential buyers and better plan for new construction. Consumers will also benefit by the classification model from wisely choosing the house location within their budgets. We present the application of data-mining approaches to house-price analysis in the real-estate industry. The objective of this project is to compare the performance of two predictive methodologies: multiple linear regression and regression trees. We also consider k-means clustering models. These four approaches were applied to Boston house-price data from the UC Irvine Repository of Machine Learning. The results suggest that the regression-tree model has the best predictive performance but the least model stability. The multiple linear regression model has the best stability and acceptable predictive power. Clustering is not recommended for our data.

Brian Arno, Predicting Graduation Success of Student Athletes, August 2, 2013 (Jeffrey Camm, David Kelton)
Athletics are an important part of a university -- they provide community and pride in the school, a source of revenue, and indirectly serve as a recruitment tool. The success of a university athletic department can directly be attributed to the success of the athletes themselves. The purpose of this study is to analyze data to identify what, if any, variables can be used to predict the success, in terms of graduation, of student athletes. This paper discusses the two methods employed in the study -- both exploratory data analysis utilizing visualization techniques, followed by logistic regression for development of an analytical model. The data analyzed constituted real-life information on student athletes from the University of Cincinnati.

Matthew Sonnycalb, Simulating Correlated Random Variates Using Reverse Principal Component Analysis, August 2, 2013 (David Kelton, David Rogers)
Failing to model correlated input variables appropriately is one of the most common inadequacies in dynamic simulation software and can lead to significant errors in simulation results. Principal component analysis is a multivariate technique that forms new uncorrelated random variates as linear combinations of the original correlated variates. This study evaluates whether independently sampling from those new variates and reversing the principal-component-analysis transformation can efficiently match the correlations, means, and variances of an original sample. A sampling algorithm is developed in R using bootstrapping and accept-reject criteria. The method is then evaluated using samples of correlated Weibull variables. The method performs well when the correlated variates have reasonably symmetric distributions, with no observable differences in correlations, means, and variances. The method becomes inefficient and introduces significant bias when the variates become highly positively skewed.

Meghan Moore, Workflow Simulation of the Emergency/Radiology Department Handoff at UC Medical Center, August 1, 2013 (Craig Froehle, David Kelton)
University Hospital's Radiology Department in Cincinnati is responsible for capturing body-tissue images to help diagnose a patient's ailment, which subsequently dictates the patient's course of treatment. It is of particular importance that patients coming in from the Emergency Department (ED) have an accurate and efficient visit, as many of these patients are in serious conditions and require treatment as soon as possible. The efficiency of this process hinges on the availability of radiology physicians, as the attending physician is present only during day shifts while the resident physicians are available around the clock. The goal of this study is to simulate the current workflow handoff between the ED and the Radiology Department. After development of a valid model, the value of adding an additional attending is explored, considering four different scheduled shifts for the extra physician. For system improvement, the simulation results suggest the implementation of another attending physician during an evening shift (3pm to 3am) or overnight shift (7pm to 7am). With a baseline of around seven hours in the initial system, these two scenarios reduce an ED patient's average time in the radiology department to less than two hours.

Shannon Downs, Optimal Bivariate Clustering of Binary Data Matrices, August 1, 2013 (David Rogers, George Polak [Wright State University])
Bivariate clustering can be applied to data in a matrix to optimize similarity or dissimilarity among elements by rows and by columns simultaneously. Areas of relevance include cellular manufacturing. In this project, a series of programs were coded in GAMS to perform bivariate clustering of a binary dataset of any dimension into any given number of clusters. Two models from the literature and two new models were explored, where each model makes use of a distance measure between the elements of the dataset. Seven methods of calculating the distance measure were used to evaluate the effectiveness of each model. The various distance measures and different objective-function equations made the objective-function values not directly comparable across the four models and they were evaluated using popular cellular-manufacturing clustering quality metrics such as the proportion of exceptional elements, machine utilization, and grouping efficacy. It was found that the best approach was from clustering by rows and by columns, with an addition of an interaction term that was a linear indicator of whether an element (a point in the dataset with a value of one) was in the same row and column. The performance of this best linear model was comparable to an equivalent nonlinear model, but the execution time of the linear version was magnitudes faster, making it a more desirable model.

Matthew Skantz, Optimization of Airline Fleet Assignment and Ticket Distribution, July 26, 2013 (Jeffrey Camm, David Rogers)
Among the decisions with the greatest implications for airlines' profitability are fleet assignment and the methods used for passenger ticket distribution. Improper fleet assignment can result in lost revenue and unit costs too high to allow profitability, while distribution costs for tickets distributed through third-party agencies, paid by the airline, can amount to a significant portion of the ticket's value and must therefore be thoroughly understood. Two mixed integer linear programming models, each of which incorporates integer relaxation to lessen computational requirements, are developed and tested using an airline's ticket-purchase records over two days with very different demand profiles in order to recommend changes in these areas. Multiple runs of the models using different segments of passenger data in combination with demand unconstraining estimates and known distribution agency rates are used to find the most profitable combination of fleet assignment and distribution outlet retention. Results show that, while most aircraft are assigned to the best routes given fleet constraints on the days under review, there are areas of significant opportunity to increase or decrease capacity on a handful of routes. Moreover, given current demand, the results suggest that it is not in the airline's interest to limit the number of distribution channels, though the relative strength of the distribution agencies is determined and one is targeted as an opportunity for possible future disengagement.

Benjamin Milroy, A Comparison of Agent-Based Modeling, Ordinary Least Squares Regression, and Linear-Programming Optimization for Forecasting Sales, July 24, 2013 (Jeffrey Camm, Edward Winksofsky)
Due to advances in business intelligence and more widely available data, accurate sales forecasting and understanding of media effectiveness continues to become more paramount in today's business. This wealth of data has led companies to new sophisticated modeling approaches. This study examines three such methodologies: agent-based modeling, ordinary least squares regression, and linear programming optimization. Using a data set from the consumer packaged goods industry, all three models are used to predict three years' of historical data. Then, the forecasting ability of each methodology will be tested over a holdout period of one additional year. Finally, using the findings from each approach, I hope to gain some understanding of media and trade promotion's effectiveness for the brand.

Jana Sudnick, Effective Automotive Issue Prioritization with Neural Network Pattern Recognition, May 30, 2013 (Uday Rao, David Rogers)
Automotive quality engineers process large amounts of "found issue" reports weekly and decide how to prioritize new issues. It is important that engineers do not miss potential urgent or high-customer-impact items because excellent customer service is expected. Engineers have many sources of customer data available to them and it is imperative that they utilize as many relevant sources as possible to properly rank issues. The purpose of this project is to develop a tool to utilize data from existing customer data sources to aid quality engineers in proper prioritization of issues to address. A neural network pattern recognition model was trained to help engineers effectively prioritize issues with quicker response times by emulating past issue ranking decisions made by a panel of highly knowledgeable subject experts. Effective issue prioritization could result in more issues investigated, quality improvements, improved early detection, and potentially a reduction in warranty claims. The project resulted in two neural network models to help engineers to identify and address new customer issues.

Zhen Guo, Decision Making in a Random-Yield Supply Chain, April 19, 2013 (Uday Rao, Michael Fry)
Supply uncertainty is widespread and has significant impact on business operations, so it is receiving increased attention by both industry and academia. This project studies a two-echelon single-supplier single-retailer random-yield supply chain. We determine how suppliers and retailers make operations decisions geared toward optimizing their profits when supply is uncertain. Equilibrium decisions include suppliers' wholesale prices and planned production quantities, retailers' order quantities, and retail prices. We study how supply uncertainty and the salvage value of leftover products affect these decisions. We show that the optimal production inflation rate, defined as suppliers' planned production quantities over retailers' order quantities, is dependent only on the wholesale price and is independent of the retailer's order quantity. We also find that, ceteris paribus, the optimal production inflation rate increases with the salvage value. Numerical examples for uniformly distributed supply uncertainty are provided to illustrate our findings.

Sha Fan, An Application of Markov Chain Model for Ohio's Unemployment Rate, April 19, 2013 (Uday Rao, David Rogers)
Stochastic processes have been applied in many fields, for example, marketing, gambling, inventory control, biology, and healthcare. The labor market is not an exception. The resource of labor is usually classified as employed, unemployed, and out of the labor force. This project involves an application of Markov-chain modeling to Ohio's unemployment rate. Data from 1990 to 2011 are public and collected by the Bureau of Labor Statistics and the U.S. Census Bureau. A Markov-chain model proposed by Rothman (2008) is then applied. There are four sub-models in this project: r = 1 (monthly) & 1st order; r = 3 (quarterly) & 1st order; r = 1 & 2nd order; r = 3 & 2nd order, where the 2nd-order Markov chain assumes that transition probabilities depend on the current state and the previous state. In the long run, we conclude that the unemployment rate is most likely to decrease rather than increase. Also, there exists a business-cycle fluctuation. When considering geographic influence, the unemployment rates among Ohio's 88 counties are statistically different. Job hunters' attributes like race and age have significant influence on the employment rate, while gender does not play a remarkable role.

Herbert Ahting, Regression Modeling of Multivariate Process Systems Data, April 17, 2013 (Martin Levy, Jeffrey Camm)
Modern industrial process-control systems archive vast quantities of data pertaining to flow, temperature, pressure, level and other parameters. Valuable information regarding process performance is contained in these data histories but they are seldom tapped to their fullest potential. Most data archives are contaminated by extreme values that are caused by measurement errors or are the result of transient or persistent disruptions to the process. For this study, a data set was obtained from an industrial process, consisting of energy-related process variables. These variables were analyzed to determine which ones have the greatest impact on overall steam consumption. Three regression approaches were evaluated: 1) Linear regression using stepwise variable selection, 2) autoregressive modeling, and 3) principal-components regression. Stepwise-variable-selection regression identified several key variables as energy drivers. The autoregressive model provided better results than stepwise regression alone, by eliminating the autocorrelation in the residuals. While principal-component regression showed promise by reducing multicollinearity, the model results were difficult to interpret because the original variables have been transformed. Principal-components analysis does, however, provide a useful set of tools for identifying extreme observations.

Saurabh Jain, Support Vector Regression vs. Neural Networks in Stock Pricing, April 17, 2013 (Uday Rao, Amitabh Raturi)
Asset pricing is one of the most researched areas in investment management. While CAPM provides the basic framework for understanding stock returns, it also brings the assumption that stock return is linearly dependent upon the market return. This relation fails to hold in various cases and many other improvements of this CAPM model seem to explain stock returns better. In our analysis, we explore various factors that can be included in the asset pricing model. We also assume a non-linear relationship of the independent variables with the dependent variable, stock return. While neural networks are popular and the most widely used technique for such cases, support vector regression is adopted for our analysis keeping in mind that this will avoid over fitting the data. The historical data for training and testing the model were obtained from vendors of stock data (Bloomberg) and other resources. With the help of this modified mathematical model we find the value of alpha, which is an indicator of superior performance of the stock and a good criterion for picking stocks that can be used to outperform market returns. We compare the results of neural networks with support vector regression and establish the superiority of one method over the other.

Aashish Reddy Takkala, Next-Purchase Propensity of a Customer, April 15, 2013 (David Rogers, Jeffrey Camm)
This study is aimed at finding the next purchase of a customer given his current purchase. The buying pattern is modeled as a Markov chain and transition-probability matrices are calculated for several product categories. A stable Markov equilibrium vector is arrived at by solving the system of equations using Matlab and by iterative matrix multiplication using SAS. Further, the mean first-passage times are calculated for each of the transitions. These matrices help the marketing team streamline their campaigns. Also, the new campaigns developed using this model make customers feel more connected because they are targeted more with the product categories they seek.

Adebola Abanikanda, Regression with Aggregated Crime Data - A Study Using Poisson and Negative Binomial Regression, April 15, 2013 (David Rogers, Yan Yu)
Several researchers have argued that parameter estimates from a disaggregated model may differ significantly from those from the aggregated model. This research is targeted at looking into this issue using two regression methods - the Poisson regression approach and the negative-binomial regression method. These models are applied to crime data at different hierarchical levels including the national, state, and county levels. Also, the crime index was disaggregated into violent crime and property crime, and regression models were built with these approaches to explore how different levels of aggregation affect the results.

Kevin Michael Roa, Scheduling NFL Games to Maximize Viewers, April 12, 2013 (Michael Magazine, Michael Fry, co-chairs)
The total number of television viewers for NFL games has declined over the past season. These games are still one of the most viewed shows in the United States, but have seen a slight decrease in popularity. These games are incredibly popular for advertisers, as people are likely to watch the games live, making it much more likely that they will view commercials. This makes it important for both the league and networks to keep the number of viewers high throughout the season. Currently there is no method of optimizing the schedule to maximize the number of viewers being employed. The games that are set to be played are pre-determined, but a looming question is in what weeks these games should be played, and which should be nationally televised during prime time. This paper presents a model that has been created with the intent of maximizing the viewer numbers of regular-season NFL games. The model can be used to control the distribution of excitement appropriately throughout the season.

Sarbani Mishra, Bayesian Forecasting of Utilization of Antidepressant Drugs in U.S. Medicaid, March 29, 2013 (Martin Levy, Jeffrey Mills [Department of Economics])
Mental illness has been one of the prevalent health disorders and the statistics are heading towards an upward trend at an alarming rate. Over recent years, the utilization of antidepressant drugs in the form of doctors' prescriptions and Medicaid reimbursements has been rising steadily. Medicaid antidepressant prescriptions grew over 40% from 1995 to 1998. In the present study, we have made an effort to forecast the utilization of antidepressant drugs using applied Bayesian methods. These methods can be of great aid in statistical models used for forecasting. Due to the unique characteristic of dynamic updating of information along with information accrued in the past, applied Bayesian methods provide greater accountability to the reliability of the forecasts obtained from the models than does the frequentist approach to forecasting. We have used the software BATS (Bayesian Analysis of Time Series) to determine the forecasts and to compute the MAD, MSE, and log-likelihood values. It was found that the forecasts emulate the actual values fairly well with the exception of a few drugs where we find a certain number of outlying points.

Peipei Yuan, Stratified Random Sampling Design for Capital Expenditure Survey, March 15, 2013 (Martin Levy, Yan Yu) 
The aim of this project is to produce a set of estimates (forecasts) for the year 2013 that involve spending intentions of the 60,000+ plants that engage in the manufacture of machine tools. Stratified random sampling design and Neyman allocation are used to design the capital-expenditure survey. 10,000 plants with known plant size are solicited in this study. There are a total of 5 strata, of which 4 are statistical strata and 1 is non-statistical. SAS PROC SURVEYMEANS and SURVEYFREQ, and SPSS Complex Samples are used to estimate means, proportions, and totals, and to produce standard deviations and confidence intervals. Among 10,000 plants selected in this study, 711 completed the survey and sent it back to the Gardner's Publication. The response rate is 16.88%, 10.57%, 6.92%, 2.30%, and 5.85% for strata 1 to 5, respectively. The range of the planned capital expenditure is 9,975,000, with a minimum of 25,000 and a maximum of 10,000,000. The 95% confidence interval for the mean is 359,728.19 to 497,168.63. The 95% confidence interval for the sum is 1.75E+10 to 2.42E+10. The statistics for subpopulations, such as states and machinery categories, are also calculated using domain analysis.

Hsin-Yi Wang, Bayesian Decisive Prediction Using BLINEX Loss and BLINEX Parameter Choice, March 7, 2013 (Martin Levy, Jeffrey Camm)
Direct marketing is a popular and successful way for businesses to reach prospective customers. It starts with the compilation of a name list that contains the set of targets together with their associated attributes. A scoring model, such as logistic regression, is used to compute activation scores from the set of attributes and these are ranked in descending order of activation likelihood. Names of the best prospects must be selected via an intelligent cutoff. Selecting too many names results in more revenue but perhaps less profit, i.e., profits diminish when costs outweigh gains from activation among bad prospects. The decision problem is to predict an optimal mailing size or cutoff in a future mailing to maximize profit. From a decision-theoretic point of view, a realistic loss function should be asymmetric (failure to choose good prospects carries a higher penalty than including too many bad prospects). The BLINEX loss function is a parsimonious loss function with three parameters, a bounding parameter, a scaling parameter, and a asymmetry parameter. Ideally, the data should be the collection of optimal scores gleaned from past direct-mail campaigns. Since we have information from only one campaign, we illustrate using a bootstrapping-like strategy to generate "historical" trials. In addition to the loss function, the elements of the Bayesian decisive prediction setup include the likelihood function of the optimal activation scores, assumed to be normal-gamma with unknown mean and variance, and conjugate priors with four hyperparameters. These are obtained using the empirical Bayes technique called the ML-II method. We show that in terms of both frequency and profit domination, in a set of 200 simulations the BLINEX loss outperforms the naive squared-error loss approach handily. Fractional-factorial analysis and half-normal plotting applied to more simulated data show that of the six parameters, the BLINEX asymmetry and scale parameters, together with the prior mean, are the most influential factors leading to BLINEX domination of squared-error loss in terms of profit.

Sandeep J. Patkar, Examination of Capacity and Delay at Airports and in the U.S. National Airspace System, January 14, 2013 (David Rogers, Edward Winkofsky)
Space - time diagrams are employed in airport master planning to enable visualization of aircraft flow along the approach pathway by projection of velocity trajectories onto a 2-dimensional surface. We present the opening and closing case to examine capacity and delay at the Cincinnati airport, which may serve as a basis for advanced dynamic stochastic programming models for air traffic-flow management. The arrival and departure pathways are examined independently, and then in mixed operations, which combine simultaneous pathways to mimic reality. We find that airline schedules, which are out of the control of local airport board authorities, influence capacity and delay fluctuations faced at major airports nationwide. In response, local air-traffic control towers may elect to use mitigation strategies to minimize aircraft space and time separation with the addition of a space - time buffer for each aircraft pairing on approach. We also consider random shocks to the airspace, which occur under the rubric of irregular operations due to weather fluctuations. Irregular operations are presented as Markov-chain scenarios where we argue that, from a practical standpoint, the total event space can be decomposed to circumvent the complication of dependencies connected to previous event nodes in our sequence. We focus on observable, partially connected Markov chains that are normalized to become row stochastic. The transition matrices form a basis to consider more advanced questions for strategic planning.


2012

Liang Xia, Application of Decision Trees in Credit-Score Analysis, December 7, 2012 (Martin Levy, Yan Yu)
Excessive abuse of credit cards has contributed to increasing credit risks, which has become a heavy burden for credit-card companies. In such a negative situation, it is important to build and use models to estimate the potential risk and to try to maximize profits from credit-card use. Classification methods have been widely used in the credit-banking industry. They can help lenders decide whether an applicant is a good candidate for a loan. This project will present the application of data-mining approaches to credit-scoring analysis in the financial industry. The objective of this project is to compare the performance of three predictive methodologies: chi-square automatic interaction detection (CHAID) decision trees, classification and Regression Trees (CART), and logistic regression models. The three approaches were applied to German credit data from the UC Irvine Repository of Machine Learning Database. The results suggest that the CART decision-tree model has the best predictive performance but the least model stability. The logistic regression model has the best model stability and acceptable predictive ability. The CHAID decision tree is more robust than CART in model building and interpretation. From the results, we illustrate the strengths for both non-parametric and parametric methods.

Xiaoning Guo, A Comparison of Data-Mining Methods in Direct Marketing, December 7,2012 (Martin Levy, Yan Yu)
Direct marketing is used to target a group of consumers who are most likely to respond to marketing campaigns. Companies typically send promotional materials to about 20% of their potential buyers from their lists. However, how to select the best customers is a question. The purpose of this study is to compare different data-mining methods to select the best customers to send the promotion catalog. The data-mining methods include generalized-linear, generalized-additive, classification-and-regression-tree, neural-network, and support-vector-machine. Based on their misclassification rates and areas under ROC curves, the logistic regression model is best for predicting consumers' response behavior.

Jun Sun, A Simulation Model for Evaluating the Performance of Fire Departments in Hamilton County, Ohio, December 6, 2012 (Uday Rao, Jeffrey Camm)
This research project focuses on (1) the evaluation of current Hamilton County fire-department performance including response time (more importantly dispatch-to-arrival time) and vehicle utilization, and (2) forecasting performance during different situations (with increasing incident frequency and reducing the number of vehicles). Based on the need of the local fire departments, this project will determine answers to "Whether the fire department has done a good job in 2010-2011" and "What will be the performance level if the fire departments' budget is cut." In order to achieve this goal, statistical- analysis methods will be used to provide an evaluation of current fire-department performance and to help build a simulation model by conducting input analysis or raw-data analysis. After that, simulation modeling will be used to forecast different scenarios based on varying inputs. Four input variables used in the simulation model involved dispatch-to-arrival time, arrival-to-closed time, frequency of incidents, and number of vehicles responding an emergency. All these variables come directly from the 2010 and 2011 records that the Hamilton County Communication Center provided. Results from this study indicated that the total performance of the Hamilton County fire departments based on the six-minutes criterion is good, and the frequency of incidents is growing year by year. Based on this increasing rate, the simulation results show that within five years the dispatch-to-arrival time will increase by 20 seconds (5.6% of the criterion), and in ten years the increase will be 62 seconds (17% of the criterion).

Ece Ceren Izgi, A Model for Financial Risk Analyses of Mass Customization, December 5, 2012 (Jeffrey Camm, Amitabh Raturi)
Mass customization is gaining popularity as a viable business model. The value proposition of the customized products is very different from a commodity product. Customized products are more valuable than non-customized products, but to be sustainable the manufacturing process must be robust with system efficiencies close to that of mass production. In this research project, Procter and Gamble's mass-customization efforts on a product are evaluated with a financial model via key financial performance metrics under three process-design scenarios: in-house semi-manual process, in-house automated process, and outsourcing semi-manual process. First, a deterministic model was designed; afterwards probabilistic components are introduced. The @RISK software is used to execute sensitivity and scenario analyses. The main goal of this project is the design of a decision-support system to illustrate and measure the risk, flexibility, and potential impact of the decisions involving mass customization.

Rebekah Wilson, Comparing Data-Mining Techniques to Build a Predictive Model to Understand Customer Risk, December 4, 2012 (Martin Levy, Yan Yu)
Businesses use data-mining techniques to evaluate and manage large amounts of data. Specifically, risk departments use data mining to develop rules and models to rate or score new and existing customers for numerous reasons. In this project, we look at multi-divisional, credit-card risk performance data and develop rules that target specific card-holders. The goal is find cardholders who have frozen accounts due to a returned payment and classify them as "good" or "bad" as defined by the company. The dataset contains 10,337 accounts, each with 370 fields such as risk score, history code, last payment amount, etc. The project uses CHAID and CART classification trees to create decision rules that most accurately predict what frozen accounts would be "good" enough to unfreeze 60 days after the return payment had been made. The good/bad flag is defined as a frozen account that, 6 months after having a returned payment, is either current or 30 days past due on a payment. The two decision trees are compared to determine which method allows for the most accurate and stable rules. Ultimately, both models correctly predicted the "good" cardholders over 60% of the time (67.71% for CART and 63.27% for CHAID). In terms of stability, CART outperforms CHAID due to the distribution of a key variable that the CHAID process used. However, CHAID did much better at separating the "good" and "bad" cardholders with a more consistent and higher KS statistic. It was decided to look more closely into the business criteria of each decision tree and determine which tree paired with a cutoff would allow for the most profit.

Torrie A. Wilson, The Effects of Cessation of Assessing Credit Card Late Fees, December 3, 2012 (Martin Levy, Edward Winkofsky)
ABC Financial (name changed to hide the identity of the actual credit-card company) currently charges accounts late fees when they reach 30, 60, 90, 120, 150, and 180 days past due. With current economic conditions and with ABC Financial's mission statement stating that they are customer-friendly, a recommendation has been made that they stop charging late fees after 90 days past due. If an account charges off, then the late fees associated with that account are written off along with the unpaid balance. This is assessed as a loss for the company since this money will not be collected. If an account does not charge off, the late fees associated with that account are assessed as a profit for the company since the money will be collected. The analytics department was asked to come up with preliminary results that would either push the decision for the company to implement a test or not. The findings show that for accounts with a delinquency status of 90 days past due, ABC Financial is more likely to acquire a loss associated with them rather than a profit based solely on late fees. Therefore, the recommendation to stop assessing late fees after 90 days past due could actually be more beneficial to the company than its current late-fee structure.

Valerie Lynn Schneider, A Simulation Study of Traffic-Intersection Signalization, December 3, 2012 (David Kelton, David Rogers)
The intersection of North Bend Road and Edger Drive is under constant scrutiny by drivers who get stuck in long queues while waiting on Edger to turn onto North Bend. The purpose of this study is to simulate the intersection and determine if a traffic light should be installed. An analysis of several key output performance-metric statistics, including average queue length and average time spent in the intersection per car, was completed. Two models were constructed; one to model the intersection as it is now, and a second to model the intersection as if a traffic light were installed. Several scenarios were examined to analyze the potential effects of increased traffic through the intersection. Results of the traffic study show that installing a traffic light in the intersection now will not be beneficial. However, if the amount of traffic through the intersection were to increase upwards of 20 percent, then a traffic light could possibly help decrease average queue lengths and the overall time spent in the intersection by all cars throughout the day.

Akshay Mahesh Jain, Customer Segmentation and Profiling, November 30, 2012 (David Rogers, Edward Winkofsky)
Companies invest a lot of resources in developing database systems that store voluminous information on their customers. Appropriate data-mining and multivariate techniques are used to leverage the data to identify customers who are valuable to the organization as this can help companies generate maximum return on their marketing dollars. In this project, a two-stage study of customer behavior, segmentation and profiling, was done on a customer database of a retail company. Segmentation is the process of dividing the database into distinct customer segments such that each customer belongs to one segment, and this process helps in identifying the most valuable customer segment. Profiling is the process of describing the demographic and socioeconomic profile of the segments. The main goals of the project are to identify at most ten customer segments such that each segment is at least 5% in size and to profile these customer segments. For this purpose, a sample of 300,000 records used and randomly split into two datasets: Training (60%) and validation (40%). The K-means algorithm was used on the training dataset to identify groups of customers based on recency, frequency, monetary, and duration variables. The segments generated were validated using the validation dataset, and the results suggested that the cluster solution obtained is not sample-specific and is representative of the population. After customer segmentation, profiling of the customer segments was done using demographic and socioeconomic variables. Based on the segments and profiles obtained, appropriate marketing strategies were devised.

Jie He, Evaluation of Response Time and Service Performance of the Fire Departments in Hamilton County, Ohio, November 30, 2012 (Uday Rao, Jeffrey Camm)
We provide a framework for evaluating existing resource and service performance provided by the fire departments of Hamilton County. The primary focus is on the efficiency of the current fire-station resources. Specifically, this study is designed to answer the question "whether current personnel and emergency equipment resources assigned to fire stations is able to meet the increasing demand of fire and medical-emergency response service." Geographic information system (GIS) network analysis will be used to calculate optimal routes and theoretical driving times between the responding fire stations and incident locations. In practice, the driving route is chosen by the driver based on experience or evaluation of route length, speed limit, and number of turns. The theoretical driving time is calculated based on the street-level conditions including factors such as street types, slopes, and speed limits, with a view to minimizing turns. The theoretical driving time will be compared with the actual response time (between "Dispatch Time" and "Arrival Time") in the 2010 and 2011 records from the Hamilton County Communication Center (911 Dispatch Center), to determine the emergency response efficiency. Results indicate that the majority of fire stations are able to provide timely emergency response to the neighborhood. The results also reveal possible resource and personnel shortages in the future.

John Lawrence Ewing, Advanced Forecasting Using ARIMA Modeling: Sales Forecasting for OMI Industries, November 15, 2012 (David Rogers, Martin Levy)
Sales are the lifeblood of a company and accurate sales forecasting helps management make key business decisions. This study has been made to forecast the sales for OMI Industries. The Box-Jenkins methodology of model identification, estimation, and validation is applied to generate autoregressive integrated moving average, ARIMA, models. An outline of the steps needed to use ARIMA time series models to forecast sales is presented. The results produced by the model indicate that ARIMA forecasting is efficient at generating short-term forecasts.

Shashi Sharma, Empirical Assessment of the Ohlson (1995) Equity Valuation Model using Dynamic Linear Modeling Methodology within a Bayesian Inference Framework, October 5, 2012 (Jeffrey Mills [Department of Economics], Martin Levy)
The purpose of estimating the fundamental (intrinsic) value of an asset is to take advantage of mispriced assets. The guiding principle of all savvy investors, "Buy Low and Sell High," basically means that if the market price of an asset is below its fundamental value then the investor may/should consider purchasing the asset, and if the market price is above its fundamental value then the investor may consider selling the asset. This paper provides an empirical assessment of Ohlson's equity valuation equation proposed in Ohlson (1995) to estimate a firm's fundamental value of equity using the statistical approach of dynamic linear modeling. The results of this assessment can help an investor build a profitable trading strategy, for example by investing in stocks that are found to have significant a difference between their market price and their fundamental value.

Dan Larsen, A Mixed-Integer Programming Approach to a Profitable Airline Route Network Design, July 30, 2012 (Jeffrey Camm, Uday Rao)
In 2007, Delta Airlines and Northwest Airlines announced merger plans. Airline executives want regulatory agencies and the general public to believe that a merger will have positive impacts for the consumer. While there are some back-office functionalities that become redundant (and therefore eliminated) in a merger, most cost savings will come from better realigning airframes to markets. This project formulates a mixed integer program and uses publicly available fare and passenger data from 2007 to determine what a profitable route network would look like at some point in the unknown future. As this analysis was conducted in 2008, we also have the ability to take a retrospective look and see how our analysis and assumptions played out five years later.

Benjamin Noah Grant, The Benefit of Reallocation Using Scenario-Based Robust Optimization and Conditional Value-at-Risk in a Long-Only Equities Portfolio, July 30, 2012 (Jeffrey Camm, David Rogers)
One of the most important aspects of investment in volatile assets is risk control. Many mathematical models have been developed to try to control risk in an investment portfolio, with one of the most widely used models being the value-at-risk (VaR) model. Conditional-value-at-risk (CVaR) was developed as a model to address expected tail loss to help mitigate catastrophic losses in portfolios. This project is an examination of five different reallocation time periods applied to a long-only equities portfolio, for which the assumption is made that you can only buy assets to include and no short positions are allowed. The goal of finding the benefit to risk control by actively reallocating using a CVaR optimization model is performed.

Yuanyuan Niu, A Study on Bond Yield Curve Forecasting, July 26, 2012 (Yan Yu, Uday Rao)
There is considerable research on in-sample fitting and out-of-sample forecasting performance of yield curves. This project first studies the model and methodology of forecasting the term structure of government bond yields (Diebold and Li 2006). The Nelson-Siegel factor model is used to fit the Treasury bond yield data from 1985 through 2000. Three time-varying regression coefficients are interpreted as level, slope, and curvature-factor loadings of the yield curve. Various scenarios are constructed to find the optimal shape parameter. In addition, unlike a constant shape parameter in the previous literature, we develop both linear and non-linear least-squares grid search algorithms to find an optimal time-varying shape parameter. Autoregressive models and recursive forecasting are also involved in predicting future yields. Finally, we compare the in-sample and out-of-sample model fits with different choices of shape parameters. The statistical software package SAS is used for implementation.

Alexander Muff, Bottleneck Analysis via Simulation of a Steel Barrel Manufacturing Mill, July 26, 2012 (David Kelton, Uday Rao)
The purpose of this study was to build a simulation model of a steel-barrel mill for analysis. This analysis would include identifying bottlenecks via cycle times and machine breakdowns. It would also be used to quantify the return on investment of process and capital improvements. Since this was the first time the company had used simulation modeling, it was also a proof-of-concept that simulation modeling was the correct approach to identify these problems. The project resulted in verifying the plant manager's intuition about bottlenecks and provided valuable data about scheduled capital improvements. The company has also rolled out simulation modeling to its other facilities across North America.

Andrew Dempsey, Using Markov Chains to Analyze a Football Drive, July 24, 2012 (David Rogers, Jeffrey Camm)
The 2007 Highlands team was facing a critical drive that threatened their season. The drive starting from the twenty-six yard line was the highlight of the season. Fans of past year's teams started questioning whether the 2006 team would have been able to pull off such a drive. The probability of scoring and turning the ball over is analyzed by the use of matrix calculations. Markov models and matrix manipulation allows for expected points and expected downs to be calculated based on a spreadsheet of the 2006 season data. The calculations are used to gain situational awareness. Situational awareness is used by coaches to gain insight into the degree of success of plays and for future play calling. The analysis of downs, yard line, and distance to the goal line allows for the drive situation to be calculated and analyzed.

Silky Abbott, Simulation of the Convergys Contact Center in Erlanger, KY, July 24, 2012 (David Kelton, Jaime Newell)
The Convergys (CVG) Contact Center currently operates at three locations, one onshore in Erlanger, KY and two offshore in the Philippines and in India. The contact center located in Erlanger, KY serves different types of processes with inbound as well as outbound service. Clients from different sectors such as retail banking, credit cards, cell-phone providers, satellite/TV providers, healthcare, and insurance outsource their processes to the CVG inbound and outbound contact center. With business development among their clients, more processes are getting outsourced to CVG by the clients and there has been a subsequent increase in the call volume to which the agents respond every day. The most immediate concern for the corporation is the presence of unsatisfied customers due to lack of operational facilities for them to speak with a customer-service representative (CSR). In this model, we look at one of the processes served by the Convergys contact center; however the process client name is not mentioned due to confidentiality concerns. Several factors such as call arrival rates, total time spent on the phone, total time spent in the queue, holds, and transfers are looked at in this report. Several scenarios such as increasing the number of CSRs and trunk lines are simulated to decrease the percent of customers who are thrown out of the contact center queue due to system overflow, while at the same time keeping a check on the total cost of the system. Using simulation modeling, these different options are explored to determine which option is best for the Convergys Contact Center.

Shangpeng Pan, An Analysis of Running Records Using Frontier Analysis, July 20, 2012 (Jeffrey Camm, Martin Levy)
The objective of this study is to develop a handicapping technique by analyzing world running records, using frontier analysis. A handicapping technique is useful in comparing performances of different ages and genders in long-distance running. Using the world records for 5K, 10K, and Marathon races, the frontier function can be calculated based on linear programming. The estimated frontier function is discussed for different ages and genders. The handicapping technique is developed based on the frontier function and applied to an example. Such a handicapping technique using frontier analysis can be extended to other individual sports with precise measures.

Madan Mohan Dharmana, Customer-Centric Pricing Analysis, June 1, 2012 (Amitabh Raturi, David Rogers)
Price is an important driver for profitability.  Even though price has a higher impact on increasing profits than other levers of operations management, companies often do not focus on pricing appropriately in the fear of losing customers to competition.  One thing is clear - the higher the value of the product perceived by the customer, the less price-sensitive the customer is.  In an ideal world, the right value of the product, as perceived by each customer, should be evaluated and the product can be priced accordingly.  Having a customized pricing policy based on the characteristics of each segment can potentially enhance sales and thus maximize profits by extracting the complete value created by the product for the different segments.  In this project, a customer-centric pricing strategy is illustrated.  The customers are classified into different segments and current pricing benchmarks are obtained for each segment.  The potential for price increases to each customer is identified based on what the current price is as compared to the segment benchmarks.  Probability of attrition is used to identify how sensitive the customer will be to a price increase.  A logistic regression model is built to obtain the probability of attrition.  Based on the upside potential for price increase and price sensitivity of the customer, a strategy for revenue increase is identified for each customer.  We develop a flowchart of the methodology for customer-centric pricing, illustrate this methodology using several examples, as well as show the magnitude of differences in the overall profitability of a firm with customized pricing policies in different scenarios.  Several avenues for future research are also identified.

Ou Liu, College Athletic Conference Realignment: Minimize the Traveling Miles, May 31, 2012 (Michael Magazine, Jeffrey Camm)
Unnecessary flying distance could cause major money loss for college athletic teams and make players feel extremely exhausted.  Thus realigning the NCAA athletic team conferences and optimizing the flying miles is of importance.  This project formulates a problem to minimize total travel distance NCAA athletic teams travel by finding a realignment based on a distance metric.  The project selects 66 teams from six conferences: Big Ten, S.E.C., A.C.C., Big Twelve, Pacific Twelve, Big East and four independent college athletic teams (Notre Dame, BYU, Navy, and Army).  The optimization model is based on seven conferences in total and ten teams in each conference.

David Teng, Simulation Analysis of the UC Bearcats Transportation System, May 31, 2012 (David Kelton, Jeffrey Camm)
The objective of the study is to simulate the University of Cincinnati Bearcats Transportation System (BTS).  Many analyses can be applied to other transportation systems as tools of cost reduction and effectiveness improvement.  This study contains three major parts.  The first part describes the development of the model.  Based on the available data, five assumptions were identified.  We displayed the limitations of Arena when building a shuttle-bus system simulation model and provide resolution of those limitations.  In the second part, the model goes through the validation process and the accuracy of the model is confirmed.  Finally, we conduct experimentation in the third part of the study.  Then the experimental outputs suggest recommendations such as reduction of total number of trips or seats, leading to potential cost cuts.

Richard Walker III, Asia-to-US Supply Network Simulation and Analysis, May 23, 2012 (David Kelton, Uday Rao)
Container freight is a key component of imported merchandise, which has shown robust growth since its introduction in the 1950s.  Within this segment, the China-to-US trade flow is the largest by volume, inter-nation trade route.  To manage supply chains that avail themselves of this tremendous trade flow, major uncertainties in lead-time and demand forecasts work against the need to provide high service reliability with minimum costs.  A dynamic simulation model was developed using a trans-Pacific, intermodal supply chain from Guangzhou, China to the southwestern quadrant of Ohio.  The model used a combination of primary shipper data and literature values for transit-time probability distributions and freight cost variables to investigate the impact of shipper reliability and mean delivery times on logistics costs and service levels.  Using regression analysis of the simulation data, it was determined that for selected situations, the cost advantage and competitive service rates attainable by direct train delivery of trans-Pacific, intermodal container merchandise can make train delivery an economically preferred choice for supply chains delivering above 95% fill rates.  Additionally, we found that faster, more reliable shipment methods can significantly increase inventory levels and inventory holding costs for supply chains that are operating between 73% and 99% fill rates.

Lianlin Chi, Nurse Schedule Optimization at a Children's Hospital Emergency Department: A Linear-Programming Approach, May 2, 2012 (Jeffrey Camm, Yan Yu)
A hospital provides patient treatment by specialized staff and equipment.  It usually has an emergency department, which is a medical-treatment facility specializing in acute care of patients who arrive without prior appointment.  Because of the variation and uncertainty in demand, emergency-department staffing is particularly challenging.  In terms of planning for this demand, the hospital needs to produce duty schedules for its emergency-department nursing staff.  The schedule has impact on budget, nursing functions, and health-care quality. Nurses at the Cincinnati Children's Hospital's emergency department are given a lot of flexibility to serve their patients best according to their own working preferences.  In this study, a computerized nurse-scheduling model was developed to adapt to Cincinnati Children's Hospital's emergency department.  We solved the problem of minimizing staffing shortages using a binary linear goal-programming approach with OpenSolver.

Sashi Kommineni, Nurse Scheduling at a Children's Hospital Emergency Department, January 6, 2012 (Jeffrey Camm, Uday Rao)
Pediatric nursing is a specialty encompassing the care of children, adolescents, and their families in a variety of settings.  Handling patients and their families in an emergency situation is a challenging task.  Nurses at the emergency department of a children's hospital in Cincinnati are given a lot of flexibility in drawing up their schedule in order to allow them to serve patients in the best possible way.  This is because the scheduling quality directly influences the nursing quality and working morale.  As it exists today, a full-time scheduler works with nurses and draws schedules for a period of six weeks while trying to accommodate individual preferences and change requests.  The underlying goal is to minimize overtime and staffing shortages while utilizing existing resources.  In this project, we attempt to automate and solve a scaled-down nurse-scheduling scenario to cover minimum staffing levels at the emergency department using a multi-objective linear-programming approach with open solver.


2011

Yanyun (Lance) Wang, Non-parametric Density Estimation in VLSI - Statistical Static Timing Analysis Boosting, December 2, 2011 (Yan Yu, Uday Rao)
In Very Large Scale Integrated (VLSI) circuit design, it's important to investigate the longest path delay from inputs to outputs, which is also termed Static Timing Analysis (STA).  As the VLSI industry steps into the deep sub-micron era, process variation becomes more and more important, especially in STA.  Statistical Static Timing Analysis (SSTA) thus helps to deal with the variation in critical path delays.  This project studies how to employ non-parametric density estimation methods in SSTA to help the hardware-design industry further understand their device-manufacturing variability.  The simulation results using kernel density estimation show significant reduction of the necessary MC simulation iterations in SSTA, which may potentially shed light on future design improvements.  This project takes in the extracted parameters from a manufacturing chip-testing process and inputs them into a test bench of 12 serially connected inverters.  Outputs are analyzed and summarized by PERL and ports to SAS and R.  Non-parametric (NP) Kernel Density Estimation (KDE) is implemented, investigated, and compared with parametric modeling methods such as Gaussian distribution fitting.  The result indicates that NP KDE demonstrates strong predictive ability with a high level of accuracy and lower cost in time and memory usage than does brute-force Monte Carlo simulation.  The NP KDE method outperforms all the other statistical curve-fitting methods.

Hongbing Chen, Forecasting Loan Loss Rates Using Multivariate Time-Series Models, November 28, 2011 (David Rogers, Hui Guo)
The ability to forecast credit loss accurately is of vital importance to every financial institution for both decision support and regulatory compliance.  This study proposes a Vector AutoRegressive and Moving-Average processes with eXogenous regressors (VARMAX) model to overcome constraints and limitations imposed by commonly used roll-rate models.  The VARMAX model allows for multivariate forecasting and takes advantage of information contained in the time-series of forecasting variables.  In particular, the VARMAX technique allows for joint forecast of the loan loss rate (the forecasted variable) and the delinquent rates (one of the forecasting variables), which substantially enhances forecasting performance, especially when the forecast window is lengthened.  Using historical performance data of various auto loan portfolios at a regional commercial bank, the paper demonstrates that the proposed VARMAX model consistently outperforms roll-rate models across various loan portfolios.

Lakshmi Palaparambil Dinesh, Robust Optimization for Resource Allocation in the Energy Sector, November 22, 2011 (Jeffrey Camm, Uday Rao)
The highly volatile nature of energy prices makes it important for the utility companies to have a plan in place to buy, sell, and store energy in high-capacity batteries.  The hourly Location Marginal Prices (LMP) change as a function of power consumption and the companies need to manage the batteries accordingly so that objectives of profit maximization and optimal power allocation are met.  One of the ways to do this is to use scenario-based optimization using the best-worst case profits model, which is tested in this project.  The best-worst case profit points to maximizing the profit that is lower than each of the individual scenario profits.  This is an extremely conservative approach.  In addition to the above-mentioned approach, other buy, sell, and store plans based on the mean prices and their confidence intervals can be used.  The objective of this project is to show empirically why the robust optimization model performs better than the other models in the face of uncertainty.  The results show that the robust optimization model performs the best in terms of stability.

Hexi Gu, Regularization Methods for Ill-Posed Inverse Problems: Empirical Research on Realistic Macroeconomic Data, November 17, 2011 (Uday Rao, Yan Yu)
Benefiting from research on many practical problems in the natural-sciences and engineering-technology areas, the inverse problem has received much attention since the 1960s.  Widespread application of the inverse problem in medicine, mathematical physics, meteorology, and economics has attracted much research.  This project briefly introduces the theory of the inverse problem and the regularization method used to solve ill-posed inverse problems.  Several well-known regularization methods, such as the Tikhonov regularization method, the Landweber regularization method, and the conjugate gradient method, are discussed and analyzed.  Regression models for parameter estimation based on these methods are developed and applied through a case study using China's real economic data from 1990 to 2008.  In order to test the effects of the regularization regression model, the ordinary least squares error (OLS) and the Eviews methods are also applied and compared with the regularization methods, and the results show that regularization is better than OLS when dealing with ill-posed inverse problems.  It is also suggested by the case study that, in order to have sustainable and healthy growth in China's economy, it is important that the government take measures to promote domestic consumption (final consumption), which forms a crucial part of GDP.

David A. Pasquel, Operational Cost-Curve Analysis for Supply-Chain Systems, November 2, 2011 (David Rogers, Amitabh Raturi)
Inventory managers of multi-level supply-chain environments experience continuous pressure, especially from website competitors, to minimize total inventory costs.  A general, non-linear mathematical formulation with the objective of minimizing total system costs (the summation of backorder penalty costs and inventory holding costs) while constraining the proportion of backorder costs to total system costs is initially considered.  This analysis is then expanded to compare this constraint to a backorder rate constraint.  Finally, the model is modified to consider a mixture of back orders and lost sales in the retail inventory system.  Cost curves are developed for these various scenarios to provide graphical support of the effects and tradeoffs inventory system decisions can have on total costs.

Ben Cofie, A Data-Mining Approach to Understanding Cincinnati Zoo Customer Behavior, October 18, 2011 (Yan Yu, Jeffrey Camm)
The Cincinnati Zoo is one of the top zoos in the nation.  They serve approximately 1,000,000 visitors every year and are looking forward to serving even more visitors in the future.  To understand the needs of their customers and provide better services, the zoo collects and manages large amounts of data on their customers through surveys and membership applications.  In this project an attempt is made to study and identify groups, structures, or patterns that exist in the Cincinnati Zoo data using cluster analysis and association-rule mining.  Clustering is used to identify useful groups that exist in the data sets. Both the K-means and Ward's clustering methods are implemented in R using the kmeans () and hclust (data, method="ward") functions, respectively.  Association-rule mining (ARM) is used to identify food/retail purchasing patterns of members by studying the correspondence/associations between items purchased together.  Association-rule mining is implemented in R using the Apriori algorithm.  Results showed that certain food/retail items are almost always purchased together even though the Zoo sells them separately, and also Zoo members with fewer children are more likely to drop membership than those with more children.

Andy Craig Starling, Modeling a Small Wireless Network for the Telecom Industry, September 16, 2011 (David Kelton, Jeffrey Camm)
When deploying a wireless network in the telecom industry, it is important to develop a proper sales strategy that will maximize revenue while filling the network to capacity with sales to both residential and business customers.  The wireless network described within has three towers from which calls or internet connections originate and then are relayed to a building in downtown Cincinnati where the calls are routed for termination, or internet connections hop onto the internet backbone.  This thesis will study, via computer simulation, all the numerous variables that are a part of a wireless network and then conclusions will be drawn regarding the best sales strategy from which to begin.

Lily Elizabeth John, A Maximal-Set-Covering Model to Determine the Allocation of Police Vehicles in Response to 911 Calls, August 24, 2011 (Jeffrey Camm, Michael Magazine)
Police departments have in the past made a significant effort in ensuring immediate response to 911 calls.  One of the many factors that directly influence response is the travel distance to the location where response is required.  The closer a patrol car is to the location when a request is made, the lesser the time required to respond.  However, due to resource limitations in terms of the number of patrol cars and officers on duty at any given time, it becomes important that patrol cars be strategically located to meet the demand.  This project attempts to solve this problem by using a strategic maximal set covering location model to allocate patrol cars optimally.  An application using real call data from the City of Cincinnati is presented.

Kenneth Darrell, An Investigation of Classification, August 24, 2011 (Jeffrey Camm, Raj Bhatnagar)
Classifying a response variable based on predictor variables is now a common task.  Methods of classifying data and the models they produce can vary wildly.  Classification methods can have different predictive capabilities, stemming from model assumptions and underlying theory.  A comparison of disparate classification methods will be evaluated on their predictive capabilities as well as the steps required in construction of the model.  A collection of data sets with varying numbers and types of predictor variables will be used to train and test various classification methods.  The data sets under consideration will all have dichotomous response variables.  The following methods will be evaluated: logistic regression, generalized additive models, decision trees, naive Bayes classification, linear discriminant analysis, and neural networks.  These methods will be gauged to see whether one model will rise to the top and always outperform other methods or if each type of model is applicable to a certain range of problems.  The evaluation methods will be based on common binary evaluation parameters.  These parameters consist of accuracy, precision, recall, specificity, F-measure, receiver operator characteristic, and AUC.

Vijay Bharadwaj Chakilam, Bayesian Decisive Prediction Approach to Optimal Mailing Size Using BLINEX Loss, August 22, 2011 (Martin Levy, Jeffrey Camm)
Direct-mail marketing is a form of advertising that reaches its audience directly and is among the most rapidly growing forms of major marketing campaigns.  A direct-mail marketing campaign starts by obtaining a name list through in-house data warehouses or external providers.  A logistic scoring model is built to create response or activation scores from the characteristic attributes that describe the name list.  The activation scores are then listed in descending order to rank and select names for direct mail solicitation purposes.  More selected names result in more revenues but not necessarily more profits.  The problem of interest is to predict an optimal mailing size to use for mailing future marketing catalogs that maximizes the profits.  A collection of trials of direct-mail solicitation campaigns is made available by using a bootstrapping-like strategy to generate a set of historical trials.  The likelihood function of the optimal activation scores is assumed to follow a normal distribution with unknown mean and unknown variance.  Bayesian decisive prediction is then applied by using conjugate priors and a BLINEX loss function to predict the optimal activation score for a future trial.

Mark Richard Boone, Minimization of Deadhead Travel and Vehicle Allocation in a Transportation Network via Integer Programming, August 19, 2011 (Jeffrey Camm, Michael Magazine)
This paper examines an everyday issue in logistics: minimizing cost as well as unused resources.  In order to find a solution for a specific case of 25 loads that needed to be transported from one city to another, data were acquired from a local firm and cities within Texas (or close proximity) were selected.  After calculating deadhead distances between all possible city combinations, an integer-programming model was crafted that would take the data and minimize the deadhead travel distance in the system, given the number of trucks available for use and the maximum mileage allowed per truck.  The findings, using AMPL and CPLEX as a solver, showed that the fewer resources employed, the longer it took for a solution to be found.  Ultimately, for the 25 city-pairs selected, with 700 miles per truck set as a constraint, the fewest trucks that could transport all loads was 15, with a total deadhead distance of 1,373 miles (the trucks in the system were loaded nearly 82% of the time they were on the road).

Aswinraj Govindaraj, Nurse Scheduling Using a Column-Generation Heuristic, August 10, 2011 (Jeffrey Camm, Michael Magazine)
This paper describes a mathematical approach to solving a nurse-scheduling problem (NSP) arising at a hospital in Cincinnati.  The hospital management finds difficulty in manually deriving a nurse roster for a six-week period while trying to place an adequate number of nurses in the emergency-care unit of the hospital.  The aim of this project is to provide proof of concept that binary integer programming can be used effectively to address the NSP.  This model employs a two-stage approach where multiple schedules are generated for all nurses in phase I based on the organizational and personal constraints, and the best-fit schedule for populating the roster is selected in phase II so as to effectively satisfy the demand.  The study also evaluates the effectiveness of schedules thus generated to help the hospital management judiciously decide on the number of full-time and part-time nurses to be employed at the emergency-care unit.

Andrew Nguyen, Optimizing Clinic Resource Scheduling Using Mixed-Integer and Scenario-Based Stochastic Linear Programming, August 5, 2011 (Craig Froehle, Jeffrey Camm)
Efficient clinic operations are vital to ensuring that patients receive care in a timely manner.  This becomes paramount when care is urgent, patients are abundant, and resources are limited.  Clinic operations are more efficient when patients wait less, staff members are less idle, and total clinic duration is shorter.  The goal of this paper is to develop a valid deterministic mixed-integer linear-programming (MILP) model from which a valid stochastic model can be derived, and to explore how such models can potentially be utilized in an actual clinical setting.  An approach to minimizing patient waiting, staff idle time, and total duration at a clinic is to develop a MILP model that optimally schedules tasks of the clinic's staff members.  This can be accomplished with a deterministic model.  However, processing times of patients for each type of staff member vary in an actual clinical setting, so a stochastic model may be more appropriate.  Also, a valid scheduling model is less valuable to efficient clinic operations if the model cannot be readily implemented for routine use by staff members.  This paper describes a deterministic MILP model that simultaneously minimizes patient waiting, staff idle time, and total operating time.  Then, from the deterministic model, a scenario-based stochastic model that assumes varying processing times is developed.  Finally, prototype software solutions emphasizing clinic staff usability are discussed.

Logan Anne Kant, Robust Optimization Based Decision-Making Methodology for Improved Management of High-Capacity Battery Storage, August 4, 2011 (Jeffrey Camm, David Rogers)
Utility companies have the option of buying, selling, or storing power in high-capacity batteries to maximize profits in the face of fluctuating energy prices.  The problem companies confront is optimally managing the batteries when hourly LMPs (Locational Marginal Prices) vary as a reflection of power consumption changes and power availability over the course of a day.  The goal of this project is to identify and compare scenario-based robust-optimization planning models that may be used to achieve the desired outcome of maximizing profits from battery management.  The following optimization planning models are considered and empirically tested: simple simulation/optimization, best worst-case, value at risk (VaR), conditional value at risk (CVaR), minimum expected downside risk, and maximum expected profit.  The project aims to give decision makers a toolbox, a decision-making methodology containing robust-optimization models that improve battery management in an uncertain market.  This toolbox facilitates and informs the process but it does not aim to solve the battery-management problem by replacing the educated judgment of the decision maker.

Wei Zhang, A Study of Count Data by Poisson Regression and Negative Binomial Regression, July 28, 2011 (Martin Levy, Jeffrey Camm)
Count data are one of the most common data types and many statistical models have been developed for their analysis.  In this work two regression methods are investigated for applications in count-data analysis.  They are Poisson regression and negative binomial regression.  Following a brief introduction of the binomial and Poisson distributions, the regression models were developed based on these two distributions.  The models were then applied in a case study, in which an insurance company was trying to model the number of emergency visits due to ischemic heart disease of the 778 subscribers.  The problem was approached by both of the two regression methods and the performance of each method was evaluated.  It turned out that the negative-binomial regression model outperformed the Poisson regression model in this case.  Linear regression was also attempted but failed for the data.

Lili Wang, Donation Prediction Using Logistic Regression, June 27, 2011 (Martin Levy, Jeffrey Camm)
Increasing the accuracy of prediction of potential responders can save a charitable organization a lot of money.  By soliciting only the most likely donors, the organization would spend less money on solicitation efforts and spend more money for charitable concerns.  This project aims to predict who would be interested in donation and explain why those people would make a donation.  The dataset contains individuals' information from a national veterans' organization, including demographic information and past behavior information that relates to donation.  The response variable of interest is binary, indicating whether the recipient will respond or not.  This paper will discuss the application of the logistic regression model and compare three models based on different variable-selection methods.  The methods can also be applied in different companies to market their products or services.

Kristen Bell, Effects of Bus Arrivals on Emergency-Department Patients, June 1, 2011 (David Kelton, Craig Froehle)
University Hospital's Emergency Department (ED) treats nearly 100,000 patients annually.  Patients arrive by air care (helicopter), ambulance, personal vehicle, and bus.  The bus schedule limits arrival times, but may result in multiple patients arriving at one time.  Using data from the hospital's record-keeping system, a simulation model was developed to examine the effects of transportation mode on wait times in the ED.  Model modifications allowed for analysis of sensitivity to bus schedules and delays, determination of peak-load handling capabilities, arrival-time-of-day impacts, and comparison of scheduled arrival to a smoothed, continuous arrival function.  Modeled results aim to help ED planners account for and adapt to changes in arrival-mode patterns.

Mahadevan Sambasivam, Factors Affecting Inpatient Reimbursements Using the Medicare Hospital Cost Reports -- A Case Study, June 1, 2011 (James Evans, Uday Rao)
Total Medicare payments provide the largest single source of a hospital's revenues. The  CMS (Centers for Medicare and Medicaid Services) has a system, called as the Inpatient Prospective Payment System, which pre-determines how much a hospital should be paid for a particular service based on the Diagnosis Related Group (DRG) codes the patient qualifies depending upon the condition and the diagnosis of the patient.  These standardized payments are calibrated every year depending upon the wage index, cost of the procedure, inflation, cost of technology, and other such criteria.  This project is a case study in identifying which factors really affect the reimbursement rates of the Medical procedures based on the Medicare cost reports data submitted to CMS for 2006-2009.  The statistical analysis is done by using two techniques, principal component analysis and multiple linear regression, in interpretation of the factors that affect the reimbursement rates the most and how they affect.  Based upon the results of the analysis it was found that the revenue generated from total inpatient services was negatively correlated to the net inpatient income but was positively correlated to the overall net income of the hospitals.

Avinash Parthasarathy, Campaign-Coupon Analysis Using Integer Programming, May 26, 2011 (Jeffrey Camm, David Rogers)
In an effort to strike a balance between retaining a set of loyal customers and attracting new customers, a retailer is considering reshuffling and reducing the current number of coupons under each campaign.  The main goal of this project is to explore optimization techniques, driven by binary integer programming, to analyze the campaign-coupon structure of a grocery store.  The model helps the retailer understand the coupon-redemption behavior of his customers and eventually reduces the number of campaigns and coupons, and results in maximizing the number of households redeeming a set of coupons.  The importance and usefulness of optimization techniques when applied directly to the data will be illustrated.  It also elucidates the process of preparing the right kind of data required to apply these techniques from a collection of datasets or tables.  The project uses SAS -- PROC SQL & PROC TRANSPOSE -- to extract and prepare the data required to feed the optimization process.

David Burgstrom, Foreclosures in Cincinnati: An Analysis of Associated Factors, May 25, 2011 (Yan Yu, Martin Levy)
The wave of recent home foreclosures across the nation was a hallmark of the financial crisis, lowering property values and in some cases leading to neighborhood blight.  This project looks at a record of over 50,000 home sales in the city of Cincinnati from the past eleven years in order to identify predictors of increased odds of foreclosure.  Part of this project is the creation of a dataset using ArcGIS software to geocode addresses from the Hamilton County Auditor and identify their respective neighborhoods, which enables demographic information from census data to be joined as possible predictors.  The analysis is performed in the R computing environment using a generalized linear mixed model.  The year of sale and the neighborhood are entered as random effects and all other predictors are evaluated as fixed effects.  AIC, BIC, AUC, and mean residual deviance are criteria used to determine the optimal collection of predictor variables.  The results show that while some predictors are similar to the findings from other foreclosure studies, other variables show that Cincinnati's experience with the foreclosure crisis was unique.

Lei Xia, A Study of Panel Data Analysis, May 23, 2011 (Martin Levy, Yan Yu)
Panel data refer to multi-dimensional data which contain observations on multiple phenomena observed over time periods for the same objects.  Results from panel data analysis are more informative and estimation based on panel data can be more efficient compared to time-series data only or cross-sectional data only.  The analysis of panel data has been widely applied in the social- and behavioral-science fields.  In first part of this project, a thorough review of "Analysis of Panel Data" by Cheng Hsiao (2003), "Econometric Analysis of Panel Data" by Badi H. Baltagi (2008), and published papers written on this subject are presented to give an overall introduction on panel data analysis and its methodology.  There are two major types of panel data models discussed in the second part of this project: the simple regression model with variable intercept, and the dynamic model with variable intercept.  For each of these two types, fixed-effects models and random-effects models, as well as the corresponding methodology are discussed.  A small case study about the cost of six U.S. airlines, conducted by researchers at Indiana University, is revisited in the end to demonstrate the implementation of panel data analysis in SAS using PROC PANEL procedure.

Brian Sacash, Data Envelopment Analysis in the Application of Bank Acquisitions, May 23, 2011 (Jeffrey Camm, David Rogers)

Data Envelopment Analysis (DEA) is an application of linear programming that helps determine and measure the efficiency of a particular type of system with multiple operating units that have behaviors based on the same principles.  Quantifiable parameters are determined to be inputs and outputs, which creates a data set that allows comparison of efficiency across similar units.  To determine an efficiency rating, these defined inputs and outputs for each decision-making unit (DMU) are parameters in a linear optimization model, which can be solved with off-the-shelf optimization tools.  In this work, we use DEA to determine the efficiency of bank branches.  The motivation was that the bank in the study took on a new set of branches through an acquisition and wished to determine the relative efficiency of the merged set of branches.

Chaojiang Wu, Partially Linear Modeling for Conditional Quantiles, May 23, 2011 (Yan Yu, Martin Levy)
We consider the estimation problem of conditional quantiles when high-dimensional covariates are involved.  To overcome the "curse of dimensionality" yet retain model flexibility, we propose two partially linear models for conditional quantiles: partially linear single-index models (QPLSIM) and partially linear additive models (QPLAM).  The unknown univariate functions are estimated by penalized splines.  An approximate iteratively reweighted least square algorithm is developed.  To facilitate model comparisons, we develop effective model degrees of freedom for penalized spline conditional quantiles.  Two smoothing-parameter selection criteria, Generalized Approximate Cross-validation (GACV) and Schwartz-type Information Criterion (SIC), are studied. Some asymptotic properties are established.  Finite- sample properties are studied by simulation studies.  A real-data application demonstrates the success of the proposed approach.  Both simulations and real applications show encouraging results of the proposed estimators.

Ying Wang, The Use of Stated-Preference Techniques to Model Mode Choices for Container Shipping in China, May 13, 2011 (Uday Rao, Yan Yu)
This project presents a case study on the possibility of shifting containers off the road and onto intermodal coastal shipping services in China by analyzing the main determinants of mode choice.  The data were collected through a mix of revealed and stated preference questionnaire surveys, and then analyzed using the logit model; the case study has been carried out on routes from Wenzhou to Ningbo.  The results show that, in the decision-making process of choosing a mode for container distribution, Coastal Shipping Cost, Coastal Shipping Time Reliability, Slot Availability for High Cube Container, Road Cost, and Road Time Reliability are significant determinants.

Neelima Kodumuri, Ratings and Rankings - A Comparative Study Based on Application of the Bradley-Terry Method to Real-World Survey Data, March 11, 2011 (Norman Bruvold, Martin Levy)
Understanding individual choice behavior is of utmost importance to organizations in order to compete in today's marketplace with fickle customer preferences.  This is most evident in the crowded fast-food industry where the restaurants are competing with each other every breakfast, lunch and dinner to capture their customers and keep them coming back for more.  In this study, we look at the choices and preferences of individuals using two separate rating and ranking scales across ten fast-food restaurants on twelve dimensions such as cleanliness, quality of food, etc.  The individual choice is estimated from primary research data based on responses of about 5000 people to two market surveys – one conducted in August 2009 and another in November 2009.  The respondents were randomly asked either to rate or rank restaurants based on their past experience.  The experiment was set up as an incomplete block design.  To achieve the objective of comparing how the restaurants perform against each other on each of the dimensions based on ratings and rankings, we employed the extended Bradley-Terry method of a paired comparison approach with an underlying logistic regression model that accommodates ties.

Ketan Kollipara, A Study of Scenario-Based Portfolio Optimization using Conditional Value-at-Risk, March 9, 2011 (Jeffrey Camm, Kipp Martin [Professor, Booth School of Business, University of Chicago])
Risk management is an essential part of portfolio management.  After the financial crisis of the past three years, there is a need for more stringent measures to control exposure to market risk.  Value at Risk (VAR) has been used extensively in the financial world as a measure for quantifying risk.  But VAR has been criticized widely for the financial debacle of the past few years.  Conditional value-at-risk (CVar)can help overcome some of the serious limitations of VAR.  One of the ways to construct a portfolio is to use historical stock prices and take a scenario-based optimization approach to minimize or contain risk.  Including CVar in the objective or the constraint of a portfolio-optimization problem is one such approach.  The goals of this project are to understand CVar and observe the behavior of CVar with changes in the target threshold for the portfolio.  Finally, I show how the portfolio that was built using the scenario-based CVar optimization performed during the two years of 2008-2009.

Zibo Wang, Building Predictive Models on Cleveland Clinic Foundation Data on the Diagnosis of Heart Disease by Data-Mining Techniques, March 7, 2011 (Martin Levy, Yan Yu)
Mining clinical data sets is challenging to a data miner.  The main objective of this report has been to develop and then propose data-mining techniques useful in diagnosing the presence of heart disease.  Data-mining techniques such as the generalized linear model (GLM) have been widely used for quantitative analysis of clinical trial data.  In this report, we examine heart-disease data provided by the Medical Center of Long Beach and the Cleveland Clinic Foundation.  In order to extract various features, we compare the model performance built by logistic regression, which is a special case of GLM, where the response variable is the presence of heart disease.  Classification and regression trees (CART), an alternative methodology, is also applied to help fit the model.  We select models using AUC (area under ROC curves) and the misclassification rate.  As a result, in order to test the effect of random sampling error of each model we apply the model on 90% training data and 10% testing data, and then conduct a 10-fold cross-validation method.

Zhufeng Zhao, Application of Quantitative Analysis to Solving Some Real-World Business Problems, February 3, 2011 (Martin Levy, Yan Yu)
The thesis focuses on the application of quantitative analysis to solving some real-world business problems. It is composed of two projects. The first project uses some linear regression models to estimate the property tax imposed on some houses by a local government and to investigate whether the governmental property taxation is appropriate. After exploring some multiple linear regression models as well as some simple linear regression models, we have decided to develop our analysis further using simple linear regression because we desire simplicity and because we want to avoid regression coefficients with the wrong algebraic signs given by the multiple linear regression models. Residual analysis has been conducted to identify the outlying and influential observations for each simple linear regression model.  On the basis of the estimated property tax given by the models, we have found that the governmental property taxation needs some correction. We also categorize the houses into three ranges according to their governmental valuation or sale prices. The comparison between the low-sale-price range and the other ranges in terms of the property tax underpaid/overpaid clearly indicates that the home owners of the low-sale-price houses are heavily taxed by the local government in an inappropriate manner. The second project uses SAS programming to manipulate the performance data of a call center that has operations in multiple sites and business areas, and to help analyze its improvement in terms of AHT (average handling time, a metric to measure the time a representative spends handling an inbound call). The AHT analysis has broken down the overall AHT improvement by each site and by each business area and thus identified the drivers of the AHT improvement at the different levels of the performance metrics.

 

2010

Lili Wang, Determining Sample Size in Design of Experiments, December 1, 2010 (Martin Levy, Yan Yu)
This Research Project is a summary of the book "HOW MANY SUBJECTS? -- Statistical Power Analysis in Research" by Helena Chmura Kraemer and Sue Thiemann.  Sample size refers to the number of subjects or participants planned to be included in an experiment or study.  Sample size is not arbitrarily decided and it is usually determined by using a statistical power analysis.  Generally, if we have a larger sample size, our decision will be more accurate and there will be less error in the parameter estimate.  This report includes methods of calculating sample sizes for different statistical tests.  The definition, calculation methods, and illustrative examples are provided for each test.  SAS IML programs are listed for each example.  The Master Tables for sample-size calculation is not provided in this Project because SAS programs can be easily used to calculate the sample size and power.  This Project could be used as a reference to obtain the sample size and power for several different hypothesis tests.

Amin Khatami, A Study of Mixed-Effects Regression Models Using the SAS and R Software, November 29, 2010 (Yan Yu, David Kelton)
In a standard linear regression, we deal with models in which the residuals have a normal distribution, and are independent among the observations.  On the other hand, in linear mixed models (LMMs), although residuals are still normally distributed, they are not assumed to be independent of each other.  There are two major advantages in using LMMs for the study of dependent or correlated data.  First, units on which the response is measured or observed are clustered into different groups.  Second, on the same unit, repeated measures are performed over time or different spatial points on the same subject.  In this project, we present a description of LMMs by building a model in the field of marketing, using a step-up model-building strategy with which we can illustrate the hierarchical structure of LMMs.  We present a summary of published papers and textbooks written on this subject to discuss mathematical notation, covariance structures, and model-building and model-selection methods in LMMs.  We revisit an application of LMMs in the field of dentistry, conducted by researchers at The University of Michigan, and perform the same analysis on the data using the SAS and R programs.

Carlos Alberto Isla-Hernández, Simulating a Retail Pharmacy: Modeling Considerations and Options, November 24, 2010 (David Kelton, Alex Lin [UC College of Pharmacy])
In a sector as highly competitive and with such complex operational aspects as chain drug stores, simulation modeling has become a valuable and powerful tool to make better management decisions.  The literature about simulation of retail pharmacies is scarce, though.  We have recently conducted an exhaustive research project for one important chain drug store in which discrete-event simulation was one of the key tools.  The objective of this Research Project is to provide future researchers with knowledge and suggestions to build reliable and useful simulation models of a retail pharmacy.  Many of the topics analyzed will also be useful to researchers modeling different healthcare settings like hospital facilities or emergency rooms.  The first part of this study compiles a description of the main challenges we found to build an accurate simulation model of a retail pharmacy.  Some important aspects will be analyzed, such as deciding what the entities should be, identifying the resources in the model, and the management of different levels of priorities throughout the model.  The second part of the study identifies some useful output statistics and describes how to produce them in the simulation model.  Finally, the third part of the study defines and analyzes three different ways in which staff behavior can affect the main performance statistics at the pharmacy: personal time, workload effects, and fatigue effects. We will describe how to include them in the simulation model.

Whitney Brooke Gaskins, Simulating Meal Pattern Behavior in the Visual Burrow System, November 22, 2010 (David Kelton, Jeffrey Johnson [UC Department of Biomedical Engineering])
America is home to some of the most obese people in the world.  A staggering 33% of American adults are obese and, as a result, obesity-related deaths have climbed to more than 300,000 a year.  This is second only to tobacco-related deaths.  High-fat diets are a contributor to these conditions.  Like humans, rodents also show a preference for high-fat diets.  To examine the eating pattern and behavior of rats to help identify the reasons for obesity, a study was developed by the University of Cincinnati (Melhorn et al. 2010b).  The time has now come to expand on this research with the use of simulation.  Simulation has been used extensively in the manufacturing world; however, there are many untapped opportunities within the world of health care.  This project will simulate and model the Visual Burrow System, a controlled habitat used to monitor the behavior of rats, to help with future experimentation.  With the use of simulation, this project will examine the parameters such as feeding frequency and number of meals.  From this, researchers will be able to examine the effects that physiological and environmental factors have on the test subjects without having to change a test subject's environment physically, thereby saving much time and money.  Our Virtual Visual Burrow System (VVBS) model's results agree with those from the Visual Burrow System, demonstrating the validity of our simulation, and suggesting that simulation modeling can provide an important adjunct to traditional physical experimentation, and in so doing, provide a much faster and cheaper way to explore initially a wide variety of different experimental scenarios, and suggest which are the best candidates for more intensive follow-up physical experiments.

Michael D. Bernstein, A Transportation-Model Approach to Empty-Rail-Car-Scheduling Optimization, August 26, 2010 (Jeffrey Camm, Michael Fry)
The daily operational task of assigning empty freight cars to serve customer demand is a complex, detailed problem that affects customer service levels, transportation costs, and the operations of a railroad.  As railroads grow larger and traffic levels increase, the problem becomes ever more difficult and ever more important.  This paper presents an optimization model developed with the open-source COIN-OR OSI project.  Matrix generation is done using VBA algorithms with a Microsoft Access interface.  Data sources and outputs are designed to easily connect with a "real-world" platform.  By calculating costs and feasible arcs outside of the optimization solution stage, a simple transportation problem is created with fast run times and implementable results.  Car availability, transportation costs, off-schedule delivery penalties, and customer priorities are all taken in to consideration.

Valentina Pilipenko, Ph.D., Improving the Calling Genotyping Algorithm Using Support Vector Machines, August 20, 2010 (Martin Levy, Lisa Martin [UC College of Medicine])
Genome-wide single nucleotide polymorphism (SNP) chips provide researchers the opportunity to examine the effects of hundreds of thousands of SNPs in a single experiment.  To process this information, software packages have been designed to convert laboratory results, fluorescent signal intensities, into genotype calls.  While this process works well for most SNPs, a small portion equating to thousands of SNPs is problematic.  This is a problem because these SNPs are removed from analysis and thus their disease-causing role cannot be explored.  Therefore, the objective of this study was to determine whether statistical methods, namely support vector machines (SVMs) and regression trees, could be used to improve genotyping.  To accomplish this objective, we used 664 individuals from the Cincinnati Children's Medical Center Genotyping Data Repository who had Affymetrix 6.0 genotyping data available.  We then evaluated performance of SVM and regression-tree analysis with respect to reduction of missigness and correction of the batch effect.  We found that neither method improved missigness.  However, SVM consistently resolved the batch effect.  As batch effects are a serious issue for genome-wide studies, the ability to resolve this issue could have substantial impact on future gene-discovery studies.

Vikram Kirikera, Quantitative Analysis of Highway Crack Treatment - A Case Study, August 20, 2010 (Martin Levy, Uday Rao)
This project is a case study of analytical approaches to assess the effectiveness of crack sealing on highway pavements for the Ohio Department of Transportation.  The study determines the viability of crack sealing on different pavement conditions and quantifies the improvement in age resulting from the crack-seal process.  The primary focus is on analyzing the effectiveness of crack sealing on two types of pavements (flexible and composite) and two types of surface layers (gravel and limestone) based on a Pavement Condition Rating (PCR) measure.  In this case study, we conduct a trendline analysis of different pavement-performance measures to determine the types of pavements that are receptive to crack sealing and to find the optimum PCR range where crack sealing is effective.  We find that the service life of pavements increased by 1.5 to 1.7 years and the maximum percentage improvement in PCR from crack sealing was in the Prior PCR range of 66% to 80% and was most effective in composite pavements.

Ying Yuan, A Comparison of the Naive Bayesian Classifier and Logistic Regression in Predictive Modeling, August 13, 2010 (Yan Yu, Uday Rao)
The naive Bayesian classifier (NBC) is based on Bayes' theorem and the attribute-conditional-independence assumption.  Despite its simple structure and unrealistic assumptions, the NBC competes well with more sophisticated methods in terms of classification performance, and is remarkably successful in practice.  This project studies the algorithm of the NBC and investigates how it can be applied successfully in practice to predictive modeling.  We discuss and compare several different approaches and techniques in building an NBC.  By fitting home-equity-loan data, we demonstrate how to develop the NBC in SAS and propose ways that can improve its performance.  By comparing the model results from the NBC with those from a logistic regression, we conclude that the NBC performs as well as logistic regression in correctly classifying targeted objects, and its performance is slightly more robust than logistic regression.  Finally, we discuss the limitations of the NBC in application.

Mingying Lu, Modeling the Probability of Default for Home-Equity Lines of Credit, August 11, 2010 (Martin Levy, Jeffrey Camm)
Credit-risk predictive modeling, as a tool to evaluate the level of risk associated with applicants or customers, has gained increasing popularity in financial firms such as banks, insurance companies, and credit-card companies.  The new Basel Capital Accord provides several alternatives for banks to calculate economic capital.  The method of advanced internal rating allows a bank to develop its own models to quantify the three components of expected loss:  probability of default (PD), exposure at default, and loss given default.  In this research, a model of PD for home-equity lines of credit is developed.  The PD model estimates the likelihood that a loan will not be repaid and therefore fall into default, within 12 months.  More than 50 predictors were considered in the model, including account-application data, performance data, credit-bureau data, and economic data.  The response variable is binary (good vs. bad).  A weight-of-evidence transformation was introduced for the numerical variables, and logistic regression was applied to formulate the model.  The model was validated on both holdout data and out-of-time data.  The results of Kolmogorov-Smirnov statistics and the receiver operation curve show the strong predictive power of the model, while the system-stability index and attribute profiling demonstrated the stability of the model.

Guo Jiang, A Test of Stock Portfolio Selection using Scenario-Based Mixed Integer Programming, July 14, 2010 (Jeffrey Camm, Martin Levy)
Generally, in the portfolio-selection problem, the decision maker considers simultaneously conflicting objectives such as rate of return, liquidity, and risk.  The main goal of this project is to introduce a decision-support method for identifying a robust stock portfolio that will meet different requirements of investor types (risk neutral, risk preference, risk averse) accordingly, using the AMPL optimization software.  Specifically, the method helps the decision maker narrow the choices by providing variance corresponding to the Markowitz model, downside risk, probability of being within k% of optimal, return and cost value that correspond to the best worst-case scenario, and value at risk.  Then the decision maker can choose the appropriate model in specific cases.  The project aims to develop a concept that can aid the fund manager to make a decision when parameters in the optimization model are stochastic.  The final call in such cases is subjective and a "good" decision depends on the choice of decision maker.

Surya Ghimire, Forest-Cover Change: An Application of Logistic Regression and Multi-Layer Neural Networks to a Case Study from the Terai Region of Nepal, July 2, 2010 (Yan Yu, Uday Rao)
Forest-cover change has been taking place in the Terai Region of Nepal for the past two decades due to accelerated urbanization and population growth.  This project investigates the forest-cover change during the period 1989-2005 in the region, with the combined use of remote-sensing satellite images, geographic information systems (GIS), and data-mining techniques.  The results indicate that there is tremendous loss in forest cover: almost 10,438 hectares (10.32% of the region) turned into agricultural land and built-up area in last 16 years.  This study finds that eight explanatory variables: distance to the settlement, topography with different slopes, land tenure with no established title, collective control of land, household size, livestock unit, community forestry, and presence of reforestation programs, are important for deforestation in the region.  The study further demonstrates that the integration of GIS, remote sensing, and mathematical-modeling approaches are beneficial in analyzing and predicting forest-cover change in the region.

Timothy Murphy, Investigating Break-Point Analysis as a Predictor of Bankruptcy Risk, June 30, 2010 (Yan Yu, Martin Levy)
Previous studies in bankruptcy prediction have used models such as discriminant analysis and logistic regression to estimate a firm's risk of bankruptcy, using selected financial ratios from discrete moments in the past as predictor variables.  These methods yield fairly effective predictions of bankruptcy risk, but ample misclassification still exists.  One possible method of improving the misclassification rate might be to use the rate at which a firm's financial ratios are changing as an additional predictor in these models.  This paper will summarize investigations into this hypothesis using Bayesian change-point (or "breakpoint") analysis.  Future paths of study will also be discussed.  This subject is of interest both in refining bankruptcy-prediction methods and in lending or retracting support from the efficient-market hypothesis.

Varun Mangla, Bankruptcy Prediction Models: A Comparison of North America and Japan, May 24, 2010 (Yan Yu, Uday Rao)
This study formulates and compares North America and Japan bankruptcy prediction models using logistic regression, linear discriminant analysis, and quadratic discriminant analysis.  These models are used to find the similarities between North America and Japan with regard to bankruptcy prediction.  Model comparisons are done on the basis of the cost of misclassification with a case study of two scenarios.  Both scenarios differ in the cost of misclassification of a bankrupt company as a non-bankrupt company, where chosen cost values are guided by relevant previous results.  These scenarios help illustrate the importance of the cost function in the prediction models.  Further variable-selection techniques are used to identify important variables for bankruptcy prediction within the domains of the chosen countries.  For this project data are obtained from the COMPUSTAT North America and Global databases.  Ten financial variables are adopted to build models for bankruptcy predictions.  This work may help investors untangle the intricacies behind corporate investments, and better to prepare investors to make judicious investment decisions when investing in North American and Japanese firms without the consideration of boundaries.

Deepsikha Saha, Comparison between Stepwise Logistic Regression on the Predictor Variables and Logistic Regression using the Factors from Factor Analysis, May 21, 2010 (Martin Levy, Norman Bruvold)
Identifying customers who are more likely to respond to a product offering is an important issue in direct marketing.  In direct marketing, data mining has been used extensively to identify potential customers for a new product (target selection).  Using historical purchase data, a predictive response model with data-mining techniques is developed to predict the probability that a customer is going to respond to a catalog mailing offer.  The purpose of this research project is to identify the customers who are more likely to respond to the catalog mailing.  To reach this purpose, a predictive response model using historical customer purchase data is built with data-mining techniques.  The data-mining techniques used are (1) logistic regression modeling using stepwise logistic regression, and (2) factor analysis to generate factors for the predictor variables in the model and then use these factors for developing a logistic regression model.  A comparison between the two logistic regression models is done.

Yuhui Qiu, Payback Study for Residential Air Conditioning Load Reduction Program, April 23, 2010 (Martin Levy, Yan Yu, Don Durack [Duke Energy])
Many utility companies implement direct load control programs to actively shift load from peak periods to non peak periods and add a level of grid security with the ability to reduce the distribution grid's load in case of equipment failures or excessive electrical usage.  A large portion of energy is used for HVAC systems.  One of the direct load control programs at Duke Energy is to cycle their residential customers' air conditioner during the peak electric load demand (normally on a hot day) for a certain period (2-6 hours) to reduce the electric load demand during these peak hours.  Previous studies show that the A/C usage will often increase after control hours, compared to what would have occurred without the control event (frequently termed payback or snapback).  This project used the A/C duty cycle data collected from randomly selected research group in 2007.  In this study, a Tobit duty cycle model and a fixed-effects panel data model were developed to quantify the payback effects for the event days across two cycling strategies.  The net energy benefits from residential air conditioning cycling program, including both the initial load reductions and the rebound amount were calculated.  The results from two regression models were compared to investigate whether the cycling program reduces the overall daily kWh consumed by the customers or just simply shifts the usage to non-cycling hours after the cycling event interruptions are released. 

Hasnaa Agouzoul, Green Driveway Survey: A Consumer Research Study Based on Discrete Choice Modeling, March 11, 2010 (David Curry, Yan Yu)
The objective of this project is to evaluate the willingness of homeowners to choose an environmentally friendly surface for paving their driveways at home.  This is a full research study with survey development, administration, data collection and analysis.  The new pavement alternative is a permeable surface that allows rain water and snow melt to seep through, thus reducing the amount of water -– called storm water runoff -– flowing into a city's sewer system.  This is particularly important during heavy storms when sewers tend to overflow causing floods.  Additionally, storm water runoff often carries pollutants and sediments that may end up in rivers and streams, thus polluting local water supplies.  The purpose of this research is:  (1) To evaluate the importance of selected pavement attributes in shaping a homeowner's decision to buy (or not buy) a new driveway surface. The chosen attributes are: impact on water quality, impact on the environment, installed cost, and possible financial incentives from the government.  (2) To determine the relationship, if any, between homeowner demographics and choice of driveway surface.  (3) To develop a model to predict the driveway surface choice as a function of homeowner demographics.  Phase one of the project involved designing and administering a survey to homeowners in Ohio.  Phase two analyzed collected data using a latent-class approach to discrete-choice modeling.

Yin Li, Using Data-Mining Techniques to Build Predictive Models and to Gain Understanding of Current Medical Health Insurance Status, March 5, 2010 (Martin Levy, Yan Yu)
General linear models (GLMs) have been widely used for the quantitative analysis of social-science data.  In this report, we examine the choice of medical health insurance of adults based on the China Health and Nutrition Survey (CHNS).  In many applications of data mining, prior to using predictive models, the most important variables must be selected.  Using SAS/STAT, variable-selection methods are provided by the PROC LOGISTIC procedure.  Among these are backward, forward, and stepwise selection.  We review the stepwise method and compare it with a rank-of-predictors method that is based on the idea of bootstrapping.  We compare the model performance built by binary logistic regression with the classification-tree methodology.  Using ROC curves and area under curves (AUC), we identify the better fitting model.  We also introduce an alternative method, namely a penalized likelihood approach, to deal with the challenge of complete separation.  Finally, by applying the model on 10% testing data and 90% training data, we conduct 10-folder cross validation in order to test the effect of sampling error on each model.

Huiqing Li, Statistical Analysis of Knee Bracing Efficacy in Off-road Motorcycling Knee Injuries, March 5, 2010 (Martin Levy, Yan Yu)
The use of Prophylactic Knee Bracing (PKB) in off-road motorcycling is frequent, while the effectiveness of PKB in preventing injuries remains a controversial topic.  An internet-based survey was conducted to pursue this issue further.  The purpose of this paper is to explore and quantify the effectiveness of wearing a knee brace vs. not wearing a brace in preventing motorcycling knee injuries, by providing statistical analysis on the questionnaire data.  Four logistic analyses are presented that statistically characterize the association of a number of risk factors with the odds ratio of a motorcycle driver suffering a knee injury.  All four models in some way involve the factors AGE, BRACE ("Do you wear a knee brace?"), and a particular type or brand of knee brace, Air Townsend.  Standard statistical diagnostic measures deem all models acceptable.  Logistic regression for each of the four types of injuries (ACL, MCL, Meniscus, and Tibia Fracture) was conducted to explore the association of covariates and the likelihood of having a specific type of knee injury.  Significant factors for each type of injury were identified.  A motorist with 5-10 years riding experience has greater chance of incurring an ACL knee injury.  In general, wearing a knee brace is helpful in preventing an MCL knee injury.  Motorists in AGE3 (25-30) have a higher chance of a Meniscus knee injury.  Motorists with riding experience less than 5 years have larger likelihood of Tibia Fracture injury.  In addition, several brands of brace design were compared regarding their degree of protectiveness for a particular type of injury.  Wearing Air_Townsend will increase the likelihood of getting an ACL injury; wearing Air_Townsend has higher chance of getting a Meniscus injury than wearing EVS and Asterisk; brace brand Asterisk has to some degree a negative effect on the Tibia Fracture injury.

Yi-Chin Huang, Optimal Vehicle Routing for a Pharmacy Prescription-Delivery Service, March 5, 2010 (Michael Fry, Alex Lin)
The objective of this project is to determine the optimal vehicle routing for prescription home delivery for a local chain of pharmacy stores, Clark's Pharmacy, to reduce its delivery costs.  Currently, the pharmacy chain delivers its prescriptions from seven sites using seven vehicles.  The preliminary analysis indicates that the prescription home delivery cost mainly depends on the number of vehicles needed for delivery and the aggregate distances traveled by the vehicles.  Three scenarios are examined to determine the best delivery policy: (1) Decentralized Solution – representing the current delivery system by using seven vehicles, (2) Hybrid Solution – three separate service areas each served by one vehicle, and (3) Centralized Solution – one service area served by three vehicles.  A cost analysis study is conducted to determine the number of vehicles and routing options that provide lowest costs.  Historical delivery point locations and vehicle-related costs are collected from the company.  A Travelling Salesperson Problem (TSP) model is solved to determine the shortest delivery tour.  A max-flow model is used to identify violated tours.  A combination of SAS and AMPL are employed to manipulate data and solve the models.  The results indicate that the Hybrid Solution is the most effective strategy for Clark's Pharmacy.

Parama Nandi, Response Modeling in Direct Marketing, February 25, 2010 (Martin Levy, Jeffrey Camm)
In direct marketing, predictive modeling has been used extensively to identify potential customers for a new product. Identifying customers who are more likely to respond to a product offering is an important issue in direct marketing. Using historical purchase data, a predictive response model with data-mining techniques is developed to predict the probability that a customer is going to respond to a promotion or an offer.  The purpose of this thesis is to build a model for identifying targets for a future mailing campaign.  Logistic regression, which is a predictive modeling technique, is used to build a response model for targeting the right group of members.  In this project the dataset used consisted of information from donors to the Paralyzed Veterans of America Fund in past fund-raising mailing campaigns.  The data (10,000 donors) were obtained from the KDD cup competition database.  First we build the predictive model using donors' historical donation data (behavioral variables), demographic and census data.  The response-modeling procedure consists of several steps.  In building a response model, one has to deal with some issues, such as determining the inputs to the model (feature selection) and missing-value problems.  The project deals with all these issues and steps of modeling and goes on to the final model-building and model-evaluation phases.  Response modeling has become a key factor to direct marketing.  In general, there are two stages in response modeling.  The first stage is to identify respondents from a customer database, while the second stage is to estimate purchase amounts of the respondents.  This paper focuses on the first stage where a classification problem is solved.  There are also a large number of predictors, which is common, since companies and other organizations are able to collect a large amount of information regarding customers.  However, many of these predictors will contain little or no useful information, so the ability to exclude redundant variables from analysis is important.  Many of the predictors have missing values.  Some are continuous and some are categorical.  Of the categorical predictors, some have a large number of levels with small exposure; that is, a small number of observations at that level.  For the continuous variables, the distribution of the observations can have extreme values, or may take a small number of unique values.  Further, there is potential for significant interaction between different predictors.  Finally, the responses are often highly unbalanced; for instance only 5% of the observations were positive, and this low response rate is typical in any direct-marketing dataset.  All these factors need to be considered in order to produce a satisfactory model.  Since irrelevant or redundant features result in bad model performance, feature selection was performed in order to determine the inputs to the model.  Feature selection was done in two steps using exploratory data analysis and stepwise selection.

 

2009

Omkar Saha, Design and Develop Cincinnati Children's Scheduling System for use in the Optimization of Hospital Scheduling, October 19, 2009 (Kipp Martin, Michael Magazine, Craig Froehle)
Escalating health care costs continue to increase the demand on hospital administrators for greater efficiency, creating tighter constraints on doctors and human resources.  Cincinnati Children's Hospital Medical Center (CCHMC) is seeking to address their existing scheduling inefficiencies, while also addressing the additional resource demands of a new satellite location.  CCHMC currently uses a manual scheduling method based on legacy schedules and each specialty maintains its own schedule.  As part of an on-going project with UC MS - Business Analytics faculty, several attempts have been made at optimizing the scheduling process across all locations and specialties.  This project attempts to consolidate the entire request-making process for different specialties irrespective of locations, generate an optimum schedule satisfying all the requests, and report the approved schedule based on different criteria and a common format.  This would help to achieve greater efficiencies of administrators, doctors, clinical and surgical spaces.

Edmund A. Berry, National Estimates of the Inpatient Burden of Pediatric Bipolar Disorder in an Inpatient Setting. An Analysis of the 2003 and 2006 Kids Inpatient Databases (KID) Data, September 25, 2009 (Martin Levy, Pamela Heaton)
Bipolar disorder (BPD) is a debilitating recurrent chronic mental illness, characterized by cycling states of depression, mania, hypomania, and mixed episodes.  This disease, ingenerating tremendous societal and economic impact, is associated with a high degree of morbidity and mortality and is particularly costly and debilitating in pediatric patients.  The objectives of this study were 1) to calculate national estimates of the annual burden of inpatient hospitalizations of children and adolescents with BPD, where burden is measured specifically in terms of charges, cost, and length of stay; 2) to describe and compare the burden across various demographic characteristics, hospital characteristics, and key comorbidities associated with BPD; and 3) to determine the independent effects of these demographic, hospital-type, and comorbidity factors on hospitalization costs.  To accomplish these objectives, we examined data in both 2003 and 2006 from the Kid's Inpatient Databases (KID).  National estimates of the means and standard error of the mean for cost, charges, and length of stay, for inpatient pediatric bipolar disorder (BPD)) used the complex sample design of the 2003 and 2006 KID data, which contains weighting, stratification, and clustering variables. Two Ordinary Least Squares regression models, using 2003 and 2006 KID data, were used to determine key predictors of cost along demographic characteristics, hospital characteristics, and comorbidities. Finally the Chow test was used to determine whether the underlying regression models estimated in 2003 and 2006 were the same.

Deepankar Arora, A Decision Support Methodology for Distribution Networks in a Stochastic Environment using Mixed Integer Programming in Spreadsheets, September 11, 2009 (Jeffrey Camm, Kipp Martin)
In an effort to reduce the distribution costs from distribution centers to the customer location a company is considering opening a set of five distribution centers to cater all of its customer locations. The main problem that the company faces is demand uncertainty at the customer location, which can have an adverse effect on its transportation costs.  The main goal of this project is to introduce a decision support methodology for identifying a robust distribution network that will lead to minimized transportation and handling costs under stochastic demands using VBA (Visual Basic for Applications) in Excel. Specifically the methodology helps the decision maker to narrow down his choices by giving him cost distributions corresponding to a candidate solution; an efficient frontier; cost value which corresponds to best worst case scenario, value at risk (VaR) and finally expected loss below a certain value.  The project aims to develop a concept which can be utilized to aid the decision maker to make a decision when parameters in an optimization model are stochastic. The final call in such cases is subjective and a "good" decision depends on the choice of decision maker, but this methodology aims to give the decision maker tools to facilitate and inform the decision making process.

Bethany Harding, Safety Stock Level Analysis for Replenishment Planning using Actual-to-Forecast Demand Ratios, September 4, 2009 (Uday Rao, Amitabh Raturi)
Senco Brands, Inc., currently stocks approximately 10,000 items at one or all of their domestic distribution locations. The planning and replenishment for these items is performed using a basic MRP planning system. Forecasts are created for each item and prorated for each distribution center based on historic usage percentage to total corporate demand. Desired ending inventory levels are set using a safety-time factor of weeks. The current model requires a safety-time level defined for each item at each distribution center. Desired ending inventory is calculated in each weekly time period of the planning horizon by accumulating the demand forecasts over contiguous future weeks specified by the safety-time. Using the company's data, an Excel-based tool was developed to: 1. Recreate the MRP planning system's approach to setting planned order releases using the desired ending inventory approach and the input safety-time. 2. Apply an actual-to-forecast demand "A/F" ratio approach to determine the probability distribution of demand over the planning horizon, 3. Simulate various scenarios for future demand, 4. Use the simulated demand scenarios to determine the performance of a chosen safety-time (or desired ending inventory) using key performance indicators such as expected customer fill-rate, inventory investment, and working capital. 5. Calculate an optimized safety-time that achieves satisfactory performance (e.g. 95% fill rate), as determined by the company. Various applications of the Excel-based tool are illustrated.

Andreas Kuncoro, Empirical Study of Supply Chain Disruptions' Impact on the Financial and Inventory Performance of Manufacturing and Non-manufacturing Firms, August 28, 2009 (Amitabh Raturi, Uday Rao)
Supply chain disruptions are various unanticipated events in the supply chain caused by internal and external factors which cause a firm to significantly deviate from its original plans and consequently affect its performance This work assesses the relationship between supply chain disruptions and overall firm performance as measured by financial (return on asset and leverage) and operational (inventory turnover) metrics.  We first chronicle 75 supply disruptions in 47 firms as reported in the business press over a three year period (2005-2007).  We then categorize these disruptions on causal factors as internally versus externally caused, and across several origin sources. The performance metrics are then observed from Compustat quarterly one year before through one year after the disruption announcements.   The impact of such disruptions is first analyzed by firm size, firm type (manufacturing versus non-manufacturing), reason and responsibility. In multivariate analysis of covariance tests, firm size showed a significant positive association with overall firm performance while disruption event announcement showed a significant negative association with overall performance.  Consistent with previous studies, our findings indicate that supply chain disruptions negatively impact both financial and operational performance. Firm size significantly moderates this impact.  One year after the event announcement, the firms are able to recover their performance.

James Andrew Kirtland III, Simulation Efficiency of the Finitized Logarithmic Power Series, August 27, 2009 (Martin Levy, David Kelton)
It is often times appropriate or desired to limit a distribution's support.  This can be due to the actual environment that an analyst is trying to model or to increase the efficiency of simulating random variates from a model.  This can be done using traditional truncation.  However, when truncation is used, undesired and often times unpredictable effects occur to the moments of the parent distribution.  Finitization is a method of limiting a Power Series distribution's support while preserving its moments up to the order of finitization, n.  The logarithmic power series distribution will be used to discuss properties of theoretical, truncated, and finitized distributions.  Four algorithms designed to generate random variates from a theoretical logarithmic power series distribution are compared to an alias method designed to generate random variates from a finitized logarithmic power series distribution.  The variates created from these four algorithms as well as the alias method will be tested against a theoretical logarithmic power series to check if the moments hold.  Finally, a horserace is used to test whether the finitized logarithmic distribution using an alias method is more efficient at generating random variates than the four other algorithms based on an infinitely supported logarithmic distribution.

Shannon Peterson, Development of a Long Range Capacity and Purchasing Plan for a Manufacturing Environment, August 24, 2009 (Jeffrey Camm, Uday Rao)
Long range capacity planning is an essential part of business planning.  This can be complicated by seasonality of products, varying material pricing plans and supplier capacities, criticality and substitutions of raw materials, and multiple production sites and bills of materials.  This project develops a flexible tool that reveals an optimal, high-level long range production schedule and purchasing plan to satisfy customer demand and identify potential outages.

Ndanatsiwa Anne Chambati, Locating an Optimal Site for a New Natorp's Garden Center, August 21, 2009 (Michael Magazine, Uday Rao)
A well known aphorism states, "the most important attributes of stores are location, location and location". The area of research for optimal store location has grown rapidly in the last decade. Most of the research in this area has been undertaken by marketing researchers, urban geographers and economists with applied mathematicians recently entering the field. Applied mathematicians have become involved in the study of retail location theory through the development of algorithms and mathematical models applicable to location problems. At the mathematical level the problem is abstract and exact removed from the practical problems of the real estate developer or marketing expert.  Natorp is a family owned business that has been around since 1916. They currently have two Garden Center locations, a nursery and landscaping services. They would like to open an additional Garden Center in the Ohio Kentucky and Indiana (OKI) region and need to know where the optimal location for it would be. First, we review the current literature on optimal store location then look at the most important factors for Natorp to consider in the expansion. Next, we evaluate each of the 8 counties in the OKI region using a multi-factor site location rating system and come up with potential sites for the new Garden Center. These potential sites will be evaluated based on population projections over the next 30 years, median household income, median home value, and proximity to competitors.

Ashutosh Mhasekar, Application of Statistical Procedures to Target Specific Segments for Upgrading Marginally Sub-par Members to Rewards-eligible Level in a Retail Loyalty Environment, August 19, 2009 (Michael Magazine, Uday Rao, Marc Schulkers)
The retail industry has become extremely competitive with loyalty programs constantly used to monitor customer behavior and engage customers for incremental sales / revenue. Retailer R runs a points-based loyalty program. Members can earn rewards certificates which are good towards future purchases. With the current economy and stiff competition, the Retailer is using targeted bonus offers to members that need additional points to earn a reward certificate. In this project we use various statistical tools to efficiently target members that need additional points to earn a reward certificate and to maximize certificate redemption which results in incremental sales to the company. Also, a test and control group approach is employed to monitor and measure the incremental behavior / performance of this “Bonused” group during the promotional period and post period as well. Using the targeted segmentation approach an increase in redemption rate was noted. There was significant increase in revenue during the promotional period, without impacting the post period sales.

Shaonan Tian, Data Sample Selection Issues for Bankruptcy Prediction, August 12, 2009 (Yan Yu, Martin Levy)
Bankruptcy prediction is of paramount interest to both academics and practitioners. This paper devotes special care to an important aspect of the bankruptcy prediction modeling: data sample selection issue. We first explore the effect of different data sample selection methods by comparing the out-of-sample predictive performances using a Monte Carlo simulation study under the logit regression model. The simulation study conducted suggests that if forecasting the probability of bankruptcy is of interest, complete data sampling technique provides more accurate results. However, if a binary bankruptcy decision or corporate rating is desired, choice based sampling technique may be still suitable. In particular, within the logit regression context, a simple remedy could be applied to justify the cut-off probability, such that choice based sampling technique and the complete data sampling technique display the same explanatory power in forecasting the bankruptcy classification. We also find that appropriate adjustment of the cut-off probability is complementary if taking into account different misclassifications. Finally, we contextualize the proposed recommendations by applying them to an updated bankruptcy database. We further investigate the effect of the different data selection methods on this corporate bankruptcy database with a non-linear classification method, Support Vector Machines (SVM), which has recently gained some popularity in the applications.

Xinhao Yao, Option Pricing: A Comparison Between Black-Scholes-Merton Model and Monte Carlo Simulation, August 7, 2009 (Martin Levy, Uday Rao)
An option, a kind of financial derivative, is a special contractual arrangement giving the owner the right to buy or sell an asset at a fixed price on a given date.  In this project, we focus on comparison between two option pricing methods: Black-Scholes-Merton model and Monte Carlo simulation.  The results from both methods can be considered equivalent and an equivalence test is applied to determine the number of iterations of Monte Carlo simulation.  We also try some modifications of the Monte Carlo simulation to see how to improve the pricing method when rare events happen.

Wei Huai, Bankruptcy Prediction: A Comparison between Simple Hazard Model and Logistic Regression Model, July 27, 2009 (Yan Yu, Uday Rao)
As a serious issue for both firms and individuals, bankruptcy has recently drawn increased attention from society thereby making its prediction an important topic. In this research project, two popular bankruptcy forecasting models, Shumway (2001) Simple Hazard Model and Logistic Regression Model, are studied and compared. Three different measurements, Deciles Ranking, Area under ROC curve and Hosmer and Lemeshow goodness of fit test are implemented to evaluate and compare these bankruptcy forecasting results. The conclusion that simple hazard model is superior to logistic regression model in accuracy of bankruptcy forecasting is reconfirmed. 

Mayur Bhat, Study of Uplift Modeling and Logistic Regression to increase ROI of Marketing Campaigns, June 5, 2009 (Uday Rao, Amitabh Raturi)
In this research project, we study a technique known as Uplift Modeling which uses control groups judiciously to measure the true lift in sales that a marketing campaign generates. In addition, Uplift Modeling proposes customer segmentation to achieve better campaign results by way of selective targeting. The results show how using test versus control groups helps in measuring true lift. We also demonstrate that selective targeting of customers using Uplift Modeling increases incremental revenue when compared to the existing alternative called Traditional Response Modeling. Logistic Regression, using categorical attitudinal data, is also used to further strengthen and complement the results seen from Uplift Modeling.

Venu Silvanose, Developing and Assessing a Multiple Logistic Regression Model on Mortgage Data to Determine the Association of Different Predictor Variables and Borrower Default, June 3, 2009 (Martin Levy, Norman Bruvold, Yan Yu)
The purpose of this paper is to develop and assess a logistic regression model to determine the association of different predictor variables and mortgage borrower default. In the current housing market, where none of the widely used models in the industry were able to predict with some certainty the high level of default by borrowers, models are still used albeit with a sense of extreme caution to identify good and bad credit risks. 

Manish Kumar, Intelligent Allocation of Safety Stock in Multi-item Inventory System to Increase Order Service Level and Order Fill Rate, June 3, 2009 (Amitabh Raturi, Michael Magazine)
In this study, we propose a model to establish safety stock in a multi-item inventory system to increase order fill rate and order service level based on correlation between the demands of multiple products. A customer order to a multi-item inventory system consists of several different products in different quantities. The rate at which a manufacturer is able to fulfill the demand for all products to the customer's order in a specified time is termed as order fill rate (OFR). Whereas, the statistical picture of how successful the manufacturer is in fulfilling all the orders completely by the required date is termed as order service level (OSL). The OFR and OSL are very important indices in measuring the performance of the manufacturer and customer satisfaction. We evaluated the order fill rate and order service level performance of the inventory system in a model in which total customer order demand process is based on normally distributed but correlated demands. We show that if the safety stock level is adjusted in accordance with the level of correlation in product demand, both the order fill rate and order service level can be improved.

Larisa Vaysman, Quantifying the Impact of Draft Round on Draft Pick Quality Using Non-Parametric Median Comparison, June 2, 2009 (Michael Fry, Jeffrey Ohlmann, Geoff Smith) 
At the beginning of each season, NFL teams take turns selecting rookies to add to their rosters in a days-long process known as the NFL Draft. The NFL Draft consists of seven rounds. Since each team wants to have the strongest possible roster, players who are thought to have the potential to be outstanding are chosen early, and less desirable players are generally chosen later in the process or not at all. We seek to quantify the “cost,” in terms of player quality, that is incurred when a team chooses to wait until a later round to draft a player at a particular position. We also examine a number of position-specific metrics to measure player quality.  We use the Kruskal-Wallis test, a non-parametric comparison of medians, to determine which draft rounds are likely to offer picks of equivalent quality, and which draft rounds are likely to offer picks of significantly better or worse quality. Our analysis is meant to assist teams during the decision-making process of drafting players by quantifying the tradeoffs inherent in each potential decision.

Michael D. Platt, Distribution Network Model Using Mixed Integer Programming and a Combination of Distribution Centers and Cross-Dock Terminals, June 2, 2009 (Jeffrey Camm, Michael Fry)
In an effort to reduce manufacturing costs, a company is considering moving its manufacturing facilities from the United States to Mexico. Though the facility costs and labor costs will be much lower at the Mexico facility, they are concerned that the move could have an adverse effect on their transportation costs.  The goal of this project is to determine the distribution network that will result in the lowest transportation and material handling cost while maintaining desired customer service levels. Specifically, the project will focus on incorporating cross-docking terminals in the solution in conjunction with fully stocked distribution centers. At a cross-docking terminal, product is moved directly from a receiving dock to a shipping dock, spending very little time in the facility. This process eliminates the need to hold these finished goods in inventory, thus reducing inventory costs and material handling costs.

Taylor W. Barker III, The Expected Box Score Method: An Objective Method for NFL Power Rankings, May 29, 2009 (Martin Levy, Michael Magazine, co-chairs)
One of the more interesting pages on ESPN.com during the NFL season is the NFL “Power Rankings” that they compile each week. This is basically a ranking of the relative strengths (during that week) of the NFL teams based on the votes of several panel members (ESPN.com NFL writers/bloggers). While the results take into account the subjective rankings of each of the panel members, it would be interesting to see if there is an "objective" method to develop weekly power rankings, based on current season statistics to date.  An objective method for weekly power rankings is found through a process I have named the Expected Box Score (EBS) Method. The EBS Method determines expected box scores between two teams with a given venue based on current season data and then plugs them into a linear regression model based on 20 years of data to get a current estimated point differential between the two teams. This process is repeated for every team playing against every other team exactly twice (once at home and once away) and are used to determine how many of those games each team would be expected to win. The team with the most wins is ranked #1 and so on. Shortcomings of other methods are addressed and then considered in the development of the EBS method. Validation for this method is provided via comparisons with Las Vegas point spreads and NFL.com Power Rankings.

Lori Mueller, Norwood Fire-Department Simulation Models: Present and Future, May 28, 2009 (David Kelton, Jeffrey Camm) 
The Norwood Fire Department (NFD) currently operates one fire station, serving approximately 22,000 people. In 2008, the NFD made approximately 4,400 runs, which averages to about 12 runs per day. With an increase in retail and business development in the city, there has been a subsequent increase in the number of emergencies the department responds to each year. If the development in the city continues over the next few years, the NFD will have to grow along with the city.  The NFD has a few options for expansion. One option is to open a second fire station at a location currently owned by the city, which used to be the Norwood fire station before a new station was opened at its current location. Another option is to expand their current station, which is located near the geographical center of the city, so that they could increase the amount of equipment and firefighters. Using simulation modeling, these different options were explored to determine which option is best for Norwood, when the time comes for expansion.

Vinod Iyengar, Call Volume Forecasting and Call Center Staffing for a Financial Services Firm, March 13, 2009 (Uday Rao, Martin Levy)
 
In this project, we use statistics and data analytics to build scalable and robust models for call center forecasting and staffing. The core of the problem involves predicting call volumes with lead times of a few months, when conditions are dynamic and there is high variability with multiple types of calls. We use data from a US-based prepaid debit card vendor with two types of calls: application calls and customer service calls. We predict application calls using a model of historical effectiveness of marketing dollars and incorporate data on card activation history and customer attrition. We predict customer service calls from active cardholders using time series analysis and regression to capture trend, seasonality, and cyclicity. Call volume predictions are then input into a stochastic newsvendor model to set a staffing level that effectively trades off staffing costs with lost-sales penalty costs for unsatisfied calls. The impact of different staffing level choices on expected costs is explored by simulating call center volume. Performance improvement resulting from this work includes more accurate forecasts with increased service levels and agent occupancy.

Lei Yu, A Comparison of Portfolio Optimization Models, March 13, 2009 (Martin Levy, Uday Rao) 
Applications of portfolio optimization models have developed rapidly. One issue is determining which model should be followed as a guide for investors to make an informed portfolio decision. In this paper, five optimization models: classical Markowitz model, MiniMax, Gini's Mean Difference, Mean Absolute Deviation, and Minimizing Conditional Value-at-Risk, are presented and compared. Solutions generated by different models applied to the same data sets provide insights for investors. The data sets employed include real world data and simulated data. MATLAB, VBA (Excel as host), and COIN-OR software were employed. Some observations about alternative selection, similarities, and discrepancies among these models are found and described.

Moumita Hanra, Assessing ultimate impact of Brand Communication on market share using Path Models and its comparison to Ridge regression, March 12, 2009 (Martin Levy, Uday Rao)
Path modeling, based on structural equation modeling, is a widely used technique in market research industries to analyze interrelationships between various measures and to measure which ones are really significant in driving sales. In this study, the objective is to find the best fitting path model to assess which attributes are really important to a consumer in terms of sales using respondent level survey data. Also, this model would predict the best media sources companies should focus on in advertising their brand for gaining maximum public awareness of that brand and how this awareness drives the way one thinks about the brand in different dimensions and its effect in turn in driving sales. The second half of this study is focused on comparing the results of Path model to ridge regression to assess which model yields better fit and gives results intuitively. Ridge regression reduces the multicollinearity among independent variables by modifying the X'X matrix used in Ordinary Least Squares regression using a ridge control parameter. The results indicate that the path model gives a much better fit than ridge regression especially when multicollinearity is not in its extremity.

Man Xu, Forecasting Default: A comparison between Merton Model and Logistic Model, March 11, 2009 (Yan Yu, Uday Rao) 
Merton default model, which is based on Merton's (1974) bond pricing model, has been widely used both in academic research and industry to forecast bankruptcy. This work reexamines Merton default model as well as the relationship of default risk with equity returns and firm size effect using an updated database from 1986 to 2006 time frame obtained from CompuStat and CRSP. We concur with most of the findings in Vassalou and Xing (2003). We find that both default risk and size have impact on equity returns. The highest returns come with the smallest firms with the highest default risk.  We then focus on the comparison between Merton model (financial model) and a logistic regression model (statistic model) for default forecasting. We compare Default likelihood indicator (DLI) from Merton model with estimated default probability from logistic model using rank correlation and deciles rankings based on out-of-sample prediction. We find that the function form of Merton model is very useful in determining default. The structure of Merton model captures important aspects of default probability. However, if bankruptcy forecasting is desired, our empirical results show that Logistic model seems to provide a better prediction. We also add distance to default (DD) from Merton model as a covariate in our best logistic model and we find out that it is not a significant predictor.

Luke Robert Chapman, A Current Review of Electronic Medical Records, March 11, 2009 (Michael Magazine, Craig Froehle) 
In this project, we research the imminent installation of Electronic Medical Records (EMR) in all hospitals and clinics throughout the United States. This project was motivated by our interaction with the Cincinnati Department of Health (CDH) via a project that focused on persuading the Cincinnati council that EMR should be immediately invested in at all six of the CDH clinics. We review the advantages of EMR and also recognize the disadvantages, some of which were overlooked in the original project with CDH. The current growth of Electronic Medical Records in the US and what the future holds for EMR is reviewed. The main analysis will review the claim that EMR helps to reduce medical errors. The analysis will use multivariate techniques such as factor and cluster analysis.

Chetan Vispute, Improving a Debt-Collection Process by Simulation, March 9, 2009 (David Kelton, Norman Bruvold)
The Auto-Search Process is an automated business process flow that has been designed by Sallie Mae for its in-house collection agency; it works sequentially to procure good phone numbers of delinquent borrowers. The process involves outsourcing of data to private vendors wherein the failed data from one vendor are sent to the next vendor until we have tested against all. Also, the process is governed by time-related business rules that allow the data to be sent to the next vendor only after a certain period. Keeping the cheapest vendor first, the process aims at reducing the cost while increasing the procurement of good phone numbers. Before this process could go live, it was required by the analytical team to analyze the process by building a time-related model, and make recommendations. This thesis explores the building of this time-based model using dynamic discrete-event simulation with Arena, and then talks about the findings and recommendations developed while working on the project which helped the company improve its annual revenue position by over $440,000.

Cary Wise, Cincinnati Children's Hospital Block-Schedule Optimization, February 10, 2009 (Kipp Martin, Craig Froehle, Michael Magazine)
Cincinnati Children Hospital is implementing an automated process to schedule clinical and surgical patient visits. The goal is to create a program that allocates operating rooms to requests submitted by individual doctors for clinical time and surgical time. The schedule creation process takes place in two phases: the first phase schedules spaces for specialties (Ortho, Cardio, etc.); the second phase allocates doctors to the specialty schedule. The program that generates the specialty allocation is named the Space Request Feasibility Solver (SRFS). The inputs of the SRFS are a set of specialty requests and information about the operating rooms; the output is the schedule of specialty assignments. The problem is formulated as a mixed-integer linear program (MILP) that minimizes the number of unfulfilled spaces requested. A very large number of potential assignments may be generated depending on whether the request parameters are very specific or general. Indeed, the instance quickly becomes intractable for a realistic problem. We implement a branch-and-price column generation algorithm to overcome the problem of an intractable number of variables. The SRFS invokes a COIN-OR solver named “bcp” to perform the procedures of branching, solving the LP at each node and managing the search tree. The scope of this master's project is to implement a column generation scheme in the SRFS. Testing of the SRFS was performed by verifying the column selected had the minimum reduced cost, and verifying the results of the LP relaxation and IP against the solution of the exhaustive enumeration of all columns. The performance of the SRFS in terms of the number of columns and nodes created to arrive at a solution was also investigated.

 

2008

F. Alan Shukairy, NFL Fourth Down Decision Making: 2002 - 2007, November 21, 2008 (Michael Fry, James Deddens, Richard Males)
This paper uses categorical data analysis and logistic regression to explore National Football League (NFL) fourth-down decision making using data from the 2002 through 2007 seasons. The focus of the analysis is on game situations where Romer's (2006) Dynamic Programming model predicts that teams should go-for-it. The likelihood of going-for-it on fourth down is examined including factors such as game time, score differential, yards-to-go and field position. The impact of going-for-it on game outcome is also reviewed. Conversion rates and play calling for both third and fourth-downs are examined. The impact of the home field advantage and momentum – the latter defined as increase in the probability of scoring after a successful fourth conversion - are also considered. Results indicate that teams deviate from Romer's optimal policies. We also find that teams employ play calling that appear contrary to those that would maximize the likelihood of a successful conversion. We find that the home field advantage is real but that the home team's advantage decreases as the game goes on. We also find some momentum benefits with fourth-down conversions.

Fred Ahrens, A Build to Forecast Model using Real Options, September 26, 2008 (Amitabh Raturi, Jeffrey Camm)
The Build to Forecast (BTF) production strategy described by Meredith, Camm, Raturi, et al, is a response mechanism to the divergent requirements of a long lead time product with high customization and a short customer accepted lead time. The BTF model, developed in the 1990s, addresses this challenge by initiating production of a product prior to receipt of actual sales order functional requirements. The ‘Build to Forecast with Real Options' strategy proposes to achieve the same objective, while also increasing engineering and procurement flexibility, by delaying fundamental design intent decisions. Using the original BTF model as an inspiration, the new model defers both component allocation and final design configuration until late in the build cycle. This is achieved by using ‘real options', an adaptation of the investment concept to an operational environment. Options enable, but do not obligate, the selection of a product design parameter at a later point in time.  While the original BTF model selects components based on a forecast then attempts to match product to sales orders, the new model only selects options based on a forecast that allow components (if the option was enabled) to be added later. The specific component would be selected based on an actual sales order (if its option was enabled), as in a traditional Make to Order model. This paper describes the original BTF model and studies the new BTF with Real Options concept.

Linda Kay Kromer, Design and Development of a Data Acquisition Application for CCHMC Scheduling Optimization, August 25, 2008 (Michael Magazine, Kipp Martin)
Increasing health care costs continue to increase the demand for greater efficiency, creating tighter constraints on physical and human resources. The Cincinnati Children's Hospital Medical Center (CCHMC) is trying to address their existing scheduling inefficiencies, as they also address the additional resource demands of a new satellite location. CCHMC currently uses a manual scheduling method based on legacy schedules, believed to be optimal, and maintained by each distinct specialty. As part of an on-going project with UC MS - Business Analytics faculty, several students have made attempts at optimizing the scheduling process across all locations and specialties. However, the only request data available is for the new location. All other data consists of schedules, which do not allow for optimization. The problem addressed here is the acquisition of request data, allowing for optimization. The data attained will include requests by each specialty for clinic space, surgical space and doctor's individual requests. This data will provide the opportunity for feasible, and possibly optimal, allotment of physical and human resources. Further, this application must be user-friendly and PC-based, in order to get administrative buy-in.

Weiqun Wu, An Empirical Study of Corporate Bankruptcy Prediction Using Hazard Models, August 25, 2008 (Yan Yu, Martin Levy)
This project investigates proportional hazard model approaches to corporate bankruptcy prediction using multi-period accounting data. One of the critical issues in the use of bankruptcy prediction models is the poor out of sample forecasting accuracy, especially when the bankruptcy rate is extremely low. Recent developed corporate bankruptcy prediction models adopt Cox Proportional Hazard analysis to create dynamic models which incorporate time-dependent covariates. Shumway (2001) developed a simple hazard model; by using logit estimation to calculate maximum likelihood estimates the hazard model can be interpreted either as a logit model done by firm-year or it can be viewed as a discrete accelerated failure-time model. In this project, we investigate US IT market data, obtained from COMPUSTAT data base yearly over 1986 till 2006. We find that both traditional Cox proportional hazard model and Shumway's hazard model estimated with logistic regression approach perform well with time-dependent covariates in dynamic models and yield almost exactly same estimated coefficients. Shumway's method performs well in forecasting out-of-sample data. Incorporating categorical variables in Cox model and baseline function in Shumway's hazard model are also explored.

Luchan Byrd III, Statistical Analysis of Testing and Production and Yield-To-Complete Data for Reactors at SUMCO, August 25, 2008 (Uday Rao, James Evans)
SUMCO Phoenix Corporation (SUMCO) manufactures electronic-grade silicon wafers for the semiconductor industry and employs about 1,500 people at three manufacturing facilities in the United States. They are currently analyzing components of enhanced production planning systems at the Cincinnati facility to optimize the scheduling of their Reactors – high value assets used to deposit a thin film of material on silicon wafers. The scheduling problem is multifaceted. Dependencies exist based on relationships of part number (finished product type) to reactor type, reactor model, reactor capability, availability of material, qualification process, and several other factors.  This project focuses on the analysis of recent testing and production data to help determine drivers for delivery performance. More specifically, batch testing and production data are analyzed using descriptive statistics to determine if there are differences in performance between different reactors, product type, work shifts, testing reason codes, etc. Additional analysis was performed to determine yield-to-complete (YTC) statistics for each stage of production. Due to lack of normality in the data, the nonparametric analogue to ANOVA, the Kruskal-Wallis test, is used to determine if certain differences identified are statistically significant. Results indicate that different reactors show significant differences in the amounts of finished product and scrapped parts. Knowledge of these differences and how to schedule particular product types (parts) with the reactors that are most efficient at producing them can lead to improved machine maintenance, resulting in decreased scrap and increased cost savings.

Andrew Faehnle, Estimation of Radiology Patient Wait Times, June 3, 2008 (Craig Froehle, Michael Magazine) 
Making accurate estimates of patient waiting times within the Radiology queue at Cincinnati Children's Hospital is a non-trivial exercise. Herein three different methods for predicting the time a patient will wait are investigated: a heuristic “Simple Algorithm”, linear regression, and iterated logistic regression. We find that while the iterated logistic regression performs the best of the approaches tested, the performance of the approaches depends on the homogeneity of the dataset.

Zhiyuan Dong, A Matrix Approach for Comparing Estimates of a Population Total Under a Many-to-Many Frame Structure, April 11, 2008 (Martin Levy, Yan Yu) 
We propose a matrix approach comparing estimates of a population total under a many-to-many structure, an improved method for calculating the 2nd order inclusion probability of the Horvitz-Thompson method in this many-to-many structure context, an improved method for characterizing the Eigen-structure of the Arc-Weight method, and a Mathematica-based package for doing the corresponding analysis.

Praveen Singaraju, How the Plant Closing Announcement Affects the Stock Price of a Firm, March 31, 2008 (Amitabh Raturi, Uday Rao) 
Plant closings are widespread throughout the US economy. The affected businesses are not limited by industry, size or any other factor. This work tries to understand the impact of plant closing announcements on the stock market. We propose that there are two antithetical perceptions of a plant closing announcement. Sometimes the market sees it as positive news and sometimes as negative. The results tend to support an over-all negative reaction on the stock market; at the same time, firms that experience a positive effect possess certain identifiable characteristics. We find that that all the inclines are associated with optimistic announcements and all declines with pessimistic announcements. By examining the quarterly financial statements of all companies we identified the variables that best discriminate between the inclines and declines. The results validate the argument that there are indeed two types of plant closings.

Sangeetha Mallya, Applied Bayesian Forecasting of U.S. Medicaid Program Expenditure on Antidepressant, March 4, 2008 (Martin Levy, Jeff Guo, Christina Kelton)
Mental health drugs expenditure, especially on prescription medicine for depression has been on a steady rise. Depression is among the most prevalent major mental disorders today with about 10% of the US population suffering from Depression. The Social Security Act established Medicaid as a jointly-funded, Federal-State health insurance program. Medicaid plays a fundamental role in the provision of prescription drugs to over 42 million low-income and disabled beneficiaries. The state Medicaid programs spent altogether approximately $2 billion on antidepressant drugs in the US in 2005, across three categories of antidepressants.  To better understand this spending and to safeguard the Medicaid program from excessive expenditure on mental health drugs, state-of-the-art forecasting models can be of great aid. Here, we focus on exploring, building and interpreting forecasting models for Medicaid's expenditure using applied Bayesian modeling methodology. The synthesis of the routine model output with dynamic assimilation of external information is the centerpiece of Bayesian forecasting. Further, a comparative assessment of the forecasts is performed with prior results from classical time-series models. The results from these forecasting processes can be leveraged by Medicaid for research, planning, optimization and inferential purposes.

Rudranil Manna, Development of a Predictive Model for Food Consumption in USA, March 3, 2008 (Norman Bruvold, David Rogers)
The accuracy of the prediction of a household's expenditure in food is a major concern for retailers and manufacturers engaged in food-marketing campaigns. The purpose here is to develop a model to predict the household spending in the major food categories, based on geographic location and household demographics. The modeling is done with "consumer expenditure diary survey data" obtained from the public domain of the US Department of Labor. A mixed modeling methodology is adopted, which includes a mixture of the fixed effects of the socio-economic characteristics of the household and random effects of each household specific intercepts. This model has taken into account the correlation between the household expenditures for the different food categories. Finally the model predictions are benchmarked against a univariate tobit regression model, widely available in the literature for similar predictions of household food-consumption.

Qiuhong Zhang, Empirical Verification of Optimal-Portfolio-Based Foreign Exchange Rate Theory, February 29, 2008 (Srdjan Stojanovic, Yan Yu) 
The recent optimal-portfolio-based Foreign Exchange Rate theory, introduced by S. Stojanovic in Foreign exchange rates, is implemented and verified using the market data for the economies of: Canada, Japan, UK, and US. The key parameter in the implemented theory is the market (relative) risk aversion parameter ? (or the market sentiment). Therefore, one of the main goals of this empirical study was to estimate the value of the relative risk aversion parameter for the pairs of the considered economies, and to conclude whether it has the same/similar value for all of them. Finally, the statistical hypothesis on whether the Foreign Exchange Rate data conforms to the theoretical model is tested as well.


2007

Feng Yu, A simple discrete-time hazard model for forecasting bankruptcy in construction companies, December 19, 2007 (Martin Levy, Jeffrey Camm, Uday Rao)
The construction industry has played a powerful role in sustaining economic growth and helping the recovery. This industry is inherently very fragile and extremely risky, and the failure of construction firms has had a serious impact on the economy and society. Consequently, the prediction of the failure of construction firms is essential not only for the economy, but also for society. To date, many bankruptcy prediction models have been developed to predict the probability of failure of construction firms based on company financial information and economic information. However, these models have their limitations and disadvantages because of one reason or another, which are reviewed in this study. There is a need to develop prediction models capable of forecasting long-term failure for construction firms of different sizes. In this study, a discrete-time hazard model is proposed to predict the probability of bankruptcy for construction firms in a long time frame. The research is based on a statistical analysis of good and bankrupt construction firms and related financial and economic data in a time frame of about 10 years. A prediction model using survival analysis is developed through this study.

Balkrishna Apte, Worldwide Desktop Computer Supply Chain Complexity and Performance Models for the Hewlett Packard Company, November 30, 2007 (David Rogers, Amitabh Raturi, Michael Stephenson)
In this project is a quantification of supply chain complexity for different business regions across the world for the personal computer desktop business, and its correlation to supply chain performance parameters. Regional supply chain performance is consolidated and quantified with parameters for order cycle time, forecast accuracy, inventory cost, excess, and/or obsolescence. Statistical techniques are utilized to determine if there is a correlation between product line complexity and key supply chain performance measures. Statistical models indicate the impact of change in supply chain complexity for various supply chain performance parameters. Results provide guidelines for management for determining the impact of product line complexity on various supply chain performance measures and ultimately upon profit. Changes for decisions regarding offering additional products by employing the impact of complexity will be posited.

Jeremy Scheidt, Clinical and Surgical Scheduling Across Multiple Facilities Using Integer Linear Programming, November 28, 2007 (Michael Magazine, Craig Froehle, Jeffrey Camm)
Rising health care costs are a complicated issue. Health care organizations have a delicate balancing act of scarce resources with high standards for care and service. Large scale operations can have several advantages for efficiency and service, but the coordination of so many resources using manual methods can be cumbersome, time-consuming, and carries a risk of being less than optimal. Scheduling doctors at several facilities in a metropolitan area is an example of such a problem. Cincinnati Children's Hospital has several locations that share many resources, such as doctors and administrators. The problem considered here is how to efficiently coordinate the scheduling of doctors at various facilities for consistency in quality and service while minimizing the already heavy demands on personnel. The proposed model uses integer programming to choose the schedule that best meets the multiple objectives of a good schedule in this situation. It handles a wide variety of scheduling requests in an automated manner that reduces manual work, minimizes the number of schedule requests that can not be met, minimizes the travel between facilities, minimizes the changes required to accommodate ongoing schedule updates, and provides a consistent space for each doctor to use.

Feng Ji, An Introduction to Credibility Theory With An Actuarial Frequency Case Study, November 21, 2007 (Martin Levy, Jeffrey Camm, Yan Yu)
Credibility theory is a set of quantitative tools which allows an insurer to perform prospective experience rating (adjust future premiums based on past experience) on a risk or group of risks. There is a manual rate which is designed to reflect the expected experience of the entire rating class and implicitly assumes that the risks are homogeneous. However, no rating system is perfect, and there always remains some heterogeneity in the risk levels after all the underwriting criteria are accounted for. Credibility theory provides models which are a compromise between the historical observations and the manual rate, and also a more credible premium. In this paper, three classic credibility approaches, which are Bayesian Methodology, Buhlmann credibility, and Non-parametric Empirical Credibility, are discussed. A case study with a true claim experience from Humana Inc. then shows that credibility premiums outperform either the manual rate or the estimate based on the historical observations.

Yanping Chen, A Case Study on the Linear Modeling Fitting with Outlier, November 14, 2007 (Martin Levy, Norman Bruvold, Jeffrey Camm)
In the application of ANOVA for hypothesis testing, the assumptions such as the homogeneity of errors or normality are often violated because of scale effects, design of the experiments, outliers and the nature of the measurements. This experiment deals with design and statistical analysis on the balance control capability of obese workers. Functional Reach (FR) is a measure of how far a person can reach without losing balance. The hypothesis assumption is that obese workers because of their larger body mass may not be able to reach as far as non-obese people without losing body balance. Except for the obesity_level (obese and non-obese), gender is chosen as another primary factor in the hypothesis testing. However, the plots of the residuals arising from fitting the 2x2 ANOVA show the heteroscedasticity due to the fact that one subject seems to be an outlier. Remedial measures are applied in the project to cure the heteroscedascity, such as the seemingly outlier removal, log, square root, inverse and the Box-Cox algorithm transformations, evaluation on the model adequacy and inadequacy, verification, and the Rank ANOVA. The consequences of these techniques are compared and the ANCOVA model succeeds to reducing the variance and removing the heteroscedasticity for the hypothesis testing.

Yann Ferrand, Forecasting U.S. Medicaid Program Expenditures on Antidepressant Drugs, November 14, 2007 (Christina Kelton, Jeffrey Guo, Martin Levy, Yan Yu)
Healthcare costs and drug prices have been on the rise, and the state Medicaid programs spent altogether approximately $2 billion on antidepressant drugs in 2005. Our goal is to build forecasting models that can be used to predict U.S. Medicaid's future spending on antidepressants. We gather quarterly data (1991-2004, Centers for Medicare & Medicaid Services) on Medicaid national antidepressant expenditure. We use Box-Jenkins forecasting techniques on expenditure time series for specific antidepressants including Prozac®, Zoloft®, Wellbutrin®, Paxil®, Effexor®, and amitriptyline. Intervention analysis is used to determine the effects of patent expiration, new branded-drug entry, and new indication approval. Forecasts are computed and compared to a holdout sample, comprised of the 2005 data, to assess the performance of the models. The Prozac® and Paxil® models incorporate an intervention term corresponding to patent expiration. The model for Wellbutrin® has a pulse with decay intervention term for the increase in Direct-to-Consumer advertising. The model for Zoloft® has an autoregressive factor, and for Effexor® both an autoregressive and a moving average factor. For amitriptyline, the final model is a random walk. Maximum likelihood was used for estimations. Usual checks on the residuals proved to be satisfactory. We find that the drugs studied are affected differently by generic entry. We found no effect of either new branded-drug entry or newly approved indications.

Claudia Rosales, Optimal Inbound Trailer Allocation at a Crossdock - Optimizing Operations and Balancing Workload, August 29, 2007 (Michael Fry, Jeffrey Camm, Rajesh Radhakrishnan)
Transfreight, LLC is a third-party logistics provider that supports Toyota's lean manufacturing operations in North America. Our work provides the optimal allocation of inbound trailers to docks at a crossdocking facility operated by Transfreight. We focus on improving the efficiency of operations as well as balancing workload among crossdock workers. We compare two different implementation tools for our models: a spreadsheet-based solver and CPLEX. Since 2006, Transfreight has successfully used our implementation model for its inbound trailer assignments, leading to considerable cost savings and growth opportunities.

Bhaskar Narayanaswamy, Impact of Interruption and Forgetting in a Knowledge-Intensive Environment on Productivity, August 21, 2007 (Craig Froehle, Jeffrey Camm, Uday Rao)
With the rise of telephone, email, and ubiquitous connectivity, one increasingly common barrier to productivity in professional and knowledge-intensive environments is interruptions. Interruptions cause stoppage of the current task and often induce forgetting on the part of the worker. Beyond the direct delay caused by the interruption, the induced forgetting also causes rework; in order to complete the interrupted task, additional effort and time is required to return to the same level of task-specific knowledge the worker had attained prior to the interruption. Together, these phenomena – interruptions, forgetting, and rework – create significant barriers to productivity in knowledge-intensive work environments. In service environments, interruptions pose an especially significant problem due to the “interruption conundrum” of facing negative consequences from both ignoring and accommodating interruptions. When customer relationships are damaged by both addressing and ignoring a potential interruption, there is no obvious best recourse. This research employs observational and process data gathered from a hospital radiology department as inputs into a simulation model in order to better understand the impact of interruptions, forgetting, and rework. To help mitigate the deleterious effect of interruption-induced rework, we introduce and test the operational policy of sequestering, where one of the service resources is protected from interruptions. Our results suggest two key conclusions. First, sequestering can improve overall productivity and cost performance of the system, but the decision to implement a sequestering policy must consider the costs associated with delaying both interruptions and production work as well as the forgetting rate of the system's human workers. Second, if interruption-induced forgetting is not explicitly considered, the model's results tend to substantially underestimate the benefits of a sequestering policy.

Hsin-Chih Kao, Asymmetric-Response Study among Stock Markets of South Korea, Japan, China, and the US, July 9, 2007 (Martin Levy, Norman Bruvold, David Kelton, Weihong Song)
This project investigates whether asymmetric responses exist among stock-price indices of South Korea, Japan, and China. Magnitude asymmetry and pattern asymmetry are two main foci in the project and are tested by using regression analysis and vector autoregression (VAR) models, respectively. The main findings are as follows: magnitude asymmetry exists as the Japanese index affects the South Korean index. Second, by analyzing impulse response functions derived from VAR models, we find that pattern asymmetry exists among three Asian stock indices. When the possible US effect is accounted for in the analysis, the results show that the movement of index returns of US stocks influence those of South Korea and Japan, but not that of China.

Hua Zou, Developing a Predictive Model for Targeting Potential Donors: Application of Logistic Regression, Classification Trees, and Support Vector Machines in Analysis of Responses to Direct Mailing, May 29, 2007 (Yan Yu, Martin Levy, David Kelton)

Direct-mail campaigns are employed as a core marketing strategy by various organizations, from catalogue-order companies and direct retailers to credit-card and insurance institutions.  As the response of a given random selection of prospects is uncertain, many data-mining techniques are used to target good prospects and improve the likelihood of response.  In this study, we compare model performance built respectively by binary logistic regression, classification trees and support vector machines (SVMs), and show that lift and gain tables are better than ROC curves, and areas under curves (AUC) to distinguish the optimal model and select the target size because they take profit and gain into account.  Finally, support vector machines stand out from other classification algorithms to understand customer behavior and maximize profit in this case.

Raja Nooti, Analyzing Search-Engine Server Patterns, May 25, 2007 (David Kelton, Jeffrey Camm, Uday Rao)
This paper deals with resequencing of server patterns in a search engine with the objective to increase resource utilization and decrease the time taken per query in the search process.  A query is a request for information from a database.  A server is a computer that holds information and responds to requests for information from it (based on the query).  Server patterns refer to the allocation of queries to servers based on query type or frequency.  This problem is motivated by the highly competitive search-engine market where each second saved is massive, and there are many potential ways to improve the search process.  A base search-engine model is simulated in Arena with a real-world time distribution input to reflect the current search engines' server patterns.  Real times are obtained from AOL search logs to develop the model as accurately as possible.  Building on this base, an alternate remodeled model is developed incorporating logical constraints on query flow within the model to improve the resource utilization and reduce time taken per search.  In addition to proving to be amenable to implementation, this remodeled scenario has several significant advantages over the base scenario, all of which are analyzed.  Furthermore, a new model is developed and analyzed that features the enhancements possible and is proved to be more effective than the remodeled scenario.

Guoxiang Xu, What Factors Explain Investor Sentiment?, March 1, 2007 (Brian Hatch, Martin Levy, David Kelton)

The sentiment index recently reported by Baker and Wurgler (2006) reveals dramatic cross-sectional performance patterns in stock returns based on a variety of factors such as Firm Size (ME), Earnings-book Ratio (E/BE), Book-to-Market Ratio (BE/ME), and Sales Growth (GS).  When the sentiment index is negative, the subsequent returns are relatively high on small stocks, young stocks, high volatility stocks, unprofitable stocks, non-dividend-paying stocks, extreme-growth stocks, and distressed stocks.  Because the sentiment index has some ability to forecast stock returns, it would be valuable to know if there are any factors that explain this sentiment index.  Initial efforts reveal that macro-economic factors have little correlation with this sentiment index; however lagged equal-weighted stock index (EW) returns have a strong correlation.  Equal-weighted stock index returns annualized from the previous six years (EWP6Y) explain a majority of the variance of the sentiment index.  I discuss two possible explanations for this phenomenon, the business cycle and how fund management is evaluated.  For further investigation, I used a Hodrick-Prescott Filter to decompose the sentiment index into a general trend and the deviation from the trend.  My analyses reveal that the trend and the deviation are composed of different groups of the six variables initially used to synthesize the sentiment index.  Logistic regression reveals that EWP6Y has strong predictive power on the sign of the sentiment index.


2006

Prido Lumbantoruan, Univariate and Multivariate Time Series Modeling Application on the Unseasonally Adjusted US Index of Industrial Production, December 8, 2006 (Martin Levy, Norman Bruvold, David Kelton)
The Index of Industrial Production is an important indicator that could be used as a barometer of economic level of a country.  In this project we used three monthly economic series, the S&P 500 index (SP), the Unemployment Rate (UR), and the Money Stock Measures (M2) as the input series to model the Index of Industrial Production (IP).  Two multivariate frameworks, a dynamic regression with transfer function and a multiequation time-series model, were built to model the Industrial Production Index.  Dynamic regression and multiequation time series models are immensely useful in examining the relationship of past values from multiple time series with each other.  Additionally, a univariate time-series model was examined and built using the Box-Jenkins method as a baseline model for comparison with the multivariate models results.

Jonathan Healey, A New Model for the Cost- and Priority-Based Carrier-Selection Problem, November 29, 2006 (Jeffrey Camm, Michael Fry, David Rogers)
The three-dimensional bin-packing problem is to pack all or a subset of boxes into one or more bins.  In the three-dimensional singular bin-packing problem, the objective is to minimize the wasted bin volume.  In the three-dimensional multiple bin-packing problem, the objective is to minimize either some type of bin cost or the number of bins used.  Bin-packing problems have many important theoretical and practical implications.  On the theoretical side, they have challenged computer scientists and discrete mathematicians for decades because they are NP-hard, and there is no universal algorithm to find the exact solution in a reasonable amount of time.  For this reason, heuristics have been developed for attempting to find approximate solutions in a smaller amount of time, some of which I describe.  On the practical side, they have many applications in industry, including scheduling and loading cargo into trucks (Sweep 2004).  However, many approaches to the three-dimensional bin-packing problem that appear in the operations-research literature are applicable to only a portion of all situations encountered in practice due to their assumptions.  The objective of this work is to develop a model for the cost- and priority-based carrier-selection problem and determine the problem sizes that are solvable to optimality in a reasonable amount of time.  This model is a more practical approach to the three-dimensional bin-packing problem and builds upon a model by Chen, Lee, and Shen (1995).  After showing how the new model works, I present results, observations, and statistical analyses from testing it.

Rossana Bandyopadhyay, A Two-Stage Newsvendor Problem for a Call Center with Downward Substitution, October 12, 2006 (Amitabh Raturi, Uday Rao, Jeffrey Camm)
The call-center industry has a constant demand problem whereby it is difficult to assess the inventory of seats that need to be maintained in order to meet demand and yet not have an overflow of idle seats.  Many studies have been done to explore this.  Call centers also have different seat types, and customers are specific to certain seat types.  So while one seat type experiences idle hours, another may have a demand surge and be unable to fulfill all customer calls.  This paper explores how revenue may be affected when we allow substitution of seats between classes.  We evaluate the case of a two-stage call center offering high-service-level and low-service-level seats to customers.  Upward substitution of seats to callers is generally not a concern.  We explore how the effect of downward substitution of seats can affect overall revenue.  An integer-programming model is first created to define the process and identify the parameters.  Three scenarios are presented for studying the problem.  The first determines the effect of variance of demand on the level of substitution.  The next experiment evaluates how downward substitution may vary by relative profit rate between the seat classes.  Finally, the influence of the differential target service level is evaluated.  We use simulation with Crystal Ball to evaluate the process.  From the results, we derive the conclusions that downward substitution does contribute to increasing the overall revenue of the firm and so it is a viable option that can be considered.  We also find that downward substitution gives marginally decreasing returns and hence we recommend that managers at call centers implement a policy on the extent of downward substitution a priori based on the additional value generated by this flexibility and the marginal cost (such as goodwill losses in either market or the cost of transferring demand).

Andrew W. Lundberg, Modeling a Sports Draft Using Dynamic and Linear Programming, August 25, 2006 (Michael Fry, Jeffrey Ohlmann [University of Iowa], Jeffrey Camm)

We model a professional sports draft using dynamic and linear programming.  Our goal is to determine the best drafting strategy for a team competing in a multiple-round sports draft. We formulate the problem first as a stochastic dynamic program using a team's needs at each player position and the current pool of available players to be drafted as the state of the dynamic program.  However, this formulation is not generally solvable for reasonably-sized problems.  Therefore, we introduce a number of additional assumptions and relaxations that results in a more tractable deterministic dynamic program.  To solve our models, we reformulate the problem as a linear program.  We develop an easy-to-use application in Microsoft Excel that allows the user to implement our algorithm to determine drafting strategies under a variety of conditions.  The application allows the user to change a number of parameters including player rankings and valuations, length of draft, and the team's initial drafting needs.  We then compare our algorithm to several competing draft strategies by measuring the performance of each in a fantasy football draft for the 2005 season.  Our results indicate that our drafting strategy out-performs these competing strategies in every instance.

Rachel M. LaRosa, Optimal Sequencing in a Multiple Machine Job Shop, August 21, 2006 (Jeffrey Camm, Michael Fry, David Rogers)
In this paper I present an optimization of a specific deterministic Job Shop Scheduling Problem (JSSP).  The JSSP studied involves six machines performing a total of eight processes on ten jobs in a real-world company.  The schedule was obtained through a model developed with Premium Solver Platform Version 6.5 for Microsoft Excel.  Comparison with the current scheduling practices of this job shop revealed many points, including insights into bottlenecking and downtime of machines and operators.  As described in the literature, this type of problem is extremely hard and time-consuming to solve.  This model may be further developed in the future for implementation in the job shop's schedule planning.

Dongmei Yang, Comparison of Import Vector Machines with Support Vector Machines to Make Predictions in Marketing, July 21, 2006 (David Curry, Martin Levy, Yan Yu)
Many marketing problems require accurately predicting the outcome of a future event.  In today's business environment, analysts often face datasets with hundreds of variables related in complex ways so that outcome classes are not linearly separable.  In the 1990s, the support vector machine (SVM) was developed for problems of this type by using kernel transformations to transform a highly nonlinear problem (in the original attribute space) into a linear problem in a higher dimensional "feature" space.  The SVM performs well (Cui and Curry 2005, 2003) but is limited by the fact what it does not naturally produce probability estimates, it cannot be easily extended to multi-class problems, and it may be computationally "expensive," depending on the kernel selected.  In this project, we propose and test a new technique, the import vector machine (IVM) that also employs kernel transformations, but overcomes the shortcomings of the SVM.  The IVM provides classification probability estimates, it naturally generalizes to the multi-class case, and it requires less computation than the SVM.  We compare the SVM and IVM using data from two sources: (1) a discrete-choice problem based on simulated data, and (2) a large-scale field study involving the prediction of the incidence of client repeat business in the marketing-research industry.  Each new technique is also benchmarked against logistic regression.  Results indicate that the IVM performs (nearly) as well as the SVM on these problems and that both machine learning techniques significantly outperform logistic regression.  Because the IVM provides class-membership probabilities, it leads to deeper understanding than the SVM in both problems.

Kartheek K. Reddy, Regression and Time Series Modeling of the United States Civilian Unemployment Rate, July 6, 2006 (Martin Levy, Norman Bruvold, David Rogers)
The unemployment rate (UER) is an important indicator of the economic performance of a country and there are many ways of forecasting the UER.  Economic indicators like the gross domestic product (GDP), the inflation rate (IR), the civilian labor force (LF), and the industrial production index (IPI) may have statistically significant influence upon the UER.  The relationship among various economic indicators was examined.  Regression and time-series models were developed for the UER.  Ordinary least-squares regression methodology was adopted to develop the regression model and univariate autoregression (proc autoreg) and multivariate vector auto regression (VAR) procedures were used to develop the univariate and multivariate time-series models, respectively.  The forecasting abilities of regression, univariate, and multivariate time-series models were compared by performing static and dynamic forecasts of the UER.

Elena Bichescu, Bankruptcy Prediction using Logistic Regression and Multiple Imputation, June 29, 2006 (Martin Levy, Jeffrey Camm, Timothy Keyes [General Electric], Yan Yu)
Altman (1968) notes that bankruptcy represents a serious financial distress state that not only affects the bankrupt company, but also has negative social and macroeconomic ramifications.  In this context, models that could accurately predict the probability of a company filing for bankruptcy have wide applications, e.g., criteria for bank loans and financial investments, financial turnaround measures, etc.  This work proposes the use of logistic regression models and multiple imputation techniques to predict bankruptcy.  Our analysis is based on a dataset created by the author and which contains 165 companies, of which 55 have been declared bankrupt.  We formulate a logistic regression model where the bankruptcy state is a binary dependent variable and the predictors are continuous financial ratios.  Model building is performed on the dataset that results after applying listwise deletion on the initial input data.  The models thus obtained are then validated using two approaches: train/test, where the models are validates on separate test sets, and cross-validation.  The misclassification rates returned by our logistic regression models average around 10%, a performance similar to models proposed by Altman and Beaver.  The proposed logistic models show that among the best predictors of bankruptcy are the financial ratios obtained based on total or current liabilities and on total or current assets.  This result verifies both previous work by Altman and Beaver and the intuition that a company's financial health depends crucially on the delicate balance between assets and debt.

Jeremy Jesse, Optimal Warehouse Delay for a Supply Chain Backorders Optimization Model, May 26, 2006 (David Rogers, Amitabh Raturi, Jeffrey Camm)
In this paper a multi-level retailer inventory distribution model with backorders is considered.  It is a periodic review system where the optimal base stock levels are determined by minimizing the total penalty cost of backorders subject to delay time constraints.  Lead times are deterministic with possible delays, lateral shipments are not allowed, and shipment times are integer constrained to model situations where a fleet of trucks is only able to make one delivery per day.  A highly nonlinear mathematical programming model was adapted for this setting.  The case of non-identical retailers created a formidable challenge for standard software to yield reliable results.  Interval search techniques and optimal selection were utilized within Excel and VBA to provide numerical results for the case of multiple identical retailers.

Yue Wu, An Empirical Study of the Post-Deregulation Electric Utility Wholesale Market, May 23, 2006 (Yan Yu, Martin Levy, Norman Bruvold)
This work explores the volatility structure of daily electricity price returns for 6 markets across the US. Based on daily data from 1998 to 2005, we examine the wholesale electricity prices for Cinergy, Entergy, PJM, Chicago, Michigan, and Ercot with parametric modeling methodology.  A family of GARCH type models is implemented to model the return behavior, in which exogenous explanatory variables, seasonality, and asymmetric effects are taken into account.  The behavior of electricity prices exhibited a strong tendency to stabilize as a common commodity after deregulation at the end of last century.  Several misspecification tests are conducted to evaluate model appropriateness.  Different back testing techniques are applied to identify the best model.  Finally, a bootstrap simulation methodology is applied to evaluate the model performance of an updated model using data from 2001 to 2005 and an overall model from 1998 to 2005.  The updated model turns out to generate a much narrower prediction interval and is more accurate.  This supports the conclusion that a structural change happened around 2001.

Robert E. Carter, Estimating Tuition Elasticity Using a Dynamic Discrete Choice Model, May 19, 2006 (David Curry, Jeffrey Camm, Michael Fry)
Prior research on tuition elasticity for institutions of higher learning has consistently found a downward sloping demand curve. That is, as tuition increases, enrollment decreases. However, most published studies relied on aggregate data covering multi-year time frames. Elasticities estimated in prior research reflect the likelihood that a student will attend any college or university. The research does not provide guidance on the choice of college that an individual student may choose to attend. The research presented in this thesis is unique because it employs discrete choice experiments on an individual student basis in order to determine the tuition elasticity for 12 colleges within the University of Cincinnati. Additionally, web-based survey software containing a unique "rules engine" was developed (as none were available commercially) so that the list of competitive schools in the choice set could be dynamic and, hence, reflect the college consideration set for each student. Thus the discrete choice experiment employed here uses a data collection format personalized for each respondent in the study.  Results are consistent with prior research in that we identified a downward sloping demand curve. However, our estimated elasticities are considerably greater than those reported in previous research due to the focus on individual student level data as compared with aggregate level analysis.  Furthermore, within the University of Cincinnati, we found that students attending the Colleges of Pharmacy, Medicine, and The College Conservatory of Music (CCM) exhibited the lowest tuition elasticity, while students from Business, Engineering, and the College of Education, Criminal Justice, and Human Services (CECH) displayed the highest relative elasticity.

Mayank Seksaria, Portfolio Risk Management Techniques for Electricity Generating Companies, May 19, 2006 (Yan Yu, David Rogers, Martin Levy)
In the past decade electricity markets have been deregulated all around the world. In this new environment energy is traded as any other commodity. Price volatility in deregulated electricity markets is max as compared to any other commodity. Confronted with this extreme price volatility market participants and traders face enormous risks and hence need risk management in electricity markets. With the volatility that fuel prices have encountered in recent past, price risk becomes most paramount for electricity companies risk management. In this thesis we start by calculating the volatility of electricity spot prices using historical simulation methods. Then I used time series models to determine characteristics of spot price returns and also do comparative forecasting of electricity spot prices. Risk cannot be avoided in any market. Modern theory of utility is an approach to decision choice under uncertainty. I developed an optimal portfolio consisting of bilateral contracts and spot pricing. I also used sequential optimization to determine the effect of various factors on the allocation ratio in the portfolio. Besides MPT I also used VaR as a risk control technique and calculate its values. I calculated the VaR for individual asset as well as for the portfolio and compare those values to illustrate the diversification provided by developing an optimal portfolio. I have provided an overall framework of risk management for Generating companies in the competitive electricity market. The proposed energy allocation model provides an analytical and quantitative approach to energy trading.

Zhouzhou Peng, A Dynamic Self-Adaptive Algorithm and Simulation Study for Warehouse Organizing, May 18, 2006 (David Rogers, Amitabh Raturi, Uday Rao)
How well the contents in a warehouse, i.e., the variety of items stored in it, are organized is among the most important factors that determine productivity and efficiency.  Current organizing methods are inadequate and cost-prohibitive when facing volatile warehouses where a huge variety of goods are frequently transferred in and out in large and unpredictably fluctuating numbers.  The reason for that incapability is twofold: first, current methods tend to focus only on the storing process and ignore the impact of the order-picking function; second, current methods often use a top-down approach and lack the flexibility needed for the ever-changing environment.  In this thesis is a new algorithm that integrates both the storing and the order-picking activities and employs a bottom-up perspective to solve the problem, utilizing only basic information readily available within a modern computerized warehouse management system (WMS).  A simulation study based upon a real-life case is used to show the algorithm's dynamics and analyze its improved performance over the current method.

Andrew R. Remington, A Study of Unsupervised Learning, April 20, 2006 (Yan Yu, Martin Levy, David Kelton)
Unsupervised learning is a collection of methods that are extremely effective in producing accurate summaries of relationships in a data set. With the recent evolution of computing power and the free implementation of the statistical programming language R, these powerful methods are now readily available to anyone interested in data mining. This project studies association rule analysis, cluster analysis, self organizing maps, principal components, independent component analysis, and multidimensional scaling, offering summaries of each method, descriptions of each method's implementation in R, examples of the application of the method to a real data set, and an assessment of the attributes of each method. Due to the new nature of the field and fragmented documentation of each method, this project crystallizes the process of usage and understanding of each method in a freely available software language to provide novice data miners with a structure of understanding and instructions on the application of each method. This project summarizes journal publications, textbooks, and R code that deal with each method individually. The results of this project show that many unsupervised learning methods are easy to apply, execute quickly, and provide similar results among differing methods. Furthermore, the results demonstrate the redundancy of different methods concerning gene tumor data and the effectiveness of unsupervised learning as exploratory analysis. The significance of the finding is that because the methods are freely available and are easily applicable to a data set, it is prudent that data miners or statisticians apply unsupervised learning methods during their initial exploration of a data set in order to define their starting assumptions more accurately.

Honghua Shang, A Model for Profiling Asian American Association Telecom Services Customers Using Logistic Regression, February 21, 2006 (Martin Levy, Norman Bruvold, James Evans)
Data mining is an information-extraction activity whose goal is to discover hidden facts contained in databases.  Using a combination of machine learning, statistical analysis, modeling techniques, and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results.  Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit-risk analysis.  This project attempts to develop a model for profiling potential customers using statistical methods, such as logistic regression for a given data set.  That is, the relationship of some responses and explanatory variables will be explored so that we can determine which variables are the most and least correlated with the response variable.  The goal is to segment data provided by the Asian American Association Telecommunication Services into potential customers and non-interested customers.  Logistic regression was chosen mainly because of its ability to analyze categorical data.  Gender, language, age, dwelling, household income, location, and time zone were variables found to be statistically significant and are therefore important contributors in determining the potential AAATS customers.  AAATS will adjust their future marketing campaign based on these findings.

Guohua Wu, A Study of Value-at-Risk Methods, February 7, 2006 (Martin Levy, Jeffrey Camm, Norman Bruvold)
Value at risk (VAR) is a method widely used in financial corporations to measure the risk of holding a portfolio over a period. Three basic methods to get VAR are the delta normal method, the historical method, and Monte Carlo simulation. Among these three methods, Monte Carlo simulation is most powerful while the delta normal method is most popular one since it is economical. However, these methods have a lethal drawback if VAR is forecasted over a volatile period because they assume common variance. Univariate and multivariate ARCH/GARCH models are discussed to deal with heterogonous data. Since the software for the time-varying covariance ARCH/GARCH model is not available currently, the common correlation multivariate GARCH model was used. The vector autocorrelation model is based on the idea that the conditional variances of the portfolio components not only have autocorrelation with themselves but also with other components. Thus, VAR can improve the GARCH model further.

 

2005

Ying Huang, Development of RFID Technology Measurement Scales, December 2, 2005 (Craig Froehle, Michael Fry, Suzanne Masterson)
Over the past few decades, radio frequency identification (RFID) technology has been used to track and identify goods, assets, and even living things. It is gaining momentum in supply-chain management.  Compared with barcodes, it is a more powerful tracking tool in many aspects and can provide more detailed and accurate information in a more timely manner.  As the most promising ID technology that might revolutionize the industrial world, it has drawn a lot of interest from supply-chain participants.  Millions of dollars have been invested into research to examine its potential and improve its features and benefits.  Although a number of surveys have been conducted to explore people's concerns about this hot topic, it is important that RFID technology as a concept be subject to the same serious and careful academic study that has been focused on the technology itself.  This could help reveal current and potential RFID users' interests and expectations.  Perceptions of RFID are not well understood, likely due in part to a lack of valid measurement instruments.  In this paper, we summarize the current state of RFID application. We then propose four important attributes of RFID - reliability, durability, flexibility, and security - and develop multi-item scales to assess the importance of each to managers.  Employing a combination of primary field (internet survey) and artificial datasets, we perform reliability and validity analyses using the SAS and AMOS statistical tools.  The results of the iterative reliability and confirmatory factor analyses suggest that two of the tested items should not be employed in further applications of the instrument.  The results and limitations of the research are then discussed.

Ying Li, Using Bayes Estimation Under BLINEX Loss to determine the Mailing Size for a Direct Mail Marketing Campaign, November 22, 2005 (Martin Levy, Norman Bruvold, David Kelton)
Direct mail marketing is a growing area of marketing practice.  Many corporations use a data-mining technique, called scoring model, to estimate the response probability of each household in the mailing list.  The selection of targets is based on the assigned probabilities in descending order.  The problem that remains unsolved is the size of mailing.  In practice, the direct marketers make the decision either based on budget or maximized response rate, which are suboptimal for profit-driven firms.  Bayes estimation, which takes cost into consideration, has been applied to find the optimal mailing size.  Traditionally, the point estimates are often derived by implicitly assuming a squared error loss (SEL) function, but the SEL may not reflect the actual loss in a direct-marketing problem.  This paper use Bayes estimation under bounded linear exponential loss (BLINEX) to find the response rate that corresponds to the optimal mailing size leading to maximized profits.  A case study with real data sets from a catalogue company demonstrates the BLINEX loss structure and the financial advantage of BLINEX method over the SEL and the mailing-to-all scenarios.

Pooja Singh, Application of Linear and Non-Linear Modeling with Random effects to Analyze Biomechanical Data, November 21, 2005 (Martin Levy, Jeffrey Camm, Norman Bruvold)
This project deals with design and statistical analysis of biomechanical data.  The biomechanical data pertain to a tissue-engineering experiment that aims at accelerating tissue repair.  Repair of tendons, ligaments, and capsular structures is common given that these injuries represent almost 45% of the 32 million musculoskeletal cases in the US each year.  As a consequence, surgeons and basic scientists have sought to identify new approaches like tissue engineering for tissue repair and returning the patient to pre-injury activities.  This experiment sought to understand how the cell-to-collagen ratio affects contraction kinetics of mesenchymal stem cells (MSC) as they mature around posts in a culture.  A split-plot design was successfully applied to the experiment and hypotheses were tested using the model.  Also, a nonlinear model was fit between the response variable, contraction factor, and time.  The model allowed that the random effect in the experiment could enter the model nonlinearly.  The analysis was implemented using Proc Mixed and Proc Nlmixed in SAS.

Vinutha Nagesh, Clinical Data Mining: Frozen Shoulder, November 18, 2005 (Martin Levy, Yan Yu, Norman Bruvold)
Data mining, an interdisciplinary research area including artificial intelligence, statistics and databases, is the science of extracting useful information from large databases.  In this research project, techniques of data mining were used to analyze the relationships in a clinical condition called frozen shoulder.  The data set derived from the clinical database of a shoulder surgeon at Cincinnati Sportsmedicine and Orthopedic Center consists of 65 patients' records.  Records include patients' demographics and clinical diagnoses information.  The severity of the frozen-shoulder problem is measured in terms of the Simple Shoulder Test score (SST), the range of motion of the aggravated arm in different elevations and the American Shoulder Elbow Score (ASES), which is calculated from the patients' responses regarding the functionality of their shoulder.  Treatment included physical therapy or surgery.  The data were used to do comparative analyses on pre-treatment and post-treatment measurements using paired and un-paired methodologies.  Predictive studies are performed to predict which treatment group a patient is assigned as a function of demographics, pre treatment scores, and clinical diagnoses.

Steven Harrod, Numerical Methods for Realizing Nonstationary Poisson Processes with Piecewise-Constant Instantaneous-Rate Functions, October 24, 2005 (David Kelton, Uday Rao, Martin Levy)
Nonstationary Poisson processes are appropriate in many applications, including disease studies, transportation, finance, and social policy.  We review the risks of failing to model nonstationary Poisson processes properly and discuss three algorithms for the generation of Poisson processes with piecewise-constant instantaneous rate functions.  We test these algorithms in C programs and make comparisons of accuracy, speed, and stability across disparate rate functions and microprocessor architectures.  Choice of optimal algorithm could not be predicted without knowledge of microprocessor architecture.

Vishva Raj Bangad, Bioequivalence and Sample-Size Determination in the Pharmaceutical Industry, October 5, 2005 (Martin Levy, Jeffrey Camm, Uday Rao)
Assessing bioequivalence between the bioavailability of a generic drug product and the innovator drug product has gained a lot of importance in recent years since the generic-drug manufacturer does not need to perform costly clinical trials to demonstrate the safety and efficacy of the generic product if the bioavailabilities of the two drug products are demonstrated to be bioequivalent.  However, this bioequivalence must be demonstrated in a statistically sound way to protect the consumer from ineffective and unsafe drugs.  Until the 1970s the statistical test of hypothesis of no difference between the bioavailabilities of two drug formulations, usually supplemented by an assessment of what the power of the statistical test would have been if the true averages had been bioequivalent, was used in the statistical analysis of bioequivalence studies.  Westlake proposed a new approach based on a confidence interval for the difference between the true means.  During the same period, Schuirmann proposed a two-one-sided-test (TOST) method.  Anderson and Hauck proposed a new test and claimed that their test was always more powerful than the above two tests.  Wilcoxon, Mann, and Whitney proposed a nonparametric version of TOST if the assumption of normality or lognormality is not valid.  We will discuss and compare these methods in this paper.  We will also determine the power and sample size of Schuirmann's TOST.  In the end, we will briefly discuss some of the new approaches that have been proposed in the last decade and define population bioequivalence and individual bioequivalence.

Bogdan Bichescu, Channel Power: Its Implication on Supply-Chain Performance, September 1, 2005 (Michael Fry, Amitabh Raturi, Pradyot Sen, George Polak [Wright State University])
Our work, comprising two essays, examines decentralized supply chains composed of one supplier and one retailer facing stochastic customer demand.  We develop models for both periodic (1st Essay) and continuous review (2nd Essay) inventory policies when the decision-making rights are split between supply-chain agents.  We seek to answer: 1) when does decentralized decision making result in the greatest loss in supply-chain performance and 2) what effect does the distribution of channel power have on system and individual agent performance.  In our first essay, we assume the retailer is responsible for choosing order sizes and the supplier chooses delivery frequency.  We find that performance losses from decentralized control are strongly influenced by the relative holding and penalty costs, but somewhat invariant to demand uncertainty due to risk pooling.  Furthermore, our numerical results suggest that concentrating channel power with the supplier can lead to supply-chain profits that are very close to a centralized scenario, but also results in lower customer-service levels.  Our second essay studies supply-chain performance under a vendor-managed inventory (VMI) agreement where the supplier controls delivery sizes and the retailer sets customer-service levels.  Within the VMI setting, we model various power scenarios: equally powerful retailer and supplier, powerful retailer, and powerful supplier.  According to our numerical results, the best system performance is achieved when the supplier acts as the Stackelberg leader. Furthermore, somewhat contrary to intuition, we find that individual agent performance is greatest when the agent acts as a follower.

Mohammad Rouholiman, Evaluating Mezzanine Finance in Real Estate: A Monte Carlo Simulation Approach, August 26, 2005 (Jim Clayton, James Evans David Kelton)
Mezzanine finance has emerged as an important source of financing in commercial real estate.  It helps to complete the market by bridging the gap between what equity investors are willing to put down and what conventional senior lenders provide.  The mezzanine position is structured as a junior debt piece or preferred equity share that takes the first loss after the equity investor, in the event of a default.  Due to the riskiness of the position a more rigorous analysis of the property's future cash flows (pro forma) is warranted.  Traditional property valuation relies on a static ten-year pro forma.  A more risk-adjusted approach is very timely given the aggressive pricing of equity and debt in property markets over the past few years.  Real-estate prices have soared and spreads on debt have contracted, leaving investors and bankers with very little room for error.  This paper aims to provide a methodology for using Monte Carlo simulation to evaluate the riskiness of a property and aid the mezzanine lender in the decision-making process.  The goal is to use Crystal Ball software to provide the mezzanine lender with a better picture of the possible outcomes for the property and see if it meets their initial underwriting criteria.  Then OptQuest is used to search for the set of loan attributes that meet the lender's IRR and default risk requirements.

Guoqiang Zhang, Ph.D., Numerical Methods in Valuation of American Options, July 29, 2005 (Michael Ferguson, David Kelton, Martin Levy)
Unlike European options, which can only be exercised at the time of maturity and can be priced with the explicit Black-Scholes formula, American-style options can be exercised at any time before the maturity and there is no closed-form formula to price them.  American Asian options, such as arithmetic average American Asian options and geometric average American Asian options, pose more difficulties in valuation since their values depend not only on the underlying assets, but also the arithmetic or geometric averages of the underlying asset over a certain time interval.  Numerical methods such as binomial, least-squares Monte Carlo simulation, and finite differences, must be used to valuate American options.  The binomial tree method proposed (Cox et al. 1979) provides a simplified numeric approach for valuing options and assumes that the price of the underlying can go up or down by fixed multiples.  Each price jump is assigned a probability and a tree of possible underlying prices is built.  Working from the tree points or nodes at the option maturity date, the worth of the option can be back calculated until the option can be valued at the desired date.  Least-squares Monte Carlo simulation (Longstaff and Schwartz 1997) uses of regression to estimate the conditional expected payoff to the option holder from continuation, and is readily applicable to path-dependent and multifactor financial instruments.  Finite differences transform the partial differential equation into a difference equation that can be solved numerically, and is the most commonly used numerical method for solving differential equations.  In this project, we discuss the explicit finite, implicit, and Crank-Nicolson methods for the one-factor model and the explicit and ADI methods for the two-factor model such as arithmetic average American Asian options and geometric average American Asian options.

Paul Bessire, Measuring Individual and Team Effectiveness in the NBA Through Multivariate Regression, June 3, 2005 (Michael Fry, Jeffrey Ohlmann [University of Iowa], David Kelton)

At the conclusion of the 2003-04 National Basketball season, the Detroit Pistons, without one player among the NBA's top ten scoring leaders, found themselves atop the NBA with a championship ring.  Conversely, Team USA, composed of the most individually talented players in the world, failed to win Gold in the 2004 Olympics.  How could this happen?  We believe that much of the variation found in a basketball team's success can be explained mathematically through looking at the interactions of the five players on the court and not just individual player abilities.  We examine several methods for rating individual NBA players and we utilize multivariate regression analysis to assist in building successful NBA teams.  We seek to predict the success of an NBA lineup consisting of the five players on a court at any time.  We measure success as the lineup's average scoring margin per minute.  In order to predict a lineup's success we consider a set of individual player attributes that serve as our explanatory variables.  We use two-way interactions between player abilities to help explain teamwork in the NBA.  Applications of the model include examining which players should play at each position, predicting the lineups that should have the greatest team success, and specifying which skill areas the coaching staff should seek to improve through the annual NBA draft, free agency, and trades.

Jason Crabtree, Construction and Tests of an Interactive Genetic Algorithm for New Product Design, June 3, 2005 (David Curry, David Kelton, Yan Yu)
Affinova IDEA(TM) is a commercial software product with marketing applications in the area of new product design.  At its core is an interactive genetic algorithm (GA), which provides certain advantages over traditional product design methods, such as conjoint analysis.  These advantages include the ability to handle products with many design features and levels to each feature, as well as nonlinear consumer utility functions involving complex effects.  The goal of this project is to construct and test an interactive genetic algorithm similar to Affinova.  The analysis portion of the project will test the GA over a variety of operating conditions and enlighten the strengths and weaknesses of a genetic-algorithm-based approach to product design.

Neelima M. Reddy, A Route-Sharing Tool for Optimization of Resource Allocation in Logistics Planning, June 3, 2005 (Uday Rao, Michael Fry, David Kelton)
The optimum allocation of resources is one of the biggest challenges faced by a third-party logistics firm during the planning phase of operations.  The problem becomes complicated with uncertainty of demand, outsourcing of resources, and dynamic constraints on the availability of resources.  The resources in this particular problem are tractors and drivers and they must be allocated to pre-designed routes such that all the routes are run at the design-specified times using a minimum of tractors.  Traditionally it has been a slow manual process taking a logistics planner about 2-3 days to come up with a feasible allocation of tractors, let alone an efficient allocation.  Also, every time a new route or a set of routes are added or route specifications are changed, the tractors have to be entirely reallocated.  The long cumbersome process does not allow comparative studies between scenarios and the possibility of choosing a best cost-effective scenario.  In this project, I have developed a software tool called the 'Route-Sharing Tool' for one such Logistics firm (Transfreight) that uses a heuristic approach to the resource-allocation problem and provides a good solution in minutes.  It creates a weekly tractor-route flow schedule and is all the more valuable when route specifications change frequently and the resources have to be reallocated.  The tool is also useful for comparative studies and can be used during route design to develop an efficient set of routes within the constraints, which reduces the idle time of resources.  The tool also gives a visual representation of tractor usage and idle time, which makes it easy to understand and implement the desired changes.

Kanampully Sunny Paul, Analysis of Some Finitized Distributions for Use in Simulation, May 27, 2005 (Martin Levy, David Kelton, Norman Bruvold)
Simulation modeling helps us to replicate real processes using computer programs that are helpful in determining various important parameters of the process.  As simulation modeling assumes greater significance today and finds applications in numerous fields, emphasis is being laid on generating accurate, efficient, and faster random-variate-generating algorithms.  A new methodology called finitization that converts an infinitely supported discrete power series distribution into another distribution having support of specified finite size has been proposed by Levy and Golnabi.  An essential feature of the finitized version is that it preserves the moments of the parent distribution up to the order of finitization.  In this paper we seek to explore the possible advantages of using such a finitized distribution in simulating random variates that belong to the family of discrete power series distributions.  We also check the accuracy of distributions derived by using the method of finitization compared to the theoretical distribution. We have studied the various methods of simulating random variates and the relative advantages with respect to computational times. We have carried out the simulation in SAS and compared the computational speed with respect to whatever conventional methods SAS uses to generate these distributions.  After analyzing the various processing times required for simulation, we could conclude that the method of finitization is advantageous in reducing the processing times by reducing an infinitely supported series into its finitely supported form.  We also could conclude that the advantages in processing times may also depend on other factors like the software used, the operating system, and the hardware configuration  of the computers used for carrying out simulations.

JianJian Cheng, Projecting the Charge-Off Rate for Consumer Loan Products at HSBC Household, May 23, 2005 (Martin Levy, Norman Bruvold, Yan Yu)

Consumer loan portfolios comprise millions of dollars of receivables at HSBC Household.  The ability to understand what the loss, mainly the charge-off, is going to be has become essential.  Yet today there are few models available that address this area at HSBC Household.  The focus of this paper is primarily on the consumer loan charge-off rate forecasts.  The goal is to predict monthly performance from two months ahead to four months ahead.  This paper is to answer the question faced by the senior management of HSBC Household 'how can we better project the charge-off for consumer loans?'  Given the absence of a formal forecasting model, this paper presents the forecasts of six models including cohort average, Winter's method, linear regression, simple ARIMA time series models, ARIMA intervention models, and ARIMAX models.  This case study concludes that, overall, the ARIMA intervention model and Winter's method provide very good forecasting for both two months ahead and four months ahead and they are recommended.  ARIMAX model forecasting accuracy is not stable.  It produces the best forecasting result for the two-months-ahead window, but is the second worst for the four-months forecasting window.  So this model should be used carefully.  Linear regression provides good results with stable accuracy.  It can be used as a benchmark for other alternative forecasting models, if the delinquency data are accessible.

Peter G. Donley, Intervention Forecasting: How to Forecast Appropriately for Categorical Demand when a New Wal-Mart Superstore Enters a Retail-Dominated Market, May 20, 2005 (Martin Levy, Norman Bruvold, David Rogers)
As Wal-Mart continues to saturate the retail market, other competitive retailers are trying to find ways to adjust for the inevitable changes that they will face in the future.  Consumers now have a wider selection of retailers to choose from than the usual local grocery store down the street.  As a new Wal-Mart Supercenter enters the market place, there is an obvious change, an intervention, in consumers' shopping patterns.  This project is focused on one appropriate method of forecasting consumer demand in a particular category, given that a Wal-Mart Supercenter has entered the marketplace.  Using ARIMA intervention modeling, the appropriate steps will be taken in finding an accurate model for forecasting categorical purchases when a Wal-Mart enters and the direct effects of consumer demand are sought.

Nelly Louise Jorgensen Shapero, Human Resources Forecasting Models for Small Companies, March 11, 2005 (David Rogers, Norman Bruvold, James Evans)
Small companies should make data collection for human resource measures a routine task.  A trend and regression analysis may work well for short-term forecasts of manpower requirements, even though it may be difficult to get a detailed forecast using these models.  A Markov model may be useful for analysis of how many people will be in each position at some future time.  The models are fitted to conditions at Transfreight LLC.  Two curves are fitted to the trend analysis, an exponential and a linear curve.  The trend analysis provided very reliable results for forecast using both of the models.  The analysis provided a forecast with an R-square of 0.981 for the linear model and 0.989 for the exponential.  A multiple regression analysis may work well for many small companies, but for Transfreight the results were not as good as the trend analysis.  Using stepwise regression, the only variable entered was time and an F-test of the single-variable linear model and the multiple-variable regression models does not favor the more complex model.  A Markov model was developed and used to describe the system but was not used for forecasting the employee numbers.  Many of transition probabilities are very small in this model.  The distribution of the standard errors, therefore, becomes very skewed and the normal assumptions necessary for accurate predictions were unreasonable.  Predictions made with this model may therefore contain large errors.  Several qualitative and quantitative models for human-resource planning are briefly described and evaluated for fit to small companies.

Anand Mathew, Work-In-Process Inventory Entitlement for the Aircraft-Engine Industry, March 11, 2005 (David Rogers, Amtiabh Raturi, Uday Rao)
Understanding, visualizing, and controlling inventory flow is one of the challenges faced by the modern manufacturing industry.  Too much or too little of inventory in any form - raw materials, work-in-process (WIP) or finished goods - is undesirable.  Of these three types of inventories, work-in-process inventory is an indication of the lack of coordination within the organization.  By constantly monitoring and properly managing the work-in-process inventory levels, an organization can substantially reduce its operating costs. Most of the parameters that affect work-in-process inventory are within the organization and hence projects related to work-in-process inventory require a significant amount of impetus and organizational restructuring to succeed.  Complexities of modern machineries, unstable and seasonal demand patterns, constant design alterations, and widely dispersed manufacturing locations have made visualization, analysis, and optimization of the work-in-process inventory flow cumbersome and time consuming.  This project was undertaken in order to develop a scenario-analysis platform for evaluating the impact of various design parameters upon work-in-process inventory.  This new application provides the user the ability to alter the demand schedule, bill of material, product cost, assembly levels, or cycle time of each component in order to analyze its impact on the work-in-process inventory levels.  Currently this tool is being used for inventory forecasting and resource allocation at one of the world's largest aircraft-engine manufacturers.

 

2004

Kelly Herrmann, Optimal Portfolio Allocations for Hedge Funds with Asymmetric Returns, November 24, 2004 (Yan Yu, David Rogers, Norman Bruvold)
'Hedge fund' is a phrase describing a broad range of alternative investment strategies.  What they all have in common is a goal to create positive returns in any market environment.  They are unregulated and privately organized, allowing for very flexible investment styles (i.e., using leverage).  Non-normality and asymmetric returns are usually observed, which make traditional quantitative studies based on Guassian symmetric assumptions difficult to justify.  Portfolio allocation, for example, is greatly affected by asymmetric returns.  The goal of this project is to determine optimal allocation for a portfolio in hedge funds.  The hedge-fund universe is divided into eight strategy categories, and recommendations of the percentage of wealth invested in each strategy are given.  Strategies are represented through indices developed by hedge-fund Research.  Also, the non-normality of returns will be accounted for using two unique optimization methods, the modified value at risk through the Cornish-Fisher expansion, and Duarte's unifying formulation.  These methods will be explained and the portfolios they produce will be compared.

Sujan Balachandran, Bayes Estimation under Bounded Asymmetric BLINEX Loss in  a Direct-Mail Decision Problem, November 24, 2004 (Martin Levy, Norman Bruvold, David Rogers)
While unbounded symmetric loss functions, such as squared-error loss, are widely used in Bayesian statistical decision theory because of their mathematical convenience, there are many situations where a bounded and asymmetric loss, such as the BLINEX loss, is more desirable.  The aim of a direct-mail marketing problem is to maximize the profitability by increasing the order size and also to increase the market share by familiarizing the potential customers with our products.  However, we restrict our problem to the quantitave realm and present an application of Bayes estimation under BLINEX loss to a direct-mail decision problem in which maximum profit is the main decision goal and mailing size is the decision variable.  Our aim is to recreate a scenario using a real data set that is very similar to what was previously done using simulation.  A scenario to demonstrate and quantify how the profitability of Bayesian estimation can be improved by incorporating the intrinsic boundedness and asymmetry features of the direct-mail loss structure.  A algorithm is used to fit BLINEX based upon information elicited from decision makers in general circumstances.

Ning Shao, Semiparametric Estimation for Credit Scoring, November 16, 2004 (Liang Peng, Martin Levy, David Kelton)
Credit scoring is a statistical system used for assessing credit worthiness of potential borrowers and classifying customers into 'good' or 'bad' risk classes.  With the explosive growth in the consumer credit market, the credit scoring methods have become increasingly important.  They are now standard tools of credit card companies, banks, and mortgage companies, etc. to assess the loan applications, minimizing companies' costs of failure over risk groups.  Common classification and regression methods of credit scores are usually linear on explanatory variables.  However, in many applications, there is not always evidence of a generalized linear relationship.  Data-driven nonparametric/semiparametric modeling techniques such as the generalized additive models, generalized partially linear models, and generalized single-index models, emerge as promising alternatives that offer the flexibility of fitting the curvature and yet retain the ease of interpretability.  They are often considered important data-mining techniques in the initial stage of exploratory data analysis.  This project investigates various semiparametric modeling techniques on a French bank credit scoring data: generalized linear models (GLM), generalized additive models (GAM), generalized partially linear models (GPLM), generalized partially linear single-index models by P-splines (GPLSIM-P), and generalized partially linear single-index models by kernel smoothing (GPLSIM-K).  The response variable of interest is a binary variable indicating default/no default of a loan.  The predictors are variables based on the customers' information and credit history etc.  The goal of this project is to study different semiparametric models of most recent research using credit-scoring data, to reveal the relationship between variables, and to capture the curvature if any non-linearity exists.  Alternatively, methods such as classification and regression trees (CART) and neural network are also discussed.

Keli Feng, Identical Jobs Cyclic Scheduling: Formulation and Solution, October 8, 2004 (Uday Rao, Amtibah Raturi, Norman Bruvold)
We study the computationally-hard, re-entrant flow, cyclic scheduling problem considered by Graves et al. (1983) and Roundy (1992).  We present two problem formulations to minimize job flow time (work-in-process), given a target cycle length (throughput).  We describe an effcient method to solve the problem to optimality; in computational experiments this method was significantly faster than commercial optimization software (CPLEX 8.0) and solved 40% more of the test instances to optimality within the specified run time and memory limits.  We also develop a new ImproveAlignment (IA) heuristic algorithm, which we test against the optimal solution or bounds.  Numerical experiments indicate that the proposed IA heuristic quickly produced solutions whose flow times were, on average, (i) 22% better than those from the Graves et al. heuristic and (ii) within 14% of the optimal flow time.

Vladimir V. Pashkevich, The Role of Culture-Level Factors in Shaping Online Purchase Intentions: A Cross-Country Comparison, August 17, 2004 (David Curry, James Evans, Yan Yu)
The primary goal of this research is to enhance our understanding of the moderating role that culture-specific variables - individualism/collectivism and cultural context - play regarding an individual's intentions to use the Internet for obtaining product information and shopping.  Specifically, this research (a) operationalizes the concept of cultural context by constructing an index with formative indicators, (b) develops reliable and valid scales for measuring constructs comprising the theory of planned behavior (TPB), and (c) examines the boundary conditions and generalizability of the TPB in Internet-mediated consumption settings.  The proposed model is used to examine effects of variables, at the culture level, on the strength of relationships among individual attitudes, experience, subjective norms, and purchase intentions.  Predictions under TPB are evaluated across two samples drawn from the United States and Belarus.  Findings reveal that subjective norms tend to influence decisions in high context/high collectivist cultures, but not in high individualist/low context cultures.  The effects of attitudes and past behaviors on intentions were equal for the American and Belarusian cultures.  Results of the proposed study are expected to yield implications for marketing practices across cultures.

Rachna Jaison, Volatility of Demand and its Operational Consequences: A Simulation Study of Systems Dynamics in the Machine-Tool Industry, August 12, 2004 (Amitabh Raturi, David Rogers, Jeffrey Camm)
The machine-tool industry, a small but vital sector in U.S manufacturing, suffers high volatility in demand due to a combination of factors. Several trillion dollars worth of inventory lie wasted in the supply-chain pipelines when demand recedes; alternatively, major opportunity losses in business are incurred when firms are unable to deliver during periods of high demand.  Machine-tool firms, furthermore, have a severe organizational problem of maintaining a skilled labor force in this highly volatile scenario.  Many studies have tried to understand the sources of the volatility and to test alternate policies to reduce volatility, such as reducing order lead-time, information lead time, and capacity planning lead time, altering the work force, and encouraging smoother customer ordering policies.  In this study, I use systems dynamics and dynamic simulation to model the non-linear causal, delay, and feedback loops in the machine tool industry.  A simple model of a machine-tool maker and a customer is created using Vensim to test various strategies that firms can implement to mitigate the effect of volatility on the industry.  From my simulations, I conclude that: (1) the bullwhip effect and the investment-accelerator effect are the two main factors responsible for the extreme amplification of volatility in the machine tool industry, (2) a decrease in the volatility in product orders by the customer increases the average productivity of the machine tool builder significantly, (3) an increase in the customer-order volatility leads to a significant decrease in the average experience level of the machine tool maker's employees, (4) reducing the production lead-time reduces the backlog for the machine tool maker and benefits the entire supply chain.  However, the sensitivity tests reveal that reduction in lead time can have unexpected effects on the machine tool maker's production level and capacity, and (5) smoother customer order policies are the most effective vehicle for reducing order volatility significantly compared to other changes in the machine tool operating policies or parameters.

Severine Renault, Forecasting Residual Value Insurance Using Logistic Regression, May 24, 2004 (Martin Levy, Jeffrey Camm, Norman Bruvold)
The purpose of this paper is to develop and assess a logistic regression model to predict the probability of claim for a Residual Value Insurance (RVI) portfolio. This type of insurance is a highly specialized asset-management tool through which an insurance provider assumes the market risk associated with end values of leased assets, automobiles in this case. In the past, the vehicle type, either at the make or model level, has been used to segment data into different groups, for each of which a separate model was built. The focus here is to include a categorical variable representing these groups in the model itself in order to fit a single regression for the entire portfolio. Fitting the model involves looking for confounding and interaction between the categorical variables and other independent variables, testing the significance of each input variable in the model, and finally deciding which one of the vehicle make or model is relevant to represent the vehicle type as a risk factor. The log-likelihood ratio test and the Wald chi-square statistic were used at this stage, the former to compare different regression models and the latter to test individual coefficient estimates. Once a satisfying set of variables has been defined, the next step is to assess the model. We relied for this on commonly used statistics for logistic regression, namely the c statistic for the area under the ROC curve, the Hosmer and Lemeshow ? statistic for goodness-of-fit, and the Osius and Rojek normal approximation to the distribution of the Pearson chi-square statistic. Since this second stage led us to conclude that the model was not a good fit, this paper ends with a brief comparison with results obtained from models where the data were partitioned by vehicle type and the corresponding categorical variable removed.

Karen L. Bickel, Evaluating Intensive Care Unit Mortality: a Comparison of Risk-Adjustment Methods , April 2, 2004 (Norman Bruvold, David Rogers, David Kelton)
Adjusting for differences in patient characteristics present on admission to the intensive care unit (ICU) is essential when comparing ICU outcomes. Mortality risk-prediction models measure variation in patient outcomes for severity of illness and predicted risk of death. Much of the literature refers to the utilization of risk prediction models to evaluate clinical performance and cost-effectiveness of ICUs. Computerization of commonly used laboratory variables in conjunction with the often extraordinary costs associated with manual data entry presents opportunity for the development of an automated, risk-adjusted ICU mortality model. We compare the performance characteristics of two different risk-adjusted ICU mortality models; the National Veteran's Administration (VA) Surgical Quality Improvement Program (NSQIP) surgical risk model, a partially manual data-collection process that identifies pre-surgical risk factors and uses those risk factors in the development of a 30 day mortality model for major surgical procedures, and the Veteran's Administration Intensive Care Unit Risk Adjustment model, (VIR). Assessment of model fit was completed using the Hosmer-Lemeshow goodness of fit statistic, sensitivity and specificity measures, and the c-statistic as performance metrics in evaluating the behavior of each model. Our results indicate that the VIR automated, mortality risk-prediction model produced similar, if not improved, results in model performance vs. a highly used manual data-collection method obtained by the NSQIP model. These results demonstrate that the VIR computerized mortality risk-prediction method yields comparable results to the NSQIP mortality risk prediction model for these data and warrants further study.

Piu Bose, Analysis of Covariance Model to Evaluate the Impact of a $40 Million Ad Campaign in a test Market, Using Retailer-Level Data , March 19, 2004 (Norman Bruvold, Jeffrey Camm, Martin Levy)
Market researchers are concerned with the effects of different interventions or experimental conditions (treatments) on a set of consumers. These experiments are used to reject or affirm a hypothesis, and in case of rejection, provides support for an alternative conclusion. But in the real world, these treatments often get convoluted due to some extraneous factors that constantly play in the market place. As a result, the impact on consumers is a function of both the test treatment as well as the external factors. Therefore it becomes impossible for the researcher to evaluate the true impact of the test treatment and thereby accept or reject the hypothesis. This paper attempts to understand the methodology called ANCOVA or Analysis of Covariance that is used to evaluate a test-treatment while eliminating the influences of extraneous non-test factors. ANCOVA combines two statistical techniques called Regression Analysis and ANOVA. Here the dependent variable scores and treatment conditions constitute the data, but the model includes not only experimental conditions, but also one or more quantitative predictor variables. These quantitative predictors, known as covariates, represent sources of variation that are thought to influence the dependent variable but have not been controlled by the experimental procedures. ANCOVA determines the covariation (correlation) between the covariate(s) and the dependant variable and then removes that variance associated with the covariate(s) from the dependent variable scores, prior to determining whether the differences between the experimental condition (dependent variable score) means are significant.

Yahong Cui, Application of Multivariate Adaptive Regression Splines (MARS) in Direct Marketing, March 15, 2004 (Martin Levy, Jeffrey Camm, Yan Yu)
Increasing costs of direct marketing campaigns coupled with declining response rates have prompted many direct marketers to turn to more sophisticated techniques to model response behavior. The underlying premise is that even a small improvement in prediction accuracy can have significant implications for the bottom line. This study investigates the use of a recently developed technique, Multivariate Adaptive Regression Splines (MARS), together with logistic regression in the context of modeling direct response. In this study, we report a performance analysis among MARS models, logistic regression models, and expectation models, i.e. MARS and logistic regression combined. The MARS procedure builds flexible regression models by fitting separate splines to distinct intervals of the predictor variables. Specifically, our goal is to assess the relative effectiveness of MARS models vis-à-vis logistic regression with original predictor variables in modeling direct response behavior. Our analyses show that the expectation models and the MARS models outperform the logistic model in general, leading us to conclude that MARS offers a number of advantages over a logistic model and MARS can improve the performance of logistic regression models. Direct marketing strategy implications in variable selection, model evaluation, and error variation stabilization are also discussed in this study.

Hui Hui, Comparing Logistic Regression, Classification Trees, and Hybrid Tree-Logit Models on Building Scoring Models for Catalog Mailing Campaign Data, March 10, 2004 (Martin Levy, Norman Bruvold, Jeffrey Camm)
In the last few decades, the direct mailing campaign has become an important field of direct marketing. An effective direct mailing campaign aims at selecting those target groups, offer and communication elements (at the right time) that maximize the profits. Of these four components the list of customers to be selected is considered to be the most important. Therefore, a large amount of direct marketing research focuses on list segmentation or target selection techniques. The scoring model is an effective methodology to realize the purpose of target selection. It assigns every observation in a database a score indicating how likely someone is to respond to a mailing campaign. Thus, according to these scores, the direct marketer can pick a specific number of people to receive a particular offer so that the response to the mailing is maximized. The objective of this project is to compare the performance of three predictive methodologies, Logistic Regression, Classification Tree, and Hybrid Tree-Logit model on building the scoring models to distinguish between the likely responders and nonresponders. By applying these three methodologies to a catalog mailing campaign data set which has 106,284 records and 47 fields, I came to the conclusion that the hybrid model is the best one to create the scoring model in this project, since it can better fit the data, while maintaining similar good performance properties as logistic regression. From the analysis results, I also found that while a classification tree is not as good for building the scoring model, it is the best choice for the classification task here.

Rajesh Radhakrishnan, Interactive Route Builder for Logistics Planning, February 27, 2004 (Jeffrey Camm, Michael Fry, David Curry, Robert Martichenko)
Logistics can be described as the planning, organizing and managing of activities that provide goods or services. Route designing is at the core of the 'planning' phase of operations for a 3PL (Third Party Logistics) company. The first step involves the plotting of all the locations on a map to identify clusters of suppliers (based on their location and freight information). Then routes can be designed that send freight from the suppliers either directly to the plant or to a consolidation center called a 'crossdock.' Route design has traditionally been a slow manual process done on an Excel worksheet with the use of mapping software to print a map of the supplier locations. The designer also has to rely on his or her experience in coming up with a good design on the very first attempt since historically the time taken in the process is usually quite long to allow for multiple designs and comparative studies among them to choose the best one. In this project, I have designed a software tool named the 'Interactive Route Builder' (IRB) to facilitate route-grouping and route design in general. It has made the route design process quicker and more efficient (locations are added to a route by simply clicking on them on an embedded map). The IRB allows the route designer to quickly generate a number of routing scenarios and compare them based on different parameters (such as total cost, cost per cube, etc). This report also includes mathematical and simulation analysis of the parameters to use when geo-locating a crossdock. The solution to the objective function that minimizes the sum of the product of cube and distance of the suppliers from the proposed crossdock is recommended as compared to the cube-center-of-gravity solution.

Samir Kulkarni, An Exploration of the Resource Constrained Scheduling Capabilities of Microsoft Project, February 13, 2004 (Amitabh Raturi, Jeffrey Camm, Michael Fry)
The resource constrained scheduling problem (RCSP) is a significant challenge because of the mathematical complexities that exist within the problem's formulation. Over time, software packages have been developed to aid practitioners with solving the RCSP and programs became increasingly friendly to the user and versatile in how much data the software could incorporate. As the software became more complex, it is claimed that the mechanisms used to determine the best resource constrained schedule began to deviate from what had been proven academically. In this paper, we explore the gap between academic research and the capabilities of scheduling software, specifically in the software's ability to produce a schedule optimal to certain objectives. We study the RCSP literature and analyze the leveling capabilities of Microsoft Project to gain insight to the aforementioned gap. The exact goals of this paper are to: 1) Discuss the major developments in RCSP that have brought the field to where it is today, 2) Discuss the leveling capabilities of Microsoft Project, a leading scheduling software, and the methods that it uses to obtain a feasible resource constrained schedule, 3) Provide insight into the effectiveness of Microsoft Project's leveling algorithm by comparing the results of several problems implemented in both MSP and as mixed integer programs in AMPL/CPLEX.

Huiqing Zhou, Response Models In Direct Marketing, January 21, 2004 (Martin Levy, Jeffery Camm, Norman Bruvold)
Direct marketing (DM) is a key area where scientific methods are often applied to analyze a massive amount of business data. The core of the decision process in DM is a response model which is applied to assess the purchase propensity of each customer in the list prior to the mailing. A variety of approaches have been developed in the direct marketing industry to model response, i.e., RFM (Recency, Frequency, Monetary) variables, tree-structured automatic segmentation methods such as AID (Automatic Interaction Detection), CHAID (Chi-squared Automatic Interaction Detection) and CART (Classification and Regression Tree), and linear statistic models such as logistic regression, etc. In this paper, two popular models (i.e. logistic regression model and RFM model) are introduced, built and evaluated. It is shown that the logistic regression slightly outperforms RFM model, while each model has its own specific advantages.

 

2003

Hong Gu, Using Data Mining Technology to Build a Predictive Model and to Gain Understanding of Customer Characteristics for a Multi-division Catalog Company, December 1, 2003 (Martin Levy, Jeffrey Camm, Yan Yu)
Data mining techniques enable companies to evaluate historical transaction data from consumer databases and to develop a good consumer model, grouping customers based on visit frequency, profitability, etc. In this project, the data are the catalog purchases from a multi-division company that mails different catalogs to a unified customer base. The dataset contains 96,551 customer records and each record has 163 fields, including life-to-date orders, dollars, items, payment method, and very minimal demographics. All the customers receive the Division D catalog. The project is aimed at identifying the characteristics of would-be responders and the construction of a model that can predict which customers are most likely to respond to their Division D catalog solicitation. The outcome variable, “buying from division D”, is binary, while the predictor variables are either continuous or categorical. Logistic regression, CHAID, and CART approaches are employed. Since there are 163 variables involved, reducing the variables to a manageable size prior to model building is an essential and big step. Due to the dominant number of non-responders in our dataset and the limitation of the software Answer Tree 1.0, logistic regression contributes a lot in variable screening. The final logistic regression model can correctly predict 60.8% of the total wouldbe responders; the CHAID can correctly predict 54.09% of the would-be responders, and the CART algorithm in Answer Tree 1.0 can correctly predict 66.58% of the would-be responders. In terms of prediction, the CART outperforms the other two. Furthermore, the tree maps provide an intuitive understanding of why certain segments respond better than others. However, the 15-node CART tree can only provide 15 different estimated probabilities. The logistic regression model has unique predicted ability for every record.

Snehlata Bomma, Conjoint Bridging and Optimization Project, November 24, 2003 (David Curry, Jeffrey Camm, Uday Rao)
Political polling, whether of public opinion about issues, such as gun control, or direct preference polling for political candidates, has traditionally relied on very distinct survey methodology. Respondents are asked to select their preferred candidate in a mock election or to answer “yes or no” regarding a specific issue. However, most of the supercharged issues of today are multidimensional. Their complexity is “dumbed down” by standard methods, a disservice to political constituents most affected by polling results. This thesis suggests an alternative technique for assessing public opinion that deals well with complexity. The basic method, called conjoint analysis, has been employed in marketing and psychological research for several decades. However, recent developments in conjoint bridging designs and conjoint optimization enhance the applicability of the overall “technique package” to political polling, yielding many insights unavailable with today's standard approaches. In this thesis, we analyze results from an online survey that involves conjoint analysis. We test a theory of “conjoint bridging” that pools parameter estimates between two conjoint exercises. Respondents are asked to react to various hypothetical candidates for US president based on the candidate's positions on several dimensions of Homeland Security policy. Output from the conjoint analysis is then used in a conjoint optimization phase to find an “optimal position on Homeland Security”. Optimal means that even though individual voters weight attributes differently and prefer different levels, there is a single combination of levels that will please the most voters.

Xuming Yang, Framingham Heart Study Data Analysis -- A Case Study for GLM, GPLSIM and GAM, November 12, 2003 (Yan Yu, Jeffrey Camm, Norman Bruvold)
One of the most important techniques in statistics is regression analysis. Applications lie in a variety of fields, such as finance, marketing, and many medical fields. Linear regression can provide useful and interpretable descriptions in the linear relationship between response and predictor variables. The generalized linear models are powerful in fitting the linear relationship between variables when the response is from a general exponential family, for instance, binomial or Poisson. Unfortunately, in many applications, there is not always evidence of a generalized linear relationship. Other data-driven nonparametric modeling techniques, e.g., the generalized single-index models, generalized additive models, emerge as promising alternatives that offer the flexibility of fitting the curvature and yet retain the ease-of-interpretability. This project focuses on the application of several different modeling techniques -- generalized linear models (GLM), generalized partially linear single-index models (GPLSIM) and generalized additive models (GAM) -- on Framingham Heart Study data. The response variable of interest is a binary variable indicating the occurrence of coronary heart disease. The predictors are the patients' age, cholesterol level, systolic blood pressure, and their smoking status. The objective of this project is to apply and compare different models using Framingham Heart Study data to reveal the relationship between variables and to capture the curvature if any non-linearity exists. From this case study we conclude that the logistic or probit regression performs well to fit the linear relationship. When the nonlinear relationship exists, generalized additive models and generalized partially linear single index models are better in terms of capturing the non-linearity. GAM are very helpful for a visual inspection of non-linearity. GPLSIM fit the Framingham data best and retain ease-of-interpretability.

Neil D. Eisner, A Daily Replenishment Production Scheduling and Inventory Minimization Simulation, October 15, 2003 (Michael Fry, David Kelton, David Rogers)
General Cable Corporation is a $1.6 billion manufacturer of industrial and specialty cable products, spread over seven major product groups. Within a major product group, products are initially subdivided into families, termed by management ”product lines.” The Portable Cord major product group is manufactured exclusively at the company's facility in Lincoln, Rhode Island. While the firm is relatively early in its implementation of more modern manufacturing practices, several cells are currently in operation at the Lincoln plant, with each cell dedicated to the manufacture of a particular group of product lines. This study addresses demand planning and production scheduling for a single cell involved in the manufacture of product lines 40, 42, 43, 46, P5, and Q5. A well-known advantage of cellular manufacturing configurations is the enhanced capability for quick and more effective response to highly variable demand. The daily aggregate bookings for these product lines, aggregated across the company's five distribution centers, demonstrate extreme variability (i.e., a coefficient of variation of 79.16). Current demand planning is simplistic and leads to excessively high inventory carrying costs. Using a quarterly planning horizon, the mean plus 2.06 standard deviations (corresponding to a 98% service level) of the previous quarter's demand is calculated, and production is scheduled for the upcoming quarter at a fixed daily rate sufficient to equal last quarter's demand. A simulation model is developed using the most recent two years of historical booking data. We provide an estimate of the inventory levels the firm would need to carry if a daily replenishment production scheduling system were to be implemented, maintaining the same 98% service level to customers. The product lines under investigation exhibit strong commonality in both their manufacturing processes and in their bills of material. Manufacturing cycle times are on the order of magnitude of hours; therefore, the cell dedicated to a particular product line is capable of a one-day turnaround time in response to bookings. Additionally, the individual distribution centers demonstrate their own individual and unique characteristics. The mean demands at the distribution centers (RDC's) differ widely (all with similar high variability), implying that the relative contribution of each is very different with respect to meeting the overall service level goal. Neither shipping lead times, nor shipping frequency, are the same for any two RDC's. Another complicating factor is the occasional need to make a large shipment from the plant directly to a customer. This study first identifies the forecasting method which will drive the daily production schedule. The proposed process through which products are distributed, manufactured, and replenished is mapped in detail. A simulation model of this system is built using Arena® discrete event simulation software. Variants of the model are explored, such as the sequence of RDC fulfillment, the daily production control limits, and the pallet (lot) size. Lastly, the potential effect of forecast accuracy on inventory levels is evaluated and described.

Marione P. Gonzales, A Model for Profiling Radio-Stations Listeners, Using Logistic Regression, CART, and CHAID for a Given Data Set, August 28, 2003 (Martin Levy, Norman Bruvold, Jeffrey Camm)
Data mining has drawn much attention in the business, marketing, and medical fields. Data-mining techniques can be used to find relationships and patterns in historical data for the purpose of predicting or classifying future observations. Practical applications include detecting credit worthiness of loan applicants and predicting a patient's risk of developing an illness. This paper attempts to develop a model for profiling radio-station listeners using statistical methods like logistic regression, CART (Classification And Regression Tree), and CHAID (Chi-Square Automatic Interaction Detection) for a given data set. More specifically, segmenting data provided by “RadioX” (radio station of interest) into listeners of RadioX and listeners of other radio stations was performed. The ability to analyze categorical data is the primary reason for choosing logistic regression, CART, and CHAID. Results from analysis using these three methods are compared and combined to profile RadioX listeners. The predictive performance of each statistical method can be measured by the misclassification rate, which measures the rate of correctly classifying the data. In terms of accurately classifying RadioX listeners from non-RadioX listeners, logistic regression, CART, and CHAID give almost the same total misclassification rate; but logistic regression gives a better misclassification rate for segmenting specific RadioX listeners. However, difficulties interpreting the results of interactions with logistic regression analysis exist. So, we use logistic regression mainly to isolate important variables, and then use CART and CHAID to determine categorical values.

Yanrong Cao, Penalized Spline Estimation For Functional Coefficient Regression Models for Nonlinear Time Series, July 25, 2003 (Yan Yu, Martin Levy, David Kelton)
A penalized spline approach is proposed to estimate functional coefficient regression models for nonlinear time series. The functional coefficient regression models assume the regression coefficients vary with certain lower dimensional covariates, providing appreciable flexibility in capturing the underlying dynamics in data and avoiding the so-called “curse of dimensionality” in multivariate nonparametric time series estimation. One of the appeals of the proposed model lies in the efficiency in estimation of the coefficient functions via the global smoothing method. In addition, different smoothness is allowed for different functional coefficients, which is enabled by assigning different penalty accordingly. The penalty terms, selected by minimizing generalized cross validation scores (GCV), balance the goodness-of-fit and smoothness. The number and location of knots are no longer crucial if the minimum number of knots is reached. The consistency and asymptotic normality of the penalized least squares estimators are obtained. Our penalized spline approach also enables multi-step-ahead forecasting with an explicit model expression in contrast to the local smoothing method. The proposed approach is demonstrated by both simulation examples and a real data application.

Sara Dziech, Exploratory Analysis of Horse Racing Data, May 27, 2003 (Martin Levy, Norman Bruvold, Yan Yu)
Gambling or wagering is big business and is becoming even bigger in the greater Cincinnati area. State lotteries and the gambling boats have brought legal betting back into the spotlight. This surge of renewed interest in gambling has brought more attention to one of the oldest forms of wagering, horse racing. With the increase in the use of home computers and the internet, now more than ever, and overwhelming amount of data are available on individual horse performance, track entries, and results. Using these data, an exploratory statistical study was conducted to look for trends in the data and to create predictive models to help select the “winners.” The study also demonstrated whether the results were truly random, or if there were commonalities that would allow the astute handicapper to have an advantage over the common bettor. The ability to predict which horses would finish in the money (first, second, or third) would be key to actually making money at the track, since the bigger payoffs come from the exotic bets such as the Daily Double or Exacta. The models and analysis presented here may prove useful in successfully selecting the horses that will finish in the top three or in the money. Basic statistics were reviewed and key elements presented. Weighted general linear models were created using the percent of finishes in the money as the dependent variable. The logistic models were developed using a binary dependent variable -- finished in the money or did not finish in the money. CHAID analysis using Answer Tree was also performed. Each type of analysis was conducted from two views -- all tracks combined with the emphasis on overall trends, between-track differences, and track-specific models. The resulting models and their appropriateness were compared. The final part of the project involved testing the predictive ability of the logistic model against the selection performance of a few average people to see if the model was more successful than random guesses.

Thushan Wijesinghe, A Comparison of Two Heuristic Solutions for Scheduling Time-Shared Jets, May 27, 2003 (Jeffrey Camm, Michael Magazine, David Rogers)
Fractional ownership of jets has been increasing its popularity over the last 3-4 years at an exponential rate. Under the concept of "Fractional ownership," customers become partial owners of an aircraft, which in return allows them to fly a predetermined number of hours per year. If the requesting customer has enough flying hours left, it then becomes the task of the scheduler of the airline company to assign a jet to fulfill the demand. There has been a rather limited amount of academic research done in the area of scheduling time-shared jets. So far, an Integer-Programming solution and a minimum cost-flow heuristic solution have been put forward. In this paper two heuristic approaches are presented for solving this problem. The two heuristic methods will be compared to an IP approach for a base case scenario. The first heuristic is for minimizing the relocation times for jets (which is the objective of the problem) by using a “one-step look ahead” rule (a “greedy heuristic”). The second heuristic rule is for allocating trips to jets based on the number of remaining trips that each jet should serve. The prime advantages of such a heuristic solution are the ease of formulation, minimum user intervention, fewer variables (compared to the IP solution), and the non-necessity of sophisticated software such as CPLEX.

Qiang Zhu, Building Credit-Scoring Models Using Logistic Regression, CART, and CHAID, May 27, 2003 (Martin Levy, Jeffrey Camm, Norman Bruvold)
Classification methods have been widely used in the identification of respondent profiles. One of the most important applications is in credit scoring, which is a method used by lenders to help decide whether or not an application is a good candidate for a loan. In this work, three classification techniques are applied and compared to analyze a complex data set of credit risks. The dependent variable is a binary variable indicating whether an applicant defaulted on a loan or not. The three classification techniques are logistic regression, Classification and Regression Tree (CART), and Chi-squared Automatic Interaction Detection (CHAID). It turns out, that in a given test sample, logistic regression technique achieves the best predictive performance. An additional two-step CHAID and logistic regression analysis is applied to produce a combined prediction in order to determine if the combination of two techniques will achieve better performance than one single technique. This prediction turns out to be slightly worse than the logistic regression model, but yields a better performance than the CHAID and CART model. Therefore, we propose CHAID as a method of enhancing interpretation of a logistic regression model through the examination of the significant predictors and interaction terms.

Wensui Liu, A Case Study Comparing CART and Neural Networks, March 21, 2003 (Martin Levy, Jeffrey Camm, Norman Bruvold)
Traditionally, statistical methods, such as logistic regression and discriminant analysis, have been widely used to do the classification analysis. However, when assumptions for statistical analysis are not met, alternatives need to be considered. In my project, two popular nonstatistical methods for classification, namely Classification and Regression Tree (CART) and Neural Networks (NNs), have been discussed. For Neural Networks, two widely-used paradigms in classification, which are Feed-forward Network trained by Back Propagation algorithm (BPN) and Generalized Regression Neural Network (GRNN), are covered. In order to evaluate the performance of these methods, I apply them to benchmark data for classification, build the models to make prediction, and then make comparison. After comparing six models from these two methods, we find that BPN outperforms the other methods for prediction performance. However, it needs more computational effort and longer training times. Moreover, its results are difficult to interpret. CART is easier to use and can be interpreted intuitively. And its result is almost as good as the ones from BPN. Therefore, we conclude that none of these models is apparently superior over the others and a possible compromise is to combine these two methods and make improvement in classification analysis.

 

2002

WenWen Wu, A Study of Tree Models, November 26 2002 (Yan Yu, Martin Levy, Sung-Eun Kim)
Tree models, recursively partitioning data to more homogenous subsets, are widely used for data mining. The seven most popular methods are discussed in this work: Classification and Regression Tree (CART), Bayesian Classification and Regression Tree (Bayesian CART), FACT, Quick, Unbiased, Efficient Statistical Tree (QUEST), Treed models, Bayesian Treed models, and Multiple Additive Regression Trees (MART). CART, one of the exhaustive search methods, is considered as a base model. The best split in CART is on the variable that can minimize the impurity of the nodes. Mean values in terminal nodes are calculated as predicted values. Bayesian CART adds stochastic methods of parameter estimation and model selection on CART. FACT and QUEST only split nodes by selected variables with unbiased variable selection and fast computational speed. Treed models and Bayesian Treed models put a subset of the original data set in the terminal nodes and fit different statistical models for them. MART is an additive model of many small regression tree models, such as CART. Its strength is its robustness and predictivity. We apply the above-discussed methods to two simulated data sets (categorical and continuous response, respectively) and one real data set, the Boston Housing data. However, under the restriction of software availability, only CART, Bayesian Treed model and QUEST are applied in the project. Traditional statistical methods, Linear Regression and Logistic Regression, are also used for comparison purpose. When the response variable is categorical, the misclassification rate is used as the criterion for model comparison. When the response variable is numerical, root mean squared error (RMSE) is used. Residual plots are also involved for model fitting performance comparison. From the applications in this project, CART and QUEST are shown to be the best for the categorical response variable case. However, CART and Bayesian Treed model are optimal for the numerical response variable case. For the more complicated Boston Housing data, the Bayesian Treed model has the best performance. The results indicate that the strength of Bayesian Treed and QUEST is more obvious for data with complicated structure. In the general case, CART is strongly recommended. CART is not only easy to use (by built-in program in software, such as S-Plus) but also yields good results.

Yuan Yao, Analysis of Volatility Time Series Models and an Evaluation of their Forecasting Performance, November 8, 2002 (Martin Levy, Norman Bruvold, Yan Yu)
The coefficient of variation is a statistical measure of volatility. For example, it measures the standard deviation of the closing price from its simple moving average. In finance, volatility is an important input to some predictive or pricing models like Balck-Scholes, CAPM, and so on. Moreover, an accurate estimation of return volatilities may shed some light on the generating process of the returns. We develop a methodology with capability for good estimation and forecasting for volatility. This paper examines three types of time-series models for their performance in volatility forecasting of economic data and tries to compare and evaluate their forecasting performance. For US ten-years bond return, a set of Simple Moving Average models (SMA(M)) with different values of M are provided to estimate and capture its volatility. We introduce GARCH, a well-known time series model for economic volatility analysis, to produce another measure of volatility. Based on the strong similarity between the volatilities created by the two models, we find that SMA(M) is a good replacement for GARCH in some special situations since GARCH is more sophisticated than SMA(M). For the S&P 500 monthly excess return and its volatility with non-linear features, we build both linear time series models and nonlinear GARCH for estimation and forecasting. Although volatility has some non-linear features, some good linear models can still be developed to describe it because the statistical theory is a well-developed and computational tool for linear models. Finally we determine which model performs better in volatility forecasting based on RMSE (Root Mean Square Error).

Vivek Kalpande, On the Mean Length of Two-Component Systems Under Some Bivariate Survival Functions, October 22, 2002 (Martin Levy, Jeffrey Camm, David Rogers)
We examine the expected survival time of series and parallel systems whose components have bivariate distributed lifetimes. The mean lifetime of such systems is a function of the dependence structure of the component lifetimes. Results are extended to multi-component systems.

Derek H. Wang, Return Analysis, Volatility Estimation and Trading on the Shanghai Stock Market, July 26, 2002 (Yan Yu, Martin Levy, Michael Ferguson)
In this project, the intraday return behavior of the Shanghai stock market with five-minute index data is first examined. Some interesting intraday seasonal patterns are found. The standard variance ratio test is used to test the random-walk hypothesis in order to understand the Shanghai stock market's microstructure efficiency. Three volatility models, including a continuous-time model, a GARCH model, and a time-dependent coefficient diffusion model with Kernel regression estimation are applied to estimate and compare the expected returns and volatilities of five-minute data. In addition, this work presents a penalized spline approach to estimate time-dependent drift and volatility in term structure dynamics. The drift and diffusion (volatility) components are estimated iteratively with weighted least squares. Two other methods, moment matching and maximum likelihood estimation, are described. The new time-dependent diffusion model can be considered to be the extension of most term-structure models and a special case of the general time-dependent diffusion model. Compared with other estimation methods, the penalized spline estimation method is easy to implement and less time-expedient. Moreover, there is no problem of discontinuous coefficient estimates when estimating the time-dependent coefficients with logsplines. With different volatility estimates, a de-volatilization technique is used to resample the data into different de-volatilized series for trading.

Natasha Lukiantseva, Using Inverse Optimization for Calculating Link Penalties for Traffic Flow, July 2, 2002 (Jeffrey Camm, David Rogers, George Polak [Wright State University])
In this paper the application of the Inverse Optimization Technique to optimally adjust link penalty factors to closely match the historical multicommodity traffic flow is presented. When using various tools to simulate railway traffic flow, network link cost factors (impedance) are introduced to reflect preferred routes often times different from shortest paths. The Inverse Optimization Technique is illustrated with an example. The application of the technique to a real problem at CSX Transportation is discussed.

Usha Viriyala, Bayesian and Classical ARIMA Analysis of Time Series and Forecasting – A Comparison Study, June 28, 2002 (Martin Levy, Uday Rao, David Rogers)
The field of Time series analysis and forecasting is gaining increasing recognition in today's business world. It is becoming increasingly important for firms to learn how they performed in the past in order for them to plan ahead in the future and more importantly predict it with maximum accuracy. Many techniques and software tools have evolved over the past few years in time series analysis and forecasting. The objective of this research project was to consider the Bayesian method of time series analysis and forecasting, and draw a comparison with the conventional Box & Jenkins' ARIMA modeling. An authentic data set from a department store in Florida was used for the research. The analysis was performed using both the above-mentioned modeling techniques, and results were compared. Recommendations are provided for further possible explorations of the problem on hand. Part of the project involved exploring a new software; BATS (Bayesian Applied Time Series) which was used for Bayesian modeling and, is felt, will be valuable for future use in classroom instruction. The ARIMA modeling was done using SAS system software.

Junying Wu, A Heuristic Approach to a Product Design Problem Under Conjoint and Hybrid-Conjoint Analysis, June 3, 2002 (Jeffrey Camm, Uday Rao, George Polak [Wright State University])
Product design, where the objective is to design a single product that will maximize the market share of the producer by selecting the appropriate levels of the various attributes, is one of the key factors in determining the success of a new product. Increasing attention has been given to the problem when firms decide to introduce their new products. As a result, techniques that will lead to an optimal product design are of great interest to every firm in order to survive and succeed in a competitive business environment. However, the fact that product design problem using conjoint analysis data is an NP-hard problem makes the searching for optimal solutions within a reasonable amount of time impractical when data are of extremely large sizes. Consequently, heuristic techniques that try to identify sub-optimal product designs have been proposed. In this paper, the authors propose a new heuristic algorithm that can generate “good” (i.e., close to optimal) solutions for the product design problem. The paper focuses on (1) how the algorithm can be applied to the product design problem, (2) evaluating the overall performance of this algorithm in generating solutions to the product design problem and the comparative results between this algorithm versus the GA heuristic (Balakrishnan and Jacob [1996]), and (3) limitations and further improvement in the algorithm.

Raymond Mapuranga, The Factors Affecting Contrived Collegiality Among Teachers: An Exploratory Study Through a Path Analysis, May 29, 2002 (Wei Pan, Martin Levy, Jeffrey Camm)
This paper involves an empirical study of factors influencing contrived collegiality among teachers in schools. The research entails the development of a path analytic model that incorporates the core constructs of Hargreaves' (1961) model of combined features that contribute to contrived collegiality. The model links 4 exogenous variables (administrative regulation, compulsory in nature, implementation and time orientation) and 2 endogenous variables (predictable outcomes and collegiality). In order to construct this model, fifty teachers, education students and professionals were surveyed at random. The CALIS procedure in SAS and the AMOS package are the two statistical software programs used in this project. The study found that administrative regulation does not encourage collegiality and that when teachers are required to work together, the outcomes, in terms of collegiality, are not predictable.

Vikas Sharma, Revenue Maximization by Capacity Rationing in an Uncertain Environment, March 5, 2002 (Amitabh Raturi, Jeffrey Camm, David Rogers)
The paper discusses and implements a capacity rationing policy that allows manufacturing firms encountering expected total demand less than available capacity to discriminate between two classes of products, one yielding a higher profit contribution than the other. The result is a selective rationing of orders, yielding an increase in total profit when compared to the base case that implements no capacity rationing. Implementation of the policy requires forecasts of demand parameters. The result indicates that, on average, the rationing policy is quite robust in improving the profits.

Aaron M. Freed, Empirical Test of a Stock Portfolio Optimization Model, February 11, 2002 (Jeffrey Camm, Martin Levy, Brian Hatch)
This empirical study examined a stock-portfolio optimization model, which did not use mean-variance statistics like the classic Markowitz Model. Instead, this optimization model optimized the weights of a stock portfolio by maximizing the number of periods a portfolio meets or beats a stock market index. Using randomly sampled data sets generated from the population of Standard & Poor's 500 (S&P 500) stock components, this study empirically tested the model by benchmarking the optimized weighted portfolios against evenly weighted portfolios. Further, the study explored the difference between using 12 monthly periods and 60 monthly periods of stock return data in the optimization of weights procedure. This study employed the use of standard statistical hypothesis tests for paired data to determine significance at an alpha level of five percent. With respect to the difference of success between optimized and evenly weighted portfolios with 'future' data, the tests indicated no significance for 12 monthly periods and significance for 60 monthly periods at a p-value of 0.024.

 

2001

Chris Christopherson and Nicole Howerton, Increasing Profits while Decreasing Scrap, August 21, 2001 (Michael Magazine, Jeffrey Camm, Robert Gould)
Increasing Profits while Decreasing Scrap is a project designed to assist Technicote, Inc. in its ability to minimize trim loss. Technicote purchases large rolls of raw adhesive backed labels, hereafter referred to as master rolls, and cuts them into smaller labels according to customer specifications. Customers will, in turn, graphically enhance and affix these labels either to finished products or to packaging materials. The inherent problem in this industry is the ability to minimize the trim loss associated with the series of cuts that make up a customer order. Since master rolls may be spliced together, roll length is not a factor in this problem. In a true cutting stock problem, roll length is a variable as well as roll width. As a result, this problem is a slight variation of the cutting-stock problem. The solution to this minimization problem comes in the form of an Excel spreadsheet. Inputs include the master roll widths and the customer specifications, or ordered widths. Utilizing the Solver Add-In, the spreadsheet returns the widths that are to be cut from each master roll, giving the least amount of trim loss.

Feng Jiao, Reexamination of Shumpeter's Hypothesis: Market Concentration and R&D Expenditure, June 5, 2001 (Norman Miller, Martin Levy, Yan Yu)
It is true that government is more interested in competitive environment rather than monopolistic structure, with good reason. However, technology advancement requires a more concentrated market structure. Shumpeter's hypothesis shows that monopolistic structure is more conducive for technology development. Previous studies do not conclusively show that there is a significant relationship between market concentration and technology advancement. In this study, I confirm that Shumpeter's hypothesis has supporting evidence. More than that, I show that more concentrated market structure is more conducive to technology development, measured by firms' R&D expenditure. Looking further, I show that product technology is more significantly related to market concentration. The conclusion contrasts the known study. Looking at industry's characteristics, I find evidence that entry barrier is an important determinant that some industries choose more concentrated market structure. The causality tests show that market concentration is significantly related to firms' R&D expenditure.

Xiaoling Sun, A Study of the Use of Multiple Additive Regression Trees for Caravan Insurance Policy Prediction, April 27, 2001 (Yan Yu, Norman Bruvold, David Kelton)
Multiple Additive Regression Trees (MART) is a novel methodology for predictive data mining and can be applied in many areas such as in credit card companies, insurance companies, as well as mortgage companies. MART makes additive expansions in decision-trees and realizes the numerical optimization in function space instead of parameter space. A major advantage of MART over other classical methods (logistic regression and Classification and Regression Trees (CART)) is its robustness, accuracy, and immunity to the adverse effects of wide tails and outliers in the distribution of the predictor variables. This work focuses on an application to caravan insurance policy prediction with MART. This problem is initially motivated by direct mailing problems faced by many companies and raised by a competition aiming at finding out why customers have a caravan insurance policy and how these customers are different from other customers. Companies desire to have a better understanding of their potential customers so that they could target the customers more accurately and reduce the waste and expenses. In this thesis, we propose to apply MART, logistic regression, CART, two-step logistic and CART, and two-step logistic and MART to the caravan insurance policy prediction problem and to compare the results.

Sriram Kannan, Finding all Optimal Solutions to Covering Problems, April 2, 2001 (Jeffrey Camm, James Cochran [Louisiana Tech University], Dennis Sweeney)
Set covering and maximal covering problems are widely used by managers to model decision-making problems. These two problems are closely related and are both modeled as binary integer programs. Some common applications are in reserve site selection, location of facilities, and the list selection problem in direct mail advertising. Managers are often interested in obtaining multiple optimal solutions to these problems when they exist. The advantage of having multiple optimal solutions is that managers have flexibility in choosing an optimal solution based on factors not considered in the model. The factors, when built into these models can make them difficult to solve to optimality and in many cases, it may not be possible to build a model with all the important factors. Existing methods rely on the cut generation approach to obtain all the alternate optimal solutions to a given problem instance. These methods are not efficient and frequently fail to generate all the alternate optimal solutions in cases where numerous optimal solutions exist. We propose an algorithm that works in two phases. In Phase I, the principle of divide and conquer is employed to reduce the size of the problem. In Phase II, a backtracking algorithm strengthened by lagrangian and logic-based bounds are used to generate all the alternate optimal solutions. We apply this algorithm to the generalized set cover and maximal cover data sets that we have available from facilities location problems.

Timothy J. Shockley, A Simulation Analysis of Supply Chain Fill-Rate Models, March 15, 2001 (David Rogers, David Kelton, Michael Magazine)
In today's business environment, a large emphasis is being placed on making the supply chain more efficient. As E-Commerce becomes more prevalent, many companies understand that remaining competitive relies upon how quickly they can get their product to the end-user. The efficiency of the supply chain not only constitutes the delivery of product to the consumer, but it also includes how much inventory to hold and where along the supply chain to hold this inventory. Many mathematical programs may be employed to define the amount and location of inventory to hold based upon certain parameters such as the demand distribution at the retailer level and the various holding and penalty costs along the supply chain. However, many real-world situations are not based upon the same assumption set that may be assumed for the mathematical model. Therefore, it is necessary to test the optimal solutions of mathematical programs prior to implementing them into an organization. Simulating a real-world scenario allows an objective view of the effects of a variety of inputs. Simulation generally does not seek to find an optimal solution, but it will allow for the testing of an optimal solution from a perhaps oversimplified mathematical programming model within a real-world environment. It is important to build the simulation to represent the real-world as closely as possible in order to establish the validity of the model. When a simulation attempts to recreate a mathematical program, it is important to make the same assumptions in the simulation as was made in the mathematical program. A simulation for inventories in a one-warehouse, n-retailer (non-identical) case will be performed and the results compared to those from a common nonlinear mathematical programming model.

 

2000 and Prior

Svetlana Nikolaeva, Trading Rules and Stock Returns: A Simulation Analysis, July 21, 2000 (David Kelton, David Rogers, Gary Raines)
In this paper are tests for three popular trading rules used for technical analysis of securities trading: Moving Averages, Relative Strength Index, and Lane's Stochastics. Trading indicators are applied to simulated stock-price time series generated for six different market environments. Standard statistical analysis was used to test stock returns following buy and sell signals. Overall, the results provide support for all studied trading strategies: the returns following buy signals are higher than returns following sell signals. Moreover, the absolute difference between the sell and buy returns is higher for more volatile markets. The method developed in the paper can be used for preliminary testing of any stock-trading rule in any specific market environment.

Qiang Lin, A Survey of Power Analysis in Design of Experiments, July 21, 2000 (Martin Levy, David Rogers, Jeffery Camm)
This is a technical report summarized from the book 'Statistical Power Analysis for the Behavioral Sciences' by Jacob Cohen.  The power of a statistical test is the probability that we can reject the null hypothesis based on the sample results when the null hypothesis is false.  We want the power to be high so that when we cannot reject the null hypothesis based on the sample results, we know the probability of accepting the true null hypothesis is very low.  For some statistical tests, power analysis and sample-size analysis can be very complicated.  This report summarizes the methods of computing power values and the sample-size values to obtain ideal powers for different tests.  For each test, the definitions of important parameters and the computational methods are followed by illustrative examples.  SAS IML programs are provided for each example.  The power tables are not reproduced in the report because using SAS programs to compute power can be much easier than looking at a table.  This report can be used as a handbook to obtain power and sample-size values for different statistical tests.

Boris A. Orlov, An Analysis of Impact of Price Protection on Supply Chain Profits in Short-Cycle Industries, June 6, 2000 (Nikhil Jain, Michael Magazine, Sean Willems)
In short life cycle industries such as the personal computer industry, price and costs of the product decline rapidly over the product life cycle. Declining prices increase the cost of holding a unit of inventory. Without price protection distributors would hold less inventory increasing unsatisfied demand. Price protection assumes that manufacturer compensate to the distributor the difference in price if it declines for a specified period of time or for a proportion of units unsold. A two-period inventory model is developed in order to measure the influence of price protection on channel profitability in short life cycle industries. The level of price protection that maximizes the individual profits of the manufacturer and the distributor is different from the optimal level of price protection that maximizes the total profit of the supply chain. Examples are given to illustrate the impact of change in profit margins on the optimal level of price protection. Some implications for supply chain management are discussed briefly.

Reeja Marath, Integration of E-Business into the Supply Chain, June 6, 2000 (Ram Ganeshan, Michael Magazine, Nikhil Jain)
The idea of doing business electronically has been around for some period of time. Today many companies are moving away from Phone and Fax to the Internet. Companies have started using Web to communicate, and to achieve real business value by incorporating Internet technology into their core businesses. E- Business is about better customer service, integrating with the suppliers and partners and being able to expand the physical market through electronic means. It is about streamlining the current business processes that would in turn add the value that is provided to the customers. Organizations that succeed in grasping and adopting the new elements of web based E-Business have an edge over their competitors. This study has focused on the aspect of integration of E-Business into the supply chain. Case studies of various companies, which market varied products and offer distinct services and have implemented Web based E-Business in their firms has been presented. In this project, we have conducted a case study analysis on a Company X, a distributor of specialty goods located in Cincinnati. We have looked into the aspect of implementing web based E-Commerce in the firm by analyzing the existing system and providing a proposal for implementation of a new system. The objectives are to ensure that the front end Order entry and the back end i.e. inventory management, supplier and customer relationship management, forecasting is coordinated effectively through efficient integration of Information systems. Another goal is to transact business directly between the customer and the Company, with minimum response time and negligible overheard costs. To achieve these objectives a detailed study has been conducted on the existing bottlenecks and a proposal to overcome these pitfalls has been presented. A cost comparison based on the resources required software and hardware requirements between the existing and the proposed system has been provided. The methodology for implementation of the proposed system and the areas where the benefits would be achieved has also been described in the present study.

Eric W. Kramer, A Heuristic Method for the Honors Plus Program Interview Scheduling at the University of Cincinnati, June 2, 2000 (Michael Magazine, Norman Baker, Jeffrey Camm)
Timetabling is an area that has often been difficult for which to generate solutions. Many universities often have difficulty scheduling courses, exams, and other activities requiring the scheduling of various entities during specified time periods. The Interview Scheduling System is a problem that requires the scheduling of a series of interviews between employers and Honors Plus students at the University of Cincinnati's College of Business Administration. The students are freshmen that will have completed their first year of study in June and are filling positions as interns with companies in the greater Cincinnati area. In January of each school year, scheduling of companies to interview students begins. After the companies have been assigned interview times, the students are assigned interview times with the companies. Finally, after the interviews are conducted in February, students are assigned as interns with the various companies. A heuristic method is developed based on a set of integer programs for solving an interview-scheduling problem. The problem is formulated in terms of reducing conflict between interview times and student course schedules. The heuristic leads to a solution of an otherwise difficult problem to solve.

Gautam Dalvi, Finding All Optimal Solutions for the Generalized Set Covering Problem, May 19, 2000 (Jeffrey Camm, David Kelton, James Cochran [Louisiana Tech University])
The generalized set covering problem (GSP) occurs in development of optimal network of land sites for conservation of natural and biological resources. Since development of conservation network may involve purchasing/leasing sites from existing owners, an optimal solution obtained by solving the GSP may not be always feasible to implement within budget constraints. Consequently, during negotiations with site owners, the decision-makers must be aware of alternative ways for developing the network and relative importance of the sites in ensuring an optimal network, which is represented by their irreplaceability indices (IRI). Since IRI is the percentage of all optimal solutions in which the given site is present, we must determine all optimal solutions to GSP to compute IRI for any site. In this project, we study the percent reservation problem, which is formulated as a GSP, for the New South Wales National Parks and Wildlife Service of Australia. We first explore computational issues involved in determining all optimal solutions for the percent reservation problem. We then present a problem reduction technique and two algorithms, namely restrictive enumeration algorithm and replacement site algorithm, for estimating IRI. Problem reduction decreases the solution space and identifies some sites that are absolutely essential for the optimal network. Restrictive enumeration algorithm allows generation of new optimal solutions in a controlled way. Replacement site algorithm algebraically generates large optimal solutions in very short time algebraically. We present computational results using the above three algorithms for data sets provided by the park services and evaluate the efficacy of the algorithms in determining all optimal solutions.

Linda A. Hirsch, Telephone vs. Internet Interviewing - A Comparison of Scale Usage, May 19, 2000 (Norman Bruvold, David Rogers, Martin Levy)
For many years, telephone interviewing has been the cornerstone for data collection on countless marketing research studies. Now, however, with the influx of telephone management options (voice mail, answering machines, Caller ID, etc.) and rising refusal rates among those who can be reached, the research community must explore alternative means for collecting quality data. The Internet has the potential to be at the forefront of the next generation of data collection for the marketing research industry. As such, it is important to assess the quality of the results obtained from this medium. This research, which was conducted among individuals with access to the Internet, examines the similarities and differences between data collected via the Internet and data collected via telephone interviewing. Specifically, it explores participation rates, scale usage, and the impact of offering respondents a 'Don't Know' response on Internet surveys. This study also compares Internet and telephone interviews from the respondent's perspective by examining the extent to which they enjoyed the interview experience and their likelihood to participate in similar studies in the future.

Dapeng Cui, Archetypal Analysis and Its Applications in Business Research, February 7, 2000 (James Cochran [Louisiana Tech University], Jeffrey Camm, Martin Levy)
There are multiple statistical methods for analyzing multivariate data. This paper discusses and illustrates a recently developed multivariate technique, archetypal analysis, and explores its applications to business problems. Archetypal analysis, developed by Cutler and Breiman (1994), results from the need to find archetypal patterns, a mixture of which could well represent each observations in a data set. It also requires that archetypal patterns must be a mixture of the observations in the same data set. Archetypes are constructed by minimizing the squared error that results from representing each individual as a mixture of archetypes. Two applications to survey data in this paper show that archetypal analysis is valuable because it aids in identifying archetypal patterns in the data and analyzing and understanding the heterogeneity of consumers in a market. Another application of archetypal analysis to conjoint data, however, indicates that archetypal analysis is not always helpful in all respondents' data analysis probably due to the vast heterogeneity of consumer behavior. Limitations of archetypal analysis are analyzed and discussed.

Zaizai Lu, Infant Feeding Behavior and its Impact on Child's Health in China, January 27, 2000 (Martin Levy, Marcia Bellas, Jeffrey Camm)
This study examined the factors affecting children's feeding behavior in China, and the impact of children's feeding behavior on their health and growth conditions, using the 1993 China Health and Nutrition Survey Data. I selected 330 children age 3 or younger for the final sample. I used 222 of the children with feeding information in the final analysis. The data showed that living area, household income, mother's age, educational level, occupation, smoking or drinking habits do not have any significant effects on a child's feeding behavior. Father's smoking habit and occupation have a significant effect on the feeding behavior. A child's gender also significantly affects feeding behavior in the expected direction. The data does not support the argument that breastfed children have a lower body weight and height, nor does it support the argument that breastfed children are healthier than non-breastfed children. The data shows that a child's feeding behavior or duration of breastfeeding does not have any significant effect on his/her health status or growth indices. I recommend that a more representative sample with more complete and clean data be used in future similar studies.

Kenneth W. Schmahl, Application of an Unconstrained Multi-Product Newsboy Model for a Style Goods Business, September 8, 1999 (Amitabh Raturi, David Rogers, Michael Magazine)
Inventory analysis is critical to the profit and loss of many businesses; this is especially true in the style goods retail market. The fickle nature of fashion and fads make it important to accurately estimate inventory requirements in order to be successful in this industry. Because of the seasonality of fashion goods, it's critical to find a balance between overestimating and underestimating the demand for each season's inventory. A common technique for such an analysis is the newsboy problem. This paper examines the inventory requirements of a maternity wear rental business, Classic Maternity Sales and Leasing. An applied analysis of the multi-product, no constraint, newsboy inventory model has been utilized to examine the inventory needs of Classic Maternity. I will discuss how the newsboy model has been modified to meet the criteria necessary for the inventory analysis of such a business. I will also provide some sensitivity analysis to show the effects of overestimating or underestimating the demand for the maternity wear and salvage value of the maternity wear. Included in the paper is a brief comparative study of other literature relating to the newsboy problem and the extensions that have been made to it.

Molina Beck, Approaches to Handling Missing Data, August 31, 1999 (Martin Levy, Norman Bruvold, Jeffrey Camm)
Missing data occur in statistical analysis in most practical situations. They present a problem since the units with missing data represent an absence of information, so that overall there is a loss of information. For example, model selection and estimation for time series is based on the assumption that the time series is complete. However, in practice, this is not usually the case. Incomplete series should not be fitted with models as this can lead to a serious lack of fit, especially when the number of missing observations is large. For the same reason, it is also not advisable to simply omit the missing observations from the series. Further, most common software packages that are used for estimation, such as SAS, SPSS, or RATS will cause errors in data analysis with missing observations, since their procedures expect input data sets to contain observations for a contiguous time sequence. This poses the question of how to estimate a model for such data and how to estimate the missing observations if these values are of interest in themselves. Historically, missing data has been estimated in an ‘ad hoc' manner. The traditional approaches to estimation consist of either discarding the observations with missing values, or imputing them by replacing these values with the means of available observations, or by regressing the missing values on the observed values for a case, and replacing the missing values by the predicted values thus obtained. In recent years, researchers have advocated the use of model based procedures. A model is defined for the missing data, and inferences are based on the likelihood under that model, with parameter estimation being done by such procedures as maximum likelihood. This approach has the advantage of flexibility and the avoidance of ad-hoc methods, since model assumptions are known and can be evaluated. In order to maximize the likelihood function for these models, several iterative algorithms such as the Newton-Raphson algorithm, the EM algorithm and the Kalman filter are discussed and evaluated, both for univariate and multivariate data. The application of the EM algorithm in evaluating means and covariance matrices, in multiple regression, and in time-series data is also discussed. This project compares the various methods of estimating missing data for the purpose of statistical analysis. The first part of the project is a discussion and comparison of the different ways of estimating missing data, and the latter part consists of the practical application of one or more of these methods to the available data.

Kemal H. Sahin, Development of Scheduling and Waste Minimization Techniques for Batch Processing Plants, August 27, 1999 (Amitabh Raturi, Jeffrey Camm, Amy Ciric)
Batch processing is the preferred option for industries that produce a wide range of products in small amounts. The scheduling of the available processing equipment to satisfy demand for all products has been investigated in detail in operations research. Many of these methods have concentrated on optimizing economic performance. However, waste recovery, which can contribute to very large costs, has not been analyzed in detail for a combination of both economic and scheduling concepts. The aim of this project is to develop a method that will include waste recovery considerations in the scheduling of batch processes. Two different approaches will be used to analyze the effect of waste treatment costs. An aggregated approach will simultaneously determine the optimal schedule for both processing and waste treatment, while disaggregated methods will develop waste recovery schedules for processes after the optimization of the production section is completed. Both simultaneous and continuous approaches will be used for comparison purposes. In case of simultaneous operation, every time a product is generated, waste has to be treated, while the continuous operation examines the common practice of continuous waste treatment. The models are developed for a single product/single waste process, as well as a multi-product/multi-waste operation. Case studies have been used for determining the efficiency of both methods. The aggregate approach results in cost savings in the range of 6% over the disaggregated approaches but takes seven times longer for even small problems. For larger problems, aggregate approaches may be too complex and time consuming for realistic implementation.

Meghna Sinha, An Evaluation of Combined Ranking, Selection and Multiple Comparison Procedures in an Industrial Application, August 25, 1999 (David Kelton, Jeffrey Camm, James Cochran [Louisiana Tech University])
In the simulation literature, ranking and selection procedures have often been recommended for comparing system designs, particularly when the goal is to select the best design.  However, in empirical research multiple comparison procedures are commonly employed.  For example, the researcher interested in making pair-wise comparisons among the groups can do so by constructing a confidence interval for the difference between the performance measures of the pair of results.  The difference between ranking and selection procedures and multiple comparison procedures is analogous to the difference between hypothesis testing and interval estimation.  The former results in a decision, rather than an estimate, so it is less informative.  Typically, ranking and selection procedures provide inference only about the design selected as the best or one of the best in some sense.  Two-stage sampling or sequential sampling is needed to attain a pre-specified probability of selecting the best design.  In contrast, multiple comparison procedures provide inference about relationships among all system designs and can be implemented in a single stage of sampling, but they do not guarantee a decision.  However, when using simulation experiments to estimate the expected performance, the best system can neither be selected nor the differences between the systems be bounded with certainty.  In 1995, Nelson and Matejcik presented procedures that simultaneously control the error in selecting the best and in bounding the differences.  These procedures combine the standard indifference-zone selection procedures, that control the error when choosing the best, and the standard multiple-comparison procedures that control the error in making simultaneous comparisons.  The procedures assume that data are normally distributed, but they do not assume known or equal variances across systems.  In this paper we apply the simulation ranking technique and the multiple comparison procedure simultaneously, as proposed by Nelson and Matejcik, to compare three product-mix scenarios in a manufacturing plant.  The objective is to determine the optimal mix, where an optimal mix is one that allows all machines to remain idle for a minimum amount of time.  The results will also determine how much better the best mix is relative to each alternative.  We also compare the Bonferroni selection procedure to Nelson and Matejcik's new procedure, NM.  Both procedures exploit the use of Common Random Numbers (CRN) to reduce variance and hence reduce computation efforts.

Xinxin Liu, A Comparative Study of Neural Networks and Statistical Models for Customer Choice Modeling, June 18, 1999 (David Rogers, David Kelton, Norman Bruvold)
This paper is an empirical study intended to be a bridge between the behavioral and statistical lines of research in customer choice behavior.  The relationship between retail store characteristics and customer buying behavior from a choice set of two stores is explored using the following approaches: the conditional logit model and the neural network (NN) model.  Using a data set of 400 survey responses, a NN was created using store characteristic variables and its accuracy checked with a holdout sample.  The same was done for the conditional logit model.  The comparison of results revealed that the NN outperformed the conditional logit model in terms of predictive accuracy.  Sensitivity analysis was conducted for the NN model and managerial implications were outlined.

Brian L. Sersion, An Application of Optimization for Establishing a Landfill Sampling Network, June 4, 1999 (Jeffrey Camm, Amitabh Raturi, David Rogers)
Waste-management operations require significant capital expenditure for ground-water sampling of sanitary landfills. High costs associated with outsourcing make internalization of this service an attractive proposition. The facilities-location problem, in this context, involves determining the optimal number and location of sampling teams to service landfill customers. The solution process includes the completion of a customer survey and linear regression to estimate demand for a two-stage mixed integer linear program. The results of this study support a managerial recommendation for Browning-Ferris Industries' landfill-sampling network.

Sanjay Chadha, Analysis of Salaries of College of Business Administration Professors at the University of Cincinnati, March 25, 1999 (Martin Levy, Norman Bruvold, David Rogers)
In this project a regression model has been formulated that explains 80% of the variation in the salaries in CBA with some exceptions. The following three research hypotheses have been tested using regression: 1) The newly hired professors in the College of Business Administration are being offered higher salaries in comparison to the professors who are serving over the last 5-15 years. 2) There is a variable difference in terms of national origin in terms of salaries of professors. 3) There is a gender variable difference in terms of salaries of professors. The results are that the first hypothesis is accepted, while the other two hypotheses are rejected.

Lubov Skurina, Exchange Rates and the Value of Foreign Operations, March 18, 1999 (Yong Kim, David Rogers, Martin Levy)
In this study I examine the effect of exchange rates on the value of foreign operations.  I perform a pooled cross-sectional and time-series regression analysis of company data and include exchange-rate trend and volatility as independent variables.  The results indicate that the exchange-rate trend does not have a significant effect on the value of foreign operations, but the volatility of exchange rates has a significant negative effect.

Chay Hoon Lee,  The Relationship of Team Members' Cognitive Decision Styles and Team Performance, January 12, 1999 (Charles Matthews, David Rogers, Martin Levy)
In most organizations, teams play a central role in planning and strategic decision making (Gilad & Gilad, 1986). Although many studies have examined the influence of demographic characteristics on team performance, few have examined the cognitive decision making styles of team members that can also influence team performance. Hackman and Morris (1975) proposed that the extent to which the team uses that knowledge and skills of its members can influence the quality of a team's performance. Therefore, understanding the team members' cognitive decision-making styles that influence the team's effectiveness seems critical because teams can shape an organization's future through the decisions they make. The challenge for any organization is to try to maximize the level of teams' effort and knowledge brought to bear on the team performance. Thus, this paper explores the influence of cognitive decision-making styles of team members on performance.

Jeffrey D. Rieder, Estimating Store-Level Promotion Effects from Market-Level Data, December 10, 1998 (Norman Bruvold, Martin Levy, David Rogers)
The debiasing procedure outlined in "Using Market-Level Data to Understand Promotion Effects in a Nonlinear Model" (Christen, et al., Journal of Marketing Research, August, 1997) attempts to quantify both the direction and magnitude of the bias associated with market-level promotion effects. Since merchandising response functions are typically non-linear, and market-level data are aggregated linearly over a set of heterogeneous stores, market-level estimates of these response functions are often severely biased. Christen, et al. claim to be able to estimate the bias and provide a mechanism for reducing the bias through the application of regression analysis. This research applies the methodology outlined by Christen, et al. to a real world data set and, after some modifications and assumptions are incorporated to fit the methodology to the available data parameters, produces some encouraging results. Using regression analysis, the market-level bias is found to be a function of the marketing environment. The resulting regression model is then used to predict future merchandising responses.

Girish Kulkarni, Determination of the Optimal Routing for the Consumer Products Division of the University of Cincinnati, October 16, 1998 (Ram Ganeshan, Jeffrey Camm, George Polak [Wright State University])
The Master Plan for the University of Cincinnati envisages a pedestrian-friendly campus with open spaces to affect positively the quality student life on campus by creating an environment for an educational experience. One of the major issues is to reduce the conflict zones between pedestrians and service-vehicle traffic. The Consumer Product Division supplies soft-drink cans to vending machines to nearly 40 buildings on the West Campus. Duties include servicing these machines via three trucks on a predetermined schedule. This division, therefore, needed to re-investigate and realign their servicing and routing scheme to the new Master Plan. Using quantitative techniques, we helped the Consumer Products Division by: (1) Performing an efficiency analysis for the available vending-machine-demand data and made recommendations for a servicing schedule. (2) Using optimization techniques we recommended a servicing route that works with the above schedule, resulting in shorter travel times for the vehicles.

Shailesh Kulkarni, An Optimal Clustering Model for Cellular Manufacturing, August 31, 1998 (David Rogers, Jeffrey Camm, James Cochran [Louisiana Tech University])
In this paper the problem of simultaneous clustering of parts into part families and machines into machine cells in a cellular manufacturing context is addressed. A mixed integer linear programming model is developed for addressing the problem. This model is then solved using conventional branch-and-bound procedures for small-sized problems. Considering the NP-complete nature of this class of problems, a genetic algorithm-based solution procedure is developed to solve realistically-sized problems of larger dimensions. Two problems from the literature are solved using the genetic algorithm. The attractiveness of the proposed model and the solution procedure to provide simultaneous grouping of parts and machines is evaluated on the basis of grouping efficacy

Amanda R. Angle, Fill-Rate Optimization Models for Supply Chain Systems, June 24, 1998 (David Rogers, Michael Magazine, Ramk Ganeshan)
Multi-echelon inventory management is very important when attempting to influence the performance of a supply chain. Formulation of a complete inventory model often requires more than just attempting to achieve minimal inventory levels to reduce holding costs. Customer satisfaction must be taken into consideration or the cost of lost sales could outweigh any inventory costs. In this paper, four models of multi-echelon inventory systems in which several finished goods are produced from a common component are considered. These models are for optimizing base stock levels when there is a penalty cost of having a backorder. Fill-rate consideration is employed to measure the customer service level. The first model is for maximizing the fill-rate subject to a budget constraint on holding costs. The second model is for minimizing the expected number of backorders subject to a budget constraint and a fill-rate constraint. The third model is for minimizing the penalty costs of backorders and inventory holding costs subject to a fill-rate constraint. In the last model, penalty costs of backorders and inventory holding costs are minimized subject to a budget constraint on holding costs and a fill-rate constraint. In the results section we will assume that demand is to be normally distributed in all the models and a non-linear optimization model is used to determine base stock levels.

Joga R. Palutla, Minimizing Maximum Lateness In a Family Single Machine Scheduling Problem, June 11, 1998 (Michael Magazine, James Cochran [Louisiana Tech University], Amitabh Raturi)
This paper studies the problem of scheduling jobs on a single machine in order to minimize the maximum lateness.  The jobs are grouped according to processing requirements in families.  The problem is NP-hard and computationally intensive.  Heuristics are the only feasible means of solving large problems.  The paper describes several existing heuristics and analyzes heuristic performance relative to one another and the optimal.  Lower bounds are developed in place of the optimal solution in this analysis.  The paper attempts to determine the best heuristic for a given set of problem parameters and its closeness to the optimal solution.

Nawal K. Roy, Risk Management: Exploration of Value at Risk, May 29, 1998 (David Rogers, Martin Levy, Ram Ganeshan)

Risk management is the fastest growing field in the investment and financial industry. This paper covers the most sophisticated methodology of risk management, i.e. Value at Risk modeling. As an overview paper, it deals with all the issues related with Value at Risk modeling: different methodologies for estimating the VaR parameters, its highlights and shortfalls, and the regulatory status. It also discusses the statistical model of J.P. Morgan's RiskMetrics and the expected recent development (the course of future research) in the field of Value at Risk modeling.

Ronald N. Gnau, A Comparison of Logistic Regression and Discriminant Analysis as Classification Techniques, May 26, 1998 (Martin Levy, David Rogers, Norman Bruvold)
Strategic marketing in modern business organizations involves three key elements: segmenting, targeting, and positioning. The development of a sound marketing strategy in today's competitive environment is barely possible without the use of multivariate statistical analysis. Two multivariate techniques that can be useful in assigning customers to the most appropriate market segment are logistic regression analysis and discriminant analysis. Each technique makes assumptions about the type of data used in variables. The independent variables in logistic regression models can be categorical, whereas discriminant models generally require that the data for independent variables come from normal populations with identical covariance matrices. This empirical study applies both techniques to the same data to classify customers to market segments, and compares the performance of each technique on the basis of classification accuracy.

Glenn A. Dahl, Core Carrier Selection: A Comparison of Solution Approaches, May 8, 1998 (Jeffrey Camm, Martin Levy, David Rogers)
In this work the problem of choosing preferred transportation companies for shipping, called core carriers, is examined. Optimal selection is treated as an extension of the Maximal Set Covering Problem. Three versions are examined. In the first model the desired core coverage is expressed as a percentage of total coverage, and all decision variables are binary. The second model is a relaxation of the binary restriction in the first model for the variables that represent lane assignments. A simple rounding heuristic is used to convert fractional solutions to integer. In the third version is a goal-programming weighting method: total core load is treated as a goal, allowing for the removal of the coverage constraint from the model.

Detelina Marinova, Between Strategic Intent and Inertia: Tracing Individual Knowledge Structure Evolution in Organizations, April 15, 1998 (Martin Levy, Murali Chandrashekaran, David Rogers)
Though organizations often employ multifunctional teams in strategic decision making to ensure maximal information dissemination in the organization, the alleged benefits of teams are seldom realized.  The central objective of this paper is to explore the process underlying individual learning in group settings, and to secure an understanding of why groups often do no produce extensive collaborative efforts.  Accordingly, we develop a conceptual model that traces individual behavior as well as knowledge structure evolution in group settings.  Our central thesis is that despite the strategic intent of each decision maker to make a 'good' decision and choose the 'best' course of action from a set of alternatives, communication with group members is likely to be shaped by the balance of intent and inertia.  As a result, communication flow in groups and, hence, individual learning is likely to proceed in a selective fashion.  We further identify possible drivers of inertia and propose hypotheses about their effect on individual knowledge structure evolution as well as on communication and influence in groups.  Econometric analyses of data obtained from a longitudinal field experiment converge to strongly support our conceptual model.

Jun Zhou, Low Birth Rate Prediction Models for the State of Ohio, April 3, 1998 (David Rogers, Martin Levy, Edward Donovan)

Low Birth Weight (LBW) prediction models were built based on the Ohio birth certificate data from 1993 and 1994. Maternal age, education level, smoking, alcohol consumption, pre-pregnancy weight, race, fetus gender, marital status, and pre-term medical complication were found significant in the logistic model. The study also showed that some interaction terms between the main factors made significant contribution to LBW. Two logistic models were built and they were validated by the 1995 birth certificate data. The model provided a quantitative tool to direct the limited resource to high-risk population of LBW in order to achieve a more cost-efficient and economical prevention result.

Srilatha S. Sekaripuram, Distribution Planning in Supply Chains - The Equal Periods of Supply (EPOS) Approach, March 26, 1998 (Ram Ganeshan, David Rogers, Michael Magazine)
One of the important challenges facing a distribution manager is the effective control of inventory.  Inventory is necessary and useful but too much inventory is expensive.  If improperly managed, inventories become a significant liability, resulting in a reduction of profit and possible erosion of the competitive advantage of the firm.  Hence, determining the proper inventory-management technique is important for the firm.  Distribution resource planning (DRP) is a computerized tool that has been aiding distributors for planning and solving some of the inherent problems in statistical ordering techniques.  Equal periods of supply (EPOS) is a DRP approach to schedule replenishments for multiple products.  Use of EOPS with DRP helps to reduce overall costs, keep inventory in check, and make planning convenient.  In this paper a heuristic method by which a distribution planner can incorporate the EPOS approach into DRP will be presented.  Using this method results in optimal costs in situations where the transportation costs dominate.

Timothy J. Cantor, Evaluating A Taxonomy of Supply Chain Management Research, November 7, 1997 (Ram Ganeshan, Michael Magazine, Amitabh Raturi)
As we approach the twenty-first century, the evolution of emerging management practices continues to unfold. Supply-chain management is one of the more rigorously debated movements. Supply-chain management covers the flow of goods from supplier through manufacturing and distribution chains to the end user. While not difficult to define, its complexity makes for uncertain boundaries and abstract scope. The areas where the discipline has been researched and those where opportunities exist must be identified. By providing a taxonomical understanding, it is determined that at least four such opportunities exist. The taxonomy is of a hierarchical nature consisting of two principal levels. At the strategic level, papers generally deal with the means by which objectives and policies should be developed. At the operational level, authors explore the efficient operation of an established aspect of the chain. Both of these principal levels can then be divided horizontally. At the strategic level, this split resulted in the sub-levels designated explanatory essays and system representations. At the operational level, the sub-levels coordination analysis and material flow analysis resulted. Finally, each of these sub-levels can be further segregated into categories that by which a biased selection of current supply chain management literature are classified.

William Pordan, Evaluating NFL Quarterback Performance Efficiency Using Data Envelopment Analysis, July 10, 1997 (Michael Magazine, Jeffrey Camm, James Evans)
Managers are often faced with evaluating the performance of numerous operating units which produce multiple products and services. Comparison analyses can be desirable for identifying which units are performing at an efficient level, and which units are utilizing resources in an inefficient manner. This task becomes difficult when there exists no proper valuation mechanism for determining the worth of one product relative to another, or when expended resources are not readily priceable. A mathematical programming method known as data envelopment analysis (DEA) has been applied to such situations in performance assessment. DEA allows each operating unit to assign a unique set of weighting factors to its outputs and inputs so as to maximize its efficiency ratio. Constraints on the weight selections lead to the identification of relatively efficient and relatively inefficient units. This research project presents an overview of the theory and formulation of data envelopment analysis, and offers an application of its use in evaluating the performance efficiency of 1996 National Football League (NFL) quarterbacks. The production of each is ranked based on his DEA efficiency score, and a comparison is made with the NFL passer rating system currently used by the league.

Christopher M. Lynd, Heuristic Solution to a Baseball Scheduling Problem, July 2, 1997 (Michael Magazine, Jeffrey Camm, James Evans)
Heuristic techniques and mathematical programming have often been at odds with one another. The mathematical-programming camp preaches global optimization, whereas the heuristic camp preaches tradeoffs. The question of which method to use should be decided on an individual problem basis. Some problems, especially large combinatorial problems, lend themselves to heuristic techniques. For instance, mathematical-programming techniques such as branch and bound and dynamic programming perform essentially no better than does complete enumeration for NP-hard problems like the traveling salesman: (N-1)!/2. Users and developers must weigh the costs of global optimization, whether it be computing time, software or development dollars with the resulting benefits. In this paper, I define an NP-hard baseball-scheduling problem. I outline three different approaches to solving the problem: two heuristic techniques and one mathematical-programming technique. The two heuristics employed are tabu search and genetic algorithms. The mathematical-programming technique being used is integer programming. I present the results and outline the advantages and disadvantages of each technique.

Ian Clough, Body Image2: Data Analyses, July 1, 1997 (Martin Levy, Terri Byczkowski [Cincinnati Children's Hospital], David Rogers)
This document is a report of a number of statistical analyses performed on a variety of data sets. Programs and computer output have been included in the appendices. The work was performed over a ten-month period.

Amy M. Anneken, Applying GIS and Benders' Partitioning to the Uncapacitated Facility Location Problem, June 12, 1997 (Dennis Sweeney, Jeffrey Camm, David Curry)
Facility location problems are very important and practical in business decision making today. The facility location model concerns finding locations to serve customers in an economical and high quality way. This project aims at providing a way to solve these types of problems in a manner that incorporates both objective and subjective means. The objective of this project is to explore the potential for an algorithm that involves both human and mathematical iterations. The problem studied is the Uncapacitated Facility Location problem. A Geographical Information System is used to assist the human decision maker in selecting good solutions. A Benders' partitioning algorithm is used to generate bounds and to suggest alternatives for the decision maker. A geographic computer interface that serves as a front end to an Operations Research algorithm has many advantages. Finding an optimal solution to a problem is the best alternative, but many companies never do this because they do not have the time or the expertise to do so. The results from this project can provide many benefits to both the business and OR community.

Angela Bansal, Discounts/Premiums on Country Funds - Time Series and Multivariate Analysis, June 3, 1997 (Martin Levy, David Rogers, Yong Kim)
In this paper, the time-series behavior of discounts/premiums of closed-end country funds is examined by using the models of Hardouvelis, La Porta, and Wizman(1993). The results show that most of the funds of emerging markets trade at a premium. This premium has predictive power for fund return but not for its nest asset value returns. Results also show that country funds are good diversification tool for US investors and at least three local stock markets are cointegrated with the US market.

Rajdeep Grewal, The Long Run Advertising-Sales Relationship: Incorporating the Impact of Economic and Political-Legal Environments, May 12, 1997 (Martin Levy, Jeff Mills, Raj Mehta)
A methodological framework for investigating marketing parameter functions with time varying coefficients is adopted, to investigate the relationship between market performance (e.g. sales, market share), marketing effort (e.g. advertising, sales promotion), and environmental conditions (e.g. market growth, inflation). The nine-step framework relies on recent methodological developments in the econometric and time series (ETS) literature to present a sequence of statistical tests and estimation techniques. The authors elaborate on the framework to provide a rationale for expecting specific behavior by marketing performance variables, marketing effort variables, and environmental variables. Further, the authors illustrate the framework for the famous case of the Lydia Pinkham Medicine Company.

Jennie Bao Jin, A Markov Chain Analysis of the New York Stock Exchange Composite Index, May 2, 1997 (David Rogers, Martin Levy, Norman Bruvold)
The behavior of stock-market prices has been researched extensively via different empirical methods (Fama 1970, Poterba and Summers 1988, Fama and French 1988, Fama 1991). Whether certain price trends and patterns exist to enable the investor to make better predictions of the expected values of future stock market prices is still debatable. A number of researchers have shown that both the relative strength of a security in the market and the nature of its successive price movements may be interpreted with the framework of Markov theory (Dryden 1969, Fielitz and Bhargava 1973, Fielitz 1975, Mcqueen and Thorley 1991) and these studies are modeled in such a way as to provide useful information to the individual investors and portfolio managers concerning stock-market movements. While most of the previous work in the area has been done in the individual-security setting, I investigate the relevant Markovian behavior with the entire stock market, which is represented by the NYSE Composite in this project. Relatively new data (from 1985 to 1995) are used to test and formulate both a first-order three-state (up, unchanged, and down) and a first-order two-state (up and down) Markov-chain model based on daily price changes of NYSE Composite. Statistical inferences are conducted to test whether the NYSE Composite movements are random, which means the probabilities for the stock-market price's going up or down on a daily basis are the same. The organization of the paper is as follows: Section II is a brief review of the literature on Markov chain analysis of security prices. Section III is a description of the methodology and data used in this project. In Section IV the three-state Markov chain model is formulated and estimated. In Section V the two-state Markov chain model is estimated and a statistical inference test regarding the hypothesis of randomness of stock market movements is conducted. Section VI is a summary and conclusion of the paper.

Himani Mohan, Application of Simulation Techniques in Operations Analysis and Facility Design, December 4, 1996 (David Kelton, David Rogers, Jeffrey Camm)

Marie D. Lane, Capacity Planning in the Machine Tool Environment: A Case Study of Ahaus Tool & Engineering, Inc., August 15, 1996 (David Rogers, Jeffrey Camm, Amitabh Raturi)
Issues that affect resource and production planning in the machine-tool industry are discussed in this paper. One company and its particular operating characteristics will be the focus of the paper. Suggestions are made on improvement possibilities to their production-planning system. These suggestions are made based on a literature search of the variety of production-planning systems and models that are available, as well as this researcher's opinions from observations of the company's operating practices and discussions with the company's management. A goal-programming model was developed that can be used as a part of the production-planning process.

Laura Miser, Enrollment Projection Models at the University of Cincinnati, August 12, 1996 (Martin Levy, Jeffrey Camm, Corey Brewer)

Gregory A. Graman, The Effect of Variation in the Intermediate Delay on the Solution to the Multi-Echelon Inventory Problems with Newsboy-Style Results and Backorder Optimization, March 4, 1996 (David Rogers, Jeffrey Camm, Martin Levy)
The statistics of variance and standard deviation are used in many disciplines to provide a measure of the level of uncertainty that exists in a wide variety of situations and studies. The uncertainty of the intermediate delay in a multi-level inventory problem with newsboy-style results and a objective of minimizing backorders is examined. An expression for the standard deviation is derived, and implementation of these results is revealed.

Thomas Osterhus, Development and Testing of an Integrated Model of Conservation Behavior, July 1995 (Martin Levy, Jeffrey Camm)

Mary J. Frey, A Discussion and Analysis of Mathematical Modeling Techniques for the Location of Retail Establishments Using Geodemographic Data, Autumn Quarter 1994 (Jeffrey Camm, David Curry, Dennis Sweeney)

Bernard B. O'Bryan, An Evaluation of Software System Designs Using Data Flow Diagrams, Data Dictionaries and Mini-Specifications, 1993 (Roger Pick)
An evaluator for an experiment involving software engineering discusses his part in the project.  The experiment had the evaluator -- without prior knowledge of the experiment (blind) -- rate some data-flow diagrams, data dictionaries, and mini-specifications of some software project performed in teams.  The 'blind' evaluations were then used to rate the effectiveness of using computer-assisted software engineering (CASE) technologies.  Ten three-person teams, composed of undergraduate information-systems majors, independently developed a software product -- a Pascal pretty printer.  Four teams used the same automated CASE software, while the remaining teams did not use an automated CASE software.  The major results of this experiment were (1) those teams that used the automated CASE software were able to code the programs in less time than those who did not, (2) all of the teams using automated CASE software were able to meet more of the requirements than those who did not use the software, and (3) the quality measures of the CASE-group designs were rate superior to the non-CASE-group design.  Also, some literature is reviewed to give the reader a point or reference on data-flow-diagrams, data dictionaries, and CASE tools in general.  Further, some biographical data on the 'blind' evaluator (the author) is also included.

Patricia Laber, ACL Knee Brace Design Study: Data Analysis, September 3, 1993 (Martin Levy,  Jeffrey Camm)

Stephen E. Kelley, A Multiple Regression Model Used to Predict Indicated Airspeed, May 24, 1993 (Jeffrey Camm, Martin Levy, David Rogers)

William Milligan, Assessment of Collective Bargaining Issues with Sample Survey Methods - Design, August 3, 1992 (Martin Levy, Thomas Innis [Adjunct Associate Professor])

Karen Averbeck, Assessment of Collective Bargaining Issues with Sample Survey Methods - Analysis, August 3, 1992 (Martin Levy, Thomas Innis [Adjunct Associate Professor])

Deryck Lampe, On the Analysis of a Repeated Measures Design, June 11, 1992 (Martin Levy, David Rogers)

Jo A. Gallagher, Rating of Designs for a Study on Computer Assisted Software Engineering, July 17, 1991 (Roger Pick, Jeffrey Camm, Timothy Sale)

Barbara C. Zellner, Using Aggregation Methods to Solve Single-Commodity Transportation Problems, May 1990 (James Evans, David Rogers, Jeffrey Camm)
Many companies must routinely solve transportation problems.  However, because of time and hardware constraints, these problems are often not solved to optimality.  In many cases, the problems are not modeled.  This paper examines single-commodity transportation problems solved to optimality using a personal computer.  Aggregation is used to convert the original problem into a two-source transportation problem.  After solving the modified problem to optimality, the solution is disaggregated and used as a starting solution to the original problem.  The time to reach optimality using this two-step method is compared to the computational time of using a poor starting solution in the original problem and solving to optimality in one step.  Various methods of aggregation are used and discussed.

Calvin Taylor, Multivariate Analyses of Telephone Company Data, August 2, 1989 (John Bryant, Martin Levy)

Yiching Lee, Bayesian Approach to Testing Equilibrium in a Segmented Line Model, 1987 (John Bryant, Martin Levy, Jeffrey Camm)

Steve Nielsen, Cost Reduction of Paper Manufacturing Through Quality Control of Pulp Production, December 11, 1987 (Jeffrey Camm, David Anderson)

Mark Kleinhenz, An Algorithm Using Aggregation to Solve a Large Scale Linear Programming Problem to Optimality, June 26, 1986 (James Evans, David Rogers, Jeffrey Camm)
As Lasdon has remarked, the solution of linear-programming problems is often hampered by size -- the problem is simply too big. The cost of supercomputers and technological limitations are two reasons for the difficulty in solving such problems. Supercomputers, such as those manufactured by Cray of Minnesota, can cost between five and 15 million dollars. But even supercomputers are limited in the size of problems that they can solve, although these limitations are continually being extended by technological advances. The need for solutions to large-scale linear-programming problems has inspired the development of several solution strategies. One such strategy is that of aggregation. After developing a smaller but similar problem to the original problem through the clustering of the latter's columns or rows, this aggregate problem is solved obtaining a solution that is 'close' to that of the original problem. Techniques have been described for reformulating the aggregate problem and improving its accuracy of solution. In this paper the technique of aggregation is applied as a step in an algorithm to solve a large-scale general linear-programming problem to optimality. The format of this paper is as follows. The experimental design and the method of problem generation are presented in Section I. Section II describes the algorithm and a sample problem is solved to illustrate the working of the algorithm. Section III details the computer software and hardware employed in the research project: computational results are presented as well. Section IV is an evaluation of the algorithm using the quality of the basis as the criterion. A related issue is addressed in Section V: the question of whether the use of even-weighting or the use of weighting provides better quality of solution after solving the aggregate problem. In the final section, conclusions are drawn and further work is suggested.

Joel I. Kahn, Analysis of Automatic Warehousing System Operating Policies, June 1981 (James Evans)
The subject area of this study deals with operating policies for Automatic Storage/Retrieval warehousing systems.  Four system design parameters are investigated:  storage algorithms, retrieval algorithms, level of storage utilization, level of crane utilization.  The system investigated has a single crane which stores and retrieves product from either side of a storage aisle.  The aisle contains two racks each having 1,000 storage locations.  Each storage location is capable of handling one pallet and each pallet contains only one product type.  The question to be focused on, relative to the system studied, is the way in which the four parameters mentioned above affect crane travel distance.  This study answers the above question by developing a digital simulation model of the system being investigated.  Where possible, the results of the simulation will be compared with analytic and other simulation results in order to validate the model.  The simulation is capable of acting as a design aid for future systems by allowing the designer to vary the four parameters mentioned and obtaining their impact on system measures of performance.  The type of system studied is common in Japanese industry and is beginning to appear here in America.  A perceived major barrier to wider spread utilization of these systems in America has been the inability to accurately predict the rate of return on these investments.  Having a model which can accurately predict system operating characteristics will greatly aid the rate of return analysis process.

Sharon Hannig-Smith, An Airport Passenger Processing Simulation Model, January 1981 (James Evans)