Predictive Analytics

Predicting Outcomes of New Entrants

In business consulting and competitor analysis, an important question that often arises is “what would happen if I were to open a new business here?” For one, it is a counterfactual question, and the consultant/analyst cannot directly find an answer from data analysis alone. This is because the relationships among variables would have been different had the new entrant existed, but by definition, the new entrant could not have been there when data was first collected, so the “naïve” patterns learned from historical data are subject to logical contradictions and poor generalizability.

In this article, I argue that with the help of a conceptual model, it may be possible to predict outcomes of a new entrant in a logically consistent way. I will use a simple example to show what this means.

Suppose I was a fitness franchisor and wanted to decide whether to open a new gym in the city. I gathered some market intelligence and found that there were L locations where customers might come from (e.g. the more affluent neighborhoods), and there were already G gyms in the city competing with each other. I also did some market research and got the total sales revenue of fitness for each of the L locations (e.g. from the Consumer Expenditure Surveys).

With all this information, I define a market penetration score (MPS): $$MPS_{lg}=frac{ISR_{lg}}{TSR_{l}},$$ where ISR is the individual sales revenue (of gym g at location l), and TSR is the total sales revenue (at l). While I cannot observe every g’s individual sales revenue, I do know the total sales revenue based on my market research, and I also know that $$TSR_{l}=sum_{g=1}^{G}ISR_{lg},$$ which is just a definition.

I then imagine opening a new gym somewhere in the city, and call it h. Instinctively, I would predict the ISR of the new entrant at l via a predicted MPS (by training a machine learning model, which I will describe later) together with my information on TSR: $$widehat{MPS}_{lh}times TSR_{l}.$$ However, because of the introduction of the new gym, the ISR of each current g at location l is expected to reduce by $$frac{widehat{MPS}_{lh}}{1+widehat{MPS}_{lh}}times ISR_{lg}.$$ As a result, the predicted ISR should be modified to $$frac{widehat{MPS}_{lh}}{1+widehat{MPS}_{lh}}times TSR_{l}.$$

This is what I mean by having a conceptual model: thinking clearly about the observed patterns being learned by machine learning models vs. the inner workings being applied to get logically consistent predictions.

Now comes the machine learning part. The predicted “raw” MPS can be acquired by fitting $$MPS_{lg}=MPSleft(D_{lg},A_{lg};Thetaright)$$ to past data in each of my other gym g at every location l in the same city or another area. Here, D is the distance-specific characteristics between g and l, and A represents other store-specific characteristics. There are a great many ways in machine learning to automatically find a function and parameter set for MPS, but oftentimes a traditional functional form specification such as the Huff model might work just as well (and is easy to interpret).

Finally, by changing site locations and store features, I can search for the best set of characteristics (subject to certain constraints) that maximize the sales revenue of my new gym. Sweet!

Standard
Data Visualization, Exploratory Data Analysis

Data Visualization Using Chart.js

Chart.js is a popular library that produces fast, elegant, and interactive charting and data visualization solutions using HTML and JavaScript. The API is very straightforward and easy to use. Chart.js has been widely deployed in admin dashboards and other user-friendly metrics-driven applications, working seamlessly with responsive web design libraries such as Bootstrap 4.

Chart.js also appears in statistics and data analytics oriented applications, though it is less known in the data science community compared to the more heavyweight D3.js. While no tool alone can solve all data visualization problems, I argue that Chart.js (together with its rich set of plugins) is able to handle most of the data visualization requests during the exploratory data analysis (EDA) phase of everyday data science practices. Chart.js is especially useful when such tasks are performed in a web application that is accessible to various stakeholders—who simply want to discover useful information from data to support their decision-making processes—regardless of their level of coding in a programming language.

For univariate data visualization, Chart.js is very handy for plotting summary statistics and distributions via the bar chart (histogram of a numerical variable as well as frequency counts of a categorical variable), the pie chart or the doughnut chart (relative frequencies of a categorical variable), and the line chart (values of a time series variable). The tooltip element of Chart.js makes it convenient for the user to instantly locate the exact number behind a bar or a slice of pie.

For multivariate data visualization, in many cases, the user can simply add a second, third, … variable to the one-variable chart by appending data to the value for the datasets key, if it is appropriate to compare these variables with each other. Chart.js creates a “dataset” label for each new variable and shows it in a different color. (See official examples here and here.) When the user clicks on such labels/legends, Chart.js will toggle the visibility of the clicked variables, which essentially performs a select/filter operation.

Chart.js is also good for presenting some important features of a two-way table with an intuitive interface. Here I give two examples (here and here) of two-variable data visualization for EDA in Chart.js. My first example uses the scatter chart to plot a y variable against an x variable, where y is categorical. My second example utilizes the bubble chart to plot a categorical y against a similarly categorical x. The flexibility of Chart.js makes it simple to customize configurations using callbacks, and data can be freely manipulated through the API itself. In my second example, the raw data numbers in the r dimension (radius of the bubbles) are correctly squared and rounded as they are displayed in the tooltips:

I am sure there are numerous other ways in Chart.js to play with the data and the charts, and hopefully, this article will help us get started.

Standard