Sandeep Rajput : Predictive Analytics : Tutorials
A very brief review of Predictive Analytics

While all projects have their peculiarities and history, often the four key steps are Data Review, Variable Generation, Model building and of course Deployment. These steps are minimally described on this page. The reader should keep in mind that the brevity of treatment and the practical nature of such work often raises more questions. The audience for this page is intended to be those interested in predictive analytics as well as practitioners looking for different characterization and approaches.

Though my site uses no JavaScript, this page experiments with MathJax to display math seamlessly with text. There may be some content from the books I've been working on, but of course the level of math is friendly in HTML.

Larger image. The image was created using TikZ (a LaTeX package), and then converted from PDF to JPEG using Adobe Photoshop.
1. Data Review
Let us assume that data is available in a form that can be read by the available tools. Data Warehousing is covered well by many technical and non-technical publications or blogs through various channels: a simple web search will likely suffice. It is easy enough to figure out how to read a database or to aggregate data when one knows the utility or goal at the end of the data processing step. It could be as mundane as finance team needs it for forecasting.
A) Provenance
The very first piece of useful information is to determine how many of the values are missing. Many times the data is the result of joining tables from various tables spanning databases, and a missing value has a trail that contains important information. For example, consider the case where one customer has never purchased a product. Typical processing steps will mark the relevant column as NULL. That is not necessarily the same as zero. If the product is Windows 8 and the last date on which the customer purchases anything is August 13, 2004, that is a completely different situation.

Consider a more realistic case where we join multiple tables based on a customer's unique ID. Let's say we begin with the customer information, and then join records containing purchases that meet a certain criteria. As the next step we join the returns. A customer who did not purchase the item of course could not have returned it. That is different from the case where the customer could have returned it but did not do so. Such distinctions are critical in approaching the problem. Practitioners, particularly with SAS, use special values, often very large negative values, to enumerate the combinations.

B) Summarizing information

Consider the case where we wish to learn more about a variable or field of interest. For ordinal or nominal cases, the preliminary approach is obvious. It is for continuous variables that one has to be careful. With the great leaps in computation, a double has precision up to at least 16 decimal places. That precision is invaluable when performing deep scientific calculations or implementing algorithms such as the Baum-Welch or a Bayesian model as one avoids buffer underflow. But for exploratory data analysis at least those decimal places are not the most relevant.

A practical solution is to provide some sort of binning or bucketing, where all observations in a range are grouped together. It might help to start with the better-known histogram, introduced by none other than Karl Pearson. Typically histograms contain bars or columns graphically, and the thickness of the bar stands for the range of the data. Very often, histograms are based on identical ranges. For data with central tendency and compact support, that is quite reasonable, though one still fusses over how many columns to create and how to display the label.

For illustration, we use the margins of victory in NCAA Men's March Madness historical results from 1985 to 2012 (generated in R). Of course, the margin of victory is at least one. The average is 11.6195 but the median is 10. That tells us the data is skewed a bit to the right. That's expected because the margin is bounded on the lower end at 1. Standard deviation computes to 8.86. Of course the distribution is not Gaussian at all. In fact it is not even symmetric. If we plotted the histogram using reverse order for the margin, there won't be much overlap with the plot shown here.

That many games are close is no surprise. However, there seems to be a separate clump around 14 and another around 17, and perhaps one at 9. Relatively mismatched teams will likely have some very large margins, or maybe the margins become tighter after the first week? Has it been changing over the years? Are there more teams in the draw, or do they otherwise play more games? All these questions of course are possible hypotheses which the analyst or data scientist can weigh with available information.

Larger image.

Data binning can be thought of as quantization that ideally does not reduce the dimension of the underlying process generating the observations. Much secret sauce in organizations using Predictive Analytics is manifest in the choice of bins.

Univariate frequencies are almost trivial but can be insightful. With Big data in particular, one has to be judicious in reading too much too soon. Is it really informative to say that the average is 24.91? How useful is the knowledge that the standard deviation is 101.86? Prima facie it appears that the spread overshadows the central tendency greatly and perhaps the values are completely unpredictable.

C) The "sniff" test

To some extent the above information is useful. How does that change if 94.3% of the values are zero? Because average is a linear operator, we can correctly relate that for non-zero values, the non-zero average is 436.90. In other words, the values are zero very often, but when they are greater than zero, they're of the order of several hundreds. In many analyst positions, it is already an insight by way of tighter specification and greater acuity.

A measure of spread for non-zero values is only a little more involved, and doable in this case, computing to 44.34. That is a huge change. The coefficient of variation (defined as the ratio of standard deviation to average) \(c=\mu/\sigma\) moves from a large value of 101.86/24.91 = 4.09 to much tighter 44.34/436.9 = 0.10!

To some it appears like magic but in reality it is grade-school math and analytical thinking. While the first two moments are easy to calculate, they are most meaningful only when the distribution of values or the distribution has one strong peak and the range of values has no holes. Mathematically speaking, these are central tendency and compact support. It is these underlying assumptions that must be probed before proceeding based on two or three numerical values.

B) Shape of measurements

Consider the above case with a further modification. For the same average and the same standard deviation, the general spread could be skewed in one of two ways. There could be few large values and many small values, or vice versa. That information is material and often edifies as to the general nature of the physical reality the measurements are attempting to capture. It is very dangerous to assume that everything follows the Gaussian distribution.

B1) Discrete measurements

Many measurements in big data are counts. Counts are easy to implement and they reflect reality: number of visits to a portal or search engine, number of purchases from an e-Commerce site, number of complaints, number of positive and negative reviews... the list is long. What distribution can one expect for counts? Counts cannot be negative by definition though an observed count of 0 might arise from missing data, latent variables and other such factors. Unless the counts are aggregated over a time scale much greater than the dynamics it is trying to capture, one is unlikely to see a Gaussian distribution. Instead, the first check should be for Poisson distribution, which is defined as follows \begin{equation} P(k) = \frac{e^{-\lambda}\, \lambda^k}{k!} \; \mathrm{for} \; k=0,1,2,\ldots \end{equation} Note that it has only one parameter \(\lambda\), and the mean and variance of Poisson are equal: both are \(\lambda\)! A relevant quantity here is dispersion, defined as the ratio of the variance to the mean -- for Poisson it is 1 by definition. It is a very simple distribution which has found its use in many areas: number of buses arriving in a time period T, number of visits to the supermarket in a week.

What if the distribution has a higher variance than mean? Negative Binomial distribution might be of interest here. If the probability of a failure is \(p\) and the number of failures until the experiment is stopped is \(r\), then the distribution of the number of successes \(k\) is given as follows \begin{equation} NB(r,p) = {k+r-1 \choose r} p^r (1-p)^k \end{equation}

While the original definition arose from Bernoulli trials (only two outcomes, typically labeled 0 and 1), one can simply interpret failure as the outcome of interest and cast a success as its complement. From that point of view, \(k\) is simply how many uninteresting events we have to observe before seeing \(r\) events of interest. Negative Binomial distribution can be extended to real values or \(r\) and \(k\) and in that case it is often called the Pólya distribution. This distribution is skewed to the right for low values of \(pr\), but with high value of \(r\) and a non-degenerate value of \(p\), the distribution becomes symmetric.

This is by no means a comprehensive list of discrete distributions one should consider. In fact, before considering a disribution one would do well to apply logic. In one column of values (appearing to be an integer or a long integer) has many unique values and each value appears once or twice, there is a good chance it is a hash code. If it only seems to take one value, perhaps more data needs to be analyzed: it might just be the hour of the day and very often transactions in a database are recorded in (almost) exact temporal order. The caveat is for real-time and parallel processing systems.

In summary, keep the following patterns in mind

  1. One or two unique values: perhaps hour of day, hashed or anonymized Personally identifying information (PII) such as social security number (SSN), date of birth (DOB), and phone number. Number of digits offer strong hints. If you think you're seeing osme PII, the best course of action is to throw a security cordon around the data and limit access to the files or directories containing PII. Unless it is your job to deal with PII data and you have all the clearances needed.
  2. Many unique values with similarly low counts: Probably a primary key or Id, store number, zip codes (esp. zip+4)
  3. Central tendency: If it looks like counts, mean and variance are similar and the distribution is symmetric, Poisson distribution is advised. For overdispersed cases, consider Negative Binomial distribution
  4. More than one peak: Possibly a mixture of two or more distributions. It is an advanced topic, but looking for this pattern can be quite instructive
B2) Continuous measurements

In the digital age, it is hard to rigorously define continuous vs. discrete since computers can never have infinite precision. Intuitively, continuous measurements are the case where the values have decimal points and of course will include the integers. Mathematically continuous measurements are easier manipulated in many cases, and much less tractable in other scenarios.

Consider what continuous measurements arise from. In the simplest case, one has amount. Currency typically has precision of 0.01, whereas weight, volume, distance can have more precision. Even weight and distance have limited precision in most cases. A day has 86,400 seconds; consequently if every measurement has arisen from dividing by 86,400 one can easily deduce that.

In many cases, continuous measurements are ratios. Ratios can represent some concepts

  1. Velocity: Number of items purchased or visits in a time unit. This is also known are rate. Someone who visited a supermarket 43 times in the past 180 days has a rate of 43/180 = 0.2389.
  2. Fractions: Fraction of visits to a web site where any link in the Trending box was clicked. As one can see, fractions are limited to a range of [0,1]. In some cases, they appear as percentages, but are easy to make out. Fractions are tricky to deal with, and we shall have more to say on that topic in a moment.
  3. Acceleration: If a user had velocity of 0.2389 in the first half of the year and 0.2944 in the second half, then the acceleration is (0.2944-0.2389)/(0.2389) = 0.2326. In other words, the rate at which the supermarket was visited grew by 23.26%. If the calculation is performed using observed values (i.e. 53 visits vs. 43 visits), precision is retained; however if such ratios are taken on what are effectively ratios themselves, the general variability is enhanced considerably. Acceleration is also a ratio.
  4. Ratio: a ratio is more general than a fraction or acceleration. If one considers the ratio of the visits to a portal this month to the visits by the person 2 months ago, several possibilities arise. Since both numerator and denominator are disjoint (i.e. have no overlap), both the denominator and numerator can be zero.

When the denominator is zero, a ratio is undefined. However, 0/0 is different from 3/0 in this case. The first case probably tells us about a very infrequent user, or perhaps the timescale we are looking to model is much larger than the one we sample at. As for the latter case, timescales are probably not as much of an issue: consideration needs to be given to the business and process one is analyzing. If it is a monthly subscription the behavior is not at all surprising.

For ratios, plotting the logarithm is often revealing. One has to take care to add a small \(\varepsilon\) to the observed value before taking the logarithm, since \(\ln(0) = -\infty\). For fractions, the Beta distribution is often a great choice -- except where 0 and 1 dominate almost exclusively -- and in that case one does not need to model a distribution anyway. Generally for ratios, Gamma distribution is a good choice, particularly with the ratio of two positive numbers which often has a fat tail to the right. If the skewness is not large, logistic and log-logistic distributions and even lognormal distribution can model the ratios very well.