# Errors using inadequate data are much less than those using no data at all

Statistician George Box, who passed away in 2013 at the advanced age of 93, pioneered many statistical techniques bearing his name, yet is as well known for a short simple quote as anything. He famously wrote “all models are wrong, but some are useful”. By this Box did not mean models, as in fashion models, were stupid but can be fun, although perhaps such statements on fashion models could be undoubtedly true!

For private equity early stage investments and evaluating variables when trying to estimate market success, Box’s quote pertains to the heart of statistics; anytime we try to find the best predictor’s of x we will have error, anytime we try to predict x from a,b,c, and d we will have error . While for private equity, important are the amount of variables included in a model and what variables are chosen, yet statistics can give us a model that leads to predictions possessing a degree of accuracy that makes them invaluable and certainly more accurate than predictions based merely on intuition.

A defense of reductionism demonstrates a surprisingly high correlation between a true “unknown” model and a simplified model. Given 600 instances put into a context of Drugs being tested by the FDA and the likelihood of the ROI on investing in 30 different drugs at the R&D stage where companies require to decide which drug’s R&D is a worthy investment verses others that may not bring little or no on R&D investment dollars. The true model consists of an additive relationship between ten variables (FDA staff, FDA longevity of Staff, previous drug class history, etc…..). We assume that the correlation between each of these variables is zero. SPSS was used to assign a pseudo random score from a t-score for each variable and instance. (T scores have a mean of 50 and a standard deviation of 10. They are just transformed z scores, very useful as we often do not want the mean of a scale to be zero.) The true score is simply the sum of all these scores for each instance. The observed model uses only either 3 or 5 of the variables. If the true model is seen as the criterion and the obtained model is the test, then their correlation represents a validity coefficient.

The model will be used to make a dichotomous decision, let us say we define as success a drug getting FDA approval. Let us also say that 20% of drugs reaching a certain stage of testing obtain FDA approval. We can now see the amount of information the obtained model gives us and express this in the percentage of correct decisions made using it as compared to the base rate (making decisions by guessing.)

The correlations between the true and obtained models were almost .8 using 5 variables and almost .7 using only three variables. This would mean the following, if 30 drugs were selected for R&D investments out of the 600being tested by using the model with 3 variables, we would expect a success rate of .8 on our choices, compared to the base rate of success which is again .2. In other words, using the model the expected results on our investments would be 24 hits and 6 misses, and the expected results of not using any model would be 6 hits and 24 misses!! That is quite a difference. If each hit meant earning a million dollars and each miss losing a million dollars it’s the difference between earning 18million dollars and losing 18 million dollars!! And notice again how “poor” our model is, we managed to measure only 30% of the relevant variables!

Our title is another quote. Charles Babbage (1791 –1871) was an English mathematician and mechanical engineer, credited with inventing the idea of a programmable computer. Though he died before the birth of modern statistics he wrote “Errors using inadequate data are much less than those using no data at all.” He was right.

*Written by Alexander Nussbaum PhD, Statistical consultant to Analytic Medtek Consultants and Professor at St. John’s University, with edited notes by Ken Peters PhD, Principal, Analytic Medtek Consultants and Professor of Economics Baruch College, CUNY*