# The Joy of Sets, Huge Sets

The technological ability to deal with huge data sets in novel ways is revolutionizing private eq-uity early stage investment decisions. Flying by the seat of the pants will fail spectacularly in competition with data-intensive systematic computer processing.

Are too many decisions predicated on like history or gut feelings? A simple demonstration is a look back at the classic Monty Hall problem. This seeming paradox which some of you heard of was based on the game show “Let’s make a deal”. Contestants were given the choice of three doors. Behind one door was a great prize, a new car for example, behind the others, gag gifts, say a rusted bicycle and an old wheelbarrow. The contestant picks one door. Note that regardless of which door they pick there is at least one remaining that contains a gag gift. Monty Hall, who knows what’s behind the doors, opens another door, which has junk, leaving two unopened doors, the one the contestant picked and one other. He then asks the contestant do you want to stick to your original pick or switch doors? Does it make a difference as to the probability of winning if the contestant switches?

The problem became famous when posed to Marilyn vos Savant, who based on having an extremely high IQ, wrote a magazine column answering queries. Her response was that the contestant should switch to the other door. She was in fact correct. But vos Savant received an avalanche of letters, in over 10 thousand accusing her of being wrong! Included among those who condemned her where academics, mathematicians, PhD statisticians! But it was this plethora of statisticians, mathematicians, researchers, as well as ordinary Joes who were wrong.

Cognitive psychologist Massimo Piattelli-Palmarini wrote “no other statistical puzzle comes so close to fooling all the people all the time” and “that even Nobel physicists systematically give the wrong answer, and that they insist on it, and they are ready to berate in print those who pro-pose the right answer”. Nobel physicists? I was in good company then holding to the wrong an-swer!

Despite teaching statistics since January 1987, I too still thought changing doors would not im-prove the chances of winning. At a lecture on probability in June 2015, the Monty Hall problem was brought up indicating that about 90% (most people) got it wrong. But it was not really ex-plained why changing your pick was the right answer. Riding home I thought over the problem but again coming to the conclusion that changing doors would not improve the odds of winning. Suddenly a light bulb lit in my head! It was simple, of course changing doors would double the chances of winning! There are three doors, so 1 in 3 people initially choose the right door. If no one switches when given the opportunity, 1 in 3 get the car. If everybody switches the 1/3 who would have won by sticking to their initial pick now lose, but the 2/3 who would have lost would now win. Overall by switching doors the chances of winning goes from 1 in 3 to 2 in 3!!

If you want to predict a dichotomous variable (one that has two levels, like success or failure) from a number of continuous variables, the traditional and still widely used approach was dis-criminant analysis. Discriminant analysis makes assumptions of the normality of each indepen-dent variable for each level of the grouping that logistic regression does not. Also logistic regres-sion is apparently in general slightly more accurate in making predictions. However, it was more processing intensive, a much bigger issue before ultrafast computers using mega working memo-ries. Yet using plain linear multiple regression, which is meant for a continuous dependent varia-ble works well, as long as neither probability (of success or failure) is less than 20%. The advan-tage here is simply that multiple regression was available on computer packages that did not have the other procedures, and familiar to more people. The purpose in going through all this is that while a professor of advanced statistical methods is going to want a maximum of bells and whistles, a relatively less sophisticated approach will still produce resultant magnitudes superior to guessing. The only hurdle is having the data to work with.

Again the statistical (Analytic) techniques are incredibly complicated, but the good news is that they are available, avoiding the laboriousness of calculations, thanks to modern statistical soft-ware programs such as SPSS used by even undergraduate students. Looking at a simple case to illustrate how statistics can provide the ability to make the complex understandable. For example, using SPSS and information for 50 college students includes gender, GPA, hours of study, surveys on how they like various aspects of their college etc. Using these analytics we can do a discriminant analysis where the idea is; if we have all this data on other students excluding gend-er could I predict their gender? The goals were two fold, come up with a model that had high predictive power and come up with a model that was parsimonious. The first goal will always be there in discriminant analysis, the models need not be parsimonious.

A model using only two predictor variables, height and GPA worked extremely well. Because on average males were taller than females and because on average females had higher GPA than males those two variables alone allowed for a discriminant function that categorized 92 % of cases correctly as to gender! This was under the “cross- validations” criteria where one by one each case is removed from analysis, categorized on basis of all the other cases. All cases are re-moved once in turn, & categorized based on the discriminant derived from the other cases, thus predicting the percentage of new cases the model is expected to categorize correctly as to gender, if all we have is student height and GPA, here 92%. We come up with a simple formula for the discriminant:

“**Discriminant = .390* height- 1.325 * GPA – 22.039”**

For any new case just plug in the numbers, a positive discriminant means male predicted and a negative discriminant means female predicted. The point is with the right data anything can be predicted, will a stock tank, will draft pick make it, will movie make profit, will drug make it to market- SPSS (or another statistical package of course) will happily churn out the analysis it just needs to be fed “nutritious” data.

The work is exactly the same as if we had FDA criteria data on former and future clinical trial data being considered by FDA employees based on gender, age, time on the job, position rank in a matrix, et al on many Medical Devices.

Richard Thaler, behavioral economic pioneer emphasized the difference between traditional eco-nomic predictions verses the emotion driven decisions actually made by humans, and that using data offers consistently greater success no matter one’s career’s historical knowledge even if they’ve spent a lifetime utilizing their own personal judgments. Such personal experience is an-tithetical to the famous words of Nietzsche, “everything changes, only nothing stays the same”. Using Data Analytics in the Healthcare field is the best method to validation in a paradigm ever changing and ambiguous world of facts and details.