Apr 17, 2017

Machine Learning For Beginners

So you are hearing a lot about machine learning these days.

You are hearing words like models, training, forks, splits, branches, leafs, recursion, test data, and overfitting, and you don’t know what any of them mean.

Well I have some help, courtesy of my colleague Jacqueline who shared this scrolling lesson in machine learning with her colleagues at USV (me included) this weekend.

This scrolling machine learning lesson was made by Stephanie and Tony. It is great work. Thanks!

#machine learning

Comments (Archived):

awaldstein Apr 17, 2017

so well done.
lpress Apr 17, 2017

Wow, that takes me back to my 1967 dissertation: IDEA, interactive data exploration and analysis (http://som.csudh.edu/fac/lp….IDEA grew decision trees using different statistical techniques depending upon the level of measurement of the variables at each node. It could grow the entire tree automatically or pause at each node, present the user with a statistical summary of the best potential splits and let him or her decide the split. (The interactive mode was inspired by JCR Licklidder’s man-machine symbiosis paper).One question — what tool was used to make the scrolling presentation?
1. Twain Twain Apr 17, 2017
  
  Is there another link to your dissertation? That one doesn’t seem to work. Thanks!
  1. PhillipT Apr 17, 2017
    
    There is an extra parenthesis on the url. The url should be:http://som.csudh.edu/fac/lp…
    1. Twain Twain Apr 18, 2017
      
      Thanks, very interesting read!
William Mougayar Apr 17, 2017

Nice. I will need to do one for blockchain using that toolset. ML, AI and blockchain are fairly technical fields. End-users only care about their application, but a bit of technical understanding doesn’t hurt.
1. Ryan Etheridge Apr 17, 2017
  
  When you have to justify your decisions to stakeholders, you’ve got to have a basic understanding of the technical aspects to your decision making model. This won’t make everyone experts in machine learning but hopefully it will help people manage the micropolitical aspects of leadership.I’d love to see you model blockchain!
2. Bharath Apr 17, 2017
  
  That would be tremendously helpful!
3. Girish Mehta Apr 17, 2017
  
  I have wondered for a while if something like a ‘bit of technical understanding’ applies (or should apply) to AI.If not neck deep in the coding and technology , can one have a deep and nuanced understanding of the current correct state of its development ?To a extent I think blockchain is different- as an end-user, you can understand concepts like the Byzantine General’s problem, distributed trust, mining. And move on to what matters to you as a end-user.But I don’t know about AI.
  1. Twain Twain Apr 17, 2017
    
    To really understand AI, a maths degree is super helpful.Sure, a number of frameworks and mathematical libraries have been put out to make it easier for developers (such as Tensorflow) to do plug+play AI. CS students do learn some foundational maths, e.g. statistics for A/B testing.However, that’s nowhere close to the rigors of theorem proving from first principles that have to be done in a maths degree.For end-users, do they care as long as the AI recognizes their photos and can tag it — even if 20% of those tags may be amiss?Where the hard problem of AI rears its head is in Natural Language Understanding (NLU). The maths for that still has to be invented because language is so subjective, cultural and contextual (time-space issues of: what the user’s looking for, where they’re looking, how they’re looking etc.).In the case of Blockchain, the maths there also isn’t hard. It’s Merkle trees and, at the most, quadratic voting mechanisms.
4. avip Apr 17, 2017
  
  This one best I’ve seen so far: https://anders.com/blockchain/
jason wright Apr 17, 2017

“Keep scrolling” – instruction or plea?
1. falicon Apr 17, 2017
  
  +1 … and click some ads on your way down! 😉
Ryan Etheridge Apr 17, 2017

So many questions:1. How was this visualization created? Like nuts and bolts, where did the data set come from? What was used to manipulate the data? How was the website designed?2. I’m in K12 Ed: is there still a way for me to learn how to do this because machine learning stands to have a dramatic impact on K12 ed (where we struggle to convince people a nuanced approach to success is better than a single measure like graduation rate).3. In a field where you are accountable to the broad public, how do you go about convincing people that a machine learning is superior to intuition, A/B testing, incremental change, committees, and any of the other decision making method?
1. falicon Apr 17, 2017
  
  in regards to your #3 – I don’t think you should think about it in terms of “superior”…it’s another tool, or rather in most of the things you mention, it’s the application of those things in combination.What I mean is that, for machine learning to actually work, picking the ‘rules’ and the method involves some “intuition”…building your training data often includes some “A/B testing” and getting things right absolutely includes “incremental change”…and of course all of those things generally include some form of committees and other decision making methods….so it’s not an argument of “vs.”, it’s a discussion/education about “where the puck is heading” and how you skate towards that spot…
  1. Ryan Etheridge Apr 18, 2017
    
    That’s a good point Falicon. I can see a lot of my bias reading back through my comment. I am ready for ML to be actively used as a tool in our decision making process.
Pete Griffiths Apr 17, 2017

I have a nasty feeling that ML is another one of those topics where a little knowledge is a dangerous thing.It is great to have an appreciation of a topic but quite another to apply sophisticated tools, that have become relatively easy to apply due to powerful libraries, without the mathematical background to really understand their proper application or how to qualify the results.
1. Twain Twain Apr 17, 2017
  
  Funny you should say that Pete ….https://www.technologyrevie… https://uploads.disquscdn.c…http://www.newsweek.com/art… https://uploads.disquscdn.c…
  1. Pete Griffiths Apr 17, 2017
    
    This is at the heart of the problem of what constitutes explanation.I have for some time argued that for many years scientific explanation relied on metaphor/analogy (the atom is like an orange surrounded by ping pong balls) but as we probe deeper and deeper into the nature of reality, we now have models for which there is no everyday metaphor or analogy. And without such metaphors or analogy how do we explain things? Explanations that refer to models with coefficients that have no evident ‘natural’ meaning are problematic. This kind of problem probably started at the quantum scale, reared its head also in astronomy but is now appearing in area after area as models are being validated but we can’t understand them.
    1. Twain Twain Apr 18, 2017
      
      Welcome to my part of the Universe.https://uploads.disquscdn.c…
    2. Twain Twain Apr 18, 2017
      
      In terms of enabling explicability at the quantum scale, two things:(1.) The classical approach inherited from Leibniz circa 1655 is that at an atomic level, 0 = nothing exists, 1 = something exists.Along comes Pascal and Bayes who say, “Well, the likelihood of that thing existing or not existing is some probability %.”Now, does 0, 1 or % likelihood or co-occurrence actually explain WHY it exists or how it emerged into existence? Nope.(2.) Quantum Physicists have to decide if atoms are purely objective and mechanistic OR if there’s some form of subjectivity entangled with objectivity.=> This gets us into the territory of Einstein’s “God doesn’t play dice (probability) with the Universe.=> Max Tegmark of MIT’s proposal for the existence of “Perceptronium, the most general substance that feels subjectively self-aware.”He hasn’t (yet) provided a cogent proof or how to build a system to test/prove this hypothesis.Maybe the LHC can do it, maybe it can’t.In 2010, I read this from a Quantum Physicist at Oxford University and it infuriated me because it speaks to how imprecise and interchangeable Quantum Physicists like him treat information as probabilistic (and Einstein would turn in his grave at this generation of quantum physicists).”At first sight, all types of information look very different from one another. For example, contrast thermodynamics – how chaotic a system is – with the information in your genome. You’d say: what on earth is the relationship between these two types of information? One looks much more orderly, the living system, while the other is disorder. But it’s actually one and the same information… you actually need very little to define the concept of information in the first place. When you strip out all the unnecessary baggage, at the core is the concept of probability. You need randomness, some uncertainty that something will happen, to let you describe what you want to describe. Once you have a probability that something might happen, then you can define information. And it’s the same information in physics, in thermodynamics, in economics.”IT’S NOT THE SAME INFORMATION AT ALL!!! In economics, subjectivity is happening every time we make a consumption choice based on information.In physics and thermodynamics — until Tegmark and co can build the machines to test and prove for subjectivity at a sub-atomic level — we only have OBJECTIVE energy, mechanics and gravitational fields to work with.
      1. Pete Griffiths Apr 18, 2017
        
        I know you’re passionate about this. I just can’t find a way into it.
      2. Twain Twain Apr 18, 2017
        
        Philosophers (and Liberal Arts, generally) think of and define consciousness and reality differently from scientists (in this case, physicists).The first lot are concerned with INTERNAL REFLECTIONS on the mind, the soul, feelings, aesthetics, morals, ethics, social interactions and qualification of reasoning through language and artistic expression.The second lot are concerned with EXTERNAL OBSERVATIONS of information, pattern recognition using probability and statistical methods, experiments to prove theorems, repeatability of results and quantification of results via numbers and process logic.Therein are some fundamental gaps between why+how philosophers (Liberal Artists) and scientists think differently in terms of explicability.
      3. Pete Griffiths Apr 18, 2017
        
        I’m not sure that difference is as clear as you suggest once a philosopher or social scientist thinks with any rigor. Social science and philosophy are no longer educated poetry.
    3. sigmaalgebra Apr 18, 2017
      
      Those thingies with the coefficients are usually weighted sums and/or orthogonal projections onto a hyper-plane.In a nice case, the components summed are orthogonal so that then can identify the contribution of each term in the sum. When the data is not orthogonal, can retreat to factor analysis where do have orthogonality — the factors — and get some information, loadings, that try to make the factors more meaningful. In a nice problem, the factors are relatively stable so that, then, you do have some decent explanations. E.g., some people have long believed that, for people, IQ was such a stable factor.This is all old, standard, although not always clearly taught, multi-variate linear statistics, e.g., regression analysis.
      1. Pete Griffiths Apr 18, 2017
        
        True. But my point is that whilst it isn’t too hard to get a sense of the relative importance of factors and cofactors, a coefficient of 73.0987498 isn’t exactly intuitive, is it?
      2. sigmaalgebra Apr 18, 2017
        
        Nine solutions:(1) Make it simpler; just round off the number to just 73. :-)! In the ML world, we are eager to drink really simple shit Kool-Aid!(2) Notice that another coefficient is 91 and all the rest are less than 5 in absolute value. So, really have just two important variables. Confirm this stuff with statistical hypothesis t-tests on the coefficients, i.e., with null hypothesis that the coefficients are 0. Maybe for the small coefficients, can’t reject that they are 0. So, keep it simple, first-cut, call’em zero.(4) Do some work to check stability of the calculations, e.g., looking at the ratio of the largest and smallest eigenvalues (condition number) of the variance/covariance matrix from the normal equations, maybe some resampling (B. Efron, P. Diaconis, etc.), maybe some Monte Carlo perturbations. Look at the confidence intervals around the predicted values — if they are really large, then apply the sniff test for something rotten.A point is, two variables can be nearly orthogonal to what are trying to predict and, thus, one at a time essentially useless as predictors but jointly predict nearly perfectly — this situation stands to be both rare and unstable so it should be easy enough to detect it and rule it out.E.g., think about that situation a little more, e.g., in the geometric context that are projecting onto a subspace spanned by those two variables.(5) In case there seems to be some stability, for those two variables look one by one into what might be behind them. E.g., in the social sciences, look for the usual suspects, age, education, income. And look for what might be driving those two variables in the particular case.(6) Get with some people, unlike me, who are deep into linear, multi-variate statistics. E.g., some people in econometrics are so convinced that such statistics should work well that they are 1000 feet deep into the goo of peripheral details and may be able to say something useful — small chance.(7) Of course, you took the original data and partitioned it into two buckets and used only the first bucket to do the fitting. Then check the fit with the data in the second bucket. The applied statistics people have been doing this for a long time, and when the ML people do this they are not fully wrong. There are some relatively recent theorems that can help here.(8) Call it machine learning and/or AI and, thus, the supremacy of the robots, “giant electronic human brains,” now far beyond any hope of human understanding. Get headlines, speaking invitations, book deals, a busy Web site with lots of ad revenue, grants for an AI institute in your name, a tenure track slot in a CS department (easier with a grant), lots of eager, devoted, adoring coeds (each with a mother and two grandmothers who want to see some BABIES) eager to work as administrative assistants in your institute and ready to do “just anything” to advance social science, etc. — i.e., real applied math! Right, they are just drop dead gorgeous, with perfect hair and nails, wear blue jeans except they are skirts and only about five inches long — not subtle! Ah, what I passed up when I was a B-school prof!Yes, since I was the Chair of the college computing committee, the de facto Chair two weeks after arriving on campus, a woman in the administration asked me to come to her office to discuss a computer application, faced me, put her feet up on her disk, and informed me that she and her husband had an “open marriage” — gee, surprising what subjects are related to computer applications! Maybe now that is the situation for AL/ML? Maybe, finally, THAT explains the interest?Or, while the usual explanatory advice is “follow the money”, maybe also “follow the short skirts”!Gee, it’s not all just theorems and proofs! Right, theorems and proofs can lead to some applied math, code, ad targeting, revenue, a high end Corvette, and short skirts?????? I may not be the first to think of such things. A Corvette — an eight piston, super-charged, coed-catching, sex machine! Real applied statistics!But, if what really want from the data analysis is just better ad targeting, e.g., that Mary Meeker at KPCB can measure, and, thus, more revenue (for a Corvette or whatever), then stay with the best statistics you can apply and leave ML to the California dream’n, fakers, hypsters, hipsters, Stanford Computer Science department, and fad followers.(9) There’s a paradigm, meta-theorem few people really appreciate and those that do are embarrassed to explain since explanations would make clear assumptions they have been making that they have never mentioned or justified.The theorem is that nearly everything is smooth enough to be differentiable. Then, for such a function of several real variables, can take partial derivatives, and these determine a tangent plane that is the best local linear approximation to the function. Well, what you get in linear multi-variate statistics is a best estimate of that tangent plane, that best local linear approximation.This meta-theorem stopped me from much of mathematical physics: I kept seeing physics people assuming all the math was nice without comment or justification when I knew enough math to know that some bizarre, pathological special cases did exist. Eventually I understood the meta-theorem. E.g., if the function of several variables is continuous, then with meager assumptions can approximate it as closely as please with a differentiable function. So, just say that physics is working with the differentiable approximations. Of course, the physics profs rarely understand this point and still more rarely explain it. Another point, related, is, really, in much of physics, what they want are not functions but distributions.The physics people are so eager to smooth over this point that commonly they don’t even understand the basics: E.g., at the MIT open course lectures, there is one on quantum mechanics that says that the wave functions are differentiable and also continuous. Of COURSE they are continuous you doofus — every differentiable function is continuous. The physics prof also was not clear on the difference between independent and uncorrelated. Right, in the multi-variate Gaussian case they are the same, but otherwise usually not.So, in a sufficiently smooth universe, your linear multi-variate statistics is making a best locally linear approximation, finding a tangent plane from partial derivatives. Right, the coefficients are the partial derivatives.Let continuity, differentiability, and linearity be your powerful, good friends!
pointsnfigures Apr 17, 2017

The more people know how machine learning and AI work, the less intimidated they will be by it and the more they can wrap their head around how they can fit in. An aside, in the model they used I would have just started by sifting by zip code : )
1. Twain Twain Apr 17, 2017
  
  Zip code would be an example of “zero shot” learning. It would have captured 100% pretty much all properties in SF as different from NY straight away.There’s no scrolling visual lesson in that!
Tony Chu Apr 17, 2017

Thanks Fred (and Jacqueline) for sharing our site!To answer a couple of the questions in the comments:”What tools were used to make this?” — mainly D3.js, plus a lot of custom JavaScript.”Wouldn’t it be easier to use zip code/lat/lng as a feature?” — yes, but then it wouldn’t be as illustrative. Our goal here is to explain the process, not build the best model.”How do you go about convincing your team/stakeholders that ML is a better solution” — in my experience you need to address both the process (ie how a model is created, like r2d3 does here) and the results (showing the accuracy metrics like AUC and highlight outliers)Ok, will answer the next batch of questions in a little bit.
1. Twain Twain Apr 17, 2017
  
  Nice work and this is my favorite visual: https://uploads.disquscdn.c…For problems like this, where the data set is super-objective (elevation of the building, year built, $ per sqft), Machine Learning and tree-structures come into their own.Wondering if your team has tackled problems where the data sets are more subjective and ambiguous, e.g. to do with human emotions and/or languages.Thanks!
2. fredwilson Apr 18, 2017
  
  thanks for stopping by and engaging in the conversation Tony!
3. ShanaC Apr 18, 2017
  
  “How do you go about convincing your team/stakeholders that ML is a better solution” — in my experience you need to address both the process (ie how a model is created, like r2d3 does here) and the results (showing the accuracy metrics like AUC and highlight outliers)As big of an issue: Why x model or y method
4. cavepainting Apr 18, 2017
  
  hi, brilliant work. One thing that confused me a bit was the graphs use feet and price per square feet while the commentary is all in meters and price per square meter.Are you planning to opensource the code (or) make it possible for people to use it to write pictorial educational stories like these? Maybe there is a business underlying this somewhere?
Diego Ventura Apr 17, 2017

This is an excellent introduction!I would recommend reading https://blog.monkeylearn.co…
creative group Apr 17, 2017

FRED:We appreciate the inclusion. Invaluable in our view. Much more of this and much less of using this premium space for hyphenation. (Your space, your rules!)
sigmaalgebra Apr 17, 2017

Apparently to a large extent current work in machine learning is a slightly different take on parts of the classic field of statistics going back about 100 years to, say, K. Pearson and farther to C. Gauss and more. So, machine learning is nearly all old wine with a lot of added hype and diluted and adulterated in new bottles with new labels, terminology, etc. But, there is much more to statistics than the parts polluted and re-bottled by machine learning. Also statistics is part of applied math, and there is much more to applied math than statistics.Machine learning is a lot like some auto mechanic’s first efforts at home made wine instead of the glories of the best of France and Germany for 200+ years. A fairly accurate description machine learning is the common “That work has a lot that is good and new, however, the good is not new and the new, not good.”Here are some of the reasons for much of the current interest in machine learning instead of just classic statistics:(1) In the economy, computing is important. It is easier to see the importance of computing than that of statistics (so far).(2) Statistics nearly always needs computing to do the arithmetic and data handling. The academic experts in computing are supposed to be from departments of computer science. So, such departments have gotten interested in applications of statistics.Alas, the people in the computer science departments rarely have good backgrounds in the important pure/applied math and statistics. So, machine learning does a lot of floundering around, being crude where polished work was long since done in statistics, lacking good knowledge of math prerequisites doing too much low quality research, etc.Too much of machine learning is essentially academic theft followed by adulteration of the content.(3) To raise interest in their work, computer science machine learning uses the importance of computing and applications of statistics to hype their work. The hype is incredibly strong and overwhelms the old, relatively small, advanced, well developed, polished, at times deep, niche world of mathematical and applied statistics.The hype is irresponsible and harmful, possibly even dangerous. E.g., the learning is nothing like human learning; apparently the theme of the hype is to anthropomorphize some computing as in some old IBM marketing hype that their computers were “giant electronic human brains”. They were no such thing. And the learning in machine learning is nothing like even kitty cat learning and a drastic change in any prior meaning of learning. We’re talking language misuse and corruption here.(4) Much of any real justification for anything new from the interest in machine learning is from the ability of computing to process quantities of data far larger what was possible in the past or what the field of statistics commonly considered. E.g., some of the machine learning processing makes use of the astounding floating point performance of graphical processing units (GPUs) from NVIDIA.(5) The interest in machine learning has brought increased interest in some issues in mathematical statistics, and there has been some good, new work in the associated statistics, apparently, however, by people with good backgrounds in the pure/applied math and mathematical statistics prerequisites and not from computer science departments. But, good, new work in mathematical statistics is welcome.E.g., there is more interest in fitting sigmoid functions to data.E.g., Stanford has and long had some of the best people for work in statistics and associated applied math, e.g., Kalman filtering, but not from the computer science departments. Similarly at MIT, Cornell, CMU, Chicago, Johns Hopkins, U. Washington, UC Davis, UNC, etc.The lesson from Ms. Garavente et al. starts with what used to be called common sense on the way to what used to be called descriptive statistics which got improved by J. Tukey, long at Princeton and Bell Labs, into exploratory data analysis.For software to aid the grunt work of descriptive statistics can use SPSS (statistical package for the social sciences) now heavily marketed by IBM, SAS (statistical analysis system), some computer-based spreadsheet software, e.g., Microsoft’s Excel, Mathematica, SQL (structured query language for relational database when the data is in a relational database) and other relational database software, system R, some of the packages for Python. For more one can write some code in any of the popular general purpose programming languages, maybe calling the old IBM Scientific Subroutine Package (SSP), using some open source software, or at times maybe use just a good text editor that can sort data and apply some macros that are simple to write.E.g., with just a little work with a text editor and some simple macros, the distribution of the 20 most common words in Fred’s post isthe 113and 73to 63of 53a 47i 41is 36it 29on 29in 28this 25that 21you 17for 16are 15at 13but 13has 12april 10be 10So, to get such a distribution, have a Web browser copy the blog post text to the computer system clipboard, pull that text into a text editor, e.g.. KEdit, run a macro to replace all punctuation with blanks, run a macro to put one word per line, convert all the words to lower case, sort the lines with the words, delete lines with numbers, run a macro to append to each word the number of duplicates and, then, delete the duplicates, sort the result in descending order on the numbers, pull the top 20 lines into a file for this post. Done.The Garavente lesson mentions dimension, and that can be important. The main pillar of dimension relevant here is not something from relativity theory in physics but just the concept of linear independence from linear algebra (and matrix theory) (the role in physics is closely related but with a much more elaborate context of differential geometry, tangent spaces, etc.).The decision tree terminology is a stretch — what is described is, yes, a tree but a decision tree is something else. E.g., statistics, going way back, e.g., to some work in WWII in sequential testing of Abraham Wald, et al., e.g., as athttps://en.wikipedia.org/wi…This work is part of statistical decision theory where we try to minimize costs with, yes, commonly a decision tree, and more generally is part of stochastic optimal control.What is being described in the Garavente lesson would be better called a classification tree. Then likely should mention classification and regression trees, (CART) as inLeo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone, Classification and Regression Trees, ISBN 0-534-98054-6, Wadsworth & Brooks/Cole, Pacific Grove, California, 1984.Right, CART is now old work. It does appear that essentially all the work in deep learning in machine learning is based on Breiman’s CART.And, of course, there is old work in classification from discriminate analysis in classic multi-variate statistics long commonly taught and used in the social sciences.As long as we are mentioning Breiman, we should mention what might be called the seminal explanation of the difference between classic statistics and, if you will, machine learning in hisLeo Breiman, “Statistical Modeling: The Two Cultures,” Statistical Science, Vol. 16, No. 3, 199–231, 2001.available athttp://projecteuclid.org/DP…The suggestion that machine learning is “automatic” is a bit of a stretch. Instead, in a new application, typically there is still a lot of care, knowledge, thinking, experimentation, etc. needed.Apparently now one of the more important directions of predictions and statistics is ad targeting for ad supported Internet media content. Some of the reasons are in, say,Shelly Palmer, “TV May Actually Die Soon – Stay Tuned,” ShellyPalmer, April 9, 2017.athttp://www.shellypalmer.com…andhttps://news.ycombinator.co…Just how to do such ad targeting has some techniques in use, likely heavily borrowed from statistics, and, then, a wide open field of new techniques that can be created from some original research in applied math. Here it helps, is crucial, to have the applied math work with well-done theorems and proofs so that know quite early on a lot about the power and value of the work.For what Palmer describes, there’s a lot of money, e.g., a good chunk of the value of all of advertising, that is, a good chunk of the revenue of now and the future of all of the FANG — Facebook, Amazon, Netflix, Google. Big bucks. No joke. The key is some applied math research for some new, much more powerful, valuable statistical, etc. means. Let’s don’t insult such good work with contemptible terminology such as machine learning.Of course, such good, original applied math will likely be proprietary, trade-secrets, etc. Yes, hearing about good, original applied math with advanced math prerequisites causes essentially every investor in Silicon Valley to tremble, bite their iPhone, and wish they were back in history class at Williams College or doing business development for some smartphone-based, social, local, mobile, sharing startup. So, such math is not just rare, not just neglected, but in major ways hated. Then the flip side of this situation is an opportunity.Silicon Valley will continue to hate original applied math until after, maybe long after, an applied math startup has a market capitalization of $500+ billion.Closely related to the ad targeting is how to let people for each of their interests find Internet content they will like best for that interest. Again, there are some techniques in use and a wide open field of new techniques that can be created from some original research in applied math.The relevant, even crucial, applied math can draw from some relatively advanced work in pure math. Taken seriously, as high value situations deserve, some of the math tools can become advanced quickly.Some of statistics not well duplicated by machine learning is, say, with some irony, detecting anomalies in computer server farms and communications networks. Sure, this work is part of monitoring in the associated system management.So, such monitoring is necessarily some continually applied statistical hypothesis tests. With high irony, apparently so far the computer science and machine learning efforts (e.g., a project at Stanford and Berkeley funded by Sun, Microsoft, and Google) along with a river of computer science conference papers, along with, say,Herve Debar, “IDFAQ: What is behavior based Intrusion Detection?”, SAMS,as athttp://www.sans.org/securit…have yet to notice the crucial role of hypothesis testing and, instead, flounder around neglecting some of the crucial points — in their slog to reinvent 100+ years of statistics on their own, they have yet to see the crucial points.The crucial points in anomaly detection: Sure, rates of false alarms and of missed detections. In the now classic field of statistical hypothesis testing going back at least to K. Pearson 100 years ago, the first is called Type I error and the second, Type II error. E.g., can seeSubutai Ahmad, “Numenta Anomaly Benchmark: A Benchmark for Streaming Anomaly Detection”, Domino, athttps://blog.dominodatalab….with The algorithm should minimize false positives and false negatives (this is true for batch scenarios as well). Sorry, can’t really do that: To give a more clear description, what is being requested should be called a perfect detector, and that’s nearly always asking for too much.Instead, and a fundamental point, is that essentially always there is a trade-off between the rates of false alarms and the rates of missed detections: If are willing to accept more false alarms, then can get fewer missed detections.Just how the trade-off goes is just crucial and is the crucial issue in picking one means of detection over another. That is, some detectors (hypothesis tests) are more powerful than others — the extra power can mean extra money saved; less power, more money wasted. Money wasted from poor anomaly detection? E.g., ask Sony, the NYSE, Target, etc. — any major company with big losses from being hacked.Then, as is standard in a junior level college first course in mathematical statistics, a crucial result for this trade-off is the 1947 or so Neyman-Pearson result, that is, K. Pearson from 100 years ago and J. Neyman, long in statistics at Berkeley. That is, right, anomaly detection via machine learning has yet to catch up with mathematical statistics of 70 years ago and long in college junior level courses. A word comes to mind — incompetent.For the Debar essay, he’s wrong: He claims that, for his context, behavioral monitoring, false alarm rates have to be too high. Wrong. Trivially wrong. And beyond being trivial, profoundly wrong.In particular, in the classic field of statistical hypothesis testing, e.g., in polished texts going back 50+ years, it is standard to be able to select the rate of false alarms in advance, set that rate in a test, and in practice get that rate exactly.Net, for monitoring, statistical hypothesis testing is crucial with lots of solid, well polished results, and machine learning is floundering around ignoring what has long been known and doing bad jobs reinventing the old work. Bummer.It’s like having the hospital bedpan nurses being self-taught in heart surgery, killing patients by the thousands all for no good reason.Computer Science Machine Learning: Risky, adulterated classic statistics with new labels.What is explained here largely applies also to artificial intelligence (AI).
1. falicon Apr 17, 2017
  
  You’ve got an interesting rant here…but IMHO what people are looking for right now is not *perfect* predictions…they are looking for “automated, data-driven, better guesses”…and that’s what the current state of ‘machine learning’ is all about.”Better guesses” can directly bring costs down and sales up…both things that *directly* affect the bottom line in all companies…and that’s why there is so much hype and excitement.It doesn’t have to be “real science”, right now it just has to result in “better guesses at scale” than what we had yesterday…
  1. Twain Twain Apr 18, 2017
    
    Right, the AI that gets green lighted does one of two things:(1.) Maximize profits.(2.) Reduces costs / inefficiencies.The ML for that doesn’t even need to be hard: it’s basically Operational Research on speed.
  2. sigmaalgebra Apr 18, 2017
    
    You’ve got an interesting rant here Or, why drink wine some amateur made in his bathtub when can get some darned good wine from France, Germany, Italy, elsewhere in Europe, parts of the US, for often quite reasonable prices.And — no longer an analogy — why pay top dollar for California Chardonnay when can get the much better, real stuff from France as just Macon Blanc.?ForIMHO what people are looking for right now is not *perfect* predictions that’s an old confusion. For a while some statisticians beat up on everyone else by saying that big buck consulting from statisticians was needed to make the results “statistically valid”. What got valid was the bank account of the statisticians.In practice, next to nothing in statistics is “perfect”.Here’s the root of much of the confusion:There is curve fitting, especially fitting a straight line. Or in several variables, fitting a plane or, if you will, a hyper-plane — same algebra but just more terms like the rest and by itself no biggie issue.So, go through the usual derivations to find the coefficients of the line, plane, etc. that minimizes the squared errors from the observed data. Physics has long, commonly called that a best fitting line, and that sounds good, right?The usual derivation, and I have a much better one, is, in the expression for the square errors, to take a partial derivative of each coefficient, set them all to zero (as in a necessary condition, and in this case due to some convexity sufficient condition, for minimization), get the normal equations with a matrix that is symmetric, thus has all real eigenvalues, all the eigenvalues are zero or positive, and except in a case of over-fitting all the eigenvalues are positive, the matrix has an inverse, and multiplying by the inverse solves the normal equations and gives the coefficients.Okay.Right, the eigenvectors are all orthogonal, and we will see that next semester in factor analysis.A lot more is true: What you are really doing is taking a perpendicular projection where get, right, the Pythagorean theorem which in that part of statistics is:total sum of squares = regression sum of squares + error sum of squares.This stuff is commonly taught to undergrads in sociology, psychology, educational testing, and economics (econometrics).Then, with the students all hanging on by their fingernails or just lost, trot out: If the errors are homoscadastic (didn’t see that one in your sixth grade spelling lessons), Gaussian distributed with mean zero and the same variance, then the estimates of the coefficients are best linear, unbiased, minimum variance.That’s a cute result. Guess the guys who first proved that were proud.But, the result has hypotheses that likely don’t hold in practice and are no doubt in practice and likely in theory essentially impossible to confirm.So, logically this is like a boat ride in the South Pacific that takes you from an island to an island of paradise except there is no transportation to the first island!Still, that result is not fully silly: The result is robust which means that it is not very sensitive to the assumptions. And the assumptions and result give some nice hypothesis tests — F ratio, t-test — for checking the quality of the calculations, and the usual gossip is that those tests are also robust.Net, for practice, do use the tests, if only as rough, first-cut guidance, unless have a solid reason not to.Forthey are looking for “automated, data-driven, better guesses” that’s not nearly new. The statistics community has had all that way back to when Dad was in college in highly polished texts from bone-head and cookbook. Then for students who want the good stuff, there was junior level mathematical statistics … to statistics with prerequisites in measure theory and results of Halmos, Dynkin, Kolmogorov, Wiener, Kalman, Wald, Tukey, Neyman, etc.Moreover, J. Tukey long did a lot of public outreach, pulling up the skirts, dropping the pants, as one might prefer, on stiff necked statistics: He explained that it was from common up to nearly universal for people to use regression, etc. largely ignoring the strict mathematical assumptions and for the results to be useful, just as you outlined.And for more, Tukey was nice enough to explain that a lot of the practical value in statistics was in just descriptive statistics and, moreover, made nice progress in such work as a field — exploratory data analysis with a book of the same title.For”Better guesses” can directly bring costs down and sales up yup, and statistics has been good at that stuff since Dad was in college. Indeed, it was much of how he long organized technical training in the US Navy — engines, electronics, hydraulics, sheet metal, etc. Those Navy fighter planes tend to need a lot of maintenance!ForIt doesn’t have to be “real science”, right now it just has to result in “better guesses at scale” than what we had yesterday… Yup. Been there, done that.E.g., suppose the US and Russia get into a spat, at sea, and start shooting, at sea, and it goes nuclear, but only at sea. The US wants to keep its nuclear missile firing subs in reserve as a deterrent so wonders how long the subs might last. So, the Navy wanted an estimate, right, statistics.One day while I was working to support both myself and my wife through our Ph.D. degrees (my financial planning flopped when she was several years late doing her dissertation) I discovered that I was to give the estimate and that the Navy wanted the results in two weeks.Hmm …And my work was to be reviewed for solid applied math by J. Keilson, well known in mathematical statistics.So, I did the applied math, wrote and ran the software, and got the results in the two weeks. Good because the next day my wife had reservations for us for a vacation she wanted in Shenandoah.The math? Okay, borrow from some WWII work by B. Koopman. See a continuous time Markov process subordinated to a Poisson process. Input data has to do with the weapons systems on both sides. Yes, from Kolmogorov, etc., there is a closed form solution but in this case that exact calculation was too difficult. But it was easy enough to use some Monte Carlo to generate 500 sample paths and average. So, that’s what my software did. Apparently the Navy liked it. The company I worked for sold the software to a US intelligence agency — I could tell you which one but then I’d have to ….Uh, there were many assumptions, or, if you will, quick and dirty approaches and intuitive approximations in there.Yup, maybe it was better than nothing. Yet, if taken too seriously it could have been dangerous — destroy the planet. Gee, I wanted to get my wife and I through our Ph.D. degrees!Yes, statistics can be valuable stuff, maybe even when computer science, machine learning people do it.But, in pharmaceutical testing, epidemiology, actuarial estimates, etc., I still want good statistics with no computer science, machine learning, artificial nonsense people within 10 miles.Why drink that bathtub swill when can have the good stuff?
    1. falicon Apr 18, 2017
      
      Just remember, “in the land of the blind the one eyed man is king”…
Alex Apr 18, 2017

Great post! Thanks for sharing. Would appreciate more content like this.
Zachary Reiss-Davis Apr 18, 2017

This is outstanding overall — but have you seen anything in a similarly engrossing style (or just ANY easy to understand teaching style) that explains “what is machine learning” overall? I’ve been trying to explain the concept a few times recently and would love something to just send people.
David Pethick Apr 18, 2017

Outstanding piece of work.A difficult part of the job of a data scientist is overcoming “black box” objections from smart people who are not trained in the field. Showing them how machine learning is similar to their own internal decision-making process makes that easier.Cheers.Dave P.
Mark Mullin Apr 18, 2017

I work in this space (multi image fusion and 3D depth extraction from static view and subsequent SfM) and I’d advise a healthy sense of caution – is it probably a next big thing ? Yup Do we understand it ? Poorly. Can we absolutely explain why it did or did not work in a given situation ? Nope Are we finding it doesn’t quite work the way we thought ? Yup. See https://blog.openai.com/adv… for examples of exactly how brittle this is – by all means venture on, Fred, but carry a healthy sense of skepticsm – and the presentation is too simplistic – not wrong per se, but so brutally oversimplified as to be trivial – for example, no discussion whatsovever about hyperdimensional linear separability (all this is just bulldozing stuff into ever higher dimensional spaces so we can draw a line between things) nor of how overfitting is effectively memorization of a training set (which I might point out is a flaw with immediate fungible value of its own 🙂 )
Hiral Shah Apr 19, 2017

Thank you for sharing, one of the best articles I have seen in ML and the best one I have started sharing with non-technical folks who want to understand this better
Steven Kane Apr 22, 2017

Fantastic! Thanks for posting. Now that is a really great example of why God created the Interwebs!