Posts from Statistics and Probability

DIY Data Science In Action

I wrote a post about DIY Data Science back in March. In that post I said that hacking on public data sets and posting about it has the potential to be a big deal in the coming years. I saw a great example of exactly what I was thinking about this morning.

Alastair Coote pulled a bunch of turnstile data from the MTA and figured out what the most used NYC subway stations are during rush hour. And he posted his code to GitHub and embedded it on his blog.

If I were a high school math teacher, I would take his work and make it a project for my students to work on together. The MTA makes a lot of data available to play with. This kind of stuff is highly relevant to teenagers in NYC. They would understand the data and the exercise.

The data and tools to do DIY Data Science are becoming more accessible every day. I hope we all get into data hacking and start collaborating on this stuff together publicly. At a minimum, it will lead to more data scientists and we might learn some interesting things about ourselves and our world at the same time.

BTW – Union Square is the most active subway station at rush hour. Midtown south FTW!

DIY Data Science

In a comment on yesterday's hobbyist post, Pete Griffiths offered "Do It Yourself Data Science" and I really liked that suggestion for a bunch of reasons.

I think data science and machine learning (I know they are not the same thing) are going to be a very big part of tech innovation in the coming years. And I also know that putting powerful tools in the hands of "everyman" produces more innovation than can happen when the tools are limited to mathematicians and scientists.

The blogging revolution in publishing is a great example of this. Once everyone could have a printing press, we got to see many important developments that did not and would not have happened as long as publishing was a high cost operation limited to professionals.

So what is the Tumblr or Blogger or WordPress of data science? When will my son and his friends be able to take the NBA dataset and start running algorithms against it to produce better fantasy picks? When will my daugther and her friends be able to take the TV viewing dataset to decide what TV shows to go back and watch that they missed last year?

I believe data science is going to go mainstream in the coming years. What will be the platform(s) that make that happen?