DIY Data Science In Action

I wrote a post about DIY Data Science back in March. In that post I said that hacking on public data sets and posting about it has the potential to be a big deal in the coming years. I saw a great example of exactly what I was thinking about this morning.

Alastair Coote pulled a bunch of turnstile data from the MTA and figured out what the most used NYC subway stations are during rush hour. And he posted his code to GitHub and embedded it on his blog.

If I were a high school math teacher, I would take his work and make it a project for my students to work on together. The MTA makes a lot of data available to play with. This kind of stuff is highly relevant to teenagers in NYC. They would understand the data and the exercise.

The data and tools to do DIY Data Science are becoming more accessible every day. I hope we all get into data hacking and start collaborating on this stuff together publicly. At a minimum, it will lead to more data scientists and we might learn some interesting things about ourselves and our world at the same time.

BTW – Union Square is the most active subway station at rush hour. Midtown south FTW!

#hacking education#NYC#Statistics and Probability

Comments (Archived):

  1. jason wright

    in films featuring NYC subway stations people are often seen jumping the turnstiles. is this myth, and how accurate is the raw data?p.s. was it really March? time suck.

    1. falicon

      In my experience it’s not super common, but it does happen…and for good reason -> http://goo.gl/5CWo9 🙂

      1. jason wright

        those turnstiles have a perfect hopping design. that would be my defense in court. it made me do it 🙂

    2. Anne Libby

      I haven’t seen it very often, and probably not once in the last 5 years…

    3. fredwilson

      kids do it. particularly kids in less affluent neighborhoods

  2. William Mougayar

    It’s good you didn’t call this Big Data because this isn’t big data. It’s open data.If it can fit in a spreadsheet or a standard database, it’s not Big Data, but it’s DIY Data,- and that’s more powerful from a peer to peer usage perspective.But not everybody is a programmer. Could this have been done by spreadsheet manipulations & some macros?

    1. Cam MacRae

      Brenda Dietrich said something in a keynote yesterday that I’ve been mulling over — paraphrasing: “Big data really means data that’s a bit bigger than I’m used to”.I don’t think ‘big data’ is a helpful term.(Oh, and to answer your question, yep, but anyone who did so would be a programmer by definition.)

      1. William Mougayar

        To reach the masses, it’s got to be simpler than that. Maybe there’s a hierarchy of Open/Big Data users:Scientists – few thousandsDevelopers – few millionsUsers – hundreds of millions

        1. Cam MacRae

          The masses need actionable information, not data (irrespective of its bigness).That doesn’t mean you can’t leave data lying around for curious types to discover.

          1. Matt A. Myers

            Agreed. The data has to have a layer on top of it that makes it useful for the context it is presented in – to present its inherent value(s).

          2. Dave W Baldwin

            Understand your point, yet Middle Schoolers do learn about samples for surveys and so on. Instructing the transformation of gross into net is something they can understand, just don’t make to big a deal of it.I do think it is worth establishing a chart of names related to data that is easy to understand and is used consistantly ranging from Meta, Big and so on.

        2. awaldstein

          Already touching the masses.Train schedules in subway stations are a relatively new thing.

          1. William Mougayar

            I meant in terms of “hacking” the data.

          2. awaldstein

            The mass markets don’t hack, they use. The win is when it becomes flexible enough and the same thins.

          3. William Mougayar

            Yes, I hope there is hope that we can lower the barriers of data usage/manipulations/collaboration by consumers.That was my understanding of Fred’s wish “I hope we all get into data hacking and start collaborating on this stuff together publicly.”

          4. awaldstein

            Data usage yes, data manipulation as in hacking for the mass market…not a reality.The beauty of local data driven apps is that they create a format for use that the developers hack for them.That’s the synergy that makes sense. Data for the masses is not the same as data hacking for the masses.

          5. ShanaC

            where are they

          6. awaldstein

            ?Just look up 😉 In every subway station.Hard to appreciate to those that are here now but for a generation, sweating and standing in the subway waiting with no idea which train was coming and when. No way to know whether to take the local or wait for the express.This is data that really touches my life big time!

          7. falicon

            especially at 4am after a night of ‘exploring the town’…not that I get to do that much these days (but I do remember many an early morning standing down there for what felt like 3 days before a train would eventually come) 😉

          8. ShanaC

            i’ve never seen them – hmmm

          9. awaldstein

            A life changer for New Yorkers

        3. Abdallah Al-Hakim

          I am not a programmer but i do apreciate it when others explain thoroughly how they did things. I find it to be an extremely useful experience. I hope some of the DIY data hackers take the time to describe their process.

  3. kumarbshah

    On a related note Fred, today’s news in India was about a Cornell student who managed to access the entire results of one of the major national level exams. He has posted his work on Quora along with his data analysis. Interestingly enough, his takeaway is that the scoring system in these exams, which literally make or break admissions into top schools here, has been tampered.http://deedy.quora.com/Hack

    1. fredwilson

      wow

  4. JimHirshfield

    Everyday begins and ends for me between the two busiest subway stops. I know, it’s a limited dataset, but that’s my DIY Data du Jour.

    1. Anne Libby

      I love the selection of musicians in the Union Square station. I always look forward to being surprised by what I might hear there.

          1. Cam MacRae

            Very cool.

        1. JimHirshfield

          I remember that story. Fascinating.

      1. JimHirshfield

        “Surprised” being the operative word!

        1. Anne Libby

          Exactly. It is NYC, after all.Though I’ve thought quite a bit about “surprise” since hearing Lee Ann Reininger, who spoke a bit about it at Etsy’s conference on the future of commerce. There’s a kind of delight that comes from good surprises…like when you run into a friend on the street in another city. She talked about building some of these good surprises into our workplaces. Thought provoking.

      2. ShanaC

        so far the best was when some opera singers started to sing an opera

        1. Anne Libby

          Yes, I loved them!

        2. Matt A. Myers

          I wish kind of this was on every street corner. A culture and society of the arts.

  5. Dave W Baldwin

    Very good and DIY Data Science is something that would be great to promote in schools.Of subject going back to a former post, you are mentioned in the following story http://www.ft.com/cms/s/0/e…in the intro. I think the comparison of Google to GE is a good one.

    1. fredwilson

      weird. seems to be some sort of paywall. i can’t take the time to fill out whatever i have to fill out right now. so i will miss that one unfortunately

  6. Brandon Kessler

    And this is the ongoing competition to make more software using their data: http://2013mtaappquest.chal

    1. fredwilson

      yesssssss

  7. awaldstein

    For a decade we’ve had search.For only two years we’ve had digital signs in the subways that tell us if and when train is coming.Trickle down data is what matters at a human level. This is the great next frontier.

    1. Matt A. Myers

      I really enjoy riding in the new subway that’s in Toronto. The UI/UX is so much better, and it makes me feel more connected – safer, and more relaxed. It’s very simple data that’s displayed and updated in real-time, though if all data in the world is available in the same way – presented in a proper / helpful form (means of learning its true value in different contexts) – then I think everyone can grasp it and feel more connected to the world.

      1. awaldstein

        Data a a targeting mechanism for marketing and data as a product that soothes and helps are too different things.Good point.

        1. Matt A. Myers

          Both take into account context though too or at least are more valuable / useful when in context.

    2. fredwilson

      those signs have been a game changer. but as my friend Michael pointed out to me a few weekends ago, those signs should be at street level before you go down into the station, or at least before you swipe and get onto the platform. because it is possible that there are quicker options for short trips if the train is delayed (walking, citibike, etc)

      1. awaldstein

        Agree completely.As a marketer, I have huge amounts of data to plumb the market through, to target potential customers. So much data that my understanding is the currency of value.As an individual, I have the massive amounts of data to make myself as informed about everything. My opinion not the information itself is the currency of value.As an individual on the street, where all this digital data intersects with my very much analog self, its a hug gap. We are just getting started. I’m excited about a damn sign that says duh, the subway is coming. It’s like the Pleistocene era man discovering tools!

  8. kirklove

    Huh. Would have guessed 42nd Times Square would take the top spot. Curious what the actual numbers are and the difference btwn 1 and 2.

    1. jonathanjaeger

      Times Square has the top spot for overall traffic, but I guess according to the data from Alastair Coote pulled Union Square comes ahead during rush hour. Here’s what it is overall: http://www.mta.info/nyct/fa

      1. kirklove

        Ahh, makes sense Times Square nearly doubled Union Square overall. Thanks.

    1. Matt A. Myers

      If I was home (in Toronto right now), I’d do a mockup and post it … anyone else with Illustrator or PS nearby? 🙂

    2. LE

      That’s a nice idea. I came from a background where the pantone books were lying around like paper cups and you had to replace them because the “faded” over time. Although I would say it’s not something that Fred is going to take the time to do.As a side note doing things like that make a great story line to show the extent that a company or founder thinks about details and understands marketing. [1] Which you can debate is good or bad depending on your perspective on it. Reminded of the story of Steve Jobs and the factory machine colors, although if you can’t see the value in what he is saying here you probably need to have someone else better qualified in marketing. At least if marketing is important to your business.Here’s a Jobsian that I can relate to:He insisted that the machinery on the 165-foot assembly line be configured to move the circuit boards from right to left as they got built, so that the process would look better to visitors who watched from the viewing gallery.”Back in 2001 I had managed to work myself in with some local reporters about doing a story on domains. The story wasn’t about me, but I was a source and came up with some sound bites for them and interesting facts that I knew they would go for. This was after spending about a year simply feeding them unsolicited info about a fairly new subject (domains) they knew nothing about. They asked some questions over the phone, and then said they wanted to send a photographer to take some pictures of what I did or something like that. Knowing a bit about publicity, photography, newspapers and marketing I spent the entire evening before the set day rearranging my desk so that the three monitors that I used could be positioned in a way that looked good in photos. Previously they weren’t all setup in a way that you could capture 3 in one shoot. I also cleaned the desk and made everything perfect. And got some interesting things on each screen. I didn’t just say “oh ok when is that guy getting here to take the pictures?”.The photographer arrived and took the ladder (I had it waiting for him to give the right perspectives I had blocked out) and he snapped away. I got a call that they liked the photos so much it was going to be the lead photo on the business page of the local paper. They were then able to syndicate the story and it appeared in probably 20 papers around the country over the next 4 weeks or so. I would know when it hit an area because all the sudden orders came in from the new city. That business paid off for years in renewals and residuals. Many of those customers are still around today. In some cases the story was really short but the photo with my name and company was on a magazine section cover.This wasn’t the first time I did this. When I had gotten publicity at a prior company I mocked up a fake book cover as a prop knowing that would be important to the story. That one ended up in the local paper, the Wall Street Journal (not the prop but the story) the local TV station and other major media across the country.Like this:Local Paper (because of photo) -> local news station -> WSJ -> all other media who follow the Journal. Without the mockup that created the photo the rest would not have happened most likely.Bottom line: Details matter in everyday business. Maybe internet hockey sticks are the exception (because it’s more like catching the things that go flying by) but the little things truly do make a difference in the 99.9% of other businesses out there.

      1. jason wright

        i love stories.the best ones are the most authentic, and we want authentic, because we want to believe and we want to trust. it’s not just marketing. it has to be from the heart.

    3. fredwilson

      it’s funny you suggested that. i have tried several times, unsuccessfully, to convince our team to adopt a subway motif for our brand.

      1. jason wright

        design by committee? no, that doesn’t normally produce excellence.

        1. fredwilson

          Sadly we do suffer from that. I think we may have finally locked down on a new brand identity but boy was it hard

          1. Robert Holtz

            Fred, pursuant to your other post about 3-letter domain names, recommend you change logo to something that features “USV” type treatment as part of the main logo form.

          2. jason wright

            “long is the way and hard that out of hell leads up to light”the subway is a great metaphor for the interwebs and perhaps your thesis.

  9. pointsnfigures

    City of Chicago has done a good job opening it’s data to entrepreneurs so they can try and build cool stuff. It’s one of the things governments can do to support a local entrepreneurial ecosystem. The downside to them is we might find out things they don’t want us to know! (Especially in Chicago-“We don’t want nobody that nobody sent”-Paddy Bauler)

  10. Matt A. Myers

    Awareness is what directs people. Data helps increase / create awareness. It can lead to solutions and innovation through seeing problems, and through seeing opportunity. That’s the base value of data. It’s pretty important to everything..Also, in order for people to learn, they need an end goal – something to drive them, a mission – and then with that mission / end goal in mind, they will run into obstacles, challenges, more things they must learn in order to achieve their final goal. This is the passion we must try to induce, to foster in children – and adults.Passion as the leading metric to innovation and learning.

  11. John Best

    I’m off to nwhealthhack(.com) in a few days, looking at the intersection between health and technology. The amount and variety of data they’ve provided for us to play with is staggering.

    1. Cam MacRae

      Looks great. Good luck!

  12. ShanaC

    by turnstile – we don’t know how it is being used as a transfer point 🙂

    1. Alastair Coote

      Indeed- the data is only useful in some contexts. Because the NYC system doesn’t require a swipe on exit (like, say, the London Tube does) it’ll never be possible to know exactly where someone goes in the system, we can only deal in approximations.As it happens, that’s fine for my project- I’m (slowly, the process takes ages) merging these ten stations with yet more open data, this time CitiBike rental stations, to show which areas benefit the most:http://experimenting.alasta

      1. awaldstein

        Very cool.East<–>West downtown usage is what I love personally as there is no coherent subway route.

      2. ShanaC

        ah – though think you should take a look how local stations are effecting this system (more than half the stations in your list are transfer points)

      3. fredwilson

        oh man, you are killing it. i just reblogged your heatmap [http://fredwilson.vc/post/5…] and spent time playing with it. i did not understand why the west village, where i live, did not show up on it. but then i checked only union square, where i commute to, and it showed up bigtime. such nice work.

  13. Robert Holtz

    Fred, if you haven’t already, you MUST play with Wolfram Alpha.http://www.wolframalpha.comNot only do you get to all those massive supersets of data but you can express entire calculations using natural language as part of your query. You’ll get instant powerful answers to things that would normally be impossible or take massive computational resources. It is an amazing gift to the world — especially since it is also accessible via an API for integration into other products and services.To give you an idea just how powerful it is, check out the Examples section:http://www.wolframalpha.com…You’re going to LOVE this… I promise you!

    1. kenberger

      we use it all the time to help make important biz decisions such as which countries to open new dev offices in.Example query to compare country stats: “gdp per capita vietnam ukraine argentina”

      1. Matt A. Myers

        GDP doesn’t tell you much other than how much movement there is.

        1. kenberger

          it can give a rough idea of cost levels for some countries, at least relative to other countries, in the absence of better measurement tools.

      2. Robert Holtz

        Thanks for contributing the example. It is marvelous for questions and what-if scenarios that involve data that is always changing.

    2. Abdallah Al-Hakim

      That is a good reminder about wolfram alpha. I need to revisit it myself

    3. fredwilson

      thanks. will do.

      1. Robert Holtz

        I promise you, Fred, you’re gonna fall in love with it. I don’t get any compensation at all for evangelizing it. I’m just a big fan giving you my sincere raves.In fact, about six months ago, I upgraded to the pay level of the service just as a gesture of my appreciation. Clearly they need to keep feeding in new source data to keep it fresh and comprehensive.Enjoy! Tools in the cloud like this point to the end of not-knowing.

  14. Guest

    Maybe the school kids can also play with this DIY Machine Learning toolset from wise.io:* http://techcrunch.com/2013/…The tools available today are an amazing sandbox to play in as a teen.

    1. fredwilson

      ooh. thanks for leaving this link. i am collecting a list of all of them.

      1. Guest

        If not on your list already, worth adding http://import.ioAt around the time you were involved with thestreet.com I was working at a data startup (jv with the ‘FT’ later acquired by the financial data provider, Bureau van Dijk).We had a proprietary browser add-on to enable us to extract unstructured data from different news sources and the stock exchanges, to put it all into structured form that we re-sold to the banks and asset mgmt companies and syndicated across the larger data platforms.When I think about what’s possible now with Open Calais, Google Freebase and these open Data Science tools…

  15. Richard

    Real time non structured data predictive analysis. Anyone want to guess the data of the photo?

  16. Richard

    Non Structured real-time data analysis that is predictive is the holy grail. It is a long time in the making. Anyone want to guess the date of this photo?

    1. John Best

      Given the shirt, I’m guessing ’96

  17. Alastair Coote

    In case anyone is curious to why I even looked up this data, it’s feeding into this heatmap I’ve made using yet more open data- this time Citibike station locations- to examine which areas of the city benefit most from the new bike sharing program:http://experimenting.alasta…The number crunching takes some time though, so there are only the top three stations for now.I could imagine a fun experiment for students in letting them remove subway lines, draw new ones, and see what effect that would have on the city. I already have plans to do a similar heat map analysis of what’ll happen when the Second Ave subway arrives (and the 7 line extension, though that should be more obvious). Just to make myself really sad, maybe I’ll do one for the entire proposed Second System of the NYC Subway:http://en.wikipedia.org/wik

    1. Kirsten Lambertsen

      I like how your domain reads like something Yoda would say 🙂

      1. Alastair Coote

        Ha, thank you. It came out of a joke I had with a friend of exactly that- the Yoda Naming convention. Hence I have blogging.alastair.is, tumbling.alastair.is, experimenting.alastair.is, and so on.Glad to see that others get the joke!

    2. nosh petigara

      Very interesting stuff, Alastair!Curious what your algorithm (at a high level) is for merging the datasets and figuring out the time saved by citibike is.I just started collecting some citybike data last night from http://citibikenyc.com/stat… to play around with. Currently it logs station status (bike availability, etc) every 15 min to a MongoDB database. Was going to use it to figure out what stations were most active, which had surpluses of bikes, peak usage times, etcIf anyone is interested in the code or data, feel free to drop me a line ([email protected])

      1. Alastair Coote

        I intend to do a writeup of my method when I’ve got all the data in a row, but the basics are that I set up an instance of OpenTripPlanner and repeatedly hammer it with requests from 13,000 points across the city.Apparently there is some software called OTP Analyzer that does all of this in a more organised way, so I want to look at that first.

        1. nosh petigara

          Ah. That makes a lot of sense and much simpler than what I was thinking it would be! Looking forward to the writeup

    3. fredwilson

      i already replied to a similar comment in the thread above, but i must say i am really impressed by your work

  18. David DiAngelo

    I must be getting a little jaded. The first thing that I thought about was how dangerous this information could be for someone like the Boston Bombers or folks of their ilk looking to do maximum damage with limited resources.

    1. kidmercury

      the criminals already have the data, the only question is whether honest people will get it too.

      1. fredwilson

        yes!!!

    2. Matt A. Myers

      It’s not hard to assume where a crowd will be. It’s not hard for someone to find out information by researching on their own. The only solution is having everyone taken care of, giving them non-violent reasons to live, things to be passionate about – family, learning, innovation, etc..

  19. LaVonne Reimer

    There is so much going on in this space. It is a great day that starts out with you initiating a great conversation about data science! I’m working on an open data deal involving commercial credit so tend to pay more attention to that domain but came across a really good venn diagram yesterday from Drew Conway–the intersection of hacking skills, substantive expertise, and math & stats knowledge. The DIY part of the title made me think of it. Here’s the link. http://drewconway.com/zia/2

    1. fredwilson

      that’s a great visual. these are the domains i am encouraging my son to study in college. i think there’s so much to be done in them in the coming years.

      1. Guest

        Another key domain is Neuroscience. Not along maths lines but to really understand the biochemical, genomic interactions of intelligence (data and human).Those are the domains I focus on in building my technologies.

  20. matthughes

    That’s a great point about high school teachers creating a project around data that is relevant to the kids.That’s the best way to teach and learn.

  21. Pete Griffiths

    Nice.

  22. sachmo

    Isn’t there a risk that data like this can be used for very evil purposes as well, i.e. terror attacks, foreign governments targeting critical infrastructure, etc…Should there be some kind of authentication system. Maybe any citizen can request the data, but the person needs to provide their license and other relevant information to release it?

    1. fredwilson

      i don’t like to worry about stuff like that

  23. Lancewlars

    Not sure if anyone has linked this yet, but here’s a pretty good TC article on the two edges of the open data sword from a government perspective. Do politicians have the willpower to open more and more data without crossing legal/constitutional lines? So far that seems to be difficult.http://m.techcrunch.com/201

  24. Gregg Thaler

    Profiler for Salesforce-Search engine research automation. Uses deep web data mining technology to gather just-in-time Contact data using the internet as the data source. This DIY data technology has been under development for 10 years by Donato Diorio who pioneered the use of data mining the Internet for contact infomation.http://bit.ly/10CebMR