# DIY Data Science

In a comment on yesterday's hobbyist post, Pete Griffiths offered "Do It Yourself Data Science" and I really liked that suggestion for a bunch of reasons.

I think data science and machine learning (I know they are not the same thing) are going to be a very big part of tech innovation in the coming years. And I also know that putting powerful tools in the hands of "everyman" produces more innovation than can happen when the tools are limited to mathematicians and scientists.

The blogging revolution in publishing is a great example of this. Once everyone could have a printing press, we got to see many important developments that did not and would not have happened as long as publishing was a high cost operation limited to professionals.

So what is the Tumblr or Blogger or WordPress of data science? When will my son and his friends be able to take the NBA dataset and start running algorithms against it to produce better fantasy picks? When will my daugther and her friends be able to take the TV viewing dataset to decide what TV shows to go back and watch that they missed last year?

I believe data science is going to go mainstream in the coming years. What will be the platform(s) that make that happen?

Humm – Your point for me hit home w/ the following comment.”When will my son and his friends be able to take the NBA dataset and start running algorithms against it to produce better fantasy picks”#BigData

Spot on.The other side of that is releasing data sets. Data sets in private hands will always be protected, but we need to insist more on liberating government data sets.Data sets do two things. One, they lend themselves to the creation of apps or other innovations to manipulate and present the data, which in themselves can be very valuable. Two, and tied to the issue/trend spotting point, they help both entrepreneurs and investors find potential market opportunities and/or test hypotheses.Spend government money making usable data sets from information that the PEOPLE own. For example, think of court cases, and how one must pay an arm and leg to private vendors to access case information. Imagine the power of releasing clever teams of folks on that data. The result would be providing cheaper disruptive ways for consumers to access that information, and disrupting expensive business models and customer deadweight loss along the way.There is also a justice principle at play — consumers should be paying for innovation value-add to public information, not for access to data the PEOPLE should already own.

i agree. but easy to use tools that the masses are using might bring more datasets public

That’s interesting. I think you’re right.In terms of private sets, owners will see the value of releasing them to build loyalty and perhaps to learn something.In fact, Manchester United did that. I blogged about it a year ago. One of the team officials put it well:”I want our industry to find a Bill James. Bill James needs data, and whoever the Bill James of football is, he doesn’t have the data because it costs money”

it’s going to be the same problem that always occurs. it’s all fun and games and everyone is sharing data and smiling and BFF and talking about how wonderful everything is. then someone monetizes the data and it’s straight up gang warfare, crips vs bloods, capulets vs montagues, etcclearer agreements are needed from the outset. the best models will be those that provide a business model for creating new data items, similar to how google shares ad revenue to publishers who can find new ways of monetizing adwords via adsense.

That’s why I think government data sets are most interesting; outside the reach of private ownership. Although once the corruption is exposed; then we’l; see how open the government data stays as well.Smart stuff on your thoughts of upfront business model/revenue share arrangements for private data sets.

government data is increasingly private. FOIA requests increasingly getting declined: http://www.huffingtonpost.c…plus there is the fact that companies get tax dollars for government projects but get to keep a lot of the research private. there was an issue not too long ago about the rights to photos from outer space being kept by the company that did the project, just licensed given to government on a non-transferrable basis. tax dollars at work……

the worst for me, given my legal background, is Lexis/Nexis and Westlaw, which make billions of dollars by privatizing court opinions.

That is probably one of the most vexing part of the law industry. Do you think their monopoly might change?

Domestically perhaps, but internationally the opposite trend is true in some countries. Why? Because governments simply have less money to do more with. By open sourcing data sets, they empower the private sector and civil society organizations to solve problems that they know they aren’t going to be able to address. Then, when problems get solved, they can take the credit for both opening up the data, helping these third-parties, and ultimately improving the quality of life of their constituents (where the problems have been solved). So even if it’s only marketing for them, the net impact is for the better.See: OpenDataInstitute (http://theodi.org) or Kenya’s open data efforts (https://opendata.go.ke/) for examples.

great point. i do believe the bankruptcy of governments has the potential to create some great opportunities. my only fear is that we will see more government confiscation/protectionism, but as you note, there are some positive examples around the world.

RIP Aaron Swartz. He was in the business of liberating such data – which we have all paid for!

Actually, the Obama administration has been pretty aggressive about putting government data online. Check out http://data.gov for more info – may not all be current, but I applaud the effort.Noah

The more the merrier.

Agreed, an inch is a mile towards this type of initiative continuing to the future. Also worth lauding the efforts of others who worked tirelessly behind the scenes to make such efforts palatable to our government to begin with, people like Todd Park, Andrew McLaughlin, and Harper Reed.

Government has its own gangs, contracting companies, private research institutions battling for control and resources.

This.

Basketball has rapidly gone stat / microslice video happy, but it seems to lend itself to that analysis ( as does thestatic nature of baseball ).I am not so sure if the beautiful game does: its fluid & there is not shot clock to force compact, discrete situations.Trust Man U to be forward thinking about it tho.

Private company provides the video dstasets for NBA teams, fyi.

Have you checked out Statwing? http://www.statwing.com They are definitely moving in this direction. YC company (I forget which class). More sophisticated analysis than what the average excel user, with seemingly less user error = pretty interesting value proposition.

thanks. will check out

very cool

Thanks for the mention. Really solid DIY data analysis is definitely our mission.The reference to the NBA dataset is funny timing, since we just found and made available for analysis a dataset of every NFL play last year. Here’s the blog post we wrote about it, along with links to analyze the dataset in Statwing: http://blog.statwing.com/nf

OMG – you just extinguished all my free time with that.

I hadn’t seen statwing before – thanks.Q. What is the most popular user interface for anything remotely numberate? A. Spreadsheets – notably Excel.Hence the desire very common decision of those trying to reach the masses to provide powerful interaction with spreadsheets. It leverages the familiar.

You would think that after 25 years and probably billions of dollars spent on developing Excel, there would be more examples of the “everyman” interaction with data science.

2. bettyjanes

Mainstream = graphs / charts / PPT no?

Mainstream certainly.

It seems like Fivetran, presented today at YC, follows that route. Surely, There will be a lot of competition (and acquisitions) in this space in the next two years.

Thanks for the tip re Fivetran.

I can’t believe nobody’s pointed out http://www.kaggle.com yet. We’ve hosted a croudsourced data-science challenge with them to great effect.

I agree. It’s amazing to contrast the innovation and number of options to search and review the patent database (made public for full download in partnership with Google, but Google also has its own platform for searching/using it) versus court records.

What about in between data sets: eg: genetic information in a publicly funded studies?

Last year, GSK started to take steps to open-source up some of its clinical trial data sets, in the hopes that greater access and the ability to mask together multiple data sets might reveal insights that were previously misses. GSK announced:”GSK is fully committed to sharing information about its clinical trials. It posts summary information about each trial it begins and shares the summary results of all of its clinical trials – whether positive or negative – on a website accessible to all. Today this website includes almost 4,500 clinical trial result summaries and receives an average of almost 10,000 visitors each month. The company has also committed to seek publication of the results of all of its clinical trials that evaluate its medicines – regardless of what the results say – to peer-reviewed scientific journals.Expanding further on its commitments to openness and transparency, GSK also announced today that the company will create a system that will enable researchers to access the detailed anonymised patient-level data that sit behind the results of clinical trials of its approved medicines and discontinued investigational medicines. To ensure that this information will be used for valid scientific endeavour, researchers will submit requests which will be reviewed for scientific merit by an independent panel of experts and, where approved, access will be granted via a secure web site. This will enable researchers to examine the data more closely or to combine data from different studies in order to conduct further research, to learn more about how medicines work in different patient populations and to help optimise the use of medicines with the aim of improving patient care.This initiative is a step towards the ultimate aim of the clinical research community developing a broader system where researchers will be able to access data from clinical trials conducted by different sponsors. GSK hopes the experience gained through this initiative will be of value in developing and catalysing this wider approach.”

de-anonymoization is still very easy. And awkward with genetic and trial data

Summaries are better than nothing but rather defeat the point. Modern techniques benefit from huge raw data sets.

Amazing strides are being made with D3 in making data digestible and dare I say fun to play with. – https://github.com/mbostock

thanks. i will check it out

D3 IS BEST OPEN OPTION FOR MAKE DATA PRETTY.

Follow Mike Bostock (main guy behind D3) and Jason Davies on twitter.

Many of the people involved with d3, including Bostock, were all hired by the New York Times.

we use d3 quite a lot for visualizations, it’s fantastic.

This is the coolest thing I’ve seen this week. Thanks so much for sharing.

interesting – how difficult is d3 to learn?

Depends where you’re coming from. To use it, you need some knowledge of JavaScript, CSS, svg, graphics programming concepts, and of course data visualization itself. It’s much more of a toolkit than a plug-and-play library like, say, highcharts, but with that complexity comes a great deal of power and flexibility.

processing. my js is weak. but i need an excuse to strengthen it

I haven’t played with processing in a while, and not much at that that. D3 definitely does things in a unique way (to me at least), but I suspect that coming from processing, it won’t be too tough to pick up.

Platforms that pair rapid machine curation (ML) AND rapid human curation. Clusters for healthcare, business, & finance need names (from humans) to be actionable. You can’t simply have a machine point at a cluster to gain insight.

“I believe data science is going to go mainstream in the coming years. What will be the platform(s) that make that happen?”The ones who make the data simple to digest, and interesting/pretty enough to get people’s attention.We, here at AVC, like numbers. Most people don’t even like figuring out how to split and tip on a restaurant bill.”Enough” data presented simply and compellingly will win over “big” data.

Totally right. Edward Tuftian?

Tufte is a technical guy who knows the importance of design to deliver the message. He knows that when there’s too much information, its useless unless presented in the right way.At some point, everyone else will realize his wisdom.The reign of the designer is coming. :o)

It’s always been here, just not fully realized.I’ll give you an example from completely outside the tech world and Big Data trends.I have spent the last decade or so making legal and economic arguments in courts and agencies in the US, EU, and China, among other places.We have at our disposal the best economists, often ones who have or will won Nobels.We always have the most complicated and latest, greatest, state of the art econometric analyses at our disposal.The above is the price of admission, the validation of our credibility.Despite that, without fail, what wins the day at the end, is the most important information simply but powerfully presented that provides the common sense explanation to the adjudicator and cuts through millions of pages of documents, days of testimony by economics experts, and all sorts of other evidence that cancels each other out in the volume of noise it brings.

yep!

I’m so anti-tufte. He ignores the repercussions of what he says and the influence of saying that systematic, “proper” data display does to actual innovation

Woah! I don’t think I’ve ever met an anti-tufte. I’d love to hear more about what bothers you about him…

it is a philosophy of science thing. Tufte is very insistent that there is a correct way to display quantitative information. The problem with quantitative data is that in collecting it, you’ve implicitly assumed that the data will show the answer to your question. This is highly problematic in that the question itself is not neutral and is framed by the knowledgebase the question is based on (inherently communal and subject to change – kuhn’s point). “correct” displays may only reinforce that communities paradigms about the knowledge rather than pointing to the fact that the data and the question was wrong in the first place.So his point about “there is a right way to display information” annoys me and makes me anti-kuhn. Though there are ideas about making information easy to understand and not throwing people off with it that I do appreciate.

Interesting. I never saw him as quite that proscriptive. Opinionated, yes, but I came to him from the graphic design side, and saw his stuff as advice on how to make information displays clearer. I don’t really see a priori why clearer information has to be used to reinforce the hypothesis. It’s still a matter of applying techniques and looking at data with integrity.You almost seem to be criticizing design in general. Of course it can be used to tell beautiful lies, but that doesn’t mean the techniques and ideas don’t help strengthen or clarify a message when it’s true.

bill splitting needs to come from the restaurant and needs to be integrated with the menu and with order taking. that’s why there’s no killer bill splitting app yet, which frustrates me so much.

This is an important problem to be solved from a guy who talks of rome burning (monetary problems) while Washington fiddles?…(sung to the tune of “another photo sharing app”).You know when restaurants will care about this? When they lose business to restaurants offering this (“hey instead of the food, atmosphere and crowd that we like let’s go to the place that will make it easier to split the bill guys!”).Starbucks has a great app that I use everyday that makes my Starbucks experience great. But I was a fan before that and I wouldn’t switch to a coffee shop that had the app if Starbucks didn’t.

let me know where i said it was an important problem.i believe they already are missing out on money because of this.

“let me know where i said it was an important problem.”Ok, you said:”which frustrates me so much”The word “frustrates” seems to indicate “important problem” although I will remove “important” and just leave “problem” and the rest of the sentence. And “killer bill splitting app” where I interpret “killer” in a way to indicate the importance of the problem to be solved or how it is viewed.”i believe they already are missing out on money because of this.”The word “they” is open up to interpretation. If you are referring to specific restaurants sure it could be true. If you are referring to restaurants widely then I don’t agree.That said my first paragraph was a bit unfair and knee jerk but I guess the comment seemed out of character for you.

‘The word “frustrates” seems to indicate “important problem”‘ — no it doesn’t. the person sitting next me on the subway frustrates me. that is not an important problem either.

you guys are hilarious. :o)

Who you calling a funny guy?http://www.youtube.com/watc…(about 1 minute in past the pre-roll which is a big frustration to me that needs to be whacked)

lol. looooooove that movie.

Those apps might not convince people to switch, but they might increase loyalty. And easier bill splitting could easily tip me if I was organizing a large dinner party and didn’t want to be stuck with 30% of it when people left early and didn’t pay enough!

in ny, just find out what the tax is and double it, then divide it by the total number of people at the table 🙂

if only everyone could process things as easily as you, shana. :o)

The question is how much is ‘enough.’If the tools are powerful enough the amount of data shouldn’t be the issue. That is pretty much invisible. The reason that ‘big’ data means anything at all is that we have discovered that many AI algorithms get better results when trained on huge data sets. So ‘enough’ should really be ‘big enough for great results’ and that may vary from problem to problem. The user shouldn’t have to care.

Question – easy to use data analysis tool,- visually represented response .Nice.

And when everyman starts applying ML to their own health measurements, the shift how we approach health care might change considerably. We are moving to age of connected devices that can produce a lot of data about your body and well-being.For example, I just talked to a person with type 1 diabetes, who is really looking for innovative cloud-connected devices for blood glucose monitoring that help him to take better responsibility of his own personal health. Just think the possibilities, if he can combine real-time glucose measurements with all the other data of his body.

yup. quantified self feeds right into this

I think that GNU R has great potential as a tool for analysis. It is open source and commonly used. A great tutorial anyone can use to play with it is available at: http://tryr.codeschool.com. There is a learning curve to really unlock its power though, and I have not gotten that far. I think people really need to take and understand stats (I last took a class like six years ago as an undergrad.)Sites like http://www.gapminder.org/ are easy and fun to use with pre-loaded data sets.One of the largest problems is standardizing and getting data sets in a usable format. Also geocoding addresses can be difficult and demanding. http://www.datasciencetoolk… is one of my favorite things for dealing with data sets. Also check out OpenRefine (formerly Google Refine) https://github.com/OpenRefineI truly believe if the open data movement is going to take off we need to build a culture where the data is open and accessible from government and other entities. Sometimes you are legally entitled to data sets but a local government will drag its feet on providing it, or you’ll get it in an unusable format. We’ll only fix this is if you educate your local government about data and the importance of storing and providing it in usable formats. Make your own information request and see how it works where you live.Also it’s fun to check out: http://alpha.data.gov for applications using federal data.

Yep, cleaning up the data sets may be a big a problem as any.

google refine plus a bit of python (google refine supports jython) is super excellent at the cleaning data part

what do you think of pandas for python?

It’s coming along nicely, but doesn’t have the depth or breadth of R.

udacity uses matplotlib and scipy/numpy for teaching statistics. Pandas just makes it easier to import?I’m not sure most people need R. R is a heavy tool. Python is closer to english and is very lighweight

At the moment the primary benefit is some nice data structures. Numpy is a dependency.R isn’t particularly heavy really but it does require a deft hand.

right now, I’m using pandas/scipy/numpy as an excel replacement so I have more time to do other stuff. I’m not quite that deft with code, and I get extremely nervous around i/o so pandas is really useful in that regard 🙂

Awesome!So, do you know what you want to be when you grow up yet? (It must be a year since we last had this conversation). For the record, I still don’t!

I think I should probably spend some more time learning python once I finish Ruby on Rails. I played with it a little (scipy and numpy) by playing along with a Data Science Bootcamp video series from O’Reilly (Hilary Mason and some others taught it at a conference). Nowhere near enough to express an opinion but I’ll stick in my queue to check out when I do Python.

Their are three open source tools that we should be teaching in every AP Stats course in high school, R, D3 and ggplot.

R is pretty amazing open source statistical environment. (as a programming language it’s a train-wreck, it makes the hard things easy and the easy things hard.) Octave is to Matlab, as R is to SAS/Splus. There’s also Python scikit-learn.

i agree they are very important. but i am looking for something that is easier to use that could be more mainstream

Check out IPython Notebooks. They appear in the browser with which kids are intimately familiar. It has integration with SciPi, LaTeX, MathJax and matPlotLib allows kids to think in math but create data viz. I teach SPSS so I hate that menu driven, memorize the key sequence monkey learning. IMO IPython is still clunky but it has great potential. I also love org-mode.

Check out fivetran.com – we are a YC 2012 company and this is exactly what we are building.

I’ve been using Equametrics (http://equametrics.com/) RIZM platform recently. I think it is a great platform for the layman and really fills a hole in the market for technical analysis tools. Before, there were super high end products for people that knew how to code algorithms and there was excel, with little in between.

thanksi will check out

This is a hands-on sport: 1) learn how to use R , 2) pick up a book on mathmatical statistics and know everything possible about a normal distribution, 3) read the book “the myth of statistical significance”.

For predictions, there is a great tool for professionals to do DIY analytics created by a company called Rapid Insight. The tool is called Analytics.It is great and at a much lower price point than established products. I use it myself when I need to throw a model together and do not want to pay a data science consultant to build a prototype.They also have a cool tool called Veera that deals with data no matter the platform.

Your challenge is spot on. For too many years the analysts have hoarded both the source data as well as spoon-fed us the results of their analysis. For data science to work we need to get out of the artisan stage and in to a world of self-service that is so easy even my mom can do it.

There are a number of roadblocks that I see at the moment:1. Mentioned by others – open and available datasets (as an example, I fell in love with Twitter because they started out providing this, but my love has since been scorned as they now have *way* too many complex restrictions on the data…you can still get it all easily, but you are very limited in what you are *allowed* to do with that data…that’s crushing).2. The hidden power, and the ugly step-child, of machine learning (and data science) is actually personalization. Your fantasy picks example illustrates that perfectly…it’s not about giving everyone the same picks, it’s about giving Josh the perfect picks for how his brain works….problem is personalization as a platform is still *very* expensive and hard to scale…trust me, I knowabout.it ;-)3. There are no clear cut algos. or processes winning the machine-learning/data science mindsets yet…probably because it’s still really early days…but I don’t think we’ve found even one specific problem or vertical where a given algo. or approach is widely accepted as *the* way to do something (even with great results, there is some lengthy debate if it’s a good long-term/general solution about how Watson does what it does)….so I think until these three issues make more general headway, the best we are going to see over the next few years are proprietary platforms that use data-science/machine learning in their background to solve a very specific set of problems for their clients (i.e. numberfire.com for the fantasy picks example).The other problem is that these proprietary systems (at least the good ones) are going to be huge profit centers…and so the motivation to make the good stuff behind the scenes general purpose and open-sourced is just not going to be there for awhile…

15. Jeffrey Hartmann

16. Paul Sanwald

17. Kirsten Lambertsen

1. fredwilson

1. ShanaC

18. Jurian Baas

1. takingpitches

2. fredwilson

1. Guest

1. takingpitches

1. Salar

2. Jurian Baas

3. leigh

3. Kirsten Lambertsen

4. ShanaC

5. Brandon Burns

6. Michael Brill

1. Pete Griffiths

1. Michael Brill

7. kirklove

8. Mark Essel

19. Bennett Resnik

20. LE

1. fredwilson

2. Richard

3. ShanaC

1. LE

21. D Chandler

22. William Mougayar

1. falicon

1. Charlie Crystle

2. ShanaC

23. Mac

1. fredwilson

24. LE

1. Richard

1. LE

25. Yalim K. Gerger

1. Yalim K. Gerger

26. Charlie Crystle

27. jason wright

1. kidmercury

28. kidmercury

29. andrew thomas

1. Carl Rahn Griffith

2. Brian Johnson

3. Pete Griffiths

1. David Crawford

1. Pete Griffiths

2. Taylor Brown

2. David Crawford

1. Pete Griffiths

3. Carl Rahn Griffith

4. Charlie Crystle

4. JamesHRH

1. Michael Brill

1. JamesHRH

1. Michael Brill

2. raycote

30. Carl Rahn Griffith

1. kidmercury

1. Carl Rahn Griffith

1. kidmercury

1. Carl Rahn Griffith

2. Kirsten Lambertsen

1. Carl Rahn Griffith

2. JLM

3. Carl Rahn Griffith

4. JLM

5. Carl Rahn Griffith

2. JLM

3. Richard

1. kidmercury

2. Jon Gosier

1. Carl Rahn Griffith

1. Michael Brill

1. Carl Rahn Griffith

2. Pete Griffiths

1. Carl Rahn Griffith

2. Pete Griffiths

3. Cam MacRae

4. Pete Griffiths

5. Cam MacRae

6. Pete Griffiths

7. Cam MacRae

31. hungrygardener

32. leigh

1. William Mougayar

33. andyidsinga

34. Nick Ambrose

35. robertdesideri

36. LaVonne Reimer

37. kenberger

1. William Mougayar

1. kenberger

1. William Mougayar

1. Taylor Brown

2. LaVonne Reimer

1. kenberger

1. LaVonne Reimer

3. Richard

4. Teemu Kurppa

1. kenberger

38. William Mougayar

1. fredwilson

1. William Mougayar

1. Abraham Thomas

1. William Mougayar

2. Abraham Thomas

39. Michael Brill

1. Cam MacRae

40. Piyush T.

1. fredwilson

41. Brian Dalessandro

42. William Wagner

43. Chris Kurdziel

44. Wordpress for data science.

1. fredwilson

45. Dasher

46. Abraham Thomas

47. Jon Gosier

1. Pete Griffiths

48. Dasher

1. Jon Gosier

1. Dasher

49. Semil Shah

1. fredwilson

2. arvind

50. Pete Griffiths

51. John Revay

52. John Revay

53. pointsnfigures

54. nikiscevak

1. arvind

55. Al Imusicmash

56. Sebastian Wain

57. BillSeitz

58. andrew thomas

59. William Mougayar

60. chicagosean

61. reveler

62. Evgeny Poberezkin

63. sbmiller5

64. Matt

65. Jeremy Barnes