DIY Data Science
In a comment on yesterday's hobbyist post, Pete Griffiths offered "Do It Yourself Data Science" and I really liked that suggestion for a bunch of reasons.
I think data science and machine learning (I know they are not the same thing) are going to be a very big part of tech innovation in the coming years. And I also know that putting powerful tools in the hands of "everyman" produces more innovation than can happen when the tools are limited to mathematicians and scientists.
The blogging revolution in publishing is a great example of this. Once everyone could have a printing press, we got to see many important developments that did not and would not have happened as long as publishing was a high cost operation limited to professionals.
So what is the Tumblr or Blogger or WordPress of data science? When will my son and his friends be able to take the NBA dataset and start running algorithms against it to produce better fantasy picks? When will my daugther and her friends be able to take the TV viewing dataset to decide what TV shows to go back and watch that they missed last year?
I believe data science is going to go mainstream in the coming years. What will be the platform(s) that make that happen?
Humm – Your point for me hit home w/ the following comment.”When will my son and his friends be able to take the NBA dataset and start running algorithms against it to produce better fantasy picks”#BigData
Spot on.The other side of that is releasing data sets. Data sets in private hands will always be protected, but we need to insist more on liberating government data sets.Data sets do two things. One, they lend themselves to the creation of apps or other innovations to manipulate and present the data, which in themselves can be very valuable. Two, and tied to the issue/trend spotting point, they help both entrepreneurs and investors find potential market opportunities and/or test hypotheses.Spend government money making usable data sets from information that the PEOPLE own. For example, think of court cases, and how one must pay an arm and leg to private vendors to access case information. Imagine the power of releasing clever teams of folks on that data. The result would be providing cheaper disruptive ways for consumers to access that information, and disrupting expensive business models and customer deadweight loss along the way.There is also a justice principle at play — consumers should be paying for innovation value-add to public information, not for access to data the PEOPLE should already own.
i agree. but easy to use tools that the masses are using might bring more datasets public
That’s interesting. I think you’re right.In terms of private sets, owners will see the value of releasing them to build loyalty and perhaps to learn something.In fact, Manchester United did that. I blogged about it a year ago. One of the team officials put it well:”I want our industry to find a Bill James. Bill James needs data, and whoever the Bill James of football is, he doesn’t have the data because it costs money”
it’s going to be the same problem that always occurs. it’s all fun and games and everyone is sharing data and smiling and BFF and talking about how wonderful everything is. then someone monetizes the data and it’s straight up gang warfare, crips vs bloods, capulets vs montagues, etcclearer agreements are needed from the outset. the best models will be those that provide a business model for creating new data items, similar to how google shares ad revenue to publishers who can find new ways of monetizing adwords via adsense.
That’s why I think government data sets are most interesting; outside the reach of private ownership. Although once the corruption is exposed; then we’l; see how open the government data stays as well.Smart stuff on your thoughts of upfront business model/revenue share arrangements for private data sets.
government data is increasingly private. FOIA requests increasingly getting declined: http://www.huffingtonpost.c…plus there is the fact that companies get tax dollars for government projects but get to keep a lot of the research private. there was an issue not too long ago about the rights to photos from outer space being kept by the company that did the project, just licensed given to government on a non-transferrable basis. tax dollars at work……
the worst for me, given my legal background, is Lexis/Nexis and Westlaw, which make billions of dollars by privatizing court opinions.
That is probably one of the most vexing part of the law industry. Do you think their monopoly might change?
Domestically perhaps, but internationally the opposite trend is true in some countries. Why? Because governments simply have less money to do more with. By open sourcing data sets, they empower the private sector and civil society organizations to solve problems that they know they aren’t going to be able to address. Then, when problems get solved, they can take the credit for both opening up the data, helping these third-parties, and ultimately improving the quality of life of their constituents (where the problems have been solved). So even if it’s only marketing for them, the net impact is for the better.See: OpenDataInstitute (http://theodi.org) or Kenya’s open data efforts (https://opendata.go.ke/) for examples.
great point. i do believe the bankruptcy of governments has the potential to create some great opportunities. my only fear is that we will see more government confiscation/protectionism, but as you note, there are some positive examples around the world.
RIP Aaron Swartz. He was in the business of liberating such data – which we have all paid for!
Actually, the Obama administration has been pretty aggressive about putting government data online. Check out http://data.gov for more info – may not all be current, but I applaud the effort.Noah
The more the merrier.
Agreed, an inch is a mile towards this type of initiative continuing to the future. Also worth lauding the efforts of others who worked tirelessly behind the scenes to make such efforts palatable to our government to begin with, people like Todd Park, Andrew McLaughlin, and Harper Reed.
Government has its own gangs, contracting companies, private research institutions battling for control and resources.
Basketball has rapidly gone stat / microslice video happy, but it seems to lend itself to that analysis ( as does thestatic nature of baseball ).I am not so sure if the beautiful game does: its fluid & there is not shot clock to force compact, discrete situations.Trust Man U to be forward thinking about it tho.
Private company provides the video dstasets for NBA teams, fyi.
Have you checked out Statwing? http://www.statwing.com They are definitely moving in this direction. YC company (I forget which class). More sophisticated analysis than what the average excel user, with seemingly less user error = pretty interesting value proposition.
thanks. will check out
Thanks for the mention. Really solid DIY data analysis is definitely our mission.The reference to the NBA dataset is funny timing, since we just found and made available for analysis a dataset of every NFL play last year. Here’s the blog post we wrote about it, along with links to analyze the dataset in Statwing: http://blog.statwing.com/nf…
OMG – you just extinguished all my free time with that.
I hadn’t seen statwing before – thanks.Q. What is the most popular user interface for anything remotely numberate? A. Spreadsheets – notably Excel.Hence the desire very common decision of those trying to reach the masses to provide powerful interaction with spreadsheets. It leverages the familiar.
You would think that after 25 years and probably billions of dollars spent on developing Excel, there would be more examples of the “everyman” interaction with data science.
my roomate’s sister-in-law makes $66 every hour on the internet. She has been out of a job for 8 months but last month her payment was $14229 just working on the internet for a few hours. Read more on Ask25.com
Mainstream = graphs / charts / PPT no?
It seems like Fivetran, presented today at YC, follows that route. Surely, There will be a lot of competition (and acquisitions) in this space in the next two years.
Thanks for the tip re Fivetran.
I can’t believe nobody’s pointed out http://www.kaggle.com yet. We’ve hosted a croudsourced data-science challenge with them to great effect.
I agree. It’s amazing to contrast the innovation and number of options to search and review the patent database (made public for full download in partnership with Google, but Google also has its own platform for searching/using it) versus court records.
What about in between data sets: eg: genetic information in a publicly funded studies?
Last year, GSK started to take steps to open-source up some of its clinical trial data sets, in the hopes that greater access and the ability to mask together multiple data sets might reveal insights that were previously misses. GSK announced:”GSK is fully committed to sharing information about its clinical trials. It posts summary information about each trial it begins and shares the summary results of all of its clinical trials – whether positive or negative – on a website accessible to all. Today this website includes almost 4,500 clinical trial result summaries and receives an average of almost 10,000 visitors each month. The company has also committed to seek publication of the results of all of its clinical trials that evaluate its medicines – regardless of what the results say – to peer-reviewed scientific journals.Expanding further on its commitments to openness and transparency, GSK also announced today that the company will create a system that will enable researchers to access the detailed anonymised patient-level data that sit behind the results of clinical trials of its approved medicines and discontinued investigational medicines. To ensure that this information will be used for valid scientific endeavour, researchers will submit requests which will be reviewed for scientific merit by an independent panel of experts and, where approved, access will be granted via a secure web site. This will enable researchers to examine the data more closely or to combine data from different studies in order to conduct further research, to learn more about how medicines work in different patient populations and to help optimise the use of medicines with the aim of improving patient care.This initiative is a step towards the ultimate aim of the clinical research community developing a broader system where researchers will be able to access data from clinical trials conducted by different sponsors. GSK hopes the experience gained through this initiative will be of value in developing and catalysing this wider approach.”
de-anonymoization is still very easy. And awkward with genetic and trial data
Summaries are better than nothing but rather defeat the point. Modern techniques benefit from huge raw data sets.
Amazing strides are being made with D3 in making data digestible and dare I say fun to play with. – https://github.com/mbostock…
thanks. i will check it out
D3 IS BEST OPEN OPTION FOR MAKE DATA PRETTY.
Follow Mike Bostock (main guy behind D3) and Jason Davies on twitter.
Many of the people involved with d3, including Bostock, were all hired by the New York Times.
we use d3 quite a lot for visualizations, it’s fantastic.
This is the coolest thing I’ve seen this week. Thanks so much for sharing.
interesting – how difficult is d3 to learn?
processing. my js is weak. but i need an excuse to strengthen it
I haven’t played with processing in a while, and not much at that that. D3 definitely does things in a unique way (to me at least), but I suspect that coming from processing, it won’t be too tough to pick up.
Platforms that pair rapid machine curation (ML) AND rapid human curation. Clusters for healthcare, business, & finance need names (from humans) to be actionable. You can’t simply have a machine point at a cluster to gain insight.
“I believe data science is going to go mainstream in the coming years. What will be the platform(s) that make that happen?”The ones who make the data simple to digest, and interesting/pretty enough to get people’s attention.We, here at AVC, like numbers. Most people don’t even like figuring out how to split and tip on a restaurant bill.”Enough” data presented simply and compellingly will win over “big” data.
Totally right. Edward Tuftian?
Tufte is a technical guy who knows the importance of design to deliver the message. He knows that when there’s too much information, its useless unless presented in the right way.At some point, everyone else will realize his wisdom.The reign of the designer is coming. :o)
It’s always been here, just not fully realized.I’ll give you an example from completely outside the tech world and Big Data trends.I have spent the last decade or so making legal and economic arguments in courts and agencies in the US, EU, and China, among other places.We have at our disposal the best economists, often ones who have or will won Nobels.We always have the most complicated and latest, greatest, state of the art econometric analyses at our disposal.The above is the price of admission, the validation of our credibility.Despite that, without fail, what wins the day at the end, is the most important information simply but powerfully presented that provides the common sense explanation to the adjudicator and cuts through millions of pages of documents, days of testimony by economics experts, and all sorts of other evidence that cancels each other out in the volume of noise it brings.
I’m so anti-tufte. He ignores the repercussions of what he says and the influence of saying that systematic, “proper” data display does to actual innovation
Woah! I don’t think I’ve ever met an anti-tufte. I’d love to hear more about what bothers you about him…
it is a philosophy of science thing. Tufte is very insistent that there is a correct way to display quantitative information. The problem with quantitative data is that in collecting it, you’ve implicitly assumed that the data will show the answer to your question. This is highly problematic in that the question itself is not neutral and is framed by the knowledgebase the question is based on (inherently communal and subject to change – kuhn’s point). “correct” displays may only reinforce that communities paradigms about the knowledge rather than pointing to the fact that the data and the question was wrong in the first place.So his point about “there is a right way to display information” annoys me and makes me anti-kuhn. Though there are ideas about making information easy to understand and not throwing people off with it that I do appreciate.
Interesting. I never saw him as quite that proscriptive. Opinionated, yes, but I came to him from the graphic design side, and saw his stuff as advice on how to make information displays clearer. I don’t really see a priori why clearer information has to be used to reinforce the hypothesis. It’s still a matter of applying techniques and looking at data with integrity.You almost seem to be criticizing design in general. Of course it can be used to tell beautiful lies, but that doesn’t mean the techniques and ideas don’t help strengthen or clarify a message when it’s true.
bill splitting needs to come from the restaurant and needs to be integrated with the menu and with order taking. that’s why there’s no killer bill splitting app yet, which frustrates me so much.
This is an important problem to be solved from a guy who talks of rome burning (monetary problems) while Washington fiddles?…(sung to the tune of “another photo sharing app”).You know when restaurants will care about this? When they lose business to restaurants offering this (“hey instead of the food, atmosphere and crowd that we like let’s go to the place that will make it easier to split the bill guys!”).Starbucks has a great app that I use everyday that makes my Starbucks experience great. But I was a fan before that and I wouldn’t switch to a coffee shop that had the app if Starbucks didn’t.
let me know where i said it was an important problem.i believe they already are missing out on money because of this.
“let me know where i said it was an important problem.”Ok, you said:”which frustrates me so much”The word “frustrates” seems to indicate “important problem” although I will remove “important” and just leave “problem” and the rest of the sentence. And “killer bill splitting app” where I interpret “killer” in a way to indicate the importance of the problem to be solved or how it is viewed.”i believe they already are missing out on money because of this.”The word “they” is open up to interpretation. If you are referring to specific restaurants sure it could be true. If you are referring to restaurants widely then I don’t agree.That said my first paragraph was a bit unfair and knee jerk but I guess the comment seemed out of character for you.
‘The word “frustrates” seems to indicate “important problem”‘ — no it doesn’t. the person sitting next me on the subway frustrates me. that is not an important problem either.
you guys are hilarious. :o)
Who you calling a funny guy?http://www.youtube.com/watc…(about 1 minute in past the pre-roll which is a big frustration to me that needs to be whacked)
lol. looooooove that movie.
Those apps might not convince people to switch, but they might increase loyalty. And easier bill splitting could easily tip me if I was organizing a large dinner party and didn’t want to be stuck with 30% of it when people left early and didn’t pay enough!
in ny, just find out what the tax is and double it, then divide it by the total number of people at the table 🙂
if only everyone could process things as easily as you, shana. :o)
The question is how much is ‘enough.’If the tools are powerful enough the amount of data shouldn’t be the issue. That is pretty much invisible. The reason that ‘big’ data means anything at all is that we have discovered that many AI algorithms get better results when trained on huge data sets. So ‘enough’ should really be ‘big enough for great results’ and that may vary from problem to problem. The user shouldn’t have to care.
Question – easy to use data analysis tool,- visually represented response .Nice.
And when everyman starts applying ML to their own health measurements, the shift how we approach health care might change considerably. We are moving to age of connected devices that can produce a lot of data about your body and well-being.For example, I just talked to a person with type 1 diabetes, who is really looking for innovative cloud-connected devices for blood glucose monitoring that help him to take better responsibility of his own personal health. Just think the possibilities, if he can combine real-time glucose measurements with all the other data of his body.
yup. quantified self feeds right into this
I think that GNU R has great potential as a tool for analysis. It is open source and commonly used. A great tutorial anyone can use to play with it is available at: http://tryr.codeschool.com. There is a learning curve to really unlock its power though, and I have not gotten that far. I think people really need to take and understand stats (I last took a class like six years ago as an undergrad.)Sites like http://www.gapminder.org/ are easy and fun to use with pre-loaded data sets.One of the largest problems is standardizing and getting data sets in a usable format. Also geocoding addresses can be difficult and demanding. http://www.datasciencetoolk… is one of my favorite things for dealing with data sets. Also check out OpenRefine (formerly Google Refine) https://github.com/OpenRefineI truly believe if the open data movement is going to take off we need to build a culture where the data is open and accessible from government and other entities. Sometimes you are legally entitled to data sets but a local government will drag its feet on providing it, or you’ll get it in an unusable format. We’ll only fix this is if you educate your local government about data and the importance of storing and providing it in usable formats. Make your own information request and see how it works where you live.Also it’s fun to check out: http://alpha.data.gov for applications using federal data.
Yep, cleaning up the data sets may be a big a problem as any.
google refine plus a bit of python (google refine supports jython) is super excellent at the cleaning data part
what do you think of pandas for python?
It’s coming along nicely, but doesn’t have the depth or breadth of R.
udacity uses matplotlib and scipy/numpy for teaching statistics. Pandas just makes it easier to import?I’m not sure most people need R. R is a heavy tool. Python is closer to english and is very lighweight
At the moment the primary benefit is some nice data structures. Numpy is a dependency.R isn’t particularly heavy really but it does require a deft hand.
right now, I’m using pandas/scipy/numpy as an excel replacement so I have more time to do other stuff. I’m not quite that deft with code, and I get extremely nervous around i/o so pandas is really useful in that regard 🙂
Awesome!So, do you know what you want to be when you grow up yet? (It must be a year since we last had this conversation). For the record, I still don’t!
I think I should probably spend some more time learning python once I finish Ruby on Rails. I played with it a little (scipy and numpy) by playing along with a Data Science Bootcamp video series from O’Reilly (Hilary Mason and some others taught it at a conference). Nowhere near enough to express an opinion but I’ll stick in my queue to check out when I do Python.
Their are three open source tools that we should be teaching in every AP Stats course in high school, R, D3 and ggplot.
R is pretty amazing open source statistical environment. (as a programming language it’s a train-wreck, it makes the hard things easy and the easy things hard.) Octave is to Matlab, as R is to SAS/Splus. There’s also Python scikit-learn.
i agree they are very important. but i am looking for something that is easier to use that could be more mainstream
Check out IPython Notebooks. They appear in the browser with which kids are intimately familiar. It has integration with SciPi, LaTeX, MathJax and matPlotLib allows kids to think in math but create data viz. I teach SPSS so I hate that menu driven, memorize the key sequence monkey learning. IMO IPython is still clunky but it has great potential. I also love org-mode.
Check out fivetran.com – we are a YC 2012 company and this is exactly what we are building.
I’ve been using Equametrics (http://equametrics.com/) RIZM platform recently. I think it is a great platform for the layman and really fills a hole in the market for technical analysis tools. Before, there were super high end products for people that knew how to code algorithms and there was excel, with little in between.
thanksi will check out
This is a hands-on sport: 1) learn how to use R , 2) pick up a book on mathmatical statistics and know everything possible about a normal distribution, 3) read the book “the myth of statistical significance”.
For predictions, there is a great tool for professionals to do DIY analytics created by a company called Rapid Insight. The tool is called Analytics.It is great and at a much lower price point than established products. I use it myself when I need to throw a model together and do not want to pay a data science consultant to build a prototype.They also have a cool tool called Veera that deals with data no matter the platform.
Your challenge is spot on. For too many years the analysts have hoarded both the source data as well as spoon-fed us the results of their analysis. For data science to work we need to get out of the artisan stage and in to a world of self-service that is so easy even my mom can do it.
There are a number of roadblocks that I see at the moment:1. Mentioned by others – open and available datasets (as an example, I fell in love with Twitter because they started out providing this, but my love has since been scorned as they now have *way* too many complex restrictions on the data…you can still get it all easily, but you are very limited in what you are *allowed* to do with that data…that’s crushing).2. The hidden power, and the ugly step-child, of machine learning (and data science) is actually personalization. Your fantasy picks example illustrates that perfectly…it’s not about giving everyone the same picks, it’s about giving Josh the perfect picks for how his brain works….problem is personalization as a platform is still *very* expensive and hard to scale…trust me, I knowabout.it ;-)3. There are no clear cut algos. or processes winning the machine-learning/data science mindsets yet…probably because it’s still really early days…but I don’t think we’ve found even one specific problem or vertical where a given algo. or approach is widely accepted as *the* way to do something (even with great results, there is some lengthy debate if it’s a good long-term/general solution about how Watson does what it does)….so I think until these three issues make more general headway, the best we are going to see over the next few years are proprietary platforms that use data-science/machine learning in their background to solve a very specific set of problems for their clients (i.e. numberfire.com for the fantasy picks example).The other problem is that these proprietary systems (at least the good ones) are going to be huge profit centers…and so the motivation to make the good stuff behind the scenes general purpose and open-sourced is just not going to be there for awhile…
http://datarpm.com/. Only they dont have access to public data sets (yet)
Well I’m biased here since it is technology I am working on.I think deep learning neural networks are a really cool tool to make this sort of thing possible. Right now you still need lots of technical chops to make things happen, but you don’t need tons of domain experience. As these techniques mature, and things like high speed GPU’s become more ubiquitous it will be pretty easy to throw a deep and simple fully connected multiple layer neural net and get a predictor or classifier as output after some training. Right now there is still a lot of guesswork, lots of knobs and levers you have to pull and tweak, but as we find newer techniques this is getting better. There are also techniques like Adaboost that could take the outputs of several classifiers and basically play moneyball with the results and get a good result from a team of weaker classifiers. Adaboost is the technique that makes face recognition so good, but the classifiers face recognition uses took domain knowledge to create the weak classifiers Adaboost needs as input. With neural networks and the GPU’s to train them quickly, we can get lots of weak classifiers without much domain knowledge. So I think deep learning + ensemble methods could make learning classifiers and predictors from data much easier and accessible. Its not there yet, but it is close. I’d love to have the cash to buy an Nvidia GTX Titan or Tesla k20x and an expensive workstation to speed my stuff up, but multithreading and my laptop do just fine for my problem domain. Give it a generation or two, and that sort of power becomes very accessible for everyone. Once processing power is more accessible and the code to use it matures a little more, we are going to see an explosion in this space.
I love the idea of a kind of IFTTT for data. For example, I’d like to tell it to count all the Amazon lists that include the word “blogging” in the description and then tell me the 10 most-mentioned books in those lists.I think journalists and content producers of all stripes would find lots of uses for this kind of data discovery and crunching. Startupeers could certainly use it 🙂
that’s a great way to put itbut IFTTT is not mainstream either
ifttt is like intro control structures for programming
Building the Tumblr or WordPress of data science is exactly what we are trying to do at Silk. Basically we let you build a site with collections of data, and give you easy ways to visualize that data. More info can be found at http://silkapp.com/product.Coincidentally, I built a Silk site to pick a TV show a few months ago at http://ratings.silkapp.com. I compared 4 shows by looking at different ratings (like IMDB and Rotten Tomatoes). Might be something your daughter might like, or she could create a site herself, of course 🙂
This is very cool!
OK. You’ve got my attention now
Thanks! We have people analyzing UN sets (The OCEAN Project), making sites about every block in Minecraft (Minecraft data) and more, have a look at http://www.silkapp.com/exam….
what would you say is the most interesting thing a user (outside your company) has built with the Silk site? the most interesting data set that they have used?
Many of those examples have been made by users outside of our company. I personally like the EDMbase (http://www.edmbase.org/) – a site about Electronic Dance Music with really nice content (check the “Highlight of the week”).I also really like The Ocean site mentioned by Jurian below – it might be less ‘fun’ but it’s data with a real world impact.And although the data seems a bit outdated right now, The Next Web Index (http://index.thenextweb.com/) is probably of particular interest to this community: it’s basically the CrunchBase dataset in Silk.For instance, here’s a graph showing how much USV’s portfolio companies have raised in their first two rounds: http://is.gd/HwVpf0. (Note that you can easily switch to a different view here, e.g. a map or table)
Well, I really like the OCEAN site I mentioned below, and another surprising one is http://endurance.silkapp.com which has tons of graphics on endurance horse racing (it’s in Dutch though).
I have asked some of my Team to use some real data that we have for some projects that are trying to deconstruct why things worked or didn’t — I’ll be interested to see the results …..
Agree with @takingpitches:disqus – very cool!
this is pretty awesome. I’m not sure how it would do displaying complex information types….
Thanks, it can get pretty complex, see http://ocean.silkapp.com/ for example.
This is super cool and addresses a very pervasive problem of data presentation. What is your take on the intersection of data science and visualization? Actually, let me ask another way: in your product roadmap, how important are advanced numerical techniques or big data (let’s arbitrarily call it > 1 terabyte) support?
If I may intrude.Imho visualization is critically important and there is some really cool stuff coming our way. Check out:http://vis.stanford.edu/
Yeah, seems like there’s an infinite opportunity in visualization. To me the exciting days are ahead of us when we start to layer more vertical semantics on top of data to aid in analysis and presentation. There were some other comments about story-telling and the problem is that (AFAIK) there are no general frameworks for telling information-driven stories and certainly nothing that can link that back into data sources. Would be thrilled to find out something like that existed.
Really well done site. Making data beautiful is well, beautiful. Kudos.
I tried out a quick dive into the rate of employment rate changes with a US map. I found some census bureau data in a few minutes of googling but couldn’t figure out how to easily plot it with Silk. Was hoping more for a combination of google for open spreadsheet/csv info and quick ingestion/automated plotting software. I’m signed up, so wil check in again in a few months to see how you folks are coming along.
Mapping data and leveraging the tools applied in social network analysis and data mapping will begin, and has started to, go mainstream. DIY tools include Gephi and NodeXL. The problem with the DIY tools for social network analysis is that an individual can create a data map but the more “science” portion of data science will continue to be passed on to the experts for deeper analysis and interpretation. DIY can only go so far.
“as long as publishing was a high cost operation limited to professionals.”Understand the point but you might say at the core the problem was really distribution not creation of the information or reproduction.It has been possible for quite some time now for people to cost effectively write things down and reproduce them (word processors and typewriters and copiers even). What hasn’t been possible is the distribution of those ideas that the internet made possible. I point this out only because in order to solve a problem it’s necessary to understand the root problem to be solved. You may have been meaning this though simply using the word “publishing” as a catch all for the entire process of dissemination of ideas. I see the use of that word differently. With respect to music and video all the great tools (garage band, imovie etc.) wouldn’t mean anything w/o youtube, itunes etc. the method of distribution of those ideas.
I stand corrected. You are right Larry
Thinking along the same lines, you said it better than i would have
so maybe data isn’t a tool problem. There seems to be a number of great tools out there – it is an ease of use and dissemination problems.The problem with big data by itself is that there isn’t much to disseminate.
“it is an ease of use and dissemination problems.”Ease of use allows for iteration.When I did photography and had a darkroom I was limited by the following:1) 24 or 36 exposures per roll (each roll cost $$) (although I could self wind a roll to any length with a bulk loader)2) I had to develop those rolls in my darkroom (which cost $$)3) I had to take the time to frame each shot because of the scarcity and cost So there was a constraint on the iterations that I could do. Once #1-#3 was removed (digital photography) I could iterate more (literally no practical upper limit) so the chance of coming up with something nice greatly increases (as does the competition for that matter). Additionally the people who would be willing to do 1-3 is quite small the barriers are trivial but they are there. I’m probably a better photographer because of this same with shooting videos since my brain is trained to think a certain way. Same with graphic design since I started at a time when you had to do “paste ups” and stats and phototypesetting.Photo of bulk loader attached (that film is really really old I just opened it for the photo I took.) Note that there is friction even in the process of adding the photo to this disqus post. Less than years ago but friction non the less that certainly limits others posting photos on disqus.
I recently decided that I ‘needed’ to start playing in this field and have been having a great time with R. I use the RStudio package as a GUI. They recently released a package called Shiny http://www.rstudio.com/shiny/ that I think will allow tools for web developers to create places where the everyman can perform data science. Here are some ‘hobbyist’ sites where people are playing with Shiny and R (everyone seems to be focusing mostly on finance themed topics):http://glimmer.rstudio.com/…http://glimmer.rstudio.com/…http://glimmer.rstudio.com/…Combine this with data sources such as Quandl http://www.quandl.com/ or the St. Louis FRB Federal Reserve Economic Economic Data (FRED) extractor http://research.stlouisfed…. you can tie your data to anything.
Big Data for the masses…and outside of IT/Data Analysts! Yes!! …and easily manipulated.Big data is when you can’t fit the data on a spreadsheet. But the key part is the analysis and how easily it can be done. The needed breakthrough is in data manipulation/analysis.Reporting data is dumb. Analyzing data is smart. Big data without analytics is a waste of time.If anyone can use data to tell stories, then we’re on to something.
Data to tell stories will be awesome…the in-between step is data to make suggestions/recommendations…not as awesome as data to tell stories, but still pretty cool…
check out Rhizalabs…
ProPublica is doing great work on this. Check out their tools section: http://www.propublica.org/t…
using data to tell a story requires planning and a strong sense of creative analysis. Plus it means a very strong understanding of the impact of qualitative data on quantitative data, and the limits of each. It breaks away from Tufti in that “there is no true story” so there is not true data.
I enjoy this when you open a topic I know very little about and the very knowledgeable commenters expand on it in the thread. My education continues at A VC, even without MBA Mondays. Thanks everyone.
It is selfish in that I want to be schooled but the schooling happens in public so we all get schooled
When will my son and his friends be able to take the NBA dataset and start running algorithms against it to produce better fantasy picks? When will my daugther and her friends be able to take the TV viewing dataset to decide what TV shows to go back and watch that they missed last year?The above reminds me a bit of those ads that appeared years ago that talked about a boy and his baseball card collection and where it might lead him in the future because of the effort he put into it. On the surface of course it comes across as a little out of touch and spoiled.The NBA dataset (even though linked to something I thought people gambled on?) could lead to skills in a particular area of benefit to a child who used it as a basis to learn about something that could provide future benefit or skills because they had fun in the process. The tv show part I can’t really spin that in any positive way.
Speaking of baseball cards Are they a buggy whip?
I don’t know anything about that (I do know that orthodox jewish kids trade rabbi trading cards) but I can’t imagine that other than collectors young kids are into that.All of these are the same, came from an era where there were few things to occupy your time with other than school:baseball cardstrain collectionmodel buildingstamp collectionsmowing lawnswashing carsshoveling snow
I will repeat my answer to yesterday’s post for this one, too. I think your son and his friends will use Numenta to gain actionable intelligence on their NBA data set.
I guess, I need to explain what Numenta is a little. Numenta’s Grok technology recognizes patterns in streaming datasets and makes predictions. The hard part in analyzing data is to create models and test them against data and improve them iteratively. In short, the hard part is to figure out what to look for.Your son and their friends, unless it is their profession, will never be able to do that. Grok does this automatically which brings data analysis to mainstream.All of a sudden, with Grok all you need is data, learn how to submit this data to Grok (which will become simpler and simpler over time.) and a lot of curiosity.
Tools for creation/experimentation/self expression through data analytics, free. What’s the model?The past two days have been great–real value in the comments across the board. I got into D3 just a bit ago… good to see it mentioned here.
Are you thinking of tools like Statwing? https://www.statwing.com/
will dataset owners be content with the world ‘slicing and dicing’ their data?
it depends. platform governance is key to determining winners.
it is already mainstream and will continue to be; the internet has been an information economy since day one. i believe there are two ways this game is playing out:1. winner take all — for generic data processing needs for casual users, google and amazon.2. niche — for superusers, experts, professionals — niche platforms. countless number.VC model is designed to finance #1, and this is resulting in many malinvestments and will continue to do so until the necessary macroeconomic changes are made. till then, more bubbles.eventually data products made on niche platforms will reach casual users.will the niche platforms be built on technologies made by google or amazon? i believe this is a major question and increasingly i believe the answer is yes, especially since i believe both companies see this opportunity and are moving accordingly — especially amazon.what are the key enabling technologies that disruptive companies in this space will be built around?1. RSS 2. microformats3. probably lots of proprietary technology standards. i.e. kindle file format, .kf8
mainstream tool? Excel. This may seem like a joke at first, but it’s not.
Very true – most of us use less than 5% of the capability of Excel.
So true. Especialy when you combine with SQL Server 2012 datamining or Predixion,
The real power of Excel is user familiarity with the user interface paradigm.Incidentally, this is a HUGE opportunity for MS.
I disagree that familiarity is the real power of Excel. I think what makes spreadsheets intuitive is their interactivity. Every time you touch it, the whole thing updates to reflect what you’ve done. Compare with SQL, where you write a query, check to see if the syntax is right, fix the syntax, then see the final result. And if you’ve done something complex like a subselect or join, you can’t see the intermediate steps except by breaking it up and running as separate queries. Reducing the feedback time is crucial to helping people understand what they’re doing to their data.
It doesn’t sound as if we disagree.
Great point David. We (Fivetran – Y Combinator 2013) just released a spreadsheet platform that aims to do exactly this – making the more complex data analysis functions like SQL select or join more accessible and in real time by adding them into a spreadsheet ui. Check it out. http://www.fivetran.com
Also, MS is certainly thinking about this. They’ve done some interesting things with the latest version of Excel, like querying databases from your spreadsheet, and even teaming up with Bing to create a search engine for datasets that can be imported directly into the workbook. Though some of that is hidden in add-ons that you have to learn about.
Makes complete sense. And others have developed DS plugins for Excel.
Absolutely. I used to do some cool stuff with it years ago when I was a power-user of it – used to often surprise me – huge scope with Excel education I feel. So under-utilised and in everyone’s hands already.
Interesting point… especially combined with other comments about how data can help tell stories, which is arguably what PPT is about. Maybe we start with the story and work our way back to the data with the help of software. Basically just the opposite of what we’re currently doing with big data.
In the end, data analysis is about asking questions and getting answers. The key is to make it easy to ask a series of questions, so that you figure out what is a good question to ask. The responses need to be in a form that a non-number crunching audience grasps.Only number friendly types can see the answers in a spreadsheet.
Ah, but if you look at the data science crowd, their message is that you don’t know what questions to ask in the first place. The answers are in the data and fancy math will reveal patterns you would never think to ask for.To your point about making it easy to ask questions, I think there’s probably a pony that lives somewhere between analytical processing and visualization – something that has just enough semantics to help guide a user’s exploration and story creation but flexible enough to work with any data domain. Might be more of a unicorn.
What stories does this DATA tell ?DATA –> STORYinductive makingThis story can be explained by what underlying DATA ?STORY –> DATAdeductive matchingWhat purpose driven social-stories can be dynamically(iteratively) constructed by interconnecting neural-networks of actionable data ?DATA –> STORY CREATIONorganic social evolution
“There’s a danger in the internet and social media. The notion that information is enough, that more and more information is enough, that you don’t have to think, you just have to get more information – gets very dangerous.” ~ Edward de Bono.
lots of truth here. in medicine there is constantly a rush for more information (largely because of business models that benefit from this). half the time instead of quantifying every single part of the person’s body if they just spent $50 for a high quality vitamins/minerals supplement and used they’re common sense instead of eating garbage food they’d be fine.
Hard to express, Kid – I just feel increasingly that our industry – and examples such as you cite – is so insular and meme/echo-chamber that we ‘disappear up our own arseholes’ (as we Brits say – unsure how universal an expression that is?) … we’re entering the domain of the financial industry with more and more abstract products/solutions that are looking for a problem/market after the event. It’s like everything is becoming too much like a Ponzi scheme.Plenty of real problems/needs out there that we should be applying ourselves to.
i agree with you 100%. and i never heard that expression before — i love it! going to start using it 🙂
Lol, cool – it’s a hard expression to exactly define when it should best be used: I recently tried to put it into context in a blog…http://carl-rahn-griffith.t…When it’s a good fit it’s just right! Enjoy using it! 😉
Agreed. I think people who’ve been in the industry a long time lose touch with the average Jane who just wants stuff to do jobs for her.
What we are going through is just a deja vu replay of what happened in the hype cycle peak of PCs – loads of apps etc follow, of which 95% were utter crap and died away – an utter waste of time, money and energy. The newer a device is that we are led to believe we can configure to our whims the more apps etc spew out to entice us to buy them, try them and waste our money and everyone’s time. This is deeply offensive waste. I blogged about it ages ago; won’t bore you all with the full rant/link now, lol ;-)Mobile in particular is at that phase now re: the hype cycle – the trough of disillusionment is just around the corner.Best recent example has been all that bullshit hype for that new email app I can’t even remember the name of – hundreds of thousands of people Tweeting their desperation for it and their angst and lament at being 623,000th in the queue for it. Give me a break – how many bloody people have an email problem of such a magnitude that they see something likes this as a life-changing event?I worry for humanity, I truly do.We have disappeared up our arsehole.
.I agree more with you than you do with yourself.We are visiting upon ourselves an increasing degree of complexity in which the increasing degree of complexity becomes the new problem set.At some point, one cannot drink from a fire hose without the hose itself becoming the problem.JLM.
I know nothing, JLM … I can’t even get a job at a supermarket, stacking shelves.
.Yes, you are left only with wisdom.JLM.
.It is even more fundamental than that.If folks just watched what time they ate, exercised just a smidgen and got adequate sleep — things would be much better.JLM.
Nothing correlates more strongly to one the big 5 diseases than having extra body fat (about 9% for men 14% for women)
i believe that…..i think one of the reasons people are fat is because they don’t get enough nutrients, so the body keeps creating hunger in hopes you’ll eat something that actually has nutrients instead of the “food” that is marketed
I agree that the misconception that more information and more tools for working with information is going reduce the need for critical thinking. The reality is that our need to think through what we’re doing and why will need to grow as exponentially as the technologies. (ex. “So we have big data tools, why are we using them? Okay we now know why we’re using them, what did we find? What is that telling us? What was the original question? How does the new data reframe the question? What does the data tell us that we may not have anticipated?” and so on…)Big Data is an invitation to think more, not less.
I don’t know, Jon – the great thinkers of our time did pretty well without all this ‘big data’ bullshit – and I don’t see m/any contemporary great thinkers, even with all this big data and analytics at our disposal…I suspect big data is a crutch for many and it is used to obviate the need for any critical faculty – let alone for fostering any imagination and philosophy. As with HFT/algobots and the screwed-up and manipulated/detached from reality markets, so it will be with our thinking unless we are careful. We’re not known for our wisdom nowadays, sadly.Great for consumer brands, media/ad placements targeting thereof and demographics (unless Facebook which has oodles and oodles of big data and yet still utterly hapless ‘targeted’ ads) et al – but these things don’t exactly enhance the quality of life.Disk is so cheap now we feel obliged to fill it with anything and that suddenly becomes big data and has gravitas. There is an old comic routine about how if you have a kitchen bin and (of course) you keep filling it and that annoys you the modern-day solution is … to get two bins.
Ha… I remember hearing almost exactly the same words about 30 years ago, except he was talking about spreadsheets – people don’t put in nearly the level of critical thought they would if they had to do it on whatever that green-ruled ledger paper was called. I reckon the same is true of someone coding in the cloud vs. creating a stack of punch cards.Although it’s pretty hard to argue against more efficient processing, it would be interesting to see a parallel universe where critical/systems thinking dominated rather than iteration/trial and error. Not saying we’d be further along… but I’m not saying we wouldn’t.
Can’t agree Carl.Let’s take an example that is admittedly at the currently extreme end of the continuum – computational astronomy. This field is uncovering discoveries in old data sets that no great thinker could have realistically uncovered. This is not controversial. If we now move into a less blindingly obvious example we find the same pattern – bioinformatics. Machines have typed cancers and folded proteins in ways that were opaque to humans. We just aren’t very good at picking obscure patterns out of gigantic data sets. Big data isn’t BS and humans have limits.
In such an example that’s great – agree, Pete – I should have been more granular – I am primarily cynical about it being used as a template/substitute for knowledge and thought in the areas of commerce and culture, etc. Your example is perfect for its appropriate use/relevance, indeed. Thanks!”A good decision is based on knowledge and not on numbers.” ~ Plato.
Thanks for clarification. And for the foreseeable future = I agree. 🙂
Also computational geometry, biology, linguistics, etc., etc.
Good examples. I think the point is that if you have an objective function of the simple formy = ax1 + bx2 + cx3…and it turns out that the coefficients are a = 0.374958548, b=7.3495870457 and c=759.5704504398504985these aren’t the kind of values that spring to mind. And it is hard for a human being to imagine that tweaks after the ninth decimal place may make a difference. But this is precisely the kind of thing that we learn by working with huge data sets.To some degree this kind of modeling challenges our ideas about explanation.
Unless there is an analytical solution. But yeah, if you’ve got to prune a billion nodes to arrive there it’s highly unlikely coefficients with even one tenth that precision will just spring to mind.
Pretty much any problem with non-linear relationships doesn’t have an analytical solution. In fact, despite the elegance of analytical solutions, they are a shrinking subset of problems we have to solve.Wouldn’t spring to my mind that’s for sure. 🙂
Indeed. I outsource all my thinking to cplex 😉
This just released simple app creates open data from any spreadsheet on Google Drive. http://app.easyopendata.com/
There is a Toronto start up called Canopy Labs http://www.canopylabs.com/it's a brilliant concept — Wojciech used to work at one of the big C companies. He said that he ended up doing the exact same process for almost every client on the data side but that they were charging as if it was completely customized. He turned that knowledge and insight into a platform for SaaS.I haven’t talked to him since he was early in his launch so i’m not sure how it has involved but i understand that they are starting to overlay social data as well which is going to be huge for big Brand clients in the coming couple of years.
The thing with predictive analytics is that people confuse causation with correlation. To get predictive insights, you need to find the cause, not the correlation.For e.g. umbrellas and rain, drownings and ice cream, etc.
when will my daughter be able to take the data sets of automotive parts, aerospace parts, and related materials, their use histories, failure rates in real applications and finally purchase everything to build go-kart that will rival a Honda in quality and tesla in technology. …and finally go on to make the flying cars we’ve all been dreaming about 🙂
One great example here of where data has been public for years but still inaccessible in meaningful ways is in Oceanographyty/Bathymetric data.This data has been open for a long time but initially comes as encoded sonar data that is many many gigabytes in size, and usually has to be processed with either expensive software or not very friendly free software like mbsystem with arcane mapping and gridding options. Finally, the gov. sponspored entities that collect it are putting up GeoTIFFs, but that’s not a great solution.Also now, some of the more popular data sets are in Google earth, but still that doesn’t open them up to numerical analysis.It will be interesting to see what http://www.oneoceancorp.com can do in this area (if anything) but given the price of their data plans I guess I will never get to find out. May be the start of something though …..
Hinton & Co DNNresearch.com I would guess. Oh wait: https://plus.google.com/u/0…I remember being excited messing around with backprop in the early 1980s. Kids today definitely have more tools. And AWS :)Brainmaker, anyone? Have some dusty DOS floppies somewhere.
I’m still an early student of the intersection of big data science and machine learning, especially the interactive learning systems. My search is for the open web version of big data science. I have come across two presentations recently, the nuggets of interest being pretty deeply buried. Here’s a link to a podcast interview of Jeffrey Heer who is a data visualization prof at Stanford. http://datastori.es/episode…The nugget is that much of the early big data deals have been about creating “really good black boxes.” I think (hope) there’s a battle line of sorts emerging–big data to serve the black box vs. self-insight. Protocols like OAuth 2.0 are important here as well. I see many of my buddies from early open source days advocating for the open web view.
I’m today attending Om Malik’s awesome Structure/Data conf in NYC http://event.gigaom.com/str…The prevailing thought this year is how much of this stuff today remains a bunch of techie words (Hadoop, NLP, Machine Learning, etc etc), and the theme is how to get away from the vernacular and shift thinking to real world problem terms– just as you conclude this post with (“platforms”, meaning usage concepts rather than tech concepts).btw, surprisingly, the best-received talk so far was from the CIA’s CTO. He was surprisingly warm, open, funny, and informative.
That’s a timely event, but my take is that it’s focused on the enterprise needs and complicated stuff.To paraphrase Fred’s ask, it would be nice if we could dumb down big data and make it available for the masses, so that innovation happens at the application level.
This sort of thing often does start at the enterprise level.Your 2nd paragraph is exactly what I meant as 1 of the key conf themes: bringing things out beyond the company, to serve the masses’ needs.
Yup. Big Data = Big headaches, today.I thought getting it out and dumbing it down will benefit from crowdsourced solutions and more creative uses.
Big Data requires a Big Spreadsheet => Fivetran.com check it out!
One of my investors sent me this link yesterday that’s relevant to our deal. http://gigaom.com/2013/03/2…When you’re immersed in this stuff it’s easy to geek out as you say. What helps me is to stay focused on the question I think big data can answer. That drives how and therefore what innovations are relevant. It’s hard though. I experience this as something of the wild west. It’s pretty exciting and easy to lose track of a goal.
Great point, and I too am in the wild west:My company (a dev shop) makes the “picks and shovels” for the many gold miners, rather than needing to think much about the mining. It’s easy to just serve up the deep specialties and get lost in the esoteric.To offset this, I actively advise founders on the business building end, and we incubate companies.
Love the gold miners reference!! We’re running pilots of our app with businesses and small business lenders which is focusing just like your advisory work I’d imagine.
Nice reminder for me … Thanks!!
It would be fantastic if you could write a blog post report about Structure/Data conference!
the gigaom team did a much better, comprehensive coverage than i ever could:http://gigaom.com/2013/03/2…Lots of great real-world topics.
Fred,- that was the vision of BuzzData, one of the companies that presented at DemoCamp in Toronto when you were there with me a couple of years ago, if you recall Pete Forde. They wanted to enable any data provider to upload their datasets and allow users to work with it.But my understanding is that traction was challenging, so they pivoted twice. Now, their business is called LookBook, and they let you drag and drop your artifacts to tell a story – Excel, PowerPoint, Word, PDF files, links, visualizations, infographics and charts, etc. go into a single shareable/publishable doc. http://lookbookhq.com
the vision was right. the product was not.
you are 100% right.
BuzzData is (was?) one of many attempts to build something conceptually obvious: one site with all the world’s data, nicely formatted and documented; an omni-platform. Platforms aspiring to this objective keep appearing and disappearing. They appear because they are great ideas. They disappear because they demand that data publishers upload and maintain data on an external site. Data publishers don’t comply because they have enough work just maintaining the data in their own database, let alone someone else’s.So, if the data won’t come to the platform the only alternative is that the platform comes to the data. What does that mean? It means that to succeed in building a truly comprehensive data platform, you must ask nothing of data publishers. You have to create a solution that feeds off whatever the publisher is spitting out regardless of how absurdly the data might be published.That’s what we’re doing at Quandl. We’ve built a sort of “universal data parser” which has thus far parsed about 4 million datasets. We ask nothing of any data publisher. As long as they spit out data somehow (excel, text file, blog post, xml, api, etc) the “Q-bot” can slurp it up and make it available. Quandl users thus get to access data from 100s of different sources and in 1000s of different formats, all in 1 place and in the format of their choice.Check it out: http://www.quandl.com .Incidentally, William, we’re in Toronto too. We just launched our beta a couple months ago and are getting some excellent traction. Would love to tell you more over a coffee some time…
So, it looks like a google/search of data sets. Great. Can the user manipulate the sets too?Yes, I’d like to learn more about it. Can you pls email me wmougayar AT gmail.
Yup, absolutely. You can do basic data transforms (trimming, sampling, scaling, % changes etc) directly on Quandl. Another cool feature is supersets: you can combine datasets from multiple sources quite seamlessly; Quandl does all the work in converting and merging and synchronizing the data. If you want to do more advanced analysis “at home”, you can download every single dataset in the format of your choice (csv, tab, xml, json etc) and again Quandl does all the work in converting and re-formatting and cleaning up the data.Finally there’s the API, which I think is perhaps the most powerful feature on Quandl. Every single dataset on Quandl (now at 4 million and counting, from over 300 sources) is accessible via a simple and consistent API. It doesn’t matter where or how the data was originally published. In fact Quandl provides API access even for sources that don’t have their own APIs. So that dramatically increases the scope of what users can do with Quandl: it’s like a “universal API” for data on the internet.Will connect separately via email. Cheers!
Pretty skeptical on the concept of DIY data science. It feels like a rationalization of the huge amount of money thrown at big data storage/processing. No doubt that better math will find its way into software and help improve decisions in specific use cases (as it already is), but that seems like a feature, not a whole new class of mainstream end-user product.Seems that the biggest hurdle to answering the questions we all have is data access, transformation, semantics… that sort of stuff rather than fancy math.
Agree, and agree wholeheartedly with your first paragraph. I think we’ll see vastly improved decision support tools along with a significant cost reduction, but by and large data science will remain an expert domain.
Very interesting that you higlighted data and machine learning. Just the other day I was reading about Sift Science (I’m no way related to it). It was interesting to see how they are using machine learning (fighting online spams and scams) with the traditional payment checks (IP, Name, Location, Amount, etc) to bring better fraud fighting technique in e-commerce/CNP transaction types!
we just announced our investment in Sift Science. we are impressed with what they have built.
Its an interesting idea but it risks undermining the skills that go into good data science. Extracting value from data requires a fair amount of training in a variety of subjects (i.e., statistics, computer science, the business domain, etc.). Data can easily lead someone into a false sense of security about their decision making (hey, the data said make the bet so I did). After all, it is easy to be “fooled by randomness.” A good data scientist needs to be able to ask the right questions and also know if the data is capable of answering those questions. And since there is likely no such thing as a standard data set nor question, automating some sort of data science process might only work on the most trivial of problems. But I’m always interested in good innovation so we’ll see what happens.An interesting idea proposed in the other comments is the democratization of data. Data analysis is always subject to the biases and understanding of the analyst. Allowing more access to data by qualified analysts creates a better opportunity for these biases to be averaged away.
Have seen at least one – exversion.com promising to be the github of data, sharing + revisioning + social editing data sets.
Not quite the “Tumblr of Data Science” (I agree with your assessment wholeheartedly) but this resource is an interesting one for people looking to visualize different types of data: http://selection.datavisual…Most of these are packaged as libraries of code, but no reason they can’t be taken further and built into a service that’s even easier for the average person to use.
Fred, we are a TechStars company that is building the WordPress of data science. Let me know if you’d like to know more.
yes. send me an email please
“putting powerful tools in the hands of “everyman” produces more innovation than can happen when the tools are limited to mathematicians and scientists.”word. Fred, you are on a roll the last couple of days. You are killing it.
We’re doing something along those lines at http://www.quandl.com . As mentioned upthread by falicon, takingpitches, kidmercury and others, one big problem for DIY data analysis is not the absence of tools, but the lack of data. Even professional data analysts spend hours finding, validating, importing, re-formatting, cleaning, merging and synchronizing messy data from multiple sources every single day. It’s unpleasant, tedious, repetitive gruntwork.Quandl aims to ease this pain a little bit. We’re building a “search engine for data”. The idea is that users find the precise data that they need on Quandl, and more importantly, they get that data in the precise format they want. It doesn’t matter who published the data, or where, or in what format; Quandl does all the work in cleaning it up, parsing it, and spitting it out.Where does Quandl fit in the ecosystem? Well, ultimately we envisage a sort of DIY data stack: platforms like Quandl form the acquisition layer, then you have tools like Statwing in the analysis layer, and finally there are apps like Silk in the dissemination layer. You need all 3 layers to really complete the process and add value. It may take some time for the pieces to coalesce but it will happen.Here’s a bit more about our vision for data on the internet: http://www.quandl.com/about… . Thanks for reading!
A MetaLayer (http://metalayer.com) we’ve been working on providing solutions for ‘drag and drop data science’ for over two years now. We’re completely bootstrapped to date and growing on revenue earned.Ironically, the biggest problem we faced when speaking to VCs was that no one believed in ‘data science’ in and of itself as a market. The problem when you begin to segment areas of the market, the solution becomes far less compelling. Think about the market of ‘Search’ versus the market of ‘Search for Medical Records’. So, I thank you for pointing out this pervasive (and growing) problem.In regards to how MetaLayer works, it’s pretty simple, just ‘drag and drop’ data onto our Dashboard. We algorithmically parse it and try to make it actionable. If you don’t have your own data, you can use our modules to get some (from Twitter, Facebook, RSS, News Articles etc.)If this sounds a little abstract, viewing this demo video may help: https://vimeo.com/59237822
Good point about general data science vs application area specific.
When you think about it. we all have THE big data tool for the masses right within us – it is called – wait for it – the brain.It is amazing how much data it processes to do the simple things we do that we take for granted.If only there is a way for our brain to gloss over all the data sets such as NBA stats etc and just understand everything that needs to be understood. Now that will be cool.
If we all had one brain to draw from that might make sense, but the problem is that your highly educated, well-trained, and hyper-informed brain is not the same as Billy who never graduated college, who doesn’t travel, and who never did well at Math.The market opportunity is for a product that can exist to empower anyone to work with data, regardless of expertise. But remember, we’re talking about the need for a product, not a process or methodology.In other words the equivalent of what search did for document and information retrieval – put it in the hands of a well-trained professional and you get the best results, but if you put it in the hands of a complete novice you still get results….and in many ways that novice can improve their understanding of the process by simply using the tool. But the point is that there are products that democratized search for anyone, regardless of training. This post is making the same case for data science.
Well, in the scenario I described above how can technology be used to enable it. That technology will take care of brains of all capabilities. I only laid out the use case, I didn’t say there was no tech used. For example user may be wearing google glasses that can understand the queries from the brain directly and display the answers while the user glosses over the data sets (or plugged into data in the cloud). This sounds like science fiction but it is not that far away.
Check out a company called Ayasdi.
hmm..they are similar to palantir, large scale problems!..kaggle might be more suited.
“So what is the Tumblr or Blogger or WordPress of data science?”Aaah. Now I understand what you meant by the blogging sites. You were speaking metaphorically, it was ease of use and empowering the masses you were referring to. Excuse me being so dense.
I was doing some big data research on the aviation industry the other night (flight tracking)I came across this challenge that GE came up with. They announced a competition using to come up with some algorithms to make flying more efficient / fuel savings.They published sample data sets as part of the competition
Analyzing data sets…, I was thinking about this post earlier today and thought of Nate Silver..FiveThirtyEight blog.I watched the video that Fred posted a few months ago – he talked about analyzing various data sets – I think he predicted / suggested pro foot ball should go for more 2 point conversions
the platforms aren’t invented for the masses yet. first B2B channels, then when they become easy to use and manipulate, they’ll go mainstream. Your son may be the inventor.
Kaggle? Data science as a sport.
they are definitely best suited for this use case.
I’ve been working with our engineering team recently on building a data mining tool that any business person can use to find patterns in their customer feedback to help drive satisfaction, loyalty and retention. It’s a cool analytics and data visualization challenge to design such a tool that allows for more analysis in less time, and the ability to communicate more patterns to more people with less time and effort. We call it Spotlight, there is a video of the tool here http://www.allegiance.com/p…
I was doing exactly that in the last years and I am happy that you have mentioned this topic. My theoretical playground for data science was posted as Egont, a Web Orchestration Language and Egont Part II. The idea is very simple: like a web spreadsheet or Yahoo Pipes, in the sense that you connect different elements and they are “recalculated” only when there is a change, but adding social namespaces (sorry for the buzzword). For example, I can reference the movies from my friends to give me some recommendation, when a friend adds a movie, the depending operations are recalculated. You can currently obtain similar results interconnecting Google Spreadsheets but I don’t think that’s the correct UI (neither Yahoo Pipes) to achieve that.In my opinion the success factors for data science experimentation will be: finding interesting information in the long tail that is difficult (time consuming) to search, serendipity, and emergent behavior from the system complexity. In some way is moving forward the current state of crowdsourcing: instead of people contributing with data, leaving the processing to the cloud, the people build the machine connecting small pieces.It is not coincidence that the basis of this system is receiving recent research interests: Advances in IC-Scheduling Theory: Scheduling Expansive and Reductive Dags and Scheduling Dags via Duality and one of these guys works in Google Research.
I think you’ve seen it already, but check out https://www.quantopian.com. We’re making algorithmic trading available to a world of people who have never had access to it before. Read our manifesto, too: http://blog.quantopian.com/…
Personal instream-filtering (which tweets, pages, emails to bother reading). Do you trust twitter and facebook to do this?
I think companies that already have the data (for example a sports site) could let users have access to tools to see the data in different ways. a drawback with having a generic site (like tumblr) is that it may take a long time to upload all the data.
Here’s big data for food, mentioned by @pointsandfigures on GG’s blog. http://getfoodgenius.com/
I’m hoping somebody can figure out the Total-Team-Tattoos to Wins ratio in NCAA basketball to aid me in completing my brackets.
Some new tools for me in the discussion and great points about the data set and data relevancy. I’m surprised no-one (that I could see) mentioned Tableau (http://tableausoftware.com/) or Excel 2013 (O365’s) new BI capabilities (http://www.microsoft.com/en….
Isn’t this platform called Wolfram Mathematica?
DIY Data tools for a very non-technical industry, Market Research, is a great summary of what we do. @fredwilson:disqus let me know if that’s of interest.
FWIW, your kids can probably do most supervised learning somewhat well at home right now just by using Excel’s Solver and the analysis tool pack.
I was fortunate when I was growing up to have a father who was an isotope hydrologist. He had all manner of datasets relating to rainfall, water flow, catchment levels, etc as well as isotype spectrums of water samples and meteorological readings. The data was always messy, noisy and incomplete. A large amount of his job was turning these datasets into models that would be able to predict the reaction of complex environmental systems to changes and shocks, which were used amongst other things to help inform policy on water use in arid countries. We spent a lot of time together exploring the data and imagining how to model the systems; this is surely responsible for my decision to make a career in data-driven science.Back then (in the 80s) there were no machine learning or data science tools as we know them today, but the elegance and accuracy of the models was sometimes pretty impressive, even by today’s standards. The tools used were nothing more sophisticated than Excel and it’s predecessors; what was important was the inventive step around modelling the system, and the mental agility to create models compatible with the available tools and data. Often the data itself was of secondary importance to what the model implied about the structure of the world; how can a purely empirical model tell a government that diverting a river will reduce the amount of water available due to its affect on aquifer replenishment, without first diverting the river?Today, things are different. We have all manner of sophisticated tools that are accessible to the lay man, and even Excel has become vastly more powerful due to extra features and more computing power. It is far easier to get access to data; to slice it and dice it; to process it in vast quantities; and to create stunning visualizations of its structure. These are really powerful tools, but from my point of view, they are not going to engage the hobbyist.What is missing is ability to add creative input in terms of modelling the data, and especially the ability to easily test out models and see how they perform. The high you get from producing a stunning visualization in an afternoon wears off pretty quickly. The high you get from discovering a base truth about the world that makes your model far better is enough to sustain you for a significant part of your career.Datasets are available (eg, Kaggle) but what we are missing is a platform that facilitates three things:1. Getting data in very easily2. The creative process around exploring and analyzing but also imagining, implementing and testing ideas. In other words, modelling.3. Making something useful out of these modelsA lot of the tools out there today either ignore the “creating” part, or try to address it as a black box or a wizard. As an example, the Google prediction API doesn’t give you any ability to understand what structure was found in your data or to influence the structure of the model. This makes sense from a pure business point of view: most people don’t want to know and hence the addressable market is much bigger. However, to get hobbyists using these tools, and have truly engaged commercial users, they need to be far more supportive of the creative process. People are willing to put in extra effort to learn a hobby, and the tools should reward this effort.Note that under this model, machine learning is part of the process (I find machine learning tools to be very useful in helping me create models) but it is not the main thrust. I see machine learning as being a kind of assembly language for the modelling process.I think that the best analogy out there at the moment is the Lego Mindstorms. It provides enough structure to support creativity, without stifling it by being too prescriptive. It’s one thing to see a number that says “simulated obstacle avoidance accuracy = 100%”. It’s another thing to see your robot change course to avoid the toy car that your sister rolled towards it, and to understand the algorithms so well that your robot is an extension of your own brain.We need a Lego Mindstorms for data science. That is what we are working on creating at Datacratic.