Why You Should Share Your Big Data

Posted by Taylor Haney on Fri, Apr 26, 2013

As April is coming to an end we are starting to wrap up our Big Data theme, but we still have a little more awesome content to share. Beth Logan PhD, Director of Optimization at DataXu writes a great post about the benefits of sharing big data and how it has been beneficial in other industries. As a reminder, our next guest blog series starts in May and the topic is Mobile, if you are interested in contributing please e-mail me at taylor [at] MITX [dot] org.

Beth LoganBeth Logan holds a PhD in Speech Recognition from the University of Cambridge, UK.  She has worked in many fields including speech, music indexing, computational biology, medical informatics and activity monitoring, and has over 30 publications and 11 issued patents.  Since joining DataXu (http://www.dataxu.com/) in 2009 she has turned her talents to online advertising, relishing the vast amounts of data available and the many interesting problems to be solved.

I started working in "Big Data" before it had a name.  My roots are in speech recognition where at the time - and likely still - it was hard to publish at a respectable conference unless you could demonstrate a significant improvement on a large dataset. “Large” meant 100s of hours of speech, which took many hours to process on a cluster.  Fortunately, the community was quite mature and many such large datasets were readily available with excellent labels and tools to get started (e.g. see http://www.ldc.upenn.edu/).

The availability of such public datasets combined with cheap and powerful computing resources was the breakthrough speech recognition needed to become ubiquitous.  These datasets allowed accurate benchmarking of results across teams, facilitating innovation.  However, they cost tens of thousands of dollars for non-academic institutions, reflecting the high cost and labor intensity of labeling.  Yet the value generated by commercial speech recognition algorithms no doubt outweighs this cost.  This raises the obvious question of what other communities would benefit from large, shared datasets.

As I soon discovered when I moved to other fields, public labeled data became rarer, even if you were willing to pay for it.  Typically, what was available (with some exceptions, e.g. www.physionet.org, trec.nist.gov) was “Small Data.” The alternative was to collect and label the data yourself with all the associated biases and costs. In some fields, even if resources were available to create a good dataset, it was impossible to share due to privacy, legal or confidentially concerns. The result was unfortunately that few clear paradigms have emerged in those fields and arguably a lot of R&D dollars are wasted because it’s impossible to compare different algorithms on significantly large datasets. 

describe the imageLack of sharing still applies to much of Big Data today. Organizations have exabytes of data behind firewalls, but they are reluctant to share it. Therefore, they must rely on their internal teams or expensive consultants to mine it for insights—which is far from ideal. No matter how brilliant the people you have working on your dataset, innovation comes from collaboration and competition. 

However, if you believe in the benefits of sharing, there is much to be hopeful about. People willingly share all sorts of personal data online, particularly in the personal health space. If you are trying to eat well, you can share your meals online. Just the thought of others reading what you eat makes you accountable. Similarly, if you log gym visits in a public place or miles and routes ran, you will be motivated by the charts of your progress and perhaps by the thought of logging more time than your friend.

We are also seeing companies willing to share. They make their datasets available for others to analyze on Kaggle (www.kaggle.com/) and similar sites. Yes, it is work to anonymize and curate a dataset and people will not necessarily analyze it for free, but the payout can be terrific. As in the field of speech recognition, the ability to compare and contrast different approaches on the same dataset can save years of work.

So next time you create a dataset and have a Big Data problem to solve, consider sharing it.  The results may surprise you.