Big data, big assumptions
While David Cameron talked vaguely about how we needed to embrace Big Society, technology firms talked equally vaguely how the future was Big Data. The poster-child of Big Data was a system for predicting flu outbreaks, but researchers at North Eastern University in Boston reckon the accolades were premature and the predictions were almost always wrong.
What is Big Data? It is a term applied to the large and often unstructured data sets which have arisen as a result of networking the planet, data which is so large that traditional data analysis techniques are no longer viable. For instance, consider the data contained in millions of tweets each day. It has no formal database structure, or well defined values. One person might tweet that the sun is shining over Sheffield and another might tweet that its raining in Runcorn whilst a third might want to know who wrote the song "Raining in my heart". If you had some way of sifting through and interpreting all the tweets coming out of the UK, you could use it to reveal a real-time weather report.
Clearly it takes exceptional computing resources to be able to process such masses of data and even more so if you need results in time to be able to do anything useful with them, but we do live in a world where we now have distributed networks and cloud computing which enables such problems to be split up into subtasks and analysed in parallel. One of the long-held ideas in computing is that sufficiently powerful computers could digest these massive collections of data and give us insights that we'd never spot for ourselves, or find those insights and trends much quicker than we could.
Back in 2008, Google began work on "Flu Trends" as an example of what could be achieved with Big Data. Identifying flu outbreaks early allows better resource allocation by medical services and spotting the signals is just what Flu Trends claims to do. The dataset in question was the billions of search engine queries which Google handles each day. Reasoning that when people start showing symptoms of flu, they will typically search for health information about it, Google used its computing resources to find the correlations between search terms and flu outbreaks and said that with this approach it could identify flu outbreaks in communities within a day of them starting, with something like 97% accuracy. This is much faster than the government departments for public health which rely on the much slower process of people visiting family doctors, doctors reporting back to their health authorities, who in turn report data to government where it is eventually collated. If Flu Trends works, it is a valuable tool.
The problem is, according to lead researcher David Lazar, is that although Flu Trends is often cited as an example of what can be achieved with Big Data, the predictions that it has made are frequently wrong. Its predictions were wrong in 100 of the 108 weeks of their study, and it often predicted around twice as many flu cases as actually occurred. Furthermore, it completely missed the swine flu pandemic of 2009.
The problem with this approach is that is it too easy to "over-fit" the data, find spurious correlations, and fall into the trap of seeing shapes and patterns which are really just manifestations of randomness. To illustrate this concept, if you had enough raw data about people in the UK and did enough cross-tabs, you might find that people who drive red cars and listen to Abba are most likely to win in a lottery. That may well have been the case, but one is not a consequence of the other. Changing to a red car will not increase your chances of picking winning numbers, and scooping the jackpot will not make you want to rush out an purchase the DVD of Mama Mia. No no matter how good the correlation looks in your historic data, there's no real reason to suppose the same random pattern will pop up again year after year or that the correlation has any predictive value.
Big Data certainly does contain big potential for addressing big social issues but we are still some way from being able to realise the full potential. For more information on the issues with Flu Trends, there is an excellent podcast from Science magazine which features an interview with David Lazar.
24th April 2014
This article comes from the SKILLZONE email newsletter, published monthly since January 2008, and covering topics related to technology and the internet. All articles and artwork in the SKILLZONE newsletter are orignal content. If you would like to receive the newsletter direct to your inbox each month, please SUBSCRIBE here. It is free, and you don't get added to any other mailing lists. It uses best-practice confirmed opt-in only, and you may unsubscribe at any time.