Big data won’t save us

US Navy 100414-N-9520G-003 Lt. Julie Cunningha...

There’s a lot of focus on “big data” these days after the recent Facebook IPO. The term is becoming as ubiquitous at “the cloud.”

There’s a great line in Michael Wolff’s article, “The Facebook Fallacy,” at the MIT Technical Review. “The company knows so much about so many people that its executives are sure that the knowledge must have value.” That sums it up in a nutshell – “We have so much of this stuff, it MUST be valuable. We just don’t know what that value is yet.” This is a faith-based business plan, not one based on any demonstrable facts.

Because so many companies have bet the farm on finding the gold supposedly buried in all this data, they are anxious to apply the approach to everything. This has lead to the impulse to collect and analyse every click, twitch, and glance that can possibly collected, as well as the belief that data set analysis can be applied to absolutly everything and is the answer to every business problem. Hack Days are springing up like flowers, full of earnest young developers looking to do interesting things with data sets.

Here’s the problem – these data sets are not the essence of a business or person, they are simply the byproduct of activity. In other words, detritus. These big data surfers are the electronic equivalent of dumpster divers. A bit like A. J. Weberman, most famous for writing a book about what he found digging though Bob Dylan’s trash.


Yes, But what does this have to do with making programmes, you may ask?

Because audio and video production has migrated to computer-based systems, assumptions have been made that it has become IT-like and data driven, as most things using computers have been up to now. So it’s assumed that things like the cloud and big data can be applied to media. The jury’s still out on the cloud, but there are elements that look promising. Big data is another story.

The production of audio and video material generates massive amounts of data in the form of media files, but it generates very little textual information of the type usually found in big data’s data sets. Essentially, we’re dealing with “large data” not “big data.”

And this is where the mismatch arises. The discipline of data analysis takes a reductionist, text-centric view of the world – essentially believing that reality arises from data, as opposed to data describing reality. This approach does not work with media files, because any textual data associated with them is not part of their essence and must be created after the fact via notoriously unreliable text to speech and pattern recognition processes, or more often by hand. A video file can easily be multiple gigabytes in size, yet have associated metadata that can be measured in kilobytes. In the big data perspective of the world, a 10 petabyte library of media would still be a small data set.

Because most metadata is generated by hand, and because time and budgets are tight, only limited, purely essential metadata is recorded. Therefore, big data tricks can’t be used to tease out patterns and relationships. I can easily do a search on “Matt Smith,” or “Doctor Who,” and get very similar results, but it’s is highly unlikely I could search on the phrase, “Matt Smith as Doctor Who but without the tweed jacket,” and get any useful result unless, by a quirk of fate, a PA with OCD happened to enter that metadata.


What the broadcasting industry needs is a sound and vision approach to analysis of “large data.” Non-textual ways of indexing and searching media are required. Face recognition and music matching applications are just scratching the surface, but there is much more possible. Pattern recognition, spacial positioning, geotagging, and timestamped telemetry feeds would be a good start, but it needs to go beyond that and take into account the ways we interpret the world around us with our five sense. An ideal search interface would involve acoustic clues, haptic interfaces, and visual references.

I look forward to what the future may bring.