BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

How Do We Define 'Big Data' And Just What Counts As A 'Big Data' Analysis?

Following
This article is more than 5 years old.

Getty

In an era where almost everything is touted as being “big data” how do we define just what we mean by “big data” and what precisely counts as a “big data” analysis? Does merely keyword searching a multi-petabyte dataset count? Does using a date filter to extract a few million tweets from the full trillion-tweet archive count as “big data?” Does running a hundred petabyte file server or merely storing a hundred petabyte backup count? What exactly should count as “big data” today?

I used to open my data science talks back in 2013 by saying I had just run several hundred analyses the previous day over a 100-petabyte database totaling more than 30 trillion rows, with more than 200 indicators incorporated into the analysis. When I would ask the audience whether this counted as a “big data” analysis, there was typically unanimous assent.

However, when I noted that what I had done was run a bunch of Google searches and that every day people all over the world were running billions of identical analyses over Google’s 100 petabyte index, suddenly the audience usually changed its mind and argued this was clearly not a “big data” analysis, it was merely “search.” Indeed, it seems hardly satisfying to argue that a 10-year-old running a Google search should count as a bleeding edge “big data” analyst.

What about Twitter? With just over a trillion tweets sent in its history as of Spring 2018 and growing at a rate of around 350 million tweets a day at the time spanning text, images, audio and video, Twitter would seem a nearly textbook example of the traditional “three V’s” of “volume, velocity and variety” of big data.

The problem is that the overwhelming majority of all Twitter analyses look at just a microscopic subset of the full trillion tweet archive. Take a researcher who uses a keyword search to identify all tweets since Twitter’s founding that contained a certain keyword and comes up with a dataset of 100,000 tweets. As with Google searches, does merely keyword searching a trillion tweets really count as performing a trillion-tweet “big data” analysis? If the actual analysis involves performing sentiment mining on the matching tweets, then at the end of the day the actual analysis itself is only being performed on 100,000 tweets. All of the resulting findings, patterns and results represent not the contents of a trillion tweets, but rather just 100,000 tweets.

Moreover, many commercial social media analysis platforms that offer advanced tools like sentiment mining or topical tagging, apply their algorithms only to a random sample of the total results, sometimes as low as 1,000 randomly selected tweets out of the total search results.

Does a keyword search of a trillion tweets that yields a set of 100,000 results of which just 1,000 randomly selected tweets are finally analyzed, really count as a “big data” analysis?

What about the commercial world? In 2013 Facebook’s internal enterprise data warehouse held more than 250PB of data. Yet, according to the company just 320TB of that data was queried per day. In short, while Facebook held an impressive 250 petabytes of data at the time, just one tenth of one percent of that data was actually queried each day. The rest was merely archived at rest.

This raises the question of whether it is the size of the underlying dataset that matters, the size of the dataset actually being touched by a query that matters, the complexity of the query that matters or the size of the actual returned data that matters?

In Facebook’s case in 2013, having a 250PB warehouse was quite impressive and still is even by 2019 standards. From the standpoint of the size of the underlying dataset, Facebook’s analyses are clearly “big data” if data size alone is our metric.

What if we look only at the amount of that data that is actually touched by queries each day? Examining 320TB is still quite impressive. However, divided by the 850 employees per day conducting those 320TB of queries, this works out to just an average of 376GB per day per user, which, while still large, is far less notable. Saying your data mart runs hundreds of terabytes of queries per day may be impressive. Saying a small set of employees run a few hundred gigabytes of queries per day is not.

Facebook is not alone in the fact that most analyses consider only a small fraction of the dataset they are examining. A month of the Twitter Decahose in 2012 contained 2.8TB of uncompressed JSON. Yet, most Twitter analyses are likely to focus only on the text of those tweets, which accounts for only 112.7GB, just 4% of the entire dataset. Indeed, this is one of the reasons production analytic platforms like Google’s BigQuery utilize columnar storage formats.

Does the complexity of the analysis matter? If the majority of those 320TB of queries per day were merely numeric range searches, does numeric searching of 250PB count as “big data” in a way that a Google keyword search of 100 petabytes does not?

In contrast, would running a massive neural network that requires an entire 100 petaflop TPU pod to execute, but applying it to just a few gigabytes of input data, count as a “big data” analysis? Is it the size of the data being analyzed or the complexity of the analysis being performed on it that counts?

Is it the size of the results that matter? A massive neural network might take days to run on a multi-petaflop system and examine tens of petabytes of data but yield just a single “go / no-go” result. Does that still count as “big data?” Given that many "big data" analyses are designed to extract simple findings like timelines or "go / no-go" results from massive piles of input data, it would seem the size of the output data would be a less than satisfactory metric for assessing what precisely counts as a "big data" analysis.

These feed into the larger question of “big data management” versus “big data analysis.” I once sat on an advisory board with a CIO of a Fortune 50 company that argued his company was at the bleeding edge of the big data revolution in its industry because it had petabytes of data. Yet, when I asked what accounted for all that data, he said it was all the backup images of its tens of thousands of desktop and laptop computers and that they almost never actually accessed the data.

Maintaining an on-premises multi-petabyte storage infrastructure certainly can be a major endeavor and require specialized hardware and software engineering. At the same time, keeping a few petabytes in cold storage with just a handful of accesses per year involves very different engineering requirements than building a realtime analytic platform that can perform complex analyses over petabytes in minutes, thousands of times per day.

Similarly, by 2013 CERN archived more than 100PB of data. However, only 13PB was stored on disk, with the remaining 88PB across eight robotic archival tape libraries. Does storing data on tape, where it must be staged back onto disk with extremely long latencies to actually use it, still count as “big data?” Again, the answer revolves around whether “big data” counts only analysis of large datasets or also the operational complexities of storing large datasets, whether or not they are actually accessed.

Does storing 100PB count as “big data” if it is in the form of a file server, rather than cold storage? By 2012 Facebook stored more than 100PB of photos and videos from its users in what amounted to a giant file server. Data on a file server is actually accessed, rather than sitting in cold storage, but somehow it doesn’t seem satisfying to count a file server as “big data” analysis.

Putting this into perspective, five years ago Google, Facebook and CERN all had 100PB datasets. Google’s 100PB was used for keyword search. CERN’s was majority cold stored on tape. Facebook’s was on-disk but accessed as a fileserver. Do any of these or all three count as “big data?”

As companies increasingly outsource “big data storage” to the commercial cloud, can a company claim to be “big data” if it just stores a few petabytes in Google or Amazon’s cold storage cloud offerings? If the company no longer has to worry about the engineering and operational requirements of managing all that data, does it cease to be a “big data” company if it never actually does anything with the data it stores in the cloud?

To put another way, if a Fortune 50 company builds an on-premise five petabyte tape backup system for its global desktop backup program, it at least can argue that it is managing a petascale storage fabric. On the other hand, if all those desktops simply upload their backup images directly to the public cloud, it is the cloud that is running the petascale storage fabric. Can the company still claim to be a “big data” company if it outsources everything?

Of course, this is the point of the commercial cloud, to outsource petascale storage and analytics to the companies that have pioneered it. Why manage a petascale analytics infrastructure when you can simply let a company like Google run it for you with BigQuery and bring to bear the collective engineering, performance and analytic creativity of its workforce?

Muddying the waters even further, when tallying “big data” sizes, do we count all of the data generated, or just the data actually recorded to disk? In 2013 the Large Hadron Collider generated more than one petabyte of data per second, an immense volume even by 2019’s standards. However, that data isn’t actually stored. A specialized prefiltering process selected just 1/10,000th of that data stream, yielding just 100 gigabytes per second, which is still significant, but then that was reduced by a further 99% to just one gigabyte per second. From a petabyte per second of raw data down to just one gigabyte per second of actual data that was archived, only 25 petabytes per year was actually needed for long term storage.

Should we rate “big data” projects by their potential data sizes or their actual data sizes? A company that instruments its vehicle fleet with GPS trackers could theoretically record the location of each vehicle millions of times per second and generate petabytes of data per day. In reality, GPS hardware simply doesn’t update that fast and such high precision data would likely be of little use, so recording every few seconds or every minute might be a far more useful metric. Should the company be viewed as analyzing a theoretically multi-petabyte data stream per day or merely the gigabyte-per-day stream it actually records?

Does data have to be stored in digital format for it to be considered “big data?” By 2012 Google had scanned and digitized more than 20 million books, producing an immense archive of scanned page imagery, OCR’d text and search indexes. At the same time, the traditional academic library is more often associated in the public mind with dust and obsolescence than bleeding edge “big data” technology. Why might we consider Google Books to be a “big data storage” application, while the libraries it draws from and represents just a small fraction of, are not?

Putting this all together, what does it mean to do “big data” in 2019?  Is it the size of the underlying dataset or the actual data used in a query? Does it count the theoretical data size or just what is recorded to disk? Does storing data count or only analyzing it and does it still count if it is outsourced to the commercial cloud or if it isn’t in digital form? In the end, there are no easy answers, but as we increasingly tout everything and anything as “big data” it is worth stepping back to ask what precisely we mean.