What is ‘big data’?

Every time you make a purchase online, you're adding to the data stream.
Every time you make a purchase online, you're adding to the data stream.
©Monkey Business/Thinkstock

In a way, big data is exactly what it sounds like -- a lot of data. Since the advent of the Internet, we've been producing data in staggering amounts. It's been estimated that in all the time leading up to the year 2003, only 5 exabytes of data were generated -- that's equal to 5 billion gigabytes. But from 2003 to 2012, the amount reached around 2.7 zettabytes (or 2,700 exabytes, or 2.7 trillion gigabytes) [sources: Intel, Lund]. According to Berkeley researchers, we are now producing roughly 5 quintillion bytes (or around 4.3 exabytes) of data every two days [source: Romanov].

The term 'big data' is usually used to refer to massive, rapidly expanding, varied and often unstructured sets of digitized data that are difficult to maintain using traditional databases. It can include all the digital information floating around out there in the ether of the Internet, the proprietary information of companies with whom we've done business and official government records, among a great many other things. There's also the implication that the data is being analyzed for some purpose.


We've generated lots of it ourselves by making online purchases and participating in social media, but that is just the tip of the iceberg. Big data can include digitized documents, photographs, videos, audio files, tweets and other social networking posts, e-mails, text messages, phone records, search engine queries, RFID tag and barcode scans and financial transaction records, though those aren't the only sources. You're producing data every time you do anything online, leaving a digital trail that others can come along and mine for useful information.

The numbers and types of devices that produce data have been proliferating as well. Besides home computers and retailers' point-of-sale systems, we have Internet-connected smartphones, WiFi-enabled scales that tweet our weight, fitness sensors that track and sometimes share health related data, cameras that can automatically post photos and videos online and global positioning satellite (GPS) devices that can pinpoint our location on the globe, to name a few. Don't forget weather and traffic sensors, surveillance cameras, sensors in cars and airplanes and other things not connected with individuals that are constantly collecting data. The large numbers of electronic devices that generate and upload data have given rise to the term "the Internet of things."

You'll find multiple definitions of big data out there, so not everyone agrees entirely on what is included, but it can be anything anyone might be interested to know that can be subjected to computer analysis. And these large, unwieldy sets of data require new methods to collect, store, process and analyze them.


How Big Data is Analyzed and Used

Server farms like this one in San Jose, Calif. are processing massive amounts of data in an effort to identify patterns and associations.
Server farms like this one in San Jose, Calif. are processing massive amounts of data in an effort to identify patterns and associations.
© Bob Sacha/Corbis

Big data has to be collected, massaged, linked together and interpreted for it to be of any use to anyone. Companies and other entities need to filter the vast amount of available data to get to what's most relevant to them. Fortunately, hardware and software that can process, store and analyze huge amounts of information are becoming cheaper and faster, so the work no longer requires massive and prohibitively expensive supercomputers. Some of the software is becoming more user friendly so that it doesn't necessarily take a team of programmers and data scientists to wrangle the data (although it never hurts to have knowledgeable people who can understand your requirements).

Companies take advantage of cloud computing services so that they don't even have to buy their own computers to do all that data crunching. Data centers, also called server farms, can distribute batches of data for processing over multiple servers, and the number of servers can be scaled up or down quickly as needed. This scalable distributed computing is accomplished using innovative tools like Apache Hadoop, MapReduce and Massively Parallel Processing (MPP). NoSQL databases have been developed as more easily scalable alternatives to traditional SQL-based database systems.


Much of this big data processing and analysis is aimed at finding patterns and correlations that provide insights that can be exploited or used to make decisions. Businesses can now mine massive amounts of data for information about consumer habits, their products' popularity or more efficient ways to do business. Big data analytics can be used to target relevant ads, products and services at the customers they believe are most likely to buy them, or to create ads that are more likely to appeal to the public at large. Companies are now even starting to do things like send real-time ads and coupons to people via their smartphones for places that are near locations where they have recently used their credit cards.

It's not just for making us buy stuff, however. Businesses can use the information to improve efficiency and practices, such as finding the most cost-effective delivery routes or stocking merchandise more appropriately. Government agencies can analyze traffic patterns, crime, utility usage and other statistics to improve policy decisions and public service. Intelligence agencies can use it to, well, spy, and hopefully foil criminal and terrorist plots. News outfits can use it to find trends and develop stories, and, of course, write more articles about big data.

In essence, big data allows entities to use nearly real-time data to inform decisions, rather than relying mostly on old information as in the past. But this ability to see what's going on with us in the present, and even sometimes to predict our future behavior, can be a bit creepy.


Big Data: Friend or Foe?

Your ATM transactions and credit and debit card purchases are part of the data profile that helps companies predict your spending habits.
Your ATM transactions and credit and debit card purchases are part of the data profile that helps companies predict your spending habits.
© Erik Tham/Corbis

The idea of big data makes a lot of us uneasy. It sounds a lot like Orwell's Big Brother, and with ads from companies that seem to know what we're doing and the recent NSA domestic spying revelations, it is understandable that some people find the massive amount of information out there about all of us disturbing.

People can tell lots about you from this data, including your age, gender, sexual orientation, marital status, income level, health status, tastes, hobbies, habits and a whole host of other things that you may or may not want to be public knowledge. They need only have the means and the will to gather and analyze it. And whether they mean well or ill, it can have unintended consequences.


We give up more information than we realize to companies with whom we do business, especially if we use loyalty cards or pay with credit or debit cards. Someone can learn a lot about you just from analyzing your purchases. Target received some press when it was discovered that they could pinpoint which customers were pregnant and even how close they were to their due dates from things like the types of supplements and lotions they were buying. In one case, Target began mailing coupons for baby products directly to a teenage girl, sparking her father's ire against the company for sending her what he considered age-inappropriate ads -- until he found out about her pregnancy [sources: Datoo, Duhigg, Economist].

Governments and privacy advocates have made attempts to regulate the way people's personally identifiable information (PII) is used or disclosed in order to give individuals some amount of control over what becomes public knowledge. But predictive analytics can bypass many existing laws (which mainly deal with specific types of data like your financial, medical or educational records) by letting companies conclude things about you indirectly, and likely without your knowledge, using disparate pieces of information gathered from digital sources. Some companies are using the information to do things like check potential customers' credit worthiness using data other than the typical credit score, which can be good or bad for you, depending upon what they find and how they interpret it. One worry, though, is that this type of personal information can lead to hard-to-detect employment, housing or lending discrimination. And worse yet, it may not always be entirely accurate.

It's also possible for patterns seen in big data to be misinterpreted and lead to bad decisions. Like any tool, the results all depend upon how well it is used. Even though math is involved, big data analytics is not an exact science, and human planning and decision-making has to come in somewhere. With huge data sets, judgment calls need to be made about what is important and what can be be ignored. But performing big data analytics well can give companies a competitive advantage.

Such analysis can be used for things that are obviously good, such as fighting fraud. Banks, credit card providers and other companies that deal in money now increasingly use big data analytics to spot unusual patterns that point to criminal activity. On an individual account, they can quickly be alerted to red flags like purchases of unusual items, amounts the customer normally wouldn't spend, an odd geographical location or a small test purchase followed by a very large purchase. Patterns across multiple accounts, like similar charges on different cards from the same area, can also alert a company to possible fraudulent behavior.

Huge data sets can aid in scientific and sociological research, election predictions, weather forecasts and other worthwhile pursuits. Social media posts and Google searches have even been used to quickly find out where disease outbreaks are occurring. So it's not all bad news. It'll just take a while to work out all the potential problems and to implement laws that would protect us from potential harm. Until then, if you're worried, you might want to revert to cash purchases and watch what you put out there about yourself. Still, we're probably too far down the rabbit hole for any of us to be entirely off the radar.


Lots More Information

Author's Note: What is 'big data'?

Like anything, big data can be used for good, for ill, and for lots of stuff in between. Having ads and coupons targeted at us can be a convenience or a major annoyance. And it's more than a little unnerving the amount strangers can learn about us just because we're swiping plastic in their stores or using their cards.

Loyalty cards I'd always figured were ways to gather data on our purchases, but I hadn't really appreciated how much similar data was being tied to us individually through debit/credit purchases until now, or the incredible detail about our lives that could be discerned from it. And this isn't even including all the other information about us out there on the Internet.

The thought of my every move being analyzed makes me want to go off the grid somewhat, stop posting online and use cash for everything. Although most of us, including me, will probably continue on as we are for convenience purposes. I just might post and buy as though I'm being watched.

Related Articles

  • Apache. "Hadoop." (Nov. 30, 2013) http://hadoop.apache.org/
  • Arthur, Lisa. "What Is Big Data?" Forbes. Aug. 15, 2013. (Dec. 1, 2013) http://www.forbes.com/sites/lisaarthur/2013/08/15/what-is-big-data/
  • Brooks, David. "What Data Can't Do." New York Times. Feb. 18, 2013. (Dec. 4, 2013) http://www.nytimes.com/2013/02/19/opinion/brooks-what-data-cant-do.html?_r=1&
  • Brooks, David. "What You'll Do Next." New York Times. April 15, 2013. (Dec. 4, 2013) http://www.nytimes.com/2013/04/16/opinion/brooks-what-youll-do-next.html
  • Brust, Andrew. "MapReduce and MPP: Two sides of the Big Data coin?" ZDNet. March 2, 2012. (Dec. 5, 2013) http://www.zdnet.com/blog/big-data/mapreduce-and-mpp-two-sides-of-the-big-data-coin/121
  • Butler, Brandon. "Lessons From Numbers Guru Nate Silver About Working With Big Data." Network World. Sept. 11, 2013. (Dec. 4, 2013) http://www.networkworld.com/news/2013/091113-nate-silver-big-data-273740.html
  • Cox, Ryan. "Nate Silver Skeptical of Big Data Trends, Keys in on Culture." Silicon Angle. Sept. 12, 2013. (Dec. 4, 2013) http://siliconangle.com/blog/2013/09/12/nate-silver-skeptical-of-big-data-trends-keys-in-on-culture/
  • Crawford, Kate and Jason Schultz. "Big Data and Due Process: Toward a Framework to Redress Predictive Privacy Harms." New York University School of Law. October 1, 2013. (Dec. 4, 2013) http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2325784
  • Datoo, Siraj. "Rapid Development in Big Data Analytics Has Led to Increased Investment." Guardian. Nov. 22, 2013. (Nov. 29, 2013) http://www.theguardian.com/news/2013/nov/22/rapid-development-in-big-data-analytics-has-led-to-increased-investment
  • Duhigg, Charles. "How Companies Learn Your Secrets." New York Times. Feb. 16, 2012. (Dec. 2, 2013) http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=6&_r=3&hp&pagewanted=all&
  • Economist. "Big Data - Crunching the Numbers." May 19, 2012. (Dec. 1, 2013) http://www.economist.com/node/21554743
  • EMC. "EMC: Behind the Big Data Curtain." 2012. (Dec. 1, 2013) http://www.emc.com/campaign/global/big-data/hfbd-infographic-4web-1500.jpg?cmp=micro-big_data-general-emc
  • Fitzgerald, Michael. "Big Data: Big Threat Or Big Lie?" InformationWeek. Nov. 21, 2013. (Dec. 4, 2013) http://www.informationweek.com/big-data-big-threat-or-big-lie/d/d-id/1112668?
  • Gartner. "Big Data." (Nov. 29, 2013) http://www.gartner.com/it-glossary/big-data/
  • Gnau, Scott. "Putting Big Data in Context." Wired. Sept. 10, 2013. (Dec. 4, 2013) http://www.wired.com/insights/2013/09/putting-big-data-in-context/
  • Henschen, Doug. "Big Data Reshapes Weather Channel Predictions." InformationWeek. Nov. 25, 2013. (Dec. 4, 2013) http://www.informationweek.com/big-data/software-platforms/big-data-reshapes-weather-channel-predictions/d/d-id/1112776?
  • IBM. "What is big data?" (Dec. 4, 2013) http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html
  • Intel. "Big Data 101: How Big Data Makes Big Impacts." (Nov. 29, 2013) http://www.intel.com/content/www/us/en/big-data/big-data-101-animation.html
  • Intel. "Combat Credit Card Fraud with Big Data." (Nov. 30, 2013) http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/combat-credit-card-fraud-with-big-data-whitepaper.pdf
  • Intel. "What is Big Data?" (Nov. 30, 2013) http://www.intel.com/content/www/us/en/big-data/big-data-what-is-big-data-landing.html
  • Laney, Doug. "Deja VVVu: Others Claiming Gartner's Construct for Big Data." Gartner. Jan. 14, 2012. (Dec. 1, 2013) http://blogs.gartner.com/doug-laney/deja-vvvue-others-claiming-gartners-volume-velocity-variety-construct-for-big-data/
  • Lund, Susan, James Manyika, Scott Nyquist, Lenny Mendonca, and Sreenivas Ramaswamy. "Game Changers: Five Opportunities for US Growth and Renewal." McKinsey Global Institute. July 2013. (Dec. 3, 2013) http://www.mckinsey.com/insights/americas/us_game_changers
  • MongoDB. "Big Data Explained." (Dec. 5, 2013) http://www.mongodb.com/learn/big-data
  • Naughton, John. "Why Big Data Has Made Your Privacy a Thing of the Past." Guardian. Oct. 5, 2013. (Nov. 29, 2013) http://www.theguardian.com/technology/2013/oct/06/big-data-predictive-analytics-privacy
  • Novet, Jordan. "Here's Why 2014 Will be the Year of the 'Internet of Things.'" Venturebeat. Nov. 25, 2013. (Dec. 1, 2013) http://venturebeat.com/2013/11/25/heres-why-2014-will-be-the-year-of-the-internet-of-things/
  • Romanov, Alex. "Putting a Dollar Value on Big Data Insights." Wired. July 17, 2013. (Dec. 4, 2013) http://www.wired.com/insights/2013/07/putting-a-dollar-value-on-big-data-insights/
  • SAS. "What is Big Data?" (Dec. 1, 2013) http://www.sas.com/big-data/
  • Sicular, Svetlana. "Gartner's Big Data Definition Consists of Three Parts, Not to Be Confused with Three 'V's." Forbes. March 27, 2013. (Dec. 1, 2013) http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-data-definition-consists-of-three-parts-not-to-be-confused-with-three-vs/
  • Zettaset. "What is Big Data and Hadoop?" (November 29, 2013) http://www.zettaset.com/info-center/what-is-big-data-and-hadoop.php