Skip to content

The History of Big Data: How Big Data Shaped the Current Data Stack

History museum of data

Over the past couple of decades, the use of data and data management tools have evolved dramatically, and are continuing to evolve. We’ll tell the history of Big Data and data management tools, beginning with the introduction Enterprise Resource Planning (ERP) in 1960, how its limitations led to the Distributed File System (2003) and MapReduce (2004), and performance improvements led to Hadoop File System (2005), Pig (2008), Hive (2010), Impala (2013), and, eventually Spark (2014). Then we’ll cover how Data as a Service emerged, in part, to reduce the labor required to connect cloud storage to Spark and Hadoop. These include BigQuery (2010), Snowflake (2014), and Athena (2016).

And we’ll address the modern data tool ecosystem and draw a direct lineage between the solutions to storage and performance in Big Data and the specialization and fragmentation of the tools for analyzing Big Data.  Finally,  we'll motivate Kaldea’s place in the lineage of this history of Big Data and data tools and how Kaldea plans to unify the tools of analysis during this latest fast growing period for data.

timeline-illustrating-how-companies-solved-the-challenges-of-analyzing-big-data

 

Enterprise Resource Planning (ERP) and Early History of Big Data

The ERP system is widely used in traditional companies to manage clients’ business activities. ERP is a type of software that helps with organizing data, but it requires “officers” to help manage the client data. Clients need to go through the officers who are familiar with the software that stores the data to storage (database).

workflow-diagram-showing-how-clients-access-data-in-ERP

Before data analytics and management tools were developed, people (usually people on the business-side) thought that the client data stored through the ERP system could help them better serve the needs of their clients and therefore make the business more successful. They had a very powerful tool: Microsoft Excel. Microsoft Excel was a great tool for looking at data in sheets, analyzing and feeding back the results of analysis as additional data, or visualizing the data as charts.

 

workflow-showing-how-analysts-used-excel-to-analyze-and-feed-data-back-to-the-ERP-system

Although Excel worked well, business folks had to ask their engineering team to extract the data into an Excel sheet for them to start analyzing, which required a lot of time and effort every time analysts requested it. The effort required was intense and repetitive. In order to make the feedback loop more efficient, traditional companies created tools (like NetSuite) to extract data from the ERP systems into Excel sheets.

 

Limitations of the ERP System to Handle Big Data

The ERP system and extracting data into Excel worked well. The system worked so well, in fact, that businesses expanded it to include more data points or to change their business model that led to more complicated data structures. This made the data volume significantly large and, eventually, caused issues for ERP and Excel.  

With the increased data volume, the ERP app started to crash from:

  1. database storage spaces overflowing 
  2. analysis taking too long because simply extracting the data into Excel sheets took a much longer time

From these limitations, the whole data management tools industry began to evolve: 

  1. to replace Excel as the tool of analysis, and 
  2. to enhance storage systems for databases. 

1.0 Transition from OLTP to OLAP, Improving Time-to-Analysis

To address the first limitation and shorten the analysis, industry introduced the concept called Data Warehouse. To differentiate between the original databases and Data Warehouses, we refer to the former as Online Transaction Processing (OLTP) and the latter as Online Analytical Processing (OLAP). Although OLAP existed since the 1970s, commercialization of OLAP began in the 90s to improve analysis. 

What is a Data Warehouse (OLAP) and how is it different from existing databases (OLTP)? 

OLTP was mainly built to update specific data points by managing the data as rows. An example of row-based management is when you want to store information about people. You’ll retrieve each person's name, phone number, address, and age when you want to retrieve information. On the other hand, industry built OLAP so it could read data in bulk and run query operations by managing the data as columns. So, for example, the OLAP allows you to retrieve a collection of all of the phone numbers stored in the database.

 

1.1 Distributed File System (2003) and MapReduce (2004), Solving Data Storage in the History of Big Data

Although OLAP seemed to improve the time to analysis, the database storage was a more critical problem because it impacted the actual data needed for analysis. Since databases, at the time, were physical machines storing data, large volumes of data started to exceed the space in a single machine. 

The first two companies who experienced and resolved the data storage problem were Google and Yahoo. They created distributed file systems called Google File System (GFS) in 2003 and Hadoop File System (HDFS) in 2005 respectively. Distributed file systems allowed data to be divided and distributed over multiple machines as raw data in a key-value pair, where the Namespace represented where the divided data was located.

diagram-showing-a-distributed-file-system

For example, the diagram below shows that file A is divided into A1, A2, A3 and distributed over three Hadoop File System (HDFS) servers. HDFS servers contain only the raw data, but they do not contain any information about the source of this divided data. It therefore needed another file to store information about the location of the divided data by providing the client the information that A1 is in D1, A2 in D2, A3 in D3. This file is called a Namespace in Hadoop.

diagram-showing-the-hadoop-file-system-and-storing-information-about-datas-location

diagram-showing-how-namespace-changed-how-data-was-stored

In order to work with a Namespace and HDFS, there had to be a processing unit to map properly the required data from multiple machines and reduce the retrieved data into an actual data set that could be analyzed. This is when Google introduced the famous MapReduce into the Google File System to allow the distributed file systems to work as intended. With MapReduce programming knowledge, it was possible to map and reduce the distributed data such that it could be analyzed.

 

1.2 Hadoop Pig (2008) and Hive (2010), Accessibility for Data Analysis

The invention of MapReduce solved the data storage issue, but it was not an ideal solution for business analysts because it required knowledge of programming. Without the programming background or knowledge of the MapReduce system, people were not able to access or analyze big data in the distributed file system. Since removing the requirement for redundant engineering resources was the original goal, there was now a need to automate the MapReduce program. 

The Hadoop team developed another powerful framework called Hadoop Hive to overcome this problem. Hive is a SQL layer, which analyzes traditional data languages such as SQL. It converts the data languages into MapReduce to compile and interact with HDFS, and then produces SQL (or other data language) as the outcome again. As a SQL layer, Hive was limited because it didn’t have any programming conditional statements. Hadoop Pig was later created to address this, a high-level platform that creates programs that run on Hadoop.

diagram-showing-how-hadoop-hive-allowing-analysts-to-access-data

With the SQL layer, the MapReduce program, and the distributed file systems, Google and Yahoo resolved the problem with overflowing database storage.

 

2.1 Hadoop Impala (2013), Query Engine

Once there was a solution to the storage problem, people started thinking about performance. The MapReduce system is not so efficient - analyzing SQL in the SQL layers, then reducing it outside of the HDFS, all while MapReduce must compile before even reaching the file system.

Since query engines were already being used in traditional databases, Hadoop File System created a query engine over the file system. That meant that the SQL did not have to be analyzed and compiled, but the SQL actually went inside the Hadoop File System to make the process faster. This project was called Hadoop Impala. 

how-hadoop-impala-changed-the-workflow-for-using-sql-to-retrieve-data

Today, people choose between Hadoop Impala or Hadoop Pig, using the former to get faster query results and the latter to use conditional programming statements.

 

2.2 Spark (2014)

Impala was great because it bypassed the need to use MapReduce for better performance. However, what if you really need to use MapReduce, either because you couldn’t afford to move to a different platform or an essential feature was only offered in MapReduce? People, who wanted to use MapReduce, started thinking about MapReduce’s limitations and how they could improve its performance.

One of the problems with MapReduce was that there were a lot of potentially reusable components that were inseparable. That meant that each MapReduce would take performance hits on repetitive steps. For example, if we have a database for a 100GB file and have a 10GB column for errors, and we wanted to get 1KB of a specific week's data, we have to read the whole 100GB every time we try to get a different week’s data.

To enhance the performance of map and reduce, people started applying cache, storing the mid-result data. However, because the mid result data could also be too large, the cache also needed to be stored in the storage causing a disk read and a disk write (Disk I/O). With the necessity for Disk I/O, the performance was not enhanced dramatically. 

Spark was created in 2010 to remove the Disk I/O by maximizing the efficiency of memory. The aim of Spark was to solve the performance issue by:

  1. Reusing as much as possible (caching)
  2. Removing Disk I/O (using memory more efficiently)

Spark introduced streaming processing, reading line by line to apply a filter for errors and dropping the data if it does not meet the filter requirements. If we take the error of a specific week as an example, Spark will read line by line to check if it is an error column of the desired week, and throw out the line if not. Since Spark is a processing unit (like MapReduce), it was first targeted for data engineers with a computer programming background. Spark later built a Spark SQL to broaden the target audience by creating a SQL layer that converts conventional data languages into Spark.

With the introduction of Spark SQL, Apache also introduced Spark DataFrames. DataFrame is a distributed collection of data organized in columns with rich optimization, similar to SQL Abstract Syntax Trees (SQL AST). Both Spark DataFrames and SQL AST optimizes the SQL and enhances the performance, so having the best DataFrames or SQL AST determines the performance of SQL processing.

Oracle is known for having one of the best SQL AST; Databricks, the software company created by Matei Zaharia who also created Spark, is known for having the best DataFrames. SQL AST and DataFrames are still hot topics in the data scene because they continue to improve and enhance performance.

 

History of Big Data: Data as a Service (DaaS) in 2016

Before Data as a Service (DaaS) was introduced in 2016, data engineers had to go through a manual process of connecting cloud storages (Amazon S3, Google Cloud Storage, MinIO, etc) with Spark and Hadoop. Although Spark and Hadoop reduced the work needed to be done by data engineers, the manual connecting process still took a lot of time and effort. To even further reduce the engineering dependencies, we started seeing services that package Spark or Hadoop with cloud storages - Data as a Service. Examples of DaaS are Google’s BigQuery, AWS Athena, and Snowflake.

how-daas-packages-cloud-storage-and-spark-or-hadoop

AWS Athena

AWS Athena takes Facebook’s Presto as a basis, which was built to improve the computing power limitation of Hadoop. Hadoop, although it had enough data storage space, could not control the CPU, meaning that it was not suitable for operations that were computing heavy. Presto solved this by creating a Presto server layer and allowing for computing units on-demand.

cost-vs-performance-showing-presto-vs-other-databases

The above chart represents the cost vs performance for Presto. Presto, as well as AWS Athena (since it is Presto based) have higher costs to set up, but scales out and surpasses the performance to cost ratio of MySQL and other databases.

Another potential advantage of AWS Athena is that it is a Data Lake, allowing for unstructured data (like image files) or semi-structured (like JSON) in addition to structured data. Data Lakes are good solutions for companies that want to store the data first and then, later, create schemas and analyze the data.

With all the benefits of AWS Athena, why do some people choose other products? The limitation of AWS Athena is its data synchronization. AWS Athena is a query engine that can be used in S3, but it needs a metadata management (schema management) solution to store the structure and schema of the data in the S3 database. This storage is commonly referred to as a Catalog. AWS solved the need for a Catalog by creating AWS Glue to “glue” together AWS Athena and S3 by serving as a Catalog. Since these three are separate entities, however, someone must synchronize the data–i.e., when there is a change in S3 database, AWS Glue needs to be notified to tell AWS Athena, and vice versa. Due to this management requirement, using AWS Athena requires a lot of data engineers to set up and manage it.

diagram-showing-how-aws-glue-connects-s3-and-athena

Google BigQuery and Snowflake

BigQuery and Snowflake are managed Data Warehouses, so data management isn't required but data must be structured in order to be stored and follow the predefined schema required for storage.

That Data Warehouses only allow structured data is a big limitation because logs generally are JSON (semi-structured data). To expand the market, Google created a system called Dremel, making it possible to save semi-structured data as tables. The algorithm to change semi-structured data into tables is called Serde. Google’s BigQuery and Snowflake are the two OLAP that incorporate the Dremel system.

History of Big Data: Kaldea and Confronting the Fragmented Tools of Data Analysis

From the history of the evolution of the data management tools, we can see how and why people moved from using OLTP and Microsoft Excel, to using OLAP software to get the same benefits of analysis when using Microsoft Excel.

Revisiting the advantages of Microsoft Excel, Excel is a good tool for:

  1. Editor for data analysis (SQL)
  2. Catalogs through Excel Sheets
  3. Visualization

As data volumes increased, Microsoft Excel was less and less viable, so OLAP software started replacing Microsoft Excel.

Out of these three categories, historically speaking, emerge the slurry of tools for analysis, output, and reporting that we see fragmenting the current data tool landscape. 

Below are some examples of products that replaced the advantages of Excel:

  1. Editor for SQL
    1. Apache Zeppelin
  2. Catalogs
    1. Amundsen
    2. DataHub
    3. SelectStar
  3. Visualization
    1. Looker
    2. Redash
    3. Tableau

So as data volumes increased, so did the fragmentation of tools intended to analyze data, aid discovery and analysis, and communicate the results. As the number of tools increased, the risk of losing the context of data or analysis increases and so did the number of data silos among tools and the people who use them.

After the fragmentation of tools in the history of big data, Kaldea rebundles your data analysis stack so you can complete your data analysis in a single platform, providing the analyst workbench for OLAP that incorporates a Query interface, Catalog and other Discovery tools, and visualization and reporting. Consolidating your metadata and automatically generating metadata as you work, Kaldea breaks down data silos between individual data owners and accelerates analysis by making sure you never lose the context for your analysis. 

Try us out at Kaldea.com.