Before Data as a Service (DaaS) was introduced in 2016, data engineers had to go through a manual...
Over the past couple of decades, the use of data and data management tools have evolved dramatically, and are continuing to evolve. We’ll tell the history of Big Data and data management tools, beginning with the introduction Enterprise Resource Planning (ERP) in 1960, how its limitations led to the Distributed File System (2003) and MapReduce (2004), and performance improvements led to Hadoop File System (2005), Pig (2008), Hive (2010), Impala (2013), and, eventually Spark (2014). Then we’ll cover how Data as a Service emerged, in part, to reduce the labor required to connect cloud storage to Spark and Hadoop. These include BigQuery (2010), Snowflake (2014), and Athena (2016).
And we’ll address the modern data tool ecosystem and draw a direct lineage between the solutions to storage and performance in Big Data and the specialization and fragmentation of the tools for analyzing Big Data. Finally, we'll motivate Kaldea’s place in the lineage of this history of Big Data and data tools and how Kaldea plans to unify the tools of analysis during this latest fast growing period for data.
Enterprise Resource Planning (ERP) and Early History of Big Data
The ERP system is widely used in traditional companies to manage clients’ business activities. ERP is a type of software that helps with organizing data, but it requires “officers” to help manage the client data. Clients need to go through the officers who are familiar with the software that stores the data to storage (database).
Before data analytics and management tools were developed, people (usually people on the business-side) thought that the client data stored through the ERP system could help them better serve the needs of their clients and therefore make the business more successful. They had a very powerful tool: Microsoft Excel. Microsoft Excel was a great tool for looking at data in sheets, analyzing and feeding back the results of analysis as additional data, or visualizing the data as charts.
Although Excel worked well, business folks had to ask their engineering team to extract the data into an Excel sheet for them to start analyzing, which required a lot of time and effort every time analysts requested it. The effort required was intense and repetitive. In order to make the feedback loop more efficient, traditional companies created tools (like NetSuite) to extract data from the ERP systems into Excel sheets.
Limitations of the ERP System to Handle Big Data
The ERP system and extracting data into Excel worked well. The system worked so well, in fact, that businesses expanded it to include more data points or to change their business model that led to more complicated data structures. This made the data volume significantly large and, eventually, caused issues for ERP and Excel.
With the increased data volume, the ERP app started to crash from:
- database storage spaces overflowing
- analysis taking too long because simply extracting the data into Excel sheets took a much longer time
From these limitations, the whole data management tools industry began to evolve:
- to replace Excel as the tool of analysis, and
- to enhance storage systems for databases.
1.0 Transition from OLTP to OLAP, Improving Time-to-Analysis
To address the first limitation and shorten the analysis, industry introduced the concept called Data Warehouse. To differentiate between the original databases and Data Warehouses, we refer to the former as Online Transaction Processing (OLTP) and the latter as Online Analytical Processing (OLAP). Although OLAP existed since the 1970s, commercialization of OLAP began in the 90s to improve analysis.
What is a Data Warehouse (OLAP) and how is it different from existing databases (OLTP)?
OLTP was mainly built to update specific data points by managing the data as rows. An example of row-based management is when you want to store information about people. You’ll retrieve each person's name, phone number, address, and age when you want to retrieve information. On the other hand, industry built OLAP so it could read data in bulk and run query operations by managing the data as columns. So, for example, the OLAP allows you to retrieve a collection of all of the phone numbers stored in the database.
1.1 Distributed File System (2003) and MapReduce (2004), Solving Data Storage in the History of Big Data
Although OLAP seemed to improve the time to analysis, the database storage was a more critical problem because it impacted the actual data needed for analysis. Since databases, at the time, were physical machines storing data, large volumes of data started to exceed the space in a single machine.
The first two companies who experienced and resolved the data storage problem were Google and Yahoo. They created distributed file systems called Google File System (GFS) in 2003 and Hadoop File System (HDFS) in 2005 respectively. Distributed file systems allowed data to be divided and distributed over multiple machines as raw data in a key-value pair, where the Namespace represented where the divided data was located.
For example, the diagram below shows that file A is divided into A1, A2, A3 and distributed over three Hadoop File System (HDFS) servers. HDFS servers contain only the raw data, but they do not contain any information about the source of this divided data. It therefore needed another file to store information about the location of the divided data by providing the client the information that A1 is in D1, A2 in D2, A3 in D3. This file is called a Namespace in Hadoop.
In order to work with a Namespace and HDFS, there had to be a processing unit to map properly the required data from multiple machines and reduce the retrieved data into an actual data set that could be analyzed. This is when Google introduced the famous MapReduce into the Google File System to allow the distributed file systems to work as intended. With MapReduce programming knowledge, it was possible to map and reduce the distributed data such that it could be analyzed.
1.2 Hadoop Pig (2008) and Hive (2010), Accessibility for Data Analysis
The invention of MapReduce solved the data storage issue, but it was not an ideal solution for business analysts because it required knowledge of programming. Without the programming background or knowledge of the MapReduce system, people were not able to access or analyze big data in the distributed file system. Since removing the requirement for redundant engineering resources was the original goal, there was now a need to automate the MapReduce program.
The Hadoop team developed another powerful framework called Hadoop Hive to overcome this problem. Hive is a SQL layer, which analyzes traditional data languages such as SQL. It converts the data languages into MapReduce to compile and interact with HDFS, and then produces SQL (or other data language) as the outcome again. As a SQL layer, Hive was limited because it didn’t have any programming conditional statements. Hadoop Pig was later created to address this, a high-level platform that creates programs that run on Hadoop.
With the SQL layer, the MapReduce program, and the distributed file systems, Google and Yahoo resolved the problem with overflowing database storage.
2.1 Hadoop Impala (2013), Query Engine
Once there was a solution to the storage problem, people started thinking about performance. The MapReduce system is not so efficient - analyzing SQL in the SQL layers, then reducing it outside of the HDFS, all while MapReduce must compile before even reaching the file system.
Since query engines were already being used in traditional databases, Hadoop File System created a query engine over the file system. That meant that the SQL did not have to be analyzed and compiled, but the SQL actually went inside the Hadoop File System to make the process faster. This project was called Hadoop Impala.
Today, people choose between Hadoop Impala or Hadoop Pig, using the former to get faster query results and the latter to use conditional programming statements.
2.2 Spark (2014)
Impala was great because it bypassed the need to use MapReduce for better performance. However, what if you really need to use MapReduce, either because you couldn’t afford to move to a different platform or an essential feature was only offered in MapReduce? People, who wanted to use MapReduce, started thinking about MapReduce’s limitations and how they could improve its performance.
One of the problems with MapReduce was that there were a lot of potentially reusable components that were inseparable. That meant that each MapReduce would take performance hits on repetitive steps. For example, if we have a database for a 100GB file and have a 10GB column for errors, and we wanted to get 1KB of a specific week's data, we have to read the whole 100GB every time we try to get a different week’s data.
To enhance the performance of map and reduce, people started applying cache, storing the mid-result data. However, because the mid result data could also be too large, the cache also needed to be stored in the storage causing a disk read and a disk write (Disk I/O). With the necessity for Disk I/O, the performance was not enhanced dramatically.
Spark was created in 2010 to remove the Disk I/O by maximizing the efficiency of memory. The aim of Spark was to solve the performance issue by:
- Reusing as much as possible (caching)
- Removing Disk I/O (using memory more efficiently)
Spark introduced streaming processing, reading line by line to apply a filter for errors and dropping the data if it does not meet the filter requirements. If we take the error of a specific week as an example, Spark will read line by line to check if it is an error column of the desired week, and throw out the line if not. Since Spark is a processing unit (like MapReduce), it was first targeted for data engineers with a computer programming background. Spark later built a Spark SQL to broaden the target audience by creating a SQL layer that converts conventional data languages into Spark.
With the introduction of Spark SQL, Apache also introduced Spark DataFrames. DataFrame is a distributed collection of data organized in columns with rich optimization, similar to SQL Abstract Syntax Trees (SQL AST). Both Spark DataFrames and SQL AST optimizes the SQL and enhances the performance, so having the best DataFrames or SQL AST determines the performance of SQL processing.
Oracle is known for having one of the best SQL AST; Databricks, the software company created by Matei Zaharia who also created Spark, is known for having the best DataFrames. SQL AST and DataFrames are still hot topics in the data scene because they continue to improve and enhance performance.
Continue reading Part 2!
Curious about what we are building? Try us out at Kaldea.com.