Skip to content

How Big Data Shaped the Current Data Stack - Part 2

History museum of data

Before Data as a Service (DaaS) was introduced in 2016, data engineers had to go through a manual process of connecting cloud storages (Amazon S3, Google Cloud Storage, MinIO, etc) with Spark and Hadoop. Although Spark and Hadoop reduced the work needed to be done by data engineers, the manual connecting process still took a lot of time and effort. To even further reduce the engineering dependencies, we started seeing services that package Spark or Hadoop with cloud storages - Data as a Service. Examples of DaaS are Google’s BigQuery, AWS Athena, and Snowflake.

Haven't read Part 1? 

 

how-daas-packages-cloud-storage-and-spark-or-hadoop

History of Big Data: Data as a Service (DaaS) in 2016

AWS Athena

AWS Athena takes Facebook’s Presto as a basis, which was built to improve the computing power limitation of Hadoop. Hadoop, although it had enough data storage space, could not control the CPU, meaning that it was not suitable for operations that were computing heavy. Presto solved this by creating a Presto server layer and allowing for computing units on-demand.

cost-vs-performance-showing-presto-vs-other-databases

The above chart represents the cost vs performance for Presto. Presto, as well as AWS Athena (since it is Presto based) have higher costs to set up, but scales out and surpasses the performance to cost ratio of MySQL and other databases.

Another potential advantage of AWS Athena is that it is a Data Lake, allowing for unstructured data (like image files) or semi-structured (like JSON) in addition to structured data. Data Lakes are good solutions for companies that want to store the data first and then, later, create schemas and analyze the data.

With all the benefits of AWS Athena, why do some people choose other products? The limitation of AWS Athena is its data synchronization. AWS Athena is a query engine that can be used in S3, but it needs a metadata management (schema management) solution to store the structure and schema of the data in the S3 database. This storage is commonly referred to as a Catalog. AWS solved the need for a Catalog by creating AWS Glue to “glue” together AWS Athena and S3 by serving as a Catalog. Since these three are separate entities, however, someone must synchronize the data–i.e., when there is a change in S3 database, AWS Glue needs to be notified to tell AWS Athena, and vice versa. Due to this management requirement, using AWS Athena requires a lot of data engineers to set up and manage it.

diagram-showing-how-aws-glue-connects-s3-and-athena

Google BigQuery and Snowflake

BigQuery and Snowflake are managed Data Warehouses, so data management isn't required but data must be structured in order to be stored and follow the predefined schema required for storage.

That Data Warehouses only allow structured data is a big limitation because logs generally are JSON (semi-structured data). To expand the market, Google created a system called Dremel, making it possible to save semi-structured data as tables. The algorithm to change semi-structured data into tables is called Serde. Google’s BigQuery and Snowflake are the two OLAP that incorporate the Dremel system.

History of Big Data: Kaldea and Confronting the Fragmented Tools of Data Analysis

From the history of the evolution of the data management tools, we can see how and why people moved from using OLTP and Microsoft Excel, to using OLAP software to get the same benefits of analysis when using Microsoft Excel.

Revisiting the advantages of Microsoft Excel, Excel is a good tool for:

  1. Editor for data analysis (SQL)
  2. Catalogs through Excel Sheets
  3. Visualization

As data volumes increased, Microsoft Excel was less and less viable, so OLAP software started replacing Microsoft Excel.

Out of these three categories, historically speaking, emerge the slurry of tools for analysis, output, and reporting that we see fragmenting the current data tool landscape. 

Below are some examples of products that replaced the advantages of Excel:

  • Editor for SQL: Apache Zeppelin, Redash

  • Catalogs: Amundsen, DataHub, Acryl Data, SelectStar

  • Visualization: Looker, Redash, Tableau, Metabase, Lightdash

So as data volumes increased, so did the fragmentation of tools intended to analyze data, aid discovery and analysis, and communicate the results. As the number of tools increased, the risk of losing the context of data or analysis increases and so did the number of data silos among tools and the people who use them.

After the fragmentation of tools in the history of big data, Kaldea rebundles your data analysis stack so you can complete your data analysis in a single platform, providing the analyst workbench for OLAP that incorporates a Query interface, Catalog and other Discovery tools, and visualization and reporting. Consolidating your metadata and automatically generating metadata as you work, Kaldea breaks down data silos between individual data owners and accelerates analysis by making sure you never lose the context for your analysis. 

Try us out at Kaldea.com.