Over the past couple of decades, the use of data and data management tools have evolved...
How modern data solutions utilize DAGs
The advent of big data introduced a plethora of technical requirements that were needed to define, store, process, and manage data. As data requirements have grown, so too has the need for more sophisticated tools to manage data, including more efficient methods of data and processes visualization and definition. Directed Acyclic Graphs(DAGs), a variant of Graph data structure, have become ever so important in the data domain space and are utilized in many toolings in the modern data stack. Many different types of software now utilize DAGs to better process and manage large amounts of data.
When working with data, it's essential to understand the paradigms behind efficient data processing. This prevents issues when making decisions about what tools to use while architecting and implementing components of data infrastructure. Data Analysts can utilize this knowledge of DAGs to choose the best tool for the job at hand, taking into account volume, velocity, and variety concerns. Additionally, Data Engineers and Scientists who understand DAGs can apply this concept when building custom pipelines and processes for transforming data efficiently.
DAGs are increasingly leveraged in modern data solutions. This article provides a brief introduction to DAGs before introducing how they're utilized with various applications and use cases. You'll learn about the benefits of leveraging DAGs and some everyday use cases for them in data capabilities. Finally, get an inside look into Kaldea's internal scheduler that uses DAG to schedule jobs across databases and data warehouse solutions.
What this article covers
- A brief introduction to Directed Acyclic Graphs(DAGs)
- How DAGs are used in modern data solutions
- Benefits of DAGs to Data Engineers, Data Scientists and Data Analysts
- How DAGs are used in data orchestration
- Kaldea's DAG based scheduler
DAGs in Data
DAGs are data structures that contain directed edges that connect vertices(nodes) within the graph in a manner that is avoidant of the presence of recursions, cycles, loops, or repetitions between one or more vertices.
DAGs consist of nodes, lines, and directions. The nodes represent processes; lines in graphs are used to depict relationships between processes, and the direction of the lines represents the flow of process execution in the graph and, in some cases, data/signal transfer.
Properties of DAGs:
- There are no cycle, loop, or repetition patterns in the graph.
- The execution order of the nodes can be topologically sorted(sequential arrangement).
- Well-designed DAGs are idempotent, meaning that any execution path taken will result in the same output, considering the same inputs are provided.
In the data domain and context, applications of DAGs reside in data workflow engines, data pipelines, scientific computing platforms, data orchestration, and scheduling tools. Data Engineers use workflow engines to model, visualize and implement data workflow, operations, and process dependencies. Data Scientists leverage DAGs based data orchestration tools to implement, manage and schedule data pipelines that use DAGs to represent ETL.
The key takeaway is that DAGs can model any system with a directed flow of information or execution.
DAGs as data structures or process modeling tools provide significant advantages to many data-centric use cases. The properties of DAGs enable Data practitioners to create, develop and manage complex processes that are scalable, flexible, and efficient. DAGs also allows for parallel processing as long as the execution flow follows the structural constraints imposed on DAGs(i.e., no cycles or repetitions between nodes).
The following sections below explore the relevance and applicability of DAGs to the modern data stack through an overview of their utilization in areas such as data analysis, data orchestration, job scheduling, and data processing.
Data Analysts are no strangers to spreadsheet software such as Google Sheets, Microsoft Excel, OpenOffice, etc. These spreadsheet tools have a fundamental logical paradigm that embodies the characteristics of DAGs that enable the execution of cell functions and update cell values in a structured manner.
Let's explore a typical scenario. Spreadsheets are often filled with cells that contain formulas and functions that modify the value of the cell itself and other associated cells whose function receives the cell's value as a parameter. For example, changes in the value or state of Cell A affect the values of Cell B, Cell C, and so on. Changing the value of Cell A involves the recalculation of cell values and execution of formulas of cells associated with Cell A through a direct or indirect relationship.
This series of activities involving cell changes invoked by the change of the value of a cell can be modeled as a DAGs that depicts the sequence of the execution of tasks (which can occur in parallel). Adopting the characteristics and properties of DAGs spoken about earlier, there are no loops or cycles in executing activities, such as cell changes or updates, within a spreadsheet. Such occurrence of the presence of cyclical properties in the execution of activities involved in a spreadsheet is called a circular reference.
Circular reference in the context of spreadsheet software refers to a cell's base formula referencing another cell with a formula that references it. Just so you know, circular reference can exist when a cell has a formula that refers to itself. Within a spreadsheet, this results in a cyclical chain of activity execution that can increase spreadsheet execution time to a large amount and possibly create an infinite execution loop.
Spreadsheets adopting a DAGs-based logical paradigm in routine or task execution management reduce the presence and effect of circular reference.
Modern-day applications are utilized by large numbers of consumers that produce and consume data on a large scale. The data infrastructure of modern applications needs to match and meet consumers' demands for high consumption and generation of data. This modern data infrastructure requires data pipelines that support large data streams and execute data processing tasks on petabytes of batched or streaming data. This is part of the responsibility of a Data Engineer.
Data processing tasks involve ETL(Extract, Transform, Load) processes alongside data modification and validation routines. These tasks are a series of activities that are representable by a directed acyclic graph, as they involve a sequential execution of processes that allows for parallelism and have the constraints of the avoidance of cycles or circular dependencies between the data processing tasks.
In a typical DAG diagram, the data processing activities are the nodes, and the edges or connections between the nodes represent the dependency of one task to another, the flow of data through processes as inputs and outputs of the processing tasks, and the execution order of processes.
Data Engineers do not have to concern themselves with implementing DAG logic-based data processing tools from scratch. Tools such as Airflow, Apache Beam, Dagster etc., all offer capabilities for creating, managing, and modifying data pipeline processes and tasks. These tools also visualize data pipeline processes as DAGs through a GUI(graphical user interface). With tools such as Airflow, data processing tasks, job schedules, and dependencies are defined programmatically as python or bash scripts.
DAGs provide a crucial benefit to the definition and execution of data processing tasks, and that is the imposed cyclical constraint on the relationships between the data processing tasks or operations. This means a data processing operation is not dependent on another data processing operation that is also dependent on the initial data processing operation, and this rule includes a dependency of the initial data processing operation on itself. An occurrence of this is called Circular Dependency.
Another benefit of DAGs to Data Engineers is the visual illustration of the data processing tasks and the interconnectivity of processes and operations in the entire illustrated data pipeline. Visualizing DAGs allows for the well-thought-out formulation of algorithms and programmatic definition of data processing tasks. Using a GUI-based data processing or orchestration tool enables the definition of data workflow through the definition and arrangement of DAGs.
DAGs allow for parallelism and multi-directed nodes, which in data processing translates to the definition of data processing tasks that can run in parallel or end in a desired outcome via alternate data flow routes and execution paths.
DAGs provide an intuitive method of thinking about the order of execution and events in a data processing system and is used by Data Engineers and Scientist as a method of building data pipelines and workflow via Data Orchestration tools. Data Orchestration is the development and scheduling of processes that consolidate data from multiple sources, performing one or multiple data transformation processes on the data in a defined order, for the data to then be made available for downstream use, such as data visualization or analysis, etc.
Data Orchestration involves several activities and processes that are connected, follow a specific order, and have workflow constraints defined within the execution of operations. DAGs are a perfect candidate to leverage in the depiction of data orchestration processes or base the job execution paradigm data orchestration tools on.
The order of process execution considered within data orchestration is defined by the directed edges present in DAGs. Within an ETL workflow, the loading of data does not occur before the data is consolidated and extracted from one or more data sources. Astronomer, Google Cloud Function, Flyte, Astronomer, and Azure Data Factory are standard data orchestration tools.
So far, the main focus has been on the order of execution of processes, tasks, and jobs represented in DAGs. Another factor of DAG-based system for data processing and orchestration is scheduling. Scheduling in the context of DAGs and data orchestration is executing processes, tasks, or jobs at a defined time. The execution schedule is usually determined and defined by Data Engineers or Data Scientists and managed by robust data orchestration tools.
Scheduling systems go hand in hand with triggers and data workflow processes based on events. Events and triggers initiate the execution of a process, task, or job once criteria are met via related upstream processes in a data pipeline. The criteria to be met to trigger process execution are typically based on the current process receiving signal or data from upstream processes.
Kaldea's single DAG-based scheduler
Several tools mentioned above have been built and designed to enable Data Engineers and Scientists to build, maintain and schedule data processing tasks. Kaldea is one such tool that acts as a unified analytics platform for Data practitioners to utilize to centralize information knowledge and streamline analytic workflows.
Data Engineers and Scientists are responsible for orchestrating data processing tasks, which include consolidating and aggregating data from various sources; some of these sources are data warehouses(DWH), data lakes, databases, or local and cloud storage.
Kaldea utilizes DAGs as a design and programmatic methodology for defining the extraction and consolidation of data from multiple sources, including executing and scheduling queries referred to as Jobs on the Kaldea platform. Jobs defined on Kaldea embody the nature of DAGs in their ability to be connected through dependency and avoidance of cyclic executions of Jobs or circular dependencies.
The management of dependency between Jobs is handled automatically internally in Kaldea, and this alleviates the time spent by Data Engineers and Scientists on Job management to ensure circular dependencies are identified and rectified, and also Job scheduling is defined and managed by the Kaldea internal job scheduler to the point where the ordering of execution is managed based on interdependencies between Jobs.
Jobs in Kaldea are defined through an intuitive user interface or programmatically using the SQL programming language. The definition of Jobs through SQL, a language utilized mainly by Data Analysts, makes it possible for Data Analysts to consolidate data from various sources promptly instead of relying on Data Engineers and Scientists.
Kaldea provides a Graph view that enables an organization's data flow to be depicted with all dependencies and cross-job dependencies illustrated. This feature of Kaldea allows transparency and complements an organization's governance efforts.
Using Kaldea, most data team members can define, develop, maintain and schedule data processing pipelines that focus transformation of data consolidated from sources such as Google BigQuery, Snowflake, Amazon Redshift, or PostgreSQL.
Kaldea currently focuses on only database-related jobs. Its functionality does not extend to local storage or other storage solutions, but this is set to change for future releases, extending the benefits of Kaldea to the broader data pipeline. Check out the job demo!
Quick takeaways on Kaldea
- Kaldea is a modern data solution that uses Directed Acyclic Graphs to help with job scheduling. This makes it an excellent choice for data teams that consolidate data from multiple sources.
- Kaldea utilizes Directed Acyclic Graphs as a design methodology for extracting and consolidating data from multiple sources, including the scheduling of Jobs.
- The Kaldea platform automatically handles the dependencies between jobs, which saves data engineers and scientists time on job management and scheduling.
- Kaldea's Graph view provides an overview of an organization's data flow, illustrating job dependencies. This helps with transparency and governance efforts.
Wrapping things up...
In this article, we have looked at Directed Acyclic Graphs(DAG), how they are used in the modern Data Stack and how Kaldea utilizes DAGs to schedule Jobs. DAGs provide a way to model data dependencies and can be used for various purposes, such as representing workflows, data pipelines, or even Jobs.
DAGs have a broad range of applications within a modern data application, from data process definition to the scheduling of data process execution. DAGs also provide a basis for modern data solutions, such as Airflow, Flyte, and Kaldea, that aim to create productive and efficient data teams and infrastructure.
We also saw how Data Engineers and Scientists could use Kaldea to manage the execution of data processing tasks through the defined dependencies between Jobs. Kaldea uses DAGs to schedule Jobs efficiently by considering the dependencies between Jobs. This allows Kaldea to run Jobs in parallel where possible and makes monitoring and troubleshooting Jobs easy.
Finally, we saw how using Kaldea enables Data Analysts to play a role in data consolidation and transformation. Kaldea's user-friendly interface and ability to handle dependencies between Jobs make it an ideal tool for Data Analysts who want to be involved in the data transformation process.
In conclusion, Directed Acyclic Graphs provide a powerful way to model dependencies between Jobs. Kaldea utilizes this power to provide an efficient and user-friendly way to manage data processing tasks. Data Analysts can also use Kaldea to be involved in the data transformation process, making it an essential tool in the modern Data Stack.
Frequently Asked Questions on Directed Acyclic Graphs(DAGs)
What are Directed Acyclic Graphs?
Directed Acyclic Graphs or DAGs are graph data structures that contain directed edges that connect vertices within the graph in a manner that is avoidant of the presence of cycles, loops, or repetitions between one or more vertices. DAGs are vital in current data solutions as they allow for neater modeling of data flows and dependencies compared to the traditional linearity of data structures.
How are DAGs leveraged in the modern data stack?
DAGs provide a method of modeling relationships, connectivity, and dependencies between siloed data processes. DAGs also provide Data Engineers with the methodology for designing the execution order of data processes.
What are some benefits of using DAGs in the modern data stack?
- DAGs provide intuitive methods of visualizing data processes, dependencies pipelines and workflows present in data systems.
- Through structured and organized visualization methods, DAGs promote transparency and governance.
What are some applications of DAGs?
Data processing, Data workflows, Data Orchestration, Job scheduling etc
What is Kaldea?
Kaldea is an analytics platform designed to streamline your data workflow from discovery, query management, docs, collaboration, and jobs, making all data assets searchable and self-serviceable by anyone across the company.