Before Data as a Service (DaaS) was introduced in 2016, data engineers had to go through a manual...
The data architecture built around data-driven decision making has exploded in the last decade and, especially recently, has created a complex system of tools and applications for serving analytics (see a16z’s modern data infrastructure). While these tools have created more optimal and specialized ways of setting up, storing, pulling, and analyzing massive amounts of data; they comprise an ecosystem that requires intensive resources and interventions across several roles. Today, to put a single new metric onto a BI tool, companies work through a series of tools ranging from ETL (ELT), storage, query and processing, observability, discovery, and reporting. Even more challenging, roles from the data engineer, scientist, and analyst to the PM or business analyst own different parts of these tools and workflows and they are often frustratingly dependent on each other.
Given this current architecture and ecosystem, Kaldea set out to meaningfully understand how we could
- simplify the analytics architecture and workflow without losing deep insight,
- accelerate the time-to-insight (no more delays in data!),
- increase accessibility and utilization of the modern data analysis architecture, and
- add some joy to the analysis workflow.
In the end, we’d decrease the following for the data scientist and analyst:
- (A) the manual labor required to maintain the data architecture and the metadata that data producers rely on, and
- (B) the intensive communication and documentation required to prevent the loss of data’s context.
Connecting the modern data stack: people to data and people to people
To understand the issues of the modern data stack and the problems encountered by the teams working in them, we interviewed 100s of data scientists, analysts, and other data-consumers individuals like product managers, marketers, and business owners across the globe. Some of the questions we ask in this article are the very questions data producers and consumers are asking right now about their work.
Have you ever served an analysis to a data consumer without any original context in the request only to change your analysis to meet a new or changing requirement? Or have you ever been frustrated trying to explain what data you’re looking for only to find out that it doesn’t exist or you have to wait 3 weeks? Whether you’re a data scientist or analyst, or attempting to make data-driven product or business decisions, you’re probably familiar with the friction in questions like:
Data Scientist and Analyst frequently asked questions
- What data or metric is relevant to this analysis?
- Is this the right table for this problem set?
- Who owns this table and can I use it?
- Is it up to date or are there updates that I should be aware of?
- Is this the table you need me to export?
- Has anyone seen other queries related to this table?
- Why did you join this table A with B?
- Is my query wrong or is it the table?
- Rinse and repeat.
Product and Business frequently asked questions
- How can we optimize LTV (for example) of our customers?
- What types of data or metrics are relevant for analyzing and optimizing LTV?
- Can we change the reporting around with different parameters?
- How does the analysis change when we change X to Y?
- What if we change Y to Z? Was X right?
- Rinse and repeat.
Ideally, perhaps, when we have to serve an analysis or make a business or product decision, we have a really well-defined context and question to ask, and we can easily find the data ourselves. This is data discovery and it’s not always easy, as we know, to find and understand the context for a table, column, query, doc, etc. Scheduling a job without making an engineering request, as another example, also fits this model: how to give access to data and balance it easily with governance. This is connecting people to data.
When data is hard to find or we’re uncertain of the data’s context, folks trying to discover data will find an expert to help them understand what’s there. This is data collaboration, or connecting people to data when the data are people. Discovery and collaboration are two essential parts of the analyst workflow and a key part of Kaldea’s evolution into a product.
Other parts of analyst workflow—like querying and modeling, documentation and visualization, scheduling—are also crucial pieces that require intensive communication, manual processes, or engineering dependencies. These processes are not discovery- or context- agnostic. The other parts of the analyst workflow are also important parts of the context of data, the patterns and ways that people use data, and the changes that occur to data and the processes of those changes (for a description of metadata in these terms, I refer to an article on Amundsen, who in turn references Ground). This is a hybrid - connecting people to data/people. For example, understanding what queries evolved into a final table exposes others to a person’s process to develop a table and can provide crucial context. This is like connecting people to people when the data is about information developed by people, the ways people use other data, or the history of how people interacted with the data - all of this as a kind of metadata.
Leveraging and enriching metadata: How do we connect people to data when data is data?
In our discussions with both data producers, the first problem surfaced right away: finding and understanding data is difficult when the context, query, catalog and lineage, governance and so on are in separate tools owned by different roles or when there’s insufficient metadata to understand the data being looked at. In this problem, the context is non-existent because documentation is missing, or it exists but in several places because the architecture is fragmented, or the knowledge is siloed in the mind of your fellow analyst.
All this context, as data and metadata, might be better served in the platform in which you work so you can access it without having to switch contexts. Questions you ask yourself during analysis go something like this.
The solution is two-fold:
- automatically generating metadata about the application, behavior, and changes of data in one place;
- taking the data in the fragmented data architecture and make it available as data or metadata in the place you do your analysis.
By connecting analysts to data in its fullest context, tools and experts don’t inadvertently become bottlenecks to another analyst’s work.
Metadata is the key to connecting people to data. The modern data analysis architecture rightly subsumes this into discovery, but the challenge is that metadata is generated external to data analysis workflows or it relies on fragmented tools or disparate people to collect, store, manage, and provide metadata.
Kaldea understands that metadata is the key to making data not only more easily discoverable, but also more contextualized and automated across the entire analysis workflow. And Kaldea understands that metadata must mean something broader: the description of data; how people use, represent, and report on data across the analyst workflow; and how data changes—these are all crucial aspects of metadata, necessary to understand the full context of the data you work with.
Bridging the gap in data context: How do we connect people to data when the data are people?
A second problem surfaced right away: communicating ad-hoc to address the missing context when data engineers, data scientist and analysts, and/or product managers collaborate.
Communication cost for Data Scientists and Analysts
When data producers need help with data, they seek out other experts or data owners. Data analysts, scientists, engineers, and their managers hold a lot of data and a lot of context about the actual data in the data warehouse or they’ve worked with it before. Call it metadata—metadata in people, if you will. They might know the best data for a specific question, or be the table owner and know its content and columns well, or know the query (and query history) to produce a table, or be familiar with a specific report.
Metadata in people has a steep organizational cost because it asks people to either maintain documentation and an intensive taxonomy or to proceed without docs and often necessary context. Then, it becomes challenging to rely on internal processes to have people maintain and share their work for others to use in the future.
Communication cost for Product and Business
When consumers don’t know how to find data, they communicate with their data producers. Product and business decision-makers can’t easily write a query or add a new data pipeline. So every time they need a new analysis, analysts and data consumers need to share their knowledge and understand the full business context and context of (ir)relevant data. Compounding the problem, these analyses typically need to move quickly. Analysts can plan when they’re serving frequently used dashboards or metrics, but when you’re making a quick decision on a campaign or considering a product change, you lose time-to-market as you wait or you lose impact if you proceed without a data-driven decision.
The solution requires taking the data in people’s experience, minds, and work making it data and metadata in the platform you do your analysis. In this case, metadata must be seen more expansively as the way a data producer has used, processed, queried, represented, reported, or changed data so that it can be found by another analyst easily.
Lack of metadata, maintenance and communication costs in the modern data analysis stack
To summarize briefly, two reasons why data analysis today--especially ad-hoc analysis--is lengthy, complex, and repetitive are:- (A) The need for an increasingly flexible metadata, including the need for a more expansive working definition of metadata, and
- (B) intensive collaborating and communicating to address gaps in context (e.g. if only you knew whether anyone performed a similar analysis)
These two problems alone result in longer turn-around cycles and, at times, they mean that decisions cannot be supported by data or data can’t be delivered on-time because of the required time investment.
Certain companies have solved for (A) by moving from a centralized to a distributed data team, where the analyst is a requisite member of every team and has all the knowledge necessary to make fast analyses with less communication. Other solutions for (B) in search of self-service data—BI tools, dashboards, metrics, no-code, etc.—have not yet found the right tool and still need to do the intensive work of managing taxonomy. On one hand, few companies have the resources to dedicate so many data resources to individual teams. On the other, self-service solutions require an intense effort initially for future data needs, can sometimes compromise deep analysis and, still, experience a troubling lack of context. How many times do you run into dashboards you used once or twice for a particular analysis and never used again?
Kaldea: flexible metadata, faster analysis, and more productivity in one data collaboration platform
So how do you reduce the intensive cost of running a modern data analysis stack, increase the speed of your analysis, and lessen the labor to organize your data and metadata?
With Kaldea, you wont’ ask, “where was that table?” You won’t have a quarterly routine to manage taxonomy, reinvent an old report, or do any painstaking search over Slack, calls, Jupyter, Athena, BigQuery, and Looker for previous work. You won’t have to create a Jira ticket to schedule a job. You won’t have to scratch your head at a table, column, or query, wonder whether it’s updated, or book a call to understand what you’re looking at.
From discovery and definition to query, docs, and reporting; Kaldea consolidates and automatically generates metadata about data, how its used, represented, and reported; and how it changes across the entire analysis workflow. So you can discover, model, and analyze data faster and in its full context without delays in analysis. If an analyst needs expert help from another colleague, Kaldea merges its flexible and expanded metadata, catalog, lineage, table Wiki, and Notion-like docs to collaborative tools in order to bridge or eliminate the context gap.