With dbt’s recent announcement, Minerva 2.0, and Transform's launch, the metric store is a hot...
Data vs metadata, and why it matters for analysis
Understanding “What is data vs metadata?” is a simple enough question, but it has more implications for data analysis than you might think. The scope and flexibility of the definition of data and metadata is profoundly important to data analysis because it becomes a choice about which data is available to the analyst for analysis and which data is available to enrich the context for the data you analyze. Oftentimes, the tools in the modern data analysis stack leave data and metadata out.
Imagine you join a data team as an analyst and you’re ramping up. You might know the canonical tables, see the metrics and reports, and so on. But what about your team’s frequently used tables? What about the tables they frequently join with? What about the queries they write or how those queries evolved? What about changes recently made to a table or schema? What about the reports or viz the team made and who uses it and how often? What about the docs? What sources or data are not frequently used and are they at all useful?
Some of this metadata may be uninteresting on its own, but, combined with the primary data for analysis, it becomes a potent force for seeing the most relevant data, how others use it and the products of it, and, even, what is perhaps being underutilized.
This is not the traditional domain of data and metadata, but it is a key to eliminating data silos in the modern data analysis stack for faster analysis, fewer meetings, and less context switching.
This article will cover the following as it explores data vs metadata
- The formal definitions of data and metadata
- The working definition of data
- The working definition of metadata
- How the synergy of data vs metadata solves data silos in the modern data analysis stack
- How Kaldea consolidates data and metadata in one analytics workbench
Formal definition of ‘data’ and ‘metadata’
According to the International Standards Organization (ISO), data is the “reinterpretable representation of information in a formalized manner suitable for communication, interpretation or processing” (ISO 2382‑1:2015, 2121272).
And metadata is the “data about data or data elements, possibly including their data descriptions, and data about data ownership, access paths, access rights” and the ways that data changes (ISO/IEC 2382:2015, 17.06.05).
Working definition of ‘data’
A working definition of data—structured, semi-structured, and unstructured data—includes the formal definition above and the relevant sources of data and the information they contain, their storage, and processing.
Sources of data can tell us a lot about what information the data contains. Sources include:
- Online Transaction Processing (OLTP) and Change Data Capture (CDC)
- Internal operations data via Enterprise Resource Planning (ERP) or Customer Relationship Management tools (CRM)
- Operational apps and SaaS tools related to sales, marketing, customer support, etc.
- Event collectors and streams
- 3rd party APIs
- File and object storage
- Analytical and reporting tools
- Data analytics workbenches, workspaces, notebooks, and so on
- Dashboards, metric stores, other analytical reporting tools
Data includes (but is not limited to):
- Data stores: tables, schemas, data warehouses, data lakes, and so on
- Analytic outputs: saved queries, metric stores, dashboard and reports from BI tools
- Events, logs, and streams
- Processes: ETLs, Reverse ETLs, ML workflows, etc.
- People working with data, their knowledge and use of data
Working definition of ‘metadata’
A good working definition of metadata can be understood through the ABCs of metadata. In an article about Amundsen, Mark Grover adapts the ABC of metadata from an article on Ground (Hellerstein, et al., CIDR ’17 January 8-11, 2017) to describe metadata.
Essentially, metadata is a set of data that gives information about another set of data.
What kind of information does metadata give about data?
According to the ABC model of metadata in Hellerstein et el., metadata gives the following information about data:
- Application context
- information that describes how raw data is interpreted for use by people and applications, and includes encodings, schemas, taxonomies, tags, models, user annotations
- information about how we created the data and used it over time, whether it includes upstream and downstream lineage (i.e. the source of data and its products), usage logs that might be used to interpret patterns of use, frequent users, and so on.
- information about the version history of data and the code used to produce it, schema and taxonomy evolution, and so on.
So metadata is not simply the descriptive elements of data that allow us to understand its structure or qualities, but it also includes how people use the data and how the data changes over time.
Data vs Metadata - how the synergy of data and metadata solve data silos in the modern data stack
Current definitions of metadata tend to emphasize the Application Context ****and neglect user behavior with respect to data and how data changes.
Yet, expanding metadata to include how data producers use data, how data consumers consume data, and how data changes would help us break down data silos in both tools and people, the data owners and experts (or formerly one-person teams) who hold crucial information about the data.
You’re an analyst and you’re creating a thought-leading analysis to a product team for the first time. Imagine being able to see the most popular tables, what they were most frequently joined with, what reports and metrics are served by these tables, who uses them most, and have the catalog, lineage, schemas, and notebook in one place. With all this rich context, you can understand, discover, and define requirements faster and with fewer meetings.
Kaldea consolidates your data and metadata in one analytics workbench
Kaldea is a collaborative analytics workbench that consolidates your data, its context, its sources and products, and your metadata into one platform, so you can analyze and discover data more productively and quickly. By working with an expanded and flexible definition of data and metadata, Kaldea connects your data to its fullest context, sources, and products so it is easily discoverable and searchable, and ultimately simpler to analyze.
Try us out at Kaldea.com.