5 Best Open Source Data Lineage Tools to Consider in 2022
The essence of data governance is to help companies create data policies and ensure that people can comply with those policies. These policies address a range of data-related processes, including guidelines for data protection, verification and use. Data stewards must solicit data requirements from business users and work with data governance council members to agree on common data definitions, specify data quality metrics, articulate relevant policies, and develop methods to measure compliance.
Best Open Source Data Lineage Tools
However, building a bridge between defining data governance policies and implementing them is often a formidable challenge. The purpose of these strategies is to control and monitor the quality of data assets across business workflows, but data stewards with key data quality management responsibilities are often not properly trained or qualified. This is where the data lineage tool comes in. In this article, we will introduce 5 best open source data lineage tools that can be found on the market of 2022.
Best Open Source Data Lineage Tools – 1. Tokern
Tokern is built for cloud data warehouses and data lakes, and takes a dedicated approach to enabling you to obtain column-level data lineage from databases and data warehouses hosted on Google BigQuery, AWS Redshift, and Snowflake. In addition, more resources such as SparkSQL, AWS Athena, and Presto are in development. Tokenn has considerable integration capabilities because it works well with most of the open source data catalogs and ETL frameworks.
Tokern Data Lineage Features:
Token was released a while ago and takes into account the latest data engineering and design patterns. One such example is that in addition to building data lineage from DBCAT (data directory), Tokern also allows you to build data lineage from query history or ETL scripts, making it ideal for BI and ETL tool integration. Tokenn stores the data catalog and lineage in a PostgreSQL database. Users can access this database for further analysis using SQL, or feed it into other visualization and analysis engines.
The visualization engine Kedro-Viz and a network graph analysis library called NetworkX are behind Tokenn’s excellent visualization and analysis capabilities. These libraries help you track, visualize, and analyze column-level lineage data. You can also interact with lineage data using Token’s SDK or API.
In addition to its state-of-the-art data lineage capabilities, Tokern uses PIICatcher to provide PII (Personally identifiable Information) and PHI (Personal health Information) detection. The built-in tool combines regular expressions with several standard NLP libraries for PII detection, such as Spacy and Stanford NER.
Best Open Source Data Lineage Tools – 2. Egeria
Described as the world’s first open source metadata standard, Egeria provides a way to seamlessly integrate data engineering tools for a reliable and consistent view of metadata. In addition to cataloging and searching metadata, the standard enables users to build more advanced solutions for data lineage tracing, data quality checking, PII identification, and more.
Many data engineering architectures involve a great deal of avoidable chatter between various data tools. Egeria moves away from this and instead adopts a spoke-and-wheel model, where everything goes through Egeria. In this way, users need only use one tool to converse.
Egeria Data Lineage Features:
Data lineage in Egeria utilizes well-known open standards to capture and store a data lineage called OpenLineage. OpenLineage also gives you greater insight into your data by providing a horizontal and vertical pedigree of tracking data.
Egeria listens for Kafka events emitted by the source system to capture data lineage information. After obtaining data lineage information, Egeria tells lineage managers to match and link lineage charts that Egeria cannot. After that, the lineage is good for commercial consumption.
The data lineage capabilities in Egeria are well aligned with the capabilities of data discovery and management, metadata provenance, and so on. These capabilities and Egeria’s lineage design and architecture make it a compelling and well-thought-out data governance and data lineage tool.
Best Open Source Data Lineage Tools – 3. Pachyderm
Like Tokenn we just mentioned, Pachyderm is another specialized data lineage tool. Rather than focusing on cloud data warehouses, it aims to enable developers to build machine learning pipelines in a language – and framework-independent way.
It has implemented a version control system, such as lakeFS or Git, to maintain lineage of data objects. Changes to these objects (think commit) are captured and stored by Pachyderm to maintain a complete and immutable audit trail of events. Audit trails enable you to have a data lineage map for viewing and analysis, and allow you to reproduce data and code at any point in time for debugging or compliance reasons.
Pachyderm Data Lineage Features:
To achieve seamless data lineage tracking and versioning of data, Pachyderm uses a central repository that uses object stores such as AWS S3 in a custom file system called PFS (Pachyderm File System). PFS helps your object store (such as S3) become the only true source of your data with its complete history.
Pachyderm also enforces invariance in your data source, which allows it to assign global ids to lineage events and data objects. Pachyderm allows you to treat immutable data lineage diagrams as DAGs in the UI. Both of these features are beneficial when working with ML pipes, and you want to trace the results back to their inputs.
Pachyderm integrates with the most widely used databases, data warehouses, and data lakes. In addition, you can import data from any database into Pachyderm using an SQL-based ingestion tool. However, Pachyderm has limitations as a general-purpose data lineage tool, which is why most of Pachyderm’s enterprise customers use it to handle MLOps, unstructured data ETL, and NLP workloads.
Best Open Source Data Lineage Tools – 4. OpenLineage
OpenLineage was founded by DataKin, the company responsible for taking over Marquez’s development, after WeWork opened it. DataKin turned over the OpenLineage project to the Linux Foundation as a sandbox project in mid-2021. Highly inspired by the ubiquitous OpenTelemetry in the field of data observability, OpenLineage aims to establish an open standard for data lineage collection and analysis.
Integration is central to OpenLineage’s design and mission. It integrates with the ETL framework, data orchestration engine, metadata directory, data quality engine, and data lineage tools. OpenLineage uses JSONSchema as an API definition and supports various languages and frameworks. Egeria is one of the popular data tools, whose core metadata layer is built on OpenLineage.
WeWork’s Marquez is also at the heart of OpenLineage’s architecture, as Marquez provides the UI and metadata repository, and the metadata collection API comes from OpenLineage. OpenLineage is also exposed to you via GraphQL and REST APIs.
OpenLineage is an attractive choice because it can be easily used with most existing data engineering stacks and provides you with a wide range of exciting and valuable features so that you can comprehensively collect, track, and analyze data lineage.
Best Open Source Data Lineage Tools – 5. TrueDat
As a complete data governance solution, TrueDat enables you to categorize, search, and track data in detail. With its data lineage capabilities, TrueDat can also help you visualize the entire life cycle of your data, giving you insight into the journey of your data over time.
TrueDat was built by BlueTab (an IBM company) in 2017 and has been in active development since then, with its latest version, V4.39, released in March 2022.
TrueDat Data Lineage Features:
TrueDat allows you to use data lineage to analyze the impact of database changes and better understand your reporting business logic. It allows you to trace the lineage of a data object with point-in-time visibility. For advanced analysis, you can also apply filters to lineage objects to examine specific parts of the lineage diagram. In addition to the graphical representation that follows in the UI, you can download the collected data lineage information into a CSV file. Because TrueDat provides an excellent set of data governance and lineage capabilities, it is a real contender to solve your data lineage problems.
Thank you for reading our article and we hope it can help you to find the best open source data lineage tools. If you want to learn more about data lineage, we would like to advise you to visit Gudu SQLFlow for more information.
As one of the best data lineage tools available on the market today, Gudu SQLFlow can not only analyze SQL script files, obtain data lineage, and perform visual display, but also allow users to provide data lineage in CSV format and perform visual display. (Published by Ryan on Jul 14, 2022)