7 Best Open Source Data Governance Tools to Consider in 2022
7 Best Open Source Data Governance Tools in 2022
Finding a good open source data governance tool can be a challenge for fmany reasons. First and foremost, the biggest hurdle in deciding anything related to data governance is the lack of a standardized approach – the goals are not well defined. In addition, the data governance capabilities of most open source tools are unclear. You must sift through the documentation pages and GitHub repository to determine if a particular tool is appropriate for a particular use case.
In addition, the data governance capabilities of most open source tools are unclear, so you must sift through the documentation pages and GitHub repositories to determine if a particular tool is appropriate for a particular use case. To simplify your evaluation process, we’ve compiled a list of the best open source data governance tools in 2022.
Best Open Source Data Governance Tools
Best Open Source Data Governance Tools – 1. Amundsen
Amundsen was originally built at Lyft and is currently hosted and maintained by LF AI & Data Foundation. In terms of data governance, it mainly addresses data security and compliance with data privacy and sovereignty laws. The idea is to tag and categorize all the data at the metadata layer.
By using Amundsen, you can search for metadata and learn who is using the data and how often they are using it. You can get a sense of the data by looking at these data access patterns, but this approach is more reactive. For a more proactive approach, you need fine-grained access controls to prevent people from accessing data based on team, role, individual, system, and so on data access policies.
Although you don’t yet have RBAC, role-based access control in Amundsen, you still have some essential data governance capabilities, such as tagging and categorizing metadata.
Because of the limited data governance capabilities available with the default Neo4J backend, Amundsen decided to add support for Apache Atlas. Since Apache Atlas is one of the most mature metadata management platforms, many features have been tried and tested in various systems, bringing reliability to data cataloging and governance solutions. Amundsen has good support for data lineage and label/badge propagation (using lineage).
Neo4j or Atlas backends are generally suitable for most enterprises. Some people want more advanced functionality from their data cataloging and governance solutions.
Best Open Source Data Governance Tools – 2. DataHub
LinkedIn created DataHub after WhereHows was no longer a viable solution for the growing demand for metadata search and discovery tools. Prior to DataHub, LinkedIn had used other tools in conjunction with WhereHows to add some data governance capabilities.
DataHub allows you to have fine-grained access control over metadata. Access is driven by policies that you can declare from the Web UI and GraphQL API. DataHub’s strategy applies at two levels: platform and metadata. Platform policies allow you to control user permissions for DataHub, for example, which features and to what extent users can view and use them.
You can apply these policies to individual users or groups. Metadata policies, on the other hand, allow you to control which users have access to different metadata entities, charts, data sources, dashboards, and so on, and what actions they can perform on them. However, DataHub does not currently allow you to control read permissions.
Several other features are part of the DataHub roadmap, but there is no clearly defined timeline yet. One of the primary data governance capabilities is RBAC, role-based access control, for entities and aspects (PDL records). RBAC not only provides finer access control over metadata, but also helps with better label management, data preview access control, and more.
In terms of governance/privacy: DataHub supports data set level classification, governed data movement, automatic data deletion, data export, and more. They plan to open source some compliance capabilities as part of their roadmap.
Best Open Source Data Governance Tools – 3. Apache Atlas
As one of the first open source data catalogs to integrate data governance capabilities, the Apache Atlas project had a somewhat slow development cycle, not to mention that the project was purpose-built for the Hadoop ecosystem. It works with anything integrated with Hive.
Apache Atlas is particularly good at classification, with the ability to create data sensitivity, expiration, and quality categories on the fly, which brings us to data lineage, another popular feature of Apache Atlas. Atlas implements true data lineage, that is, lineage is operational.
By using lineage data, Apache Atlas can propagate metadata properties to entities in a lineage hierarchy, a feature you won’t find in other data governance tools.
Apache Atlas also has a number of DE data privacy and security features. For example, it has fine-grained access control over entities and categories, and works well with Apache Ranger for data authorization and masking. When working together, these features form an effective data privacy and security network, allowing data to be shielded or classified as PII, sensitive, etc. Notably, it also provides you with a framework to control who can access PII and sensitive data.
Best Open Source Data Governance Tools – 4. Magda
Developed by Data61, the data science arm of CSIRO (Commonwealth Scientific and Industrial Research Organisation of Australia), MAGDA is an acronym that stands for Making Australian Government Data Available. CSIRO deployed Magda to create an open data portal containing more than 70,000 datasets from the Federal and state governments of Australia, and they open-source the project for others to use.
Although Magda’s richest and most mature features remain search and discovery, it also provides powerful support for tagging and defining data set topics. In addition, Magda has built-in data preview options, including spreadsheets and interactive charts. Other tools such as Amundsen need to be integrated with Superset. One caveat: Integrating with tools like Superset for data preview is more scalable.
While Magda does not currently support RBAC (role-based access control), it does support features that allow strict control of access to resources ingested into Magda. Magda uses Kubernetes to remain cloud-independent. It uses the open policy broker standard to manage access policies, which facilitates different types of access control, such as role-based, attribute-based, and so on.
Best Open Source Data Governance Tools – 5. Open Metadata
Announced in August 2021, Open Metadata defines specifications to standardize Metadata using a schema-first approach, consisting of a centralized Metadata store and an ingestion framework that supports popular connectors in the data stack.
Open Metadata takes a different approach to tagging, which allows you to tag data owners with data sets, and it also allows you to tag data sets into multiple layers based on their importance. Open Metadata also implements all metadata version control, which means that with the database entities (tables, views, mode), tag, the ownership of the data set detailed information and business vocabulary related all metadata for version control, all the information about changes, such as who changed the change and when to change it.
Best Open Source Data Governance Tools – 6.Egeria
Launched in 2019 and maintained by the AI and Data division of the Linux Foundation, Egeria is designed to easily exchange metadata between tools and platforms in a vendor-neutral manner. Other tools do this through SDKS and apis, but their capabilities are limited, whereas Egeria does a good job of this because it is built around the principles of platform independence, ease of extensibility, and data accessibility.
While all the other tools we’ve seen so far address metadata management and governance issues primarily from a user’s perspective, Egeria tries to solve problems for users and systems, and works well with a variety of data tools.
Egeria gives you very fine-grained and fine-grained control over your metadata through governance regions, validity dates, metadata archiving, metadata provenance, and more, some of which are unique to Egeria. It is also worth mentioning that it comes with, but is not limited to, more than 800 predefined metadata types. You can also define your own types based on your business needs, which means that Egeria is flexible enough to adapt to your business needs.
Best Open Source Data Governance Tools – 7. Truedat
Finally, TrueDat, arguably the only mature open source data governance tool on the list, was created by BlueTab (now IBM) after understanding the market’s needs as a data solution provider and finding gaps in the data governance space.
TrueDat has a set of overlapping features with the other tools mentioned above, including data catalogs, search engines, data lineage capabilities, and so on. Still, the most popular features are the business vocabulary and the ability to share data across teams, with fine-grained controls that focus on data management and data ownership management, classification, and so on.
There are other features that make TrueDat completely unique on this list, one of which is a data-sharing feature similar to Snowflake data sharing that makes it easier for teams to share and collaborate more effectively. In addition, in order to ensure a high degree of security and control over data, subscription and notification capabilities can be used to record change events in audit trails and monitor them in real time.
Thank you for reading our article and we hope it can be helpful when you’re looking for the best open source data governance tools. If you want to find more information about open source data governance tools, we would like to advise you to visit Gudu SQLFlow for more information.
As one of the best data lineage tools available on the market today, Gudu SQLFlow can not only analyze SQL script files, obtain data lineage, and perform visual display, but also allow users to provide data lineage in CSV format and perform visual display. (Published by Ryan on Jul 16, 2022)