Everything You Need to Know About Data Lineage
What is data lineage? What are the characteristics of data lineage? What are the uses of data lineage? What are the methods of data lineage collection? If you are looking for answers to the above questions, you have come to the right place. In this article, we will go into detail about everything you need to know about data lineage.
Everything You Need to Know About Data Lineage
Everything You Need to Know About Data Lineage – What is data lineage?
In the entire process of data generation, processing and integration, circulation, and final demise, a relationship will naturally form between data. This relationship between data is expressed by drawing on a similar relationship in human society, which is called the data lineage relationship. Data lineage is one of the components of metadata.
It can be used to analyzes the lineage path of the table and fields from the data source to the current table, whether the relationship between the lineage fields is satisfied, the data consistency of the concern, and the rationality of the table design. It can be used to analyze the impact of changes in upstream data on downstream data, and trace the source of upstream problems when changes in downstream data occur.
Everything You Need to Know About Data Lineage – The Characteristics of Data Lineage
- Attribution: Generally speaking, specific data belongs to a specific organization or individual, and the data has attribution.
- Multisource: The same data can have multiple sources. One data can be generated by processing multiple data, and this processing process can be multiple.
- Traceability: The data lineage relationship reflects the life cycle of data, and reflects the entire process of data from generation to demise, with traceability.
- Hierarchical: Data lineage relationships are hierarchical. The description information of the data, such as the classification, induction, and summary of the data, forms new data, and the description information of different degrees forms the level of the data.
Everything You Need to Know About Data Lineage – The Uses of Data Lineage
- Data Traceability: The data lineage relationship reflects the ins and outs of the data, which can help us track the source of the data and track the data processing process. When data is abnormal, it is necessary to trace the cause of the abnormality and control the risk to an appropriate level. Especially with the development of enterprises, there are many data sources and different levels of quality, which will inevitably lead to an impact on the data results. Therefore, the ability of data traceability is very valuable.
- Impact Analysis: With more and more data applications, the data flow chain is getting longer and longer. For changes in the core business of a source, downstream analysis applications must be kept in sync. Without impact analysis, abnormal access will be caused to each data service. This is often a core requirement of the development sector to facilitate impact assessments.
- The Data Value: How to measure the value of data is a very difficult problem. To evaluate the value of data, you need a basis. The data lineage provides reference data in several aspects to facilitate evaluation. Data audience: The more data demanders, the greater the data value.
Update magnitude: The larger the amount of updated data, the greater the value of the data.
Update frequency: The more frequent the update, the greater the value of the data.
- Quality Assessment: This is the requirement of data analysis applications, and it is necessary to control the quality of the final result data. When data problems are found, it is necessary to further trace the data source, analyze the causes of quality problems, and then solve them. From the data lineage diagram, the data processing process can be easily obtained, and then the possible quality problems can be analyzed.
- The Life Cycle: Through the data lineage, the entire life cycle of the data can be intuitively obtained. For less valuable data, consider tiering, archiving, or even destruction.
- Compliance Requirements: This is often the need of upper management. For regulatory compliance, all points and sources of data flow are the key points that need to be regulated. Therefore, a comprehensive understanding of the relationship between data is required.
- Security Control: This is often a security compliance need to understand how the data is used? Who is the audience? These are all available from data lineage. These will also provide the basis for permission management (table-level, field-level authorization), thereby ensuring data security from a higher level. Combined with the security identifiers in the metadata (for example, which data needs to be desensitized, etc.), the security management and control of the whole domain can be carried out.
- Architecture Design: Data lineage provides the possibility to understand data from the perspective of data processing. In some scenarios, this makes perfect sense. For example, when a specific old project is terminated and a new project is required to take over, without a data flow mapping table, it will take a lot of time to organize, and it is difficult to ensure the integrity and correctness of the migration.
- Data Assets: For the company management, the overall data flow can be understood through data lineage. This is of great benefit to the formulation of the company’s data asset strategy.
- R&D Needs: For data R&D personnel, if there is data lineage, it is convenient to find and understand the data processing logic and troubleshoot problems.
Everything You Need to Know About Data Lineage – The Methods of Data Lineage Collection
- Automatic Parsing: Automatic parsing is currently the main collection method. The specific method is to parse SQL statements, stored procedures, ETL procedures and other files. Due to the complex code and application environment and other reasons, according to the experience of international manufacturers, automatic analysis can cover 70-95% of enterprise data, but it is currently impossible to achieve 100%.
- System Tracking: That is, in the process of data processing flow, the data processing program is responsible for sending data lineage information. For example, in the annotation section, add a description of lineage, etc. The advantage of this approach is that the collection is accurate, timely, and fine-grained support; of course, the disadvantage is that it is intrusive. However, if the company adopts a unified processing platform, this method is highly recommended.
- Machine Learning: This method calculates the similarity of data based on the dependencies between data sets. The advantage of this method is that there is no dependence on tools and business, and the disadvantage is that the accuracy rate needs to be confirmed manually.
- Manual Collection: The last one is the “universal” method, which is to organize the data lineage manually by R&D.
Everything You Need to Know About Data Lineage – The Application of Data Lineage
Based on the ability of data lineage, the following applications are commonly used:
- By specifying a table/field, to trace its forward multilevel object.
- By specifying the table/field, to associate its backward multi-level object.
- Not limited to a single object, you can understand the overall data flow from a larger scale (for example, within a project, etc.). This makes sense for analysis of hot objects, data cleaning, etc.
- That is, starting from the perspective of “operations”, analyze its forward and backward operations. This makes sense for troubleshooting, quality analysis, and more.
- In addition to our usual sense of lineage, there is a more generalized lineage. There is not only a data-level relationship between objects, but also a reference-level relationship, that is, the associated use between objects, including but not limited to tables, views, models, reports, jobs, and so on. This is meaningful for data value assessment, data asset management, etc.
- For traditional data exploration, it is often completed according to the data hierarchy definition. When there is data lineage, another exploration method is provided, that is, a step-by-step exploration according to the data processing logic.
Thank you for reading our article and we hope it can be helpful to you. If you want to learn more about data lineage, we would like to advise you to visit Gudu SQLFlow for more information.
As one of the best data lineage tools available on the market today, Gudu SQLFlow can not only analyze SQL script files, obtain data lineage, and perform visual display, but also allow users to provide data lineage in CSV format and perform visual display. (Published by Ryan on Aug 29, 2022)