Why Data Lineage Matters and Why It’s So Challenging?
Change brings risk. This is one of the fundamental principles of software development that most of us learn early in our careers. Yet despite these repeated life lessons, it always seems to keep coming. Our inability to foresee the impact of change, even a small one, often leads to negative outcomes. This problem is exacerbated as the complexity of interconnected IT systems increases. It is increasingly difficult to predict how even small modifications will affect the upstream or downstream systems of the change. Data lineage accomplishes this, but doing it well is extremely difficult.
Why Data Lineage Matters
Why Data Lineage Matters?
In the old world, data resided in silos with relatively few connection points between them. But today, data flows in multiple directions between multiple systems. Many of these systems share data with external organizations, which means that information then flows to internal systems there as well. As a result, it’s easier than ever to ignore the influence of a single change on multiple other systems.
For example, let’s say your company offers a mobile application that enables customers to request new service calls quickly and easily. In the past, your customers could specify the product variants they had. Your design team has determined that since you already have that information in your CRM database, you no longer need to collect this information from users of your mobile app.
Your task was to simplify the application, so you removed that input field and replaced it with a query to extract customer product information from your CRM. In theory, the generated work order should include all the information your field service personnel need to get the job done.
Unluckily, there is a problem: some customer product records in your CRM system contain null values. Your mobile application was looking for this information, but it provided invalid results to your ticketing system. Your field service personnel no longer know how to prepare for the next service call.
This sounds like a relatively simple data quality issue – one that might have been foreseen if someone had reviewed CRM data more thoroughly. However, if the null values in the CRM database were not introduced until later, even that might not solve the problem. For example, if you incorporate customer data from a newly acquired company into your CRM system, you may not have product information associated with those new records.
Sometimes these problems are harder to predict and may not be immediately apparent. Imagine what happens when your marketing team decides to redefine its customer segmentation criteria. A new data field is added to specify the customer category, which then populates all customer values, and the old field is not recommended.
Unbeknownst to the marketing department, however, the company’s top executives were still looking at old data. The dashboards they use depend on data in deprecated fields. New customers are being added every day, but no one is entering any value in the deprecated field. As a consequence, executive teams are looking at skewed data. Suddenly, top management is tracking KPIs that are very different from the rest of the organization. Worse, no one even noticed there was a problem until weeks or months after the fact.
Next, imagine a similar scenario, but instead of populating an executive dashboard, the source system feeds data to an AI algorithm that drives customer recommendations on your e-commerce site. When an AI is trained on flawed data—or if the data changes significantly in some way—it can greatly affect the effectiveness of the technology. In this case, it distorts your customer recommendation engine.
Addressing Data Lineage Challenges
Data lineage tools provide a systematic approach to understanding the impact of changes by providing a complete roadmap of potential upstream and downstream impacts.
This is equivalent to risk reduction and cost control. If we can predict how changes might affect other systems, we can avoid the problems we described earlier. This leads to less wasted effort, faster time to market and lower costs. We are all familiar with the principle that when problems are identified early in the development process, they are solved faster and cheaper. Data lineage provides a critical missing link, and problems can be identified before they even reach QA.
There are also benefits to data governance and compliance. When the European Union passed the General Data Protection Regulation (GDPR) in 2016, it imposed a series of new requirements on systems containing personally identifiable information (PII). If a customer asks you to delete their data, you are legally obligated to comply with that request. But what happens upstream and downstream of this change? Will it break anything? Or will some of the customer’s PII remain elsewhere on your system, keeping you out of compliance with the law?
Data lineage addresses these issues by automating the process of detecting upstream and downstream impacts. It provides clear auditability that can help determine the source of changes and how the data reached its current state.
Column-level data lineage is particularly difficult because it needs parsing SQL queries to identify exactly what has changed and how, which means leveraging database logs, but because there are many database vendors, each with their own unique SQL dialect, parsing this information for various databases can be particularly challenging. As each SQL dialect evolves over time, data lineage tools must evolve to accommodate these changes.
Column-level data lineage will become even more important as interconnected systems further expand their scope and information flow accelerates. Today, it’s a very useful tool that saves time, effort, and money. Soon, data lineage will become indispensable.
Thank you for reading our article and we hope it can let you have a better understanding of why data lineage matters and why it’s so challenging. If you want to learn more about data lineage, we would like to advise you to visit Gudu SQLFlow for more information.
As one of the best data lineage tools available on the market today, Gudu SQLFlow can not only analyze SQL script files, obtain data lineage, and perform visual display, but also allow users to provide data lineage in CSV format and perform visual display. (Published by Ryan on Jul 23, 2022)