Get the Data Lineage of Complex SQL Statements in One Minute

Data lineage is a very important link in enterprise data governance. For the important role of data lineage in enterprise data governance, you can refer to What’s Data Lineage and Why Is It So Important?. The SQL language is widely used in data processing, and SQL statements contain rich data lineage. For what is data lineage in SQL and how to discover these data lineages hidden in SQL statements, please refer to How to Discover Data Lineage in SQL Language in 5 Minutes?.

How to Get the Data Lineage of Complex SQL Statements in One Minute?

How to Get the Data Lineage of Complex SQL Statements in One Minute?

This article mainly introduces how to use the UI, Rest API and Java library provided by the Gudu SQLFlow data lineage analysis tool to quickly obtain the data lineage in complex SQL statements, and integrate it into your own data governance platform as needed.

A slightly complex SQL statement

Now we use the following slightly complicated SQL statement to demonstrate how to use Gudu SQLFlow to quickly obtain various data lineages. If you have more complex SQL statements or stored procedures to process, then a data lineage analysis tool like Gudu SQLFlow is even more urgent.

Our goal is to know which columns are included in the top-level select list, and which other tables and columns the source data of these columns come from. An ideal result should look like this:

Data Lineage for Top Level Select List

Data Lineage for Top Level Select List

To get the above result, we need to do these things:

  1. Automatically expand the * (asterisk) in data.* to find the corresponding column.
  2. Perform data source tracing for each column in the select list to find the original table and column. This process may require multiple levels of source tracing until the final data source is found.

About the automatic expansion of the *

* (asterisk) in the Select List represents all the columns in the relation, which need to be expanded to specific column names. In this case, Gudu SQLFlow can automatically expand based on the contextual information provided in the SQL. But sometimes, the SQL statement itself doesn’t provide enough information to determine what columns * contains. At this time, you need to provide metadata information to Gudu SQLFlow in order to correctly expand asterisks.

Gudu SQLFlow provides three ways to help you quickly obtain the data lineage of complex SQL statements, which is convenient for you to use in different occasions.

1. Gudu SQLFlow UI

You can directly access the Gudu SQLFlow Cloud version without installing any software to use it. It is very simple to use. You can directly paste the SQL statement to be processed into the input area, and then click Visualize.

Gudu SQLFlow Data Lineage

Gudu SQLFlow Data Lineage

We can ignore the intermediate processing steps and detailed information of data lineage, and directly display source data and target data. To obtain the concise results as shown in the figure above, the settings used are as follows:

Gudu SQLFlow Setting

Gudu SQLFlow Setting

You can download the JSON result containing the data lineage for further processing. (The storage path of data lineage: data -> sqlflow -> relationships)

Gudu SQLFlow Data Lineage in JSON

Gudu SQLFlow Data Lineage in JSON

2. Gudu SQLFlow Rest API

You can call the Gudu SQLFlow Rest API in your code to automatically submit the SQL statements that need to be processed, get the processing results in real time, and then process the returned data lineage in your code according to business requirements.

In order to use the Gudu SQLFlow Rest API, you need a Gudu SQLFlow Cloud Premium Account. Here is the detailed tutorial. Or you can install Gudu SQLFlow On-Premise Version inside your company to use Rest API.

  1. Python code connects to the Gudu SQLFlow server ( Gudu SQLFlow on-premise server ) deployed in the enterprise to obtain data lineage results.
  2. Python code connects to the SQLFlow cloud server to obtain data lineage results.

Click here to view the data lineage results generated by the above Python code analysis.

3. Gudu SQLFlow Java library

You can also use the Gudu SQLFlow Java class library to analyze the data lineage of SQL statements. The advantage of using the Gudu SQLFlow Java class library is that you do not need to install the Gudu SQLFlow server software, and it can run independently without relying on any third-party Java class library, which is easy to integrate into your own projects.

We have created a Java demo program for you, so you can use it to compile and run directly. To get all the columns and their data sources returned by the top-level select list of the SQL statement in this article, we can run the Java demo with the following parameters:

The generated data lineage is stored in XML format, and you can click here to open this XML file.

Conclusion

Thank you for reading our article and we hope it can help you to have a better understanding of how to get the data lineage of complex SQL statements in one minute. If you want to learn more about data lineage, we would like to advise you to visit Gudu SQLFlow official website for more information.

As one of the best data lineage analysis tools on the market today, Gudu SQLFlow can not only analyze SQL script files, obtain data lineage, and perform visual display, but also allow users to provide data lineage in CSV format and perform visual display. (Published by Ryan on May 22, 2022)

Try Gudu SQLFlow Live

SQLFlow Cloud version

Subscribe to the Weekly Newsletter

Leave A Comment