Amazon Redshift Data Lineage | Gudu SQLFlow

In your Amazon warehouse environment, use Amazon Redshift Spectrum to query data directly from files on Amazon S3, save data in Redshift databases, and use Business Intelligence tools such as Tableau, PowerBI, Looker, Qlik, Superset to generate reports from the data. The data originates from your business source system and lands on Amazon S3, then using an ETL tool such as DBT to transfer and store it in Redshift Database for later uses.

Amazon Redshift Data Lineage

Amazon Redshift Data Lineage

In order to have an overview of the data flow in your Amazon warehouse system, you need a data lineage tool to help you understand how the data arrived at a particular location, as well as the intermediate steps and transformations which happen as the data moves through the business system.

One way to get the data linage automatically from the Amazon warehouse environment is to analyze all SQL queries used during the data loading, transforming, and analysis. The good news is that all those SQL statements are stored in the Redshift user activity log and Gudu SQLFlow can parse those log files to discover the data lineage automatically.

Redshift User activity log

The user activity log is useful primarily for troubleshooting purposes and here we use it for the data lineage discovery. It tracks information about the types of queries that both the users and the system perform in the database.

Logs each query before it is run on the database.

Column name Description
recordtime Time the event occurred.
db Database name.
user User name.
pid Process ID associated with the statement.
userid User ID.
xid Transaction ID.
query A prefix of LOG: followed by the text of the query, including newlines.
Redshift user activity log

Please check this article to see how to enable logging.

Amazon Redshift User Activity Audit Log Sample

'2018-05-21T06:00:09Z UTC [ db=prod_sales user=duc pid=99753 userid=95 xid=6728324 ]' LOG: create table SumoProdbackUp.organization as (select * from SumoProd.simpleuser) 
'2018-05-21T06:00:09Z UTC [ db=vendor user=ankit pid=36616 userid=53 xid=2956702 ]' LOG: DELETE FROM SumoProd.employee WHERE id = 38; 
'2018-05-21T06:20:09Z UTC [ db=dev user=himanshu pid=64458 userid=35 xid=5143208 ]' LOG: drop user testuser3

 

Automatic data lineage analysis

Gudu SQLFlow is a tool that automated SQL data lineage analysis across Databases, ETL, Business Intelligence, Cloud, and Hadoop environments by parsing SQL Script and stored procedure. It can also analyze the Redshift user activity log files to discover the data lineage and depict all the data movement graphically.

Here is a part of a data lineage picture that was generated after analyzing the Amazon Redshift user activity log files:

Amazon Redshift Data Lineage 
Amazon Redshift Data Lineage

 

Conclusion 

Thank you for reading our article and we hope it can help you to have a better understanding of Amazon Redshift Data Lineage. If you want to know more about Amazon Redshift Data Lineage, we would like to advise you to visit our website Gudu SQLFlow for more information. Gudu SQLFlow, as a data lineage analysis tool, can not only analyze SQL script files, obtain data lineage, and perform visual display, but also allow users to provide data lineage in CSV format and perform visual display. Thanks again! ( Edited by Ryan on Apr 26, 2022)

Try Gudu SQLFlow Live

SQLFlow Cloud version

Subscribe to the Weekly Newsletter

Leave A Comment