Discover data lineage of SQL in CSV file

2021/09/05 SQLFlow 2577 characters,About 8 minutes

SQLFlow can discover data lineage of SQL queries in a CSV file.

CSV file format

In order to discover data lineage from SQL queries, at least one column that includes the SQL query must be presented in the CSV file.

Other columns such as database, schema, object type, and object name are helpful to help SQLFlow to make a more precise data lineage analysis. But there are not mandatory.

Here is the common structure of a CSV file that may submit to SQLFlow for the data lineage analysis.

Database,Schema,ObjectType,ObjectName,ObjectCode,Notes

Your CSV file may have a different structure, as long as the column incudes the SQL query exits, and you tell the SQLFlow the column position of this SQL query, then it can be analyzed correctly.

column information

  • database: database that this SQL query will run against.
  • schema: schema that includes database objects in the SQL query.
  • objectType: type of the SQL query such as procedure, function, trigger, view.
  • objectName: name of the SQL query.
  • objectCode: the SQL query code.
  • Notes: comments for this SQL query.

Position of the columns in the CSV file

In order to help SQLFlow understand your CSV file correctly, you need to tell the column position of each predefined column through the UI interface of the SQLFlow.

Please note that the column name needn’t be the exact same as the predefined one, the SQLFlow only cares about the position.

The default columns required is in a format like this:

Database,Schema,ObjectType,ObjectName,ObjectCode,Notes

if your CSV file has the same format, you may tell SQLFlow that

Database is in column 1
Schema is in column 2
ObjectType is in column 3
ObjectName is in column 4
ObjectCode is in column 5
Notes is in column 6

If you have a CSV file in format like this:

catalogschematypecode
database1schema1procedure“create procedure test_db.test_schema.PRC_STG(IN register_batch_id INT64) BEGIN # STEP 1 # Declare variables –DECLARE date_to_query DATE; –DECLARE hour_to_query INT64; DECLARE v_insert_time DATETIME; ….”

As you can see, the catalog (AKA database) is at the first column; the schema is at the second column, object type is at the third column, and query code is at the fourth column.

In this CSV file, other columns such as ObjectType, Notes in the default CSV file format don’t exist.

objectCode column

ObjectCode column includes the SQL query. Usually, the SQL query will be in multi lines, So it should be enclosed with an enclosed char, which is " by default.

If the SQL query is enclosed with other char, then it must be set using objectCodeEncloseChar.

The objectCodeEncloseChar itself inside the SQL query should be escaped by objectCodeEscapeChar. The default objectCodeEscapeChar is ".

enclosed char of other columns

Besides the objectCode column, if other column is enclosed by the objectCodeEncloseChar as well, SQLFlow will remove this char automatically.

Set the CSV format

  • Upload a CSV file by creating a SQLFlow job.
  • After upload the CSV file, click “This is a csv file” checkbox.

  • Tells the SQLFlow column position. 0 means this columns doesn’t exist in CSV file.

Document information

Search

    Table of Contents