SQLFlow can discover data lineage of SQL queries in a CSV file. CSV file format In order to discover data lineage from SQL queries, at least one column that includes the SQL query must be presented in the CSV file. Other columns such as database, schema, object type, and object name are helpful to help SQLFlow to make a more precise data lineage analysis. But there are not mandatory. Here is the common structure of a CSV file that may submit to SQLFlow for the data lineage analysis. Database,Schema,ObjectType,ObjectName,ObjectCode,Notes Your CSV file may have a different structure, as long as the column incudes the SQL query exits, and you tell the SQLFlow the column position of this SQL query, then it can be analyzed correctly. column information database: database that this SQL query will run against. schema: schema that includes database objects in the SQL query. objectType: type of the SQL query such as procedure, function, trigger, view. objectName: name of the SQL query. objectCode: the SQL query code. Notes: comments for this SQL query. Position of the columns in the CSV file In order to help SQLFlow understand your CSV file correctly, you need to tell the column position of each predefined column through the UI interface of the SQLFlow. Please note that the column name needn’t be the exact same as the predefined one, the SQLFlow only cares about the position. The default columns required is in a format like this: Database,Schema,ObjectType,ObjectName,ObjectCode,Notes if your CSV file has the same format, you may tell SQLFlow that Database is in column 1 Schema is in column 2 ObjectType is in column 3 ObjectName is in column 4 ObjectCode is in column 5 Notes is in column 6 If you have a CSV file in format like this: catalog schema type code database1 schema1 procedure “create procedure test_db.test_schema.PRC_STG(IN register_batch_id INT64) BEGIN # STEP 1 # Declare variables –DECLARE date_to_query DATE; –DECLARE hour_to_query INT64; DECLARE v_insert_time DATETIME; ….” As you can see, the catalog (AKA database) is at the first column; the schema is at the second column, object type is at the third column, and query code is at the fourth column. In this CSV file, other columns such as ObjectType, Notes in the default CSV file format don’t exist. objectCode column ObjectCode column includes the SQL query. Usually, the SQL query will be in multi lines, So it should be enclosed with an enclosed char, which is " by default. If the SQL query is enclosed with other char, then it must be set using objectCodeEncloseChar. The objectCodeEncloseChar itself inside the SQL query should be escaped by objectCodeEscapeChar. The default objectCodeEscapeChar is ". enclosed char of other columns Besides the objectCode column, if other column is enclosed by the objectCodeEncloseChar as well, SQLFlow will remove this char automatically. Set the CSV format Upload a CSV file by creating a SQLFlow job. After upload the CSV file, click “This is a csv file” checkbox. Tells the SQLFlow column position. 0 means this columns doesn’t exist in CSV file. Restful API Submit csv file You may submit the csv file using restful api. API url:http://ip:port/gspLive_backend/sqlflow/job/submitUserJob method:POST Body type:form-data parameters: pamaterer type optional comment dbvendor string NO available database vendors: dbvazuresql,dbvbigquery,dbvcouchbase,dbvdb2,dbvgreenplum,dbvhana,dbvhive,dbvimpala,dbvinformix,dbvmdx,dbvmysql,dbvnetezza,dbvopenedge,dbvoracle,dbvpostgresql,dbvredshift,dbvsnowflake,dbvmssql,dbvsparksql,dbvsybase,dbvteradata,dbvvertica jobName string NO job name userId string NO user id token string YES This token must be specified if you are using the Cloud version:https://github.com/sqlparser/sqlflow_public/blob/master/api/readme.md#1-generate-a-token csvFormat string YES the format of the csv file, default value is 123456 objectCodeEncloseChar string YES the char the enclosed the SQL code, default is double quote. objectCodeEscapeChar string YES escape char, default is double quote sqlfiles file NO the full path of the csv file response: parameter type comment code int 200 means successful. jobId string SQLFlow job id jobName string job name fileNames array[] csv file name status string status of the job. Available value: success, partial_success,fail Demo upload a csv file: BigQuery-Sample-Procs.csv,the csvFormat is 123456,objectCodeEncloseChar and objectCodeEscapeChar is ",curl command line: curl --location --request POST 'http://ip:port/gspLive_backend/sqlflow/job/submitUserJob' \ --form 'sqlfiles=@"/Users/mac/Documents/BigQuery-Sample-Procs.csv"' \ --form 'userId="gudu|0123456789"' \ --form 'csvFormat="123456"' \ --form 'objectCodeEncloseChar="\""' \ --form 'objectCodeEscapeChar="\""' \ --form 'dbvendor="dbvoracle"' \ --form 'jobName="test"' Response: { "code":200, "data":{ "jobId":"8620465045bd46d88c148f253ae947f7", "jobName":"test", "userId":"gudu|0123456789", "dbVendor":"dbvoracle", "dataSource":{ }, "fileNames":["BigQuery-Sample-Procs.csv"], "createTime":"2022-02-15 21:29:26", "status":"create", "hasMetadata":false, "csvFormat":"123456", "objectCodeEncloseChar":"\"", "objectCodeEscapeChar":"\"", "onTop":false, "simple":false, "parallel":false } } You may find the result data lineage in SQLFlow UI for this created job. Fetch data lineage URL:http://ip:port/gspLive_backend/sqlflow/job/exportLineageAsJson Method:POST Body type:form-data Parameters: name type optional comment jobId string NO job id that the data lineage belongs ignoreFunction boolean YES ignore function in the result ignoreRecordSet boolean YES ignore ecordSet userId string NO always use gudu|0123456789 in the on premise verison token string NO must be specified when using SQLFlow cloud version, https://github.com/sqlparser/sqlflow_public/blob/master/api/readme.md#1-generate-a-token Response: name type comment code int return code, 200 means successful summary Summary summary info sqlflow Sqlflow data lineage curl sample curl --location --request POST 'http://ip:port/gspLive_backend/sqlflow/job/exportLineageAsJson' \ --form 'userId="gudu|0123456789"' \ --form 'jobId="8620465045bd46d88c148f253ae947f7"' \ --form 'ignoreFunction="true"' \ --form 'ignoreRecordSet="true"' result: { "code":200, "data":{ "mode":"summary", "summary":{ "schema":20, "process":0, "database":3, "view":0, "mostRelationTables":[ { "schema":"SMFPRE", "database":"SMART-METER-DEVELOPMENT", "table":"T_REGISTER_READ_INVALID_RECORD" } ... ], "column":3348, "relationship":4931, "table":198 }, "sqlflow":{ "dbvendor":"dbvoracle", "relationships":[], "dbobjs":[ { "schema":"SMF_RPTPRE", "database":"SMART-METER-DEVELOPMENT", "columns":[], "name":"SMART-METER-DEVELOPMENT.SMF_RPTPRE.T_GCP_RECON_MUDR_PROCESS", "coordinates":[], "id":"2", "type":"table" } ... ] } }, "sessionId":"acb8d71f299044b9e15741ff716373eba97749668034b3ea0b4f23d57d1ef839_1644931773637" }
Document, sample code and other materials for SQLFlow
SQL Formatter plugin for Notepad++
Java demos for the General SQL Parser library
SQL beautifier for databases include but not limited to Oracle, SQL Server, DB2, Sybase, MySQL, PostgreSQL, Teradata.
General SQL Parser is a Java/.NET library. It provides a rich set of APIs to parse, decode, analyze and rewrite SQL scripts. Supports more than 10 major database platforms. This repo provides demos and testcases to help people make better use of the General SQL Parser