Microsoft Fabric: Data pipelines part 2 | Baker Tilly

Have questions about the Moss Adams combination? We're here to help. Submit your inquiry.

Loading...

Microsoft Fabric: Data pipelines part 2 | Baker Tilly

Capture pipeline audit data

An important part of the data factory pipeline process is monitoring and auditing your pipeline processes. Luckily, data factory pipelines have some built in variables that allow for an easy capture of this audit data.

Before capturing the audit data from our pipeline runs, we need to first establish a lakehouse table to store that data. Below are code snippets from a notebook that can be executed to create your audit_pipeline_run delta table:

Cell 1 - Spark configuration

spark.conf.set("spark.sql.parquet.vorder.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.binSize", "1073741824")

Cell 2 – Create delta table

from pyspark.sql.functions import *

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, ArrayType

schema = StructType([
StructField("PipelineRunId", StringType()),
StructField("PipelineId", StringType()),
StructField("StartTimeUTC", StringType()),
StructField("EndTimeUTC", StringType()),
StructField("WorkspaceId", StringType()),
StructField("PipelineTriggerId", StringType()),
StructField("ParentPipelineRunId", StringType()),
StructField("PipelineCompletedSuccessfully", IntegerType()),
StructField("Process", StringType())
])

data = []

table_name = "audit_pipeline_run"

metadata_df = spark.createDataFrame(data=data, schema=schema)

metadata_df.write.mode("overwrite").option("overwriteSchema", "true").format("delta").save("Tables/" + table_name)

2. Next you will need to create a notebook that inserts and updates your audit record for each pipeline run. This notebook will be called in your data factory pipeline in a later step. Below are code snippets that you will need to save as a notebook in your Fabric workspace:

Cell 1 – Spark configuration

spark.conf.set("spark.sql.parquet.vorder.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.binSize", "1073741824")

Cell 2 – Parameter cell
Important: this cell needs to be toggled as a parameter cell in order to pass values between your notebook and a data factory pipeline.

PipelineRunId = "e3680a99-cb15-41dd-8d4e-2eb3c3e3a315"
PipelineId = "111fb227-7de7-482c-8afa-7277c912d46b"
StartTimeUTC = "8/1/2023 10:59:46"
EndTimeUTC = ""
WorkspaceId = "48cfb6f5-d490-432d-9c9b-42ed05108b4b"
PipelineTriggerId = "4561afd5-d561-641c-9d5b-42e56sa1df4b"
ParentPipelineRunId = "95651dfc6-e954-521c-9d65-6542s5df45b"
PipelineCompletedSuccessfully = 0
Process = "Copy blob storage tables to lakehouse files"

Parameter cell toggled to pass values between notebook and data factory pipeline

Cell 3 – Delta table merge statement to write pipeline audit data to lakehouse table

from pyspark.sql.functions import *

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, ArrayType

from datetime import datetime

from delta.tables import *

schema = StructType([
StructField("PipelineRunId", StringType()),
StructField("PipelineId", StringType()),
StructField("StartTimeUTC", StringType()),
StructField("EndTimeUTC", StringType()),
StructField("WorkspaceId", StringType()),
StructField("PipelineTriggerId", StringType()),
StructField("ParentPipelineRunId", StringType()),
StructField("PipelineCompletedSuccessfully", IntegerType()),
StructField("Process", StringType())
])

source_data = [(PipelineRunId, PipelineId, StartTimeUTC, EndTimeUTC, WorkspaceId, PipelineTriggerId, ParentPipelineRunId, PipelineCompletedSuccessfully, Process)]

source_df = spark.createDataFrame(source_data, schema)

display(source_df)

target_delta = DeltaTable.forPath(spark, 'Tables/audit_pipeline_run')

(target_delta.alias('target') \
.merge(source_df.alias('source'), "source.PipelineRunId = target.PipelineRunId")
.whenMatchedUpdate(
set = {"target.EndTimeUTC": "source.EndTimeUTC", "target.PipelineCompletedSuccessfully": "source.PipelineCompletedSuccessfully"}
)
.whenNotMatchedInsert(
values = {
"target.PipelineRunId": "source.PipelineRunId",
"target.PipelineId": "source.PipelineId",
"target.StartTimeUTC": "source.StartTimeUTC",
"target.EndTimeUTC": "source.EndTimeUTC",
"target.WorkspaceId": "source.WorkspaceId",
"target.PipelineTriggerId": "source.PipelineTriggerId",
"target.ParentPipelineRunId": "source.ParentPipelineRunId",
"target.PipelineCompletedSuccessfully": "source.PipelineCompletedSuccessfully",
"target.Process": "source.Process"
})
.execute()
)

3. Once your notebook is created, head back to your data pipeline and add a “Notebook” activity. In the “Settings” menu of your notebook activity, first add your notebook in the dropdown menu. Next, configure the same parameters you defined in cell two of your notebook (see above). Define the parameter values in the expression builder.

Adding a Notebook activity to your Microsoft fabric data pipeline and defining base parameter values in the expression builder of notebook activity

Base parameters:

PipelineRunId = @pipeline().RunId
PipelineId = @pipeline().Pipeline
StartTimeUTC = @pipeline().TriggerTime
EndTimeUTC = Treat as Null
WorkspaceId = @pipeline().DataFactory
PipelineTriggerId = @pipeline().TriggerId
ParentPipelineRunId = @pipeline()?.TriggeredByPipelineRunId
PipelineCompletedSuccessfully = 0
Process = Copy blob storage tables to lakehouse files

4. After configuring this “Notebook” activity, copy and paste it at the end of your pipeline. Update the EndTimeUTC and PipelineCompletedSuccessfully parameters to reflect a successful pipeline run.

Copy and paste Notebook activity to the end of your Microsoft Fabric pipeline

Base parameter changes:

EndTimeUTC = @utcNow()
PipelineCompletedSuccessfully = 1

5. Finally save and run your pipeline. Navigate to your audit_pipeline_run table in your lakehouse to confirm the pipeline run data has been captured correctly.

The table record should look like this after the first “Notebook” activity:

Notebook activity table record in Microsoft Fabric

The record should then be updated to record a successful run with an EndTimeUTC value if the pipeline run succeeds.

Successful pipeline run displays EndTimeUTC value

Create and leverage metadata framework

Continuing on the concept of using a metadata framework for our data factory pipelines, we will now demonstrate how to leverage a lakehouse table to drive our pipeline process.

First, you need to create a metadata table in your lakehouse. You will achieve this through a Fabric notebook. Below are the code snippets to execute in a notebook to create your lakehouse delta table:

Cell 1 – Spark configuration

spark.conf.set("spark.sql.parquet.vorder.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.binSize", "1073741824")

Cell 2 – Create delta table with data

from pyspark.sql.functions import *

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

schema = StructType([
StructField("fileName", StringType()),
StructField("batch", IntegerType()),
StructField("active", IntegerType())
])

data = [{"fileName": "dimension_city", "batch": 1, "active": 1},
{"fileName": "dimension_customer", "batch": 1, "active": 1},
{"fileName": "dimension_date", "batch": 1, "active": 1},
{"fileName": "dimension_employee", "batch": 1, "active": 1},
{"fileName": "dimension_stock_item", "batch": 1, "active": 1},
{"fileName": "fact_sale", "batch": 2, "active": 1},
{"fileName": "fact_sale_1y_full", "batch": 2, "active": 1}]

table_name = "metadata_source_table"

metadata_df = spark.createDataFrame(data=data, schema=schema)

metadata_df.write.mode("overwrite").option("overwriteSchema", "true").format("delta").save("Tables/" + table_name)

After executing the notebook code, your metadata table should look like this:

Metadata source table for Microsoft fabric data pipeline

2. Next you will build on your existing data pipeline and add a “Lookup” activity. This activity should be placed in between the “Notebook” and “Set variable” activities in the pipeline.

Adding a Lookup activity to your Microsoft Fabric data pipeline

3. In the “Settings” menu of the lookup activity, configure the properties. Notice the preview data option to ensure you are returning the records from your metadata data.

Configure properties in the Settings menu of the lookup activity

Data store type: Workspace
Workspace data store type: Lakehouse
Lakehouse: Lakehouse name
Root folder: Tables
Table name: Table name
First row only: Unchecked

4. Previously we showed how you could explicitly set your variable value in a pipeline. Now you want to change the variable definition to use the returned value from your metadata table in your lookup activity.

Changing the variable definition to use the returned value from your metadata table in your lookup activity

Value: @activity('Lookup Metadata Table').output.value

5. Your last configuration changes are going to be in the “ForEach” activity where you need to specify which column in the multi-column array you want to use in your child “Copy data” activity. You want to specify the “fileName” column from the array to use in your “File path” directory property.

specify which column you want to use in your child Copy data activity in the ForEach activity

Directory: @concat('WideWorldImportersDW/parquet/full/',item().fileName)

6. Now save and run your pipeline to test your results. Notice the difference in the data passed through to the variable compared to the earlier results of hardcoding the variable. Three columns of data are now stored in the array variable compared to just one used before.

save and run your pipeline to test results, output settings for metadata framework. With a metadata framework, three columns of data are now stored in the array variable

Hopefully this example shows the value in creating your data factory pipelines around a metadata framework. Continue on for a final tutorial on how you can filter and batch your table loads ending with an efficient and well-architected pipeline process.