Microsoft Fabric Notebooks and PySpark: Data Engineering Guide
Microsoft Fabric
Microsoft Fabric15 min read

Microsoft Fabric Notebooks and PySpark: Data Engineering Guide

Master data engineering in Microsoft Fabric using notebooks and PySpark. Learn medallion architecture, Delta Lake operations, OneLake integration, Spark configuration, and Power BI semantic model connectivity.

By EPC Group

<h2>Why Fabric Notebooks Are the Center of Modern Data Engineering</h2>

<p>Microsoft Fabric notebooks provide a fully managed Apache Spark environment that eliminates the infrastructure overhead traditionally associated with big data engineering. Unlike standalone Spark clusters that require sizing, provisioning, patching, and monitoring, Fabric notebooks run on shared capacity with automatic Spark session management, built-in OneLake integration, and native connectivity to every other Fabric workload. When a data engineer opens a notebook in a Fabric workspace, the platform provisions a Spark session within seconds, pre-configured with access to the workspace Lakehouse, Delta Lake libraries, and the full PySpark API.</p>

<p>This guide covers the complete data engineering workflow in Fabric notebooks: reading and writing data to OneLake, implementing Delta Lake operations, building medallion architecture pipelines, configuring Spark for performance, managing libraries, scheduling parameterized notebooks, and connecting transformed data to Power BI semantic models. Our <a href="/services/microsoft-fabric">Microsoft Fabric consulting services</a> help organizations design and implement these patterns at enterprise scale.</p>

<h2>The Fabric Notebook Environment</h2>

<p>Fabric notebooks support four languages: PySpark (Python), Spark SQL, Scala, and R. Each cell in a notebook can use a different language by specifying a magic command (<code>%%pyspark</code>, <code>%%sql</code>, <code>%%scala</code>, <code>%%sparkr</code>) at the top of the cell. PySpark is the dominant choice for data engineering because it combines the expressiveness of Python with the distributed computing power of Spark, and the majority of the Fabric ecosystem documentation and samples target PySpark.</p>

<h3>Session Configuration</h3>

<p>When a notebook starts, Fabric allocates a Spark session from the workspace capacity. The session configuration determines the number of executors, memory per executor, and available cores. For F64 or higher capacities, Fabric provides enough resources for production-scale data engineering. For smaller capacities (F2, F4), sessions are limited but sufficient for development and testing.</p>

<p>Configure session properties using the <code>%%configure</code> magic command at the top of your notebook:</p>

<pre><code>%%configure { "conf": { "spark.sql.shuffle.partitions": "200", "spark.sql.adaptive.enabled": "true", "spark.sql.adaptive.coalescePartitions.enabled": "true", "spark.sql.parquet.vorder.enabled": "true", "spark.microsoft.delta.optimizeWrite.enabled": "true" } }</code></pre>

<p>Two Fabric-specific settings deserve attention: <strong>V-Order</strong> (<code>spark.sql.parquet.vorder.enabled</code>) applies a special sort optimization to Parquet files that dramatically accelerates Power BI Direct Lake reads. <strong>Optimize Write</strong> (<code>spark.microsoft.delta.optimizeWrite.enabled</code>) automatically coalesces small output files into optimally sized 128 MB files, preventing the small file problem that degrades Lakehouse query performance.</p>

<h3>Connecting to a Lakehouse</h3>

<p>Every Fabric notebook can attach to one or more Lakehouses in the workspace. The default Lakehouse is mounted at the path <code>Tables/</code> for Delta tables and <code>Files/</code> for raw files. When you attach a Lakehouse, its tables appear in the notebook's Lakehouse explorer panel on the left, and you can reference them directly in PySpark and Spark SQL:</p>

<pre><code># PySpark: Read a Delta table from the attached Lakehouse df = spark.read.format("delta").load("Tables/sales_raw")

# Spark SQL: Query the same table %%sql SELECT * FROM lakehouse_name.sales_raw LIMIT 10</code></pre>

<h2>Reading and Writing Data to OneLake</h2>

<h3>Reading from External Sources</h3>

<p>Data engineering pipelines typically begin by ingesting data from external sources into the Lakehouse. Fabric notebooks support reading from numerous sources:</p>

<pre><code># Read CSV files uploaded to the Files section df_csv = spark.read.format("csv") \ .option("header", "true") \ .option("inferSchema", "true") \ .load("Files/raw/customers.csv")

# Read JSON files df_json = spark.read.format("json") \ .load("Files/raw/events/*.json")

# Read Parquet files df_parquet = spark.read.format("parquet") \ .load("Files/raw/transactions/")

# Read from Azure SQL Database via JDBC df_sql = spark.read.format("jdbc") \ .option("url", "jdbc:sqlserver://server.database.windows.net:1433;database=mydb") \ .option("dbtable", "dbo.Orders") \ .option("user", mssparkutils.credentials.getSecret("keyvault-url", "sql-user")) \ .option("password", mssparkutils.credentials.getSecret("keyvault-url", "sql-password")) \ .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \ .load()</code></pre>

<h3>Writing Delta Tables to OneLake</h3>

<p>All persistent tables in a Fabric Lakehouse are stored as Delta Lake format in OneLake. Writing a DataFrame as a Delta table makes it immediately available to the SQL analytics endpoint, other notebooks, dataflows, and Power BI Direct Lake models:</p>

<pre><code># Write as a managed Delta table df_transformed.write.format("delta") \ .mode("overwrite") \ .saveAsTable("sales_clean")

# Append new records to an existing table df_new_records.write.format("delta") \ .mode("append") \ .saveAsTable("sales_clean")

# Write with partitioning for large datasets df_transformed.write.format("delta") \ .mode("overwrite") \ .partitionBy("year", "month") \ .saveAsTable("sales_partitioned")

# Write to the Files section as Parquet (not a managed table) df.write.format("parquet") \ .mode("overwrite") \ .save("Files/output/report_data/")</code></pre>

<h3>OneLake Shortcuts</h3>

<p>Fabric shortcuts allow notebooks to read data from external storage (Azure Data Lake Storage Gen2, Amazon S3, Google Cloud Storage) without copying it into OneLake. Create shortcuts through the Lakehouse UI, then read them like any other table or file path in your notebook. This enables hybrid architectures where raw data remains in existing data lakes while Fabric handles transformation and analytics. Learn more in our <a href="/blog/onelake-shortcuts">OneLake shortcuts guide</a>.</p>

<h2>Delta Lake Operations in Fabric</h2>

<p>Delta Lake is the default table format in Fabric Lakehouses. Understanding Delta operations is essential for building reliable, performant data engineering pipelines.</p>

<h3>Merge (Upsert) Operations</h3>

<p>The <code>MERGE</code> operation is the workhorse of incremental data loading. It compares incoming data against existing records and performs insert, update, or delete actions based on matching conditions:</p>

<pre><code>from delta.tables import DeltaTable

# Load the target Delta table target_table = DeltaTable.forName(spark, "customers_silver")

# Incoming data (new and updated records) df_updates = spark.read.format("delta").load("Tables/customers_staging")

# Merge: update existing records, insert new ones target_table.alias("target").merge( df_updates.alias("source"), "target.customer_id = source.customer_id" ).whenMatchedUpdate( condition="source.modified_date &gt; target.modified_date", set={ "customer_name": "source.customer_name", "email": "source.email", "city": "source.city", "modified_date": "source.modified_date" } ).whenNotMatchedInsertAll().execute()</code></pre>

<h3>Time Travel</h3>

<p>Delta Lake maintains a transaction log that enables querying previous versions of a table. This is invaluable for debugging data quality issues, auditing changes, and recovering from bad writes:</p>

<pre><code># Read a specific version df_v5 = spark.read.format("delta").option("versionAsOf", 5).load("Tables/sales_silver")

# Read as of a specific timestamp df_yesterday = spark.read.format("delta") \ .option("timestampAsOf", "2026-02-23T00:00:00Z") \ .load("Tables/sales_silver")

# View the transaction history %%sql DESCRIBE HISTORY sales_silver</code></pre>

<h3>Table Maintenance</h3>

<p>Delta tables require periodic maintenance to maintain query performance. Fabric provides built-in maintenance commands:</p>

<pre><code># OPTIMIZE: Compact small files into larger, more efficient files %%sql OPTIMIZE sales_silver

# OPTIMIZE with Z-ORDER: Co-locate related data for faster filtered queries %%sql OPTIMIZE sales_silver ZORDER BY (customer_id, order_date)

# VACUUM: Remove old file versions to reclaim storage %%sql VACUUM sales_silver RETAIN 168 HOURS</code></pre>

<p>In Fabric, V-Order optimization (enabled via Spark configuration) applies an additional sort optimization during write that accelerates Direct Lake reads. Always enable V-Order for tables that will be consumed by Power BI semantic models.</p>

<h2>Implementing Medallion Architecture in Notebooks</h2>

<p>The medallion architecture (Bronze, Silver, Gold) is the standard pattern for organizing data engineering pipelines in Fabric Lakehouses. Each layer progressively refines data quality and structure.</p>

<h3>Bronze Layer: Raw Ingestion</h3>

<p>The Bronze layer ingests raw data from source systems with minimal transformation. The goal is to create a complete, immutable record of all source data with metadata for lineage tracking:</p>

<pre><code>from pyspark.sql.functions import current_timestamp, lit, input_file_name

# Read raw CSV files from the Files section df_raw = spark.read.format("csv") \ .option("header", "true") \ .option("inferSchema", "true") \ .load("Files/raw/sales/*.csv")

# Add ingestion metadata columns df_bronze = df_raw \ .withColumn("_ingestion_timestamp", current_timestamp()) \ .withColumn("_source_file", input_file_name()) \ .withColumn("_batch_id", lit("batch_20260224_001"))

# Write to Bronze table (append mode preserves history) df_bronze.write.format("delta") \ .mode("append") \ .saveAsTable("sales_bronze")</code></pre>

<h3>Silver Layer: Cleaned and Conformed</h3>

<p>The Silver layer applies data quality rules, deduplication, type casting, standardization, and business key alignment. Silver tables are the single source of truth for downstream analytics:</p>

<pre><code>from pyspark.sql.functions import col, trim, upper, to_date, when, row_number from pyspark.sql.window import Window

# Read from Bronze df_bronze = spark.read.format("delta").load("Tables/sales_bronze")

# Data quality transformations df_silver = df_bronze \ .filter(col("order_id").isNotNull()) \ .withColumn("customer_name", trim(upper(col("customer_name")))) \ .withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd")) \ .withColumn("amount", col("amount").cast("decimal(18,2)")) \ .withColumn("region", when(col("region") == "N/A", None).otherwise(col("region"))) \ .dropDuplicates(["order_id"])

# Deduplicate: keep latest record per business key window_spec = Window.partitionBy("order_id").orderBy(col("_ingestion_timestamp").desc()) df_silver = df_silver \ .withColumn("_row_num", row_number().over(window_spec)) \ .filter(col("_row_num") == 1) \ .drop("_row_num")

# Write Silver table using merge for incremental updates from delta.tables import DeltaTable

if spark.catalog.tableExists("sales_silver"): target = DeltaTable.forName(spark, "sales_silver") target.alias("t").merge( df_silver.alias("s"), "t.order_id = s.order_id" ).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute() else: df_silver.write.format("delta").saveAsTable("sales_silver")</code></pre>

<h3>Gold Layer: Business-Ready Aggregations</h3>

<p>The Gold layer creates business-oriented aggregations, dimensional models, and summary tables optimized for Power BI consumption:</p>

<pre><code>from pyspark.sql.functions import sum, count, avg, month, year

# Read from Silver df_silver = spark.read.format("delta").load("Tables/sales_silver")

# Monthly sales summary (Gold) df_gold_monthly = df_silver \ .groupBy( year("order_date").alias("year"), month("order_date").alias("month"), "region", "product_category" ).agg( sum("amount").alias("total_revenue"), count("order_id").alias("order_count"), avg("amount").alias("avg_order_value") )

# Write Gold table with V-Order for Direct Lake optimization df_gold_monthly.write.format("delta") \ .mode("overwrite") \ .option("vorder.enabled", "true") \ .saveAsTable("sales_monthly_gold")</code></pre>

<p>This three-layer pattern ensures data lineage, enables reprocessing at any layer, and delivers clean, performant data to <a href="/services/power-bi-consulting">Power BI dashboards</a> and other analytics consumers. Learn how this integrates with your broader analytics strategy in our <a href="/blog/modern-data-lakehouse-fabric">modern data lakehouse guide</a>.</p>

<h2>Scheduling and Parameterized Notebooks</h2>

<h3>Scheduling with Data Pipelines</h3>

<p>Fabric Data Pipelines orchestrate notebook execution on schedules or triggers. Create a pipeline, add a Notebook activity, select your notebook, and configure the schedule. Pipelines support:</p>

<ul> <li><strong>Sequential execution</strong>: Run Bronze notebook, then Silver notebook, then Gold notebook in order</li> <li><strong>Parallel execution</strong>: Run independent Bronze notebooks for different source systems simultaneously</li> <li><strong>Conditional logic</strong>: Use If Condition activities to branch pipeline logic based on notebook output or external conditions</li> <li><strong>Error handling</strong>: Configure retry policies, failure notifications, and fallback activities</li> <li><strong>Triggers</strong>: Schedule-based (cron), event-based (file arrival in OneLake), or manual triggers</li> </ul>

<h3>Parameterized Notebooks</h3>

<p>Parameters allow a single notebook to process different data based on runtime inputs. Define parameters in a dedicated cell tagged as a Parameters cell:</p>

<pre><code># Parameters cell (tag this cell as "Parameters" in the notebook UI) batch_date = "2026-02-24" source_system = "crm" process_mode = "incremental"</code></pre>

<p>When a Data Pipeline invokes this notebook, it can override these parameter values. The same notebook processes CRM data, ERP data, or any other source simply by changing the <code>source_system</code> parameter. This pattern eliminates notebook duplication and simplifies maintenance.</p>

<p>Reference parameters in subsequent cells:</p>

<pre><code># Use parameters in data processing logic if process_mode == "incremental": df = spark.read.format("delta").load(f"Tables/{source_system}_bronze") \ .filter(col("ingestion_date") == batch_date) else: df = spark.read.format("delta").load(f"Tables/{source_system}_bronze")

print(f"Processing {df.count()} records from {source_system} for {batch_date}")</code></pre>

<h3>Calling Notebooks from Other Notebooks</h3>

<p>Use <code>mssparkutils.notebook.run()</code> to invoke one notebook from another, enabling modular pipeline design:</p>

<pre><code># Run a child notebook with parameters result = mssparkutils.notebook.run( "silver_transform", timeout_seconds=600, arguments={"source_system": "crm", "batch_date": "2026-02-24"} ) print(f"Child notebook result: {result}")</code></pre>

<h2>Library Management and Spark Configuration</h2>

<h3>Installing Libraries</h3>

<p>Fabric notebooks include common data engineering libraries pre-installed (pandas, numpy, scikit-learn, delta-spark, etc.). For additional libraries, you have several options:</p>

<ul> <li><strong>Inline installation</strong>: Use <code>%pip install</code> in a notebook cell for session-scoped packages. These packages are available only for the current session and must be reinstalled on next session start.</li> <li><strong>Environment resources</strong>: Create a Fabric Environment item in your workspace, add required packages to its configuration, and attach the environment to your notebook. Packages in the environment persist across sessions and are shared by all notebooks attached to that environment.</li> <li><strong>Workspace-level libraries</strong>: Upload custom <code>.whl</code> or <code>.jar</code> files to the workspace for proprietary or internal packages not available on PyPI.</li> </ul>

<pre><code># Inline installation (session-scoped) %pip install great-expectations==0.18.0 %pip install azure-identity==1.15.0

# After pip install, restart the Spark session to pick up new packages import importlib importlib.invalidate_caches()</code></pre>

<h3>Spark Configuration Best Practices</h3>

<p>Optimal Spark configuration significantly impacts notebook performance. Key settings for Fabric data engineering:</p>

<pre><code>%%configure { "conf": { "spark.sql.shuffle.partitions": "auto", "spark.sql.adaptive.enabled": "true", "spark.sql.adaptive.coalescePartitions.enabled": "true", "spark.sql.adaptive.skewJoin.enabled": "true", "spark.sql.parquet.vorder.enabled": "true", "spark.microsoft.delta.optimizeWrite.enabled": "true", "spark.microsoft.delta.optimizeWrite.binSize": "134217728", "spark.sql.files.maxPartitionBytes": "134217728", "spark.sql.broadcastTimeout": "600" } }</code></pre>

<ul> <li><strong>Adaptive Query Execution (AQE)</strong>: Enabled by default in Fabric. AQE dynamically adjusts shuffle partitions, optimizes skew joins, and coalesces small partitions at runtime based on actual data statistics.</li> <li><strong>V-Order</strong>: Applies a sort optimization to written Parquet files that accelerates Power BI Direct Lake reads by 10-50%. Always enable for Gold layer tables consumed by Power BI.</li> <li><strong>Optimize Write</strong>: Automatically coalesces small output files to the target bin size (128 MB default), preventing the small file problem without requiring manual OPTIMIZE runs.</li> <li><strong>Broadcast Timeout</strong>: Increase for notebooks that join large dimension tables (default 300 seconds may not be enough for dimension tables exceeding 1 GB).</li> </ul>

<h2>Data Transformation Patterns</h2>

<h3>Schema Evolution</h3>

<p>Source systems change schemas over time (new columns, renamed columns, type changes). Delta Lake supports schema evolution to handle these changes gracefully:</p>

<pre><code># Enable automatic schema merge for append operations df_new.write.format("delta") \ .mode("append") \ .option("mergeSchema", "true") \ .saveAsTable("events_bronze")

# Overwrite with new schema (replacing the table schema entirely) df_new.write.format("delta") \ .mode("overwrite") \ .option("overwriteSchema", "true") \ .saveAsTable("events_bronze")</code></pre>

<h3>Data Quality Validation</h3>

<p>Embed data quality checks directly into your notebooks to catch issues before they propagate to downstream tables:</p>

<pre><code>from pyspark.sql.functions import col, count, when, isnan, isnull

# Data quality checks total_rows = df.count() null_checks = df.select( count(when(isnull(col("order_id")), True)).alias("null_order_ids"), count(when(isnull(col("amount")), True)).alias("null_amounts"), count(when(col("amount") &lt; 0, True)).alias("negative_amounts") ).collect()[0]

# Fail the notebook if quality thresholds are breached null_rate = null_checks.null_order_ids / total_rows if null_rate &gt; 0.01: raise ValueError(f"Data quality failure: {null_rate:.2%} null order_ids exceeds 1% threshold")

print(f"Quality check passed: {total_rows} rows, {null_checks.null_order_ids} null IDs, {null_checks.negative_amounts} negative amounts")</code></pre>

<h3>Slowly Changing Dimensions (SCD Type 2)</h3>

<p>Implement SCD Type 2 tracking in Fabric notebooks to maintain historical dimension records:</p>

<pre><code>from pyspark.sql.functions import current_timestamp, lit, sha2, concat_ws

# Generate hash of tracked attributes for change detection df_source = df_source.withColumn( "_hash", sha2(concat_ws("|", col("name"), col("address"), col("city"), col("state")), 256) )

# Compare with existing current records df_current = spark.read.format("delta").load("Tables/customer_dim") \ .filter(col("is_current") == True)

# Identify changed records df_changes = df_source.alias("s").join( df_current.alias("c"), col("s.customer_id") == col("c.customer_id"), "inner" ).filter(col("s._hash") != col("c._hash"))

# Close old records and insert new versions # (implementation varies based on specific SCD requirements)</code></pre>

<h2>Integration with Power BI Semantic Models</h2>

<p>The ultimate consumer of most Gold layer tables is Power BI. Fabric provides two primary integration paths:</p>

<h3>Direct Lake Mode</h3>

<p>Direct Lake is the preferred connection mode for Fabric Lakehouse tables. Unlike Import mode (which copies data into the Power BI model) or DirectQuery (which queries the source on every interaction), Direct Lake reads Delta Parquet files directly from OneLake into the VertiPaq engine on demand. This provides Import-mode performance without requiring scheduled refresh:</p>

<ul> <li><strong>No data movement</strong>: Power BI reads Gold layer Delta tables directly from OneLake</li> <li><strong>Automatic freshness</strong>: When a notebook updates a Gold table, the Direct Lake model reflects the changes on the next query (framing)</li> <li><strong>V-Order optimization</strong>: Tables written with V-Order enabled load into VertiPaq 10-50% faster</li> <li><strong>Fallback to DirectQuery</strong>: If the data exceeds VertiPaq memory, Direct Lake automatically falls back to DirectQuery for those tables</li> </ul>

<p>To maximize Direct Lake performance from your notebooks, ensure Gold layer tables have V-Order enabled, are OPTIMIZED periodically (compact small files), and use appropriate data types (integer keys, minimal string columns in fact tables). See our <a href="/blog/power-bi-vertipaq-analyzer-model-optimization-2026">VertiPaq optimization guide</a> for detailed model tuning techniques.</p>

<h3>SQL Analytics Endpoint</h3>

<p>Every Fabric Lakehouse automatically generates a SQL analytics endpoint that exposes all Delta tables as read-only SQL views. Power BI can connect to this endpoint using DirectQuery or Import mode, enabling existing Power BI reports to query Lakehouse data without modification. The SQL endpoint also supports T-SQL views, stored procedures (read-only), and cross-database queries.</p>

<h2>Monitoring and Optimization</h2>

<h3>Spark UI and Monitoring</h3>

<p>Fabric notebooks include a built-in Spark UI accessible from the notebook toolbar. The Spark UI shows:</p>

<ul> <li><strong>Jobs and Stages</strong>: Execution timeline, task distribution, shuffle statistics</li> <li><strong>SQL/DataFrame</strong>: Query plans, physical execution plans, scan statistics</li> <li><strong>Storage</strong>: Cached DataFrame statistics, memory usage</li> <li><strong>Executors</strong>: Per-executor metrics (memory, disk, task counts)</li> </ul>

<h3>Performance Optimization Checklist</h3>

<ul> <li><strong>Partition pruning</strong>: Filter on partition columns early in the query to skip reading unnecessary data files</li> <li><strong>Column pruning</strong>: Select only the columns you need (<code>df.select("col1", "col2")</code>) rather than reading all columns</li> <li><strong>Broadcast joins</strong>: For dimension tables under 1 GB, use broadcast joins to avoid expensive shuffle joins: <code>df_fact.join(broadcast(df_dim), "key")</code></li> <li><strong>Cache strategically</strong>: Use <code>df.cache()</code> or <code>df.persist()</code> for DataFrames read multiple times in the same notebook, but unpersist when done</li> <li><strong>Avoid UDFs when possible</strong>: PySpark UDFs serialize data between JVM and Python. Use built-in PySpark functions instead of Python UDFs for 10-100x better performance</li> <li><strong>Right-size writes</strong>: Target 128 MB per output file. Use Optimize Write or manually repartition before writing: <code>df.repartition(num_partitions)</code></li> </ul>

<h3>Cost Monitoring</h3>

<p>Fabric notebook sessions consume capacity units (CUs) from your workspace capacity. Monitor notebook costs through:</p>

<ul> <li><strong>Fabric Capacity Metrics app</strong>: Shows CU consumption by workload type, including notebook/Spark sessions</li> <li><strong>Notebook session duration</strong>: Sessions that run longer consume more CUs. Optimize code to complete faster and configure idle timeout to release unused sessions</li> <li><strong>Session configuration</strong>: Larger sessions (more executors, more memory) consume CUs proportionally faster. Use the smallest session configuration that meets your performance requirements</li> </ul>

<h2>Enterprise Patterns and Best Practices</h2>

<h3>Source Control and CI/CD</h3>

<p>Fabric workspaces integrate with Git (Azure DevOps or GitHub) for source control. Enable Git integration in workspace settings, connect to a repository, and all notebooks, pipelines, and other items are version-controlled. Implement CI/CD using deployment pipelines to promote notebooks from Development to Test to Production workspaces.</p>

<h3>Security and Access Control</h3>

<p>Fabric enforces security at multiple levels:</p>

<ul> <li><strong>Workspace roles</strong>: Admin, Member, Contributor, Viewer roles control who can create, edit, and run notebooks</li> <li><strong>OneLake data access roles</strong>: Control which tables and folders within a Lakehouse are accessible to specific users</li> <li><strong>Row-level security</strong>: Implement RLS in the SQL analytics endpoint or Power BI semantic model for fine-grained data access control</li> <li><strong>Managed private endpoints</strong>: Connect to on-premises or VNet-protected data sources through managed private endpoints for secure data ingestion</li> </ul>

<h3>Error Handling and Logging</h3>

<p>Production notebooks need robust error handling:</p>

<pre><code>import logging from datetime import datetime

logger = logging.getLogger("data_engineering") logger.setLevel(logging.INFO)

try: # Bronze ingestion df_raw = spark.read.format("csv").load("Files/raw/daily/*.csv") record_count = df_raw.count() logger.info(f"Ingested {record_count} records from source")

# Transform and write df_clean = transform_data(df_raw) df_clean.write.format("delta").mode("append").saveAsTable("events_bronze") logger.info(f"Successfully wrote {record_count} records to events_bronze")

# Set notebook exit value for pipeline monitoring mssparkutils.notebook.exit(f"SUCCESS: {record_count} records processed at {datetime.now()}")

except Exception as e: logger.error(f"Pipeline failed: {str(e)}") mssparkutils.notebook.exit(f"FAILURE: {str(e)}") raise</code></pre>

<p>For enterprise-scale Fabric implementations including notebook development, medallion architecture design, Spark optimization, and Power BI integration, <a href="/contact">contact EPC Group</a>. Our <a href="/services/microsoft-fabric">Microsoft Fabric consulting</a> team has delivered data engineering solutions across healthcare, financial services, and government organizations requiring <a href="/services/compliance-governance">compliance-grade data governance</a> and enterprise security. We help organizations migrate from legacy ETL platforms to Fabric notebooks, design scalable medallion architectures, optimize Spark performance, and build end-to-end analytics pipelines from ingestion through Power BI visualization.</p>

Frequently Asked Questions

What programming languages are supported in Microsoft Fabric notebooks?

Microsoft Fabric notebooks support four languages: PySpark (Python with Spark), Spark SQL, Scala, and R. Each cell in a notebook can use a different language by adding a magic command at the top of the cell (%%pyspark, %%sql, %%scala, %%sparkr). PySpark is the most widely used for data engineering because it combines Python expressiveness with Spark distributed computing capabilities, and the majority of Fabric documentation, samples, and community resources target PySpark. Spark SQL is useful for analysts familiar with SQL syntax and for quick ad-hoc queries against Lakehouse tables. Scala offers the best Spark performance for compute-intensive workloads since Spark is natively written in Scala. R is available for statistical analysis and data science workloads. All four languages share the same SparkSession, so a table created in a PySpark cell is immediately queryable in a Spark SQL cell within the same notebook. Most enterprise data engineering teams standardize on PySpark for transformation pipelines and use Spark SQL for validation queries and interactive exploration.

How does the medallion architecture work in Microsoft Fabric Lakehouses?

The medallion architecture organizes data into three progressive refinement layers within a Fabric Lakehouse. The Bronze layer ingests raw data from source systems with minimal transformation, preserving the original data format and adding metadata columns like ingestion timestamp, source file name, and batch identifier. Bronze tables use append mode to maintain a complete historical record of all ingested data. The Silver layer reads from Bronze and applies data quality transformations: deduplication, null handling, type casting, standardization, and business key alignment. Silver tables use merge (upsert) operations for incremental processing and represent the single source of truth for clean, conformed data. The Gold layer reads from Silver and creates business-oriented aggregations, dimensional models, KPIs, and summary tables optimized for Power BI consumption. Gold tables are written with V-Order enabled for Direct Lake performance and are typically structured as star schemas with fact and dimension tables. Each layer is stored as Delta Lake tables in the Lakehouse, enabling time travel, ACID transactions, and schema evolution. Data Pipelines orchestrate the flow between layers on schedules or triggers. This architecture provides data lineage, reprocessing capability at any layer, and clean separation between engineering and analytics concerns.

What is V-Order and why should I enable it for Power BI Direct Lake models?

V-Order is a Fabric-specific write optimization that applies a special sort order to Parquet files stored in OneLake Delta tables. When V-Order is enabled (via spark.sql.parquet.vorder.enabled = true in Spark configuration), the Spark engine sorts data within each Parquet file in a way that is optimized for the VertiPaq columnar engine used by Power BI. This optimization enables Direct Lake semantic models to load data from OneLake into VertiPaq memory 10-50 percent faster compared to standard Parquet files. V-Order does not change the logical data or query results; it only changes the physical layout of Parquet files on disk. Enable V-Order for all Gold layer tables that will be consumed by Power BI Direct Lake models. The write-time overhead is minimal (typically 5-15 percent slower writes) but the read-time benefit for Power BI is substantial. V-Order is compatible with all other Delta Lake features including partitioning, Z-Order, time travel, and merge operations. For Bronze and Silver layer tables that are consumed only by other notebooks and not by Power BI, V-Order is optional and can be skipped to maximize write throughput.

How do I schedule Fabric notebooks to run automatically?

Fabric notebooks are scheduled through Data Pipelines, which provide orchestration capabilities similar to Azure Data Factory. Create a new Data Pipeline in your Fabric workspace, add a Notebook activity from the activity palette, and select the target notebook. Configure the schedule trigger (hourly, daily, weekly, or custom cron expression) in the pipeline trigger settings. Pipelines support advanced orchestration: chain multiple notebook activities sequentially (Bronze then Silver then Gold), run independent notebooks in parallel, add If Condition activities for branching logic, and configure retry policies and failure notifications. Parameterized notebooks receive runtime values from the pipeline: define parameters in a cell tagged as the Parameters cell in your notebook, then map pipeline parameters or expressions to notebook parameters in the Notebook activity settings. This enables a single notebook to process different data sources or date ranges based on pipeline inputs. For monitoring, the Fabric monitoring hub shows pipeline run history, duration, status, and detailed per-activity execution logs. Set up alerts via the pipeline notification settings to receive emails on failure. Enterprise deployments typically combine scheduled triggers for regular processing with event-based triggers that fire when new files arrive in OneLake.

Can I use pandas in Fabric notebooks or do I have to use PySpark for everything?

You can use pandas in Fabric notebooks and it is fully supported. Fabric notebooks include pandas pre-installed along with other common Python libraries. However, understanding when to use pandas versus PySpark is critical for performance. Pandas operates on a single node in memory, so it is limited by the memory of the Spark driver node (typically 8-64 GB depending on capacity). PySpark distributes processing across all executor nodes and can handle terabyte-scale data. Use pandas for small datasets (under 1 GB), quick data exploration, and operations that are simpler in pandas than PySpark. Use PySpark for production pipelines processing large datasets. Fabric also supports the pandas API on Spark (formerly Koalas), which provides a pandas-compatible interface that runs on the distributed Spark engine. Convert between PySpark and pandas DataFrames using toPandas() and spark.createDataFrame(pandas_df), but be cautious with toPandas() on large DataFrames as it collects all data to the driver node. For the best of both worlds, use PySpark for heavy transformations and aggregations on large data, then convert the final result to pandas for visualization or export when the result set is small enough to fit in driver memory.

Microsoft FabricPySparkData EngineeringNotebooksDelta LakeOneLakeMedallion ArchitectureSparkETLLakehouse

Industry Solutions

See how we apply these solutions across industries:

Need Help With Power BI?

Our experts can help you implement the solutions discussed in this article.

Ready to Transform Your Data Strategy?

Get a free consultation to discuss how Power BI and Microsoft Fabric can drive insights and growth for your organization.