Getting Started with Fabric Notebooks and PySpark
Data Engineering
Data Engineering12 min read

Getting Started with Fabric Notebooks and PySpark

Process data at scale with PySpark in Microsoft Fabric notebooks. Tutorials for data engineering, transformation, and lakehouse integration.

By Errin O'Connor, Chief AI Architect

Microsoft Fabric notebooks provide the most productive PySpark development environment available in 2026, combining managed Apache Spark compute with direct Lakehouse integration, IntelliSense, and Copilot AI assistance in a single browser-based interface. If you are evaluating Fabric notebooks for data engineering, the short answer is that they replace the complexity of configuring Databricks clusters, Synapse Spark pools, or standalone Jupyter environments with an experience that starts in under 10 seconds and reads from your Lakehouse tables with zero configuration.

In my 25+ years of implementing enterprise data platforms, I have watched Spark evolve from a research project at Berkeley to the dominant distributed processing engine for Fortune 500 analytics. The biggest barrier to Spark adoption was never the API—it was the infrastructure. Configuring clusters, managing libraries, tuning Spark settings, and connecting to storage consumed 60-70% of a data engineer's time before they wrote a single transformation. Fabric notebooks eliminate that overhead entirely. Our Microsoft Fabric consulting team has migrated organizations from Databricks, Synapse Spark, and EMR to Fabric notebooks, and the consistent result is a 40-50% reduction in time-to-first-pipeline.

Setting Up Your First Fabric Notebook

Creating a notebook in Fabric starts in any workspace assigned to a Fabric capacity. Navigate to your workspace, select New, then Notebook, and choose your preferred language. Fabric supports PySpark (Python), Spark SQL, Scala, and R. For most enterprise data engineering, PySpark is the recommended choice due to its ecosystem of libraries, readability, and broad talent availability.

The notebook environment includes:

  • IntelliSense code completion that understands your Lakehouse schema, suggesting table and column names as you type
  • Inline documentation for PySpark functions with examples
  • Variable explorer showing DataFrame schemas, row counts, and sample data without running additional cells
  • Managed Spark session that initializes in 5-10 seconds compared to 2-5 minutes for traditional cluster startup
  • Built-in Copilot that generates PySpark code from natural language descriptions

The Fabric Spark session uses a managed pool with autoscaling. You do not provision nodes, select VM sizes, or configure autoscaling policies. Fabric handles all of this based on your capacity SKU. An F64 capacity provides approximately 64 Spark vCores that scale dynamically based on workload.

PySpark Fundamentals in Fabric

PySpark is the Python API for Apache Spark, and it is the most popular language in Fabric notebooks by a wide margin. Every data engineer working in Fabric should master these core operations:

Reading Data from Lakehouse

Load data from your Lakehouse using Delta format for optimal performance:

Read MethodUse CasePerformance
spark.read.format("delta").load()Read Delta tables from Files sectionGood—explicit path control
spark.sql("SELECT * FROM lakehouse.table")Read tables from Tables sectionBest—uses metadata catalog
spark.read.parquet()Read raw Parquet filesGood—no Delta features
spark.read.csv()Read CSV uploadsAcceptable—schema inference needed

For Lakehouse tables registered in the Tables section, always use Spark SQL syntax. It leverages the Delta transaction log for predicate pushdown, partition pruning, and file skipping—techniques that can reduce scan time by 90% on large tables.

Transformations That Matter

The transformations you will use in 90% of enterprise data engineering pipelines:

  • filter() — Row-level filtering with conditions. Chain multiple filters for complex business rules
  • select() and withColumn() — Column projection and derived columns. Use withColumn for adding calculated fields
  • groupBy().agg() — Aggregations with multiple functions (sum, count, avg, min, max, countDistinct)
  • join() — Table joins with explicit join types (inner, left, right, full, semi, anti)
  • window functions — Row numbering, running totals, lag/lead for time-series calculations
  • unionByName() — Combining DataFrames with matching column names (safer than union which uses position)

Writing Results Back to Lakehouse

Write transformed data to your Lakehouse Tables section to make it immediately queryable by SQL analytics endpoints and Direct Lake Power BI models:

Best practices for writing Delta tables:

  • Use mode("overwrite") for full refreshes of dimension tables
  • Use mode("append") for incremental loads into fact tables
  • Partition by date for large fact tables—use .partitionBy("year", "month") to enable partition pruning
  • Apply Z-ordering on high-cardinality filter columns after writing to optimize read performance
  • Set table properties for auto-optimization: delta.autoOptimize.optimizeWrite and delta.autoOptimize.autoCompact

Building a Medallion Architecture with Notebooks

The medallion architecture (Bronze, Silver, Gold) is the standard pattern for organizing data in Fabric Lakehouses. Notebooks are the primary tool for implementing the transformations at each layer:

Bronze Layer (Raw Ingestion)

  • Read raw files from external sources via OneLake shortcuts or pipeline copy activities
  • Preserve original data exactly as received—add metadata columns (ingestion timestamp, source file, batch ID) but never modify source columns
  • Store as Delta tables for ACID guarantees and time travel

Silver Layer (Cleansed and Conformed)

  • Read from Bronze tables and apply data quality rules: null handling, type casting, deduplication, standardization
  • Conform naming conventions across sources (e.g., "CustomerID" vs "cust_id" vs "CUSTOMER_ID" all become customer_id)
  • Apply business logic that is universal across use cases: currency conversion, timezone normalization, reference data joins

Gold Layer (Business-Ready Aggregates)

  • Read from Silver tables and build star schema models optimized for Power BI consumption
  • Create aggregated fact tables for common reporting patterns (daily summaries, monthly rollups)
  • Build dimension tables with slowly changing dimension logic (Type 1 overwrites, Type 2 history tracking)

Performance Optimization for Fabric Notebooks

Spark Configuration Tuning

While Fabric manages cluster infrastructure, you still control critical Spark configuration properties within your notebook:

  • spark.conf.set("spark.sql.shuffle.partitions", "auto") — Let Fabric auto-tune shuffle partitions based on data size (default is 200, which is often too many for small datasets or too few for very large ones)
  • Broadcast joins — For dimension tables under 500 MB, use broadcast hints to avoid shuffle joins: df_fact.join(broadcast(df_dim), "key")
  • Predicate pushdown — Always filter early in your pipeline to reduce data volume before joins and aggregations
  • Cache strategically — Use df.cache() only for DataFrames accessed multiple times. Unnecessary caching wastes memory

Common Performance Anti-Patterns

Anti-PatternImpactFix
collect() on large DataFramesCrashes driver with OOMUse .limit() or aggregate first
UDFs for simple operations10-100x slower than native functionsUse pyspark.sql.functions instead
Narrow transformations in loopsCreates deeply nested query plansUse reduce() or union pattern
Reading CSV without schemaSchema inference scans entire file twiceProvide explicit StructType schema
Too many small filesSlow reads due to file listing overheadRun OPTIMIZE after writes

Notebook Scheduling and Orchestration

Notebooks become production pipelines when scheduled through Fabric Data Factory:

  • Direct notebook activity — Drag a notebook into a pipeline and configure parameters, timeout, and retry policies
  • Parameterized notebooks — Pass pipeline parameters (date ranges, source paths, flags) to notebook cells tagged as parameter cells
  • Chaining — Connect multiple notebooks in sequence with conditional logic (on success, on failure, on completion)
  • Monitoring — Fabric provides Spark application monitoring with stage-level metrics, data read/written, and error details

For advanced CI/CD patterns with notebooks, see our guide on Git integration in Fabric and CI/CD pipelines.

Copilot in Fabric Notebooks

Copilot for Fabric notebooks generates PySpark code from natural language descriptions. In my experience, Copilot produces correct, runnable code about 75% of the time for standard data engineering tasks. It excels at:

  • Generating boilerplate read/write operations
  • Building common transformations (joins, aggregations, window functions)
  • Explaining existing code when you highlight a cell
  • Suggesting optimizations for slow operations

Where Copilot struggles: complex business logic requiring domain knowledge, multi-step transformations with intricate dependencies, and advanced Spark tuning. Treat Copilot output as a first draft that requires review by an experienced data engineer.

Ready to accelerate your data engineering with Fabric notebooks? Contact our team for a free architecture review and migration assessment.

PySpark Notebook Production Checklist

Before promoting any Fabric notebook to production, validate against this checklist:

  • Parameterized inputs: All file paths, database names, and configuration values passed via notebook parameters — never hardcoded.
  • Error handling: Try/except blocks around all I/O operations with meaningful error messages logged to a monitoring table.
  • Idempotent writes: Use MERGE INTO for Delta table updates so re-running a failed notebook does not create duplicates.
  • Resource cleanup: Explicitly unpersist cached DataFrames and close any database connections in a finally block.
  • Logging: Write start time, end time, rows processed, and success/failure status to a central audit table for every run.
  • Testing: Unit test transformation functions separately from I/O operations. Use small sample DataFrames for validation.

For help building production-grade Fabric notebook pipelines, contact our team.

Frequently Asked Questions

Do I need PySpark experience to use Fabric notebooks?

Basic Python knowledge is sufficient to get started. Fabric notebooks also support SQL, which many analysts already know. PySpark syntax is similar to pandas but designed for distributed processing. Microsoft provides templates and Copilot assistance to help beginners write Spark code.

How does Fabric Spark compare to Azure Databricks?

Fabric Spark is tightly integrated with the Fabric ecosystem (Lakehouse, Warehouse, Power BI) and requires less infrastructure management. Databricks offers more advanced ML capabilities and a larger partner ecosystem. For organizations already invested in Power BI and Fabric, Fabric notebooks provide a more unified experience.

What is the best data format to use in Fabric notebooks?

Delta Lake is the recommended format for Fabric Lakehouses. It provides ACID transactions, schema evolution, time travel, and optimized read performance. Use Delta for all production tables. Raw files (CSV, JSON) should be converted to Delta during ingestion for best query performance.

Microsoft FabricPySparkNotebooks

Industry Solutions

See how we apply these solutions across industries:

Need Help With Power BI?

Our experts can help you implement the solutions discussed in this article.

Ready to Transform Your Data Strategy?

Get a free consultation to discuss how Power BI and Microsoft Fabric can drive insights and growth for your organization.