
Getting Started with Fabric Notebooks and PySpark
Process data at scale with PySpark in Microsoft Fabric notebooks. Tutorials for data engineering, transformation, and lakehouse integration.
Microsoft Fabric notebooks provide the most productive PySpark development environment available in 2026, combining managed Apache Spark compute with direct Lakehouse integration, IntelliSense, and Copilot AI assistance in a single browser-based interface. If you are evaluating Fabric notebooks for data engineering, the short answer is that they replace the complexity of configuring Databricks clusters, Synapse Spark pools, or standalone Jupyter environments with an experience that starts in under 10 seconds and reads from your Lakehouse tables with zero configuration.
In my 25+ years of implementing enterprise data platforms, I have watched Spark evolve from a research project at Berkeley to the dominant distributed processing engine for Fortune 500 analytics. The biggest barrier to Spark adoption was never the API—it was the infrastructure. Configuring clusters, managing libraries, tuning Spark settings, and connecting to storage consumed 60-70% of a data engineer's time before they wrote a single transformation. Fabric notebooks eliminate that overhead entirely. Our Microsoft Fabric consulting team has migrated organizations from Databricks, Synapse Spark, and EMR to Fabric notebooks, and the consistent result is a 40-50% reduction in time-to-first-pipeline.
Setting Up Your First Fabric Notebook
Creating a notebook in Fabric starts in any workspace assigned to a Fabric capacity. Navigate to your workspace, select New, then Notebook, and choose your preferred language. Fabric supports PySpark (Python), Spark SQL, Scala, and R. For most enterprise data engineering, PySpark is the recommended choice due to its ecosystem of libraries, readability, and broad talent availability.
The notebook environment includes:
- IntelliSense code completion that understands your Lakehouse schema, suggesting table and column names as you type
- Inline documentation for PySpark functions with examples
- Variable explorer showing DataFrame schemas, row counts, and sample data without running additional cells
- Managed Spark session that initializes in 5-10 seconds compared to 2-5 minutes for traditional cluster startup
- Built-in Copilot that generates PySpark code from natural language descriptions
The Fabric Spark session uses a managed pool with autoscaling. You do not provision nodes, select VM sizes, or configure autoscaling policies. Fabric handles all of this based on your capacity SKU. An F64 capacity provides approximately 64 Spark vCores that scale dynamically based on workload.
PySpark Fundamentals in Fabric
PySpark is the Python API for Apache Spark, and it is the most popular language in Fabric notebooks by a wide margin. Every data engineer working in Fabric should master these core operations:
Reading Data from Lakehouse
Load data from your Lakehouse using Delta format for optimal performance:
| Read Method | Use Case | Performance |
|---|---|---|
| spark.read.format("delta").load() | Read Delta tables from Files section | Good—explicit path control |
| spark.sql("SELECT * FROM lakehouse.table") | Read tables from Tables section | Best—uses metadata catalog |
| spark.read.parquet() | Read raw Parquet files | Good—no Delta features |
| spark.read.csv() | Read CSV uploads | Acceptable—schema inference needed |
For Lakehouse tables registered in the Tables section, always use Spark SQL syntax. It leverages the Delta transaction log for predicate pushdown, partition pruning, and file skipping—techniques that can reduce scan time by 90% on large tables.
Transformations That Matter
The transformations you will use in 90% of enterprise data engineering pipelines:
- filter() — Row-level filtering with conditions. Chain multiple filters for complex business rules
- select() and withColumn() — Column projection and derived columns. Use withColumn for adding calculated fields
- groupBy().agg() — Aggregations with multiple functions (sum, count, avg, min, max, countDistinct)
- join() — Table joins with explicit join types (inner, left, right, full, semi, anti)
- window functions — Row numbering, running totals, lag/lead for time-series calculations
- unionByName() — Combining DataFrames with matching column names (safer than union which uses position)
Writing Results Back to Lakehouse
Write transformed data to your Lakehouse Tables section to make it immediately queryable by SQL analytics endpoints and Direct Lake Power BI models:
Best practices for writing Delta tables:
- Use mode("overwrite") for full refreshes of dimension tables
- Use mode("append") for incremental loads into fact tables
- Partition by date for large fact tables—use .partitionBy("year", "month") to enable partition pruning
- Apply Z-ordering on high-cardinality filter columns after writing to optimize read performance
- Set table properties for auto-optimization: delta.autoOptimize.optimizeWrite and delta.autoOptimize.autoCompact
Building a Medallion Architecture with Notebooks
The medallion architecture (Bronze, Silver, Gold) is the standard pattern for organizing data in Fabric Lakehouses. Notebooks are the primary tool for implementing the transformations at each layer:
Bronze Layer (Raw Ingestion)
- Read raw files from external sources via OneLake shortcuts or pipeline copy activities
- Preserve original data exactly as received—add metadata columns (ingestion timestamp, source file, batch ID) but never modify source columns
- Store as Delta tables for ACID guarantees and time travel
Silver Layer (Cleansed and Conformed)
- Read from Bronze tables and apply data quality rules: null handling, type casting, deduplication, standardization
- Conform naming conventions across sources (e.g., "CustomerID" vs "cust_id" vs "CUSTOMER_ID" all become customer_id)
- Apply business logic that is universal across use cases: currency conversion, timezone normalization, reference data joins
Gold Layer (Business-Ready Aggregates)
- Read from Silver tables and build star schema models optimized for Power BI consumption
- Create aggregated fact tables for common reporting patterns (daily summaries, monthly rollups)
- Build dimension tables with slowly changing dimension logic (Type 1 overwrites, Type 2 history tracking)
Performance Optimization for Fabric Notebooks
Spark Configuration Tuning
While Fabric manages cluster infrastructure, you still control critical Spark configuration properties within your notebook:
- spark.conf.set("spark.sql.shuffle.partitions", "auto") — Let Fabric auto-tune shuffle partitions based on data size (default is 200, which is often too many for small datasets or too few for very large ones)
- Broadcast joins — For dimension tables under 500 MB, use broadcast hints to avoid shuffle joins: df_fact.join(broadcast(df_dim), "key")
- Predicate pushdown — Always filter early in your pipeline to reduce data volume before joins and aggregations
- Cache strategically — Use df.cache() only for DataFrames accessed multiple times. Unnecessary caching wastes memory
Common Performance Anti-Patterns
| Anti-Pattern | Impact | Fix |
|---|---|---|
| collect() on large DataFrames | Crashes driver with OOM | Use .limit() or aggregate first |
| UDFs for simple operations | 10-100x slower than native functions | Use pyspark.sql.functions instead |
| Narrow transformations in loops | Creates deeply nested query plans | Use reduce() or union pattern |
| Reading CSV without schema | Schema inference scans entire file twice | Provide explicit StructType schema |
| Too many small files | Slow reads due to file listing overhead | Run OPTIMIZE after writes |
Notebook Scheduling and Orchestration
Notebooks become production pipelines when scheduled through Fabric Data Factory:
- Direct notebook activity — Drag a notebook into a pipeline and configure parameters, timeout, and retry policies
- Parameterized notebooks — Pass pipeline parameters (date ranges, source paths, flags) to notebook cells tagged as parameter cells
- Chaining — Connect multiple notebooks in sequence with conditional logic (on success, on failure, on completion)
- Monitoring — Fabric provides Spark application monitoring with stage-level metrics, data read/written, and error details
For advanced CI/CD patterns with notebooks, see our guide on Git integration in Fabric and CI/CD pipelines.
Copilot in Fabric Notebooks
Copilot for Fabric notebooks generates PySpark code from natural language descriptions. In my experience, Copilot produces correct, runnable code about 75% of the time for standard data engineering tasks. It excels at:
- Generating boilerplate read/write operations
- Building common transformations (joins, aggregations, window functions)
- Explaining existing code when you highlight a cell
- Suggesting optimizations for slow operations
Where Copilot struggles: complex business logic requiring domain knowledge, multi-step transformations with intricate dependencies, and advanced Spark tuning. Treat Copilot output as a first draft that requires review by an experienced data engineer.
Ready to accelerate your data engineering with Fabric notebooks? Contact our team for a free architecture review and migration assessment.
PySpark Notebook Production Checklist
Before promoting any Fabric notebook to production, validate against this checklist:
- Parameterized inputs: All file paths, database names, and configuration values passed via notebook parameters — never hardcoded.
- Error handling: Try/except blocks around all I/O operations with meaningful error messages logged to a monitoring table.
- Idempotent writes: Use MERGE INTO for Delta table updates so re-running a failed notebook does not create duplicates.
- Resource cleanup: Explicitly unpersist cached DataFrames and close any database connections in a finally block.
- Logging: Write start time, end time, rows processed, and success/failure status to a central audit table for every run.
- Testing: Unit test transformation functions separately from I/O operations. Use small sample DataFrames for validation.
For help building production-grade Fabric notebook pipelines, contact our team.
Frequently Asked Questions
Do I need PySpark experience to use Fabric notebooks?
Basic Python knowledge is sufficient to get started. Fabric notebooks also support SQL, which many analysts already know. PySpark syntax is similar to pandas but designed for distributed processing. Microsoft provides templates and Copilot assistance to help beginners write Spark code.
How does Fabric Spark compare to Azure Databricks?
Fabric Spark is tightly integrated with the Fabric ecosystem (Lakehouse, Warehouse, Power BI) and requires less infrastructure management. Databricks offers more advanced ML capabilities and a larger partner ecosystem. For organizations already invested in Power BI and Fabric, Fabric notebooks provide a more unified experience.
What is the best data format to use in Fabric notebooks?
Delta Lake is the recommended format for Fabric Lakehouses. It provides ACID transactions, schema evolution, time travel, and optimized read performance. Use Delta for all production tables. Raw files (CSV, JSON) should be converted to Delta during ingestion for best query performance.