Getting Started with Fabric Notebooks and PySpark
Data Engineering
Data Engineering12 min read

Getting Started with Fabric Notebooks and PySpark

Process data at scale with PySpark in Microsoft Fabric notebooks. Tutorials for data engineering, transformation, and lakehouse integration.

By Errin O'Connor, Chief AI Architect

Microsoft Fabric notebooks provide the most productive PySpark development environment available in 2026, combining managed Apache Spark compute with direct Lakehouse integration, IntelliSense, and Copilot AI assistance in a single browser-based interface. If you are evaluating Fabric notebooks for data engineering, the short answer is that they replace the complexity of configuring Databricks clusters, Synapse Spark pools, or standalone Jupyter environments with an experience that starts in under 10 seconds and reads from your Lakehouse tables with zero configuration.

In my 25+ years of implementing enterprise data platforms, I have watched Spark evolve from a research project at Berkeley to the dominant distributed processing engine for Fortune 500 analytics. The biggest barrier to Spark adoption was never the API—it was the infrastructure. Configuring clusters, managing libraries, tuning Spark settings, and connecting to storage consumed 60-70% of a data engineer's time before they wrote a single transformation. Fabric notebooks eliminate that overhead entirely. Our Microsoft Fabric consulting team has migrated organizations from Databricks, Synapse Spark, and EMR to Fabric notebooks, and the consistent result is a 40-50% reduction in time-to-first-pipeline.

Setting Up Your First Fabric Notebook

Creating a notebook in Fabric starts in any workspace assigned to a Fabric capacity. Navigate to your workspace, select New, then Notebook, and choose your preferred language. Fabric supports PySpark (Python), Spark SQL, Scala, and R. For most enterprise data engineering, PySpark is the recommended choice due to its ecosystem of libraries, readability, and broad talent availability.

The notebook environment includes:

  • IntelliSense code completion that understands your Lakehouse schema, suggesting table and column names as you type
  • Inline documentation for PySpark functions with examples
  • Variable explorer showing DataFrame schemas, row counts, and sample data without running additional cells
  • Managed Spark session that initializes in 5-10 seconds compared to 2-5 minutes for traditional cluster startup
  • Built-in Copilot that generates PySpark code from natural language descriptions

The Fabric Spark session uses a managed pool with autoscaling. You do not provision nodes, select VM sizes, or configure autoscaling policies. Fabric handles all of this based on your capacity SKU. An F64 capacity provides approximately 64 Spark vCores that scale dynamically based on workload.

PySpark Fundamentals in Fabric

PySpark is the Python API for Apache Spark, and it is the most popular language in Fabric notebooks by a wide margin. Every data engineer working in Fabric should master these core operations:

Reading Data from Lakehouse

Load data from your Lakehouse using Delta format for optimal performance:

Read MethodUse CasePerformance
spark.read.format("delta").load()Read Delta tables from Files sectionGood—explicit path control
spark.sql("SELECT * FROM lakehouse.table")Read tables from Tables sectionBest—uses metadata catalog
spark.read.parquet()Read raw Parquet filesGood—no Delta features
spark.read.csv()Read CSV uploadsAcceptable—schema inference needed

For Lakehouse tables registered in the Tables section, always use Spark SQL syntax. It leverages the Delta transaction log for predicate pushdown, partition pruning, and file skipping—techniques that can reduce scan time by 90% on large tables.

Transformations That Matter

The transformations you will use in 90% of enterprise data engineering pipelines:

  • filter() — Row-level filtering with conditions. Chain multiple filters for complex business rules
  • select() and withColumn() — Column projection and derived columns. Use withColumn for adding calculated fields
  • groupBy().agg() — Aggregations with multiple functions (sum, count, avg, min, max, countDistinct)
  • join() — Table joins with explicit join types (inner, left, right, full, semi, anti)
  • window functions — Row numbering, running totals, lag/lead for time-series calculations
  • unionByName() — Combining DataFrames with matching column names (safer than union which uses position)

Writing Results Back to Lakehouse

Write transformed data to your Lakehouse Tables section to make it immediately queryable by SQL analytics endpoints and Direct Lake Power BI models:

Best practices for writing Delta tables:

  • Use mode("overwrite") for full refreshes of dimension tables
  • Use mode("append") for incremental loads into fact tables
  • Partition by date for large fact tables—use .partitionBy("year", "month") to enable partition pruning
  • Apply Z-ordering on high-cardinality filter columns after writing to optimize read performance
  • Set table properties for auto-optimization: delta.autoOptimize.optimizeWrite and delta.autoOptimize.autoCompact

Building a Medallion Architecture with Notebooks

The medallion architecture (Bronze, Silver, Gold) is the standard pattern for organizing data in Fabric Lakehouses. Notebooks are the primary tool for implementing the transformations at each layer:

Bronze Layer (Raw Ingestion)

  • Read raw files from external sources via OneLake shortcuts or pipeline copy activities
  • Preserve original data exactly as received—add metadata columns (ingestion timestamp, source file, batch ID) but never modify source columns
  • Store as Delta tables for ACID guarantees and time travel

Silver Layer (Cleansed and Conformed)

  • Read from Bronze tables and apply data quality rules: null handling, type casting, deduplication, standardization
  • Conform naming conventions across sources (e.g., "CustomerID" vs "cust_id" vs "CUSTOMER_ID" all become customer_id)
  • Apply business logic that is universal across use cases: currency conversion, timezone normalization, reference data joins

Gold Layer (Business-Ready Aggregates)

  • Read from Silver tables and build star schema models optimized for Power BI consumption
  • Create aggregated fact tables for common reporting patterns (daily summaries, monthly rollups)
  • Build dimension tables with slowly changing dimension logic (Type 1 overwrites, Type 2 history tracking)

Performance Optimization for Fabric Notebooks

Spark Configuration Tuning

While Fabric manages cluster infrastructure, you still control critical Spark configuration properties within your notebook:

  • spark.conf.set("spark.sql.shuffle.partitions", "auto") — Let Fabric auto-tune shuffle partitions based on data size (default is 200, which is often too many for small datasets or too few for very large ones)
  • Broadcast joins — For dimension tables under 500 MB, use broadcast hints to avoid shuffle joins: df_fact.join(broadcast(df_dim), "key")
  • Predicate pushdown — Always filter early in your pipeline to reduce data volume before joins and aggregations
  • Cache strategically — Use df.cache() only for DataFrames accessed multiple times. Unnecessary caching wastes memory

Common Performance Anti-Patterns

Anti-PatternImpactFix
collect() on large DataFramesCrashes driver with OOMUse .limit() or aggregate first
UDFs for simple operations10-100x slower than native functionsUse pyspark.sql.functions instead
Narrow transformations in loopsCreates deeply nested query plansUse reduce() or union pattern
Reading CSV without schemaSchema inference scans entire file twiceProvide explicit StructType schema
Too many small filesSlow reads due to file listing overheadRun OPTIMIZE after writes

Notebook Scheduling and Orchestration

Notebooks become production pipelines when scheduled through Fabric Data Factory:

  • Direct notebook activity — Drag a notebook into a pipeline and configure parameters, timeout, and retry policies
  • Parameterized notebooks — Pass pipeline parameters (date ranges, source paths, flags) to notebook cells tagged as parameter cells
  • Chaining — Connect multiple notebooks in sequence with conditional logic (on success, on failure, on completion)
  • Monitoring — Fabric provides Spark application monitoring with stage-level metrics, data read/written, and error details

For advanced CI/CD patterns with notebooks, see our guide on Git integration in Fabric and CI/CD pipelines.

Copilot in Fabric Notebooks

Copilot for Fabric notebooks generates PySpark code from natural language descriptions. In my experience, Copilot produces correct, runnable code about 75% of the time for standard data engineering tasks. It excels at:

  • Generating boilerplate read/write operations
  • Building common transformations (joins, aggregations, window functions)
  • Explaining existing code when you highlight a cell
  • Suggesting optimizations for slow operations

Where Copilot struggles: complex business logic requiring domain knowledge, multi-step transformations with intricate dependencies, and advanced Spark tuning. Treat Copilot output as a first draft that requires review by an experienced data engineer.

Ready to accelerate your data engineering with Fabric notebooks? Contact our team for a free architecture review and migration assessment.

PySpark Notebook Production Checklist

Before promoting any Fabric notebook to production, validate against this checklist:

  • Parameterized inputs: All file paths, database names, and configuration values passed via notebook parameters — never hardcoded.
  • Error handling: Try/except blocks around all I/O operations with meaningful error messages logged to a monitoring table.
  • Idempotent writes: Use MERGE INTO for Delta table updates so re-running a failed notebook does not create duplicates.
  • Resource cleanup: Explicitly unpersist cached DataFrames and close any database connections in a finally block.
  • Logging: Write start time, end time, rows processed, and success/failure status to a central audit table for every run.
  • Testing: Unit test transformation functions separately from I/O operations. Use small sample DataFrames for validation.

For help building production-grade Fabric notebook pipelines, contact our team.

Enterprise Implementation Best Practices

Deploying Microsoft Fabric at enterprise scale requires a structured approach that addresses governance, security, and organizational readiness from day one. Organizations that skip the planning phase typically face costly rework within the first 90 days.

Establish a Fabric Center of Excellence (CoE) before provisioning production capacities. The CoE should include a Fabric admin, at least one data engineer, a Power BI developer, and a business stakeholder who understands the reporting requirements. This cross-functional team defines workspace naming conventions, capacity allocation policies, and data classification standards that prevent sprawl as adoption grows.

Implement environment separation from the start. Use dedicated workspaces for development, testing, and production with deployment pipelines automating the promotion process. Every Lakehouse, warehouse, and semantic model should follow a consistent naming convention that includes the business domain, data layer (bronze, silver, gold), and environment identifier. This structure makes governance auditable and reduces the risk of accidental production changes.

Right-size your Fabric capacity based on actual workload profiles, not vendor sizing guides. Run a two-week proof of concept on an F64 capacity with representative data volumes and query patterns. Monitor CU consumption using the Fabric Capacity Metrics app, then adjust the SKU based on measured peak and sustained usage. Over-provisioning wastes budget; under-provisioning creates throttling that frustrates users during critical reporting windows.

Data security must be layered. Configure workspace-level RBAC for broad access control, OneLake data access roles for table-level permissions, and row-level security in semantic models for row-level filtering. Sensitivity labels from Microsoft Purview should be applied to all datasets containing PII, financial data, or protected health information to ensure compliance with HIPAA, SOC 2, and GDPR requirements.

Measuring Success and ROI

Quantifying Microsoft Fabric impact requires tracking metrics across infrastructure cost reduction, operational efficiency, and business value creation.

Infrastructure savings are the most immediately measurable. Compare monthly Azure spend before and after Fabric migration, including compute, storage, and data movement costs across all replaced services. Organizations typically see 30-60% reduction in total analytics infrastructure costs within the first six months, primarily from eliminating redundant storage copies and consolidating multiple service SKUs into a single Fabric capacity.

Operational efficiency gains show up in reduced time-to-insight. Measure the average time from data availability to published report before and after Fabric adoption. Track pipeline failure rates, data freshness SLAs, and the number of manual data preparation steps eliminated by OneLake unified storage. Target a 40-50% reduction in data engineering effort within the first year.

Business value metrics connect Fabric capabilities to revenue and decision-making speed. Track the number of business decisions supported by Fabric-powered analytics per quarter, the time to answer ad-hoc business questions, and user adoption rates across departments. Establish quarterly business reviews where stakeholders quantify decisions that were enabled or accelerated by the platform.

Ready to move from strategy to execution? Our team of certified consultants has delivered 500+ enterprise analytics projects across healthcare, financial services, manufacturing, and government. Whether you need architecture design, hands-on implementation, or ongoing optimization, our Microsoft Fabric implementation services are designed for organizations that demand production-grade results. Contact us today for a free assessment and learn how we can accelerate your analytics transformation.

Frequently Asked Questions

Do I need PySpark experience to use Fabric notebooks?

Basic Python knowledge is sufficient to get started. Fabric notebooks also support SQL, which many analysts already know. PySpark syntax is similar to pandas but designed for distributed processing. Microsoft provides templates and Copilot assistance to help beginners write Spark code.

How does Fabric Spark compare to Azure Databricks?

Fabric Spark is tightly integrated with the Fabric ecosystem (Lakehouse, Warehouse, Power BI) and requires less infrastructure management. Databricks offers more advanced ML capabilities and a larger partner ecosystem. For organizations already invested in Power BI and Fabric, Fabric notebooks provide a more unified experience.

What is the best data format to use in Fabric notebooks?

Delta Lake is the recommended format for Fabric Lakehouses. It provides ACID transactions, schema evolution, time travel, and optimized read performance. Use Delta for all production tables. Raw files (CSV, JSON) should be converted to Delta during ingestion for best query performance.

Microsoft FabricPySparkNotebooks

Industry Solutions

See how we apply these solutions across industries:

Need Help With Power BI?

Our experts can help you implement the solutions discussed in this article.

Ready to Transform Your Data Strategy?

Get a free consultation to discuss how Power BI and Microsoft Fabric can drive insights and growth for your organization.