
Getting Started with Fabric Notebooks and PySpark
Process data at scale with PySpark in Microsoft Fabric notebooks. Tutorials for data engineering, transformation, and lakehouse integration.
Microsoft Fabric notebooks provide a powerful environment for data engineering at scale. Built on Apache Spark, they support PySpark, Scala, SQL, and R, enabling data engineers to process terabytes of data with familiar programming paradigms while leveraging Fabric's unified analytics platform.
Setting Up Your First Fabric Notebook
Creating a notebook in Fabric starts in any workspace with Fabric capacity. Navigate to your workspace, select New > Notebook, and choose your preferred language. Fabric notebooks connect directly to your Lakehouse, making data immediately accessible without complex configuration.
The notebook environment includes IntelliSense code completion, inline documentation, and variable exploration. Fabric uses a managed Spark pool that starts in seconds compared to the minutes required by traditional Spark clusters.
PySpark Fundamentals in Fabric
PySpark is the Python API for Apache Spark, and it is the most popular language choice in Fabric notebooks. Key operations every data engineer should master include:
Reading Data: Load data from your Lakehouse using Delta format for best performance. Fabric supports reading from CSV, Parquet, JSON, and Delta tables. Use `spark.read.format("delta").load("Tables/your_table")` for Lakehouse tables or `spark.read.format("csv").option("header", "true").load("Files/data.csv")` for raw files.
Transformations: PySpark DataFrames support operations like `select`, `filter`, `groupBy`, `join`, `withColumn`, and `agg`. Chain transformations together for readable pipelines: read source data, apply filters, join reference tables, calculate aggregates, and write results.
Writing Data: Write results back to your Lakehouse in Delta format using `df.write.format("delta").mode("overwrite").saveAsTable("clean_sales")`. Delta format provides ACID transactions, schema evolution, and time travel capabilities.
Delta Lake Integration
Delta Lake is the default storage format in Fabric Lakehouses and provides significant advantages over raw Parquet:
- ACID Transactions: Multiple writers can safely update the same table concurrently
- Schema Evolution: Add new columns without rewriting existing data using `mergeSchema` option
- Time Travel: Query previous versions of your data for auditing or rollback
- MERGE Operations: Perform upserts (insert or update) efficiently with Delta MERGE syntax
- Optimize and Vacuum: Compact small files and clean up old versions to maintain performance
Performance Optimization Techniques
Fabric Spark performance depends on proper data engineering practices:
Partitioning: Partition large tables by commonly filtered columns like date or region. Use `df.write.partitionBy("year", "month")` for time-series data. Avoid over-partitioning which creates too many small files.
Broadcast Joins: When joining a large table with a small lookup table (under 100MB), use broadcast joins: `df_large.join(broadcast(df_small), "key")`. This avoids expensive shuffle operations.
Caching: Cache frequently accessed DataFrames with `df.cache()` when they are used multiple times in the same notebook. Unpersist when no longer needed to free memory.
File Compaction: Run `OPTIMIZE` on Delta tables regularly to merge small files into larger ones, improving read performance. Set target file size based on your query patterns.
Predicate Pushdown: Apply filters early in your pipeline so Spark can skip reading unnecessary data partitions and row groups.
Collaborative Development
Fabric notebooks support real-time collaboration where multiple data engineers can edit the same notebook simultaneously. Combined with Git integration, teams can:
- Branch notebooks for feature development
- Use pull requests for code review before merging to main
- Track all changes with full version history
- Roll back to any previous version if issues arise
Scheduling and Orchestration
Notebooks can be scheduled to run automatically or orchestrated through Fabric Data Pipelines. Common patterns include:
- Daily ETL: Schedule notebooks to refresh Lakehouse tables each morning
- Pipeline Integration: Call notebooks from Data Pipeline activities with parameterized inputs
- Dependency Chains: Use pipeline activities to run notebooks in sequence with error handling
- Monitoring: Track notebook run history, duration, and failures through the Fabric monitoring hub
Best Practices for Production Notebooks
- Parameterize inputs: Use notebook parameters for dates, file paths, and configuration values instead of hardcoding
- Error handling: Wrap critical sections in try/except blocks and log errors to a monitoring table
- Modular design: Break complex ETL into multiple focused notebooks called from a pipeline
- Testing: Create validation notebooks that check row counts, null percentages, and data freshness after each ETL run
- Documentation: Use markdown cells to explain business logic, data lineage, and transformation rationale
Related Resources
Frequently Asked Questions
Do I need PySpark experience to use Fabric notebooks?
Basic Python knowledge is sufficient to get started. Fabric notebooks also support SQL, which many analysts already know. PySpark syntax is similar to pandas but designed for distributed processing. Microsoft provides templates and Copilot assistance to help beginners write Spark code.
How does Fabric Spark compare to Azure Databricks?
Fabric Spark is tightly integrated with the Fabric ecosystem (Lakehouse, Warehouse, Power BI) and requires less infrastructure management. Databricks offers more advanced ML capabilities and a larger partner ecosystem. For organizations already invested in Power BI and Fabric, Fabric notebooks provide a more unified experience.
What is the best data format to use in Fabric notebooks?
Delta Lake is the recommended format for Fabric Lakehouses. It provides ACID transactions, schema evolution, time travel, and optimized read performance. Use Delta for all production tables. Raw files (CSV, JSON) should be converted to Delta during ingestion for best query performance.