Optimizing Spark Jobs in Fabric
Data Engineering
Data Engineering11 min read

Optimizing Spark Jobs in Fabric

Improve Microsoft Fabric notebook performance with Spark tuning best practices. Optimize partitioning, caching, joins, and cluster configuration.

By Administrator

Spark jobs in Microsoft Fabric notebooks offer enormous processing power for data engineering, but that power comes with complexity. An unoptimized Spark job can take 10x longer and consume 10x more capacity units than a well-tuned one processing identical data. Unlike traditional Spark on Databricks or HDInsight where you manage cluster infrastructure directly, Fabric abstracts cluster management—but this does not mean optimization is automatic. You still control data layout, join strategies, caching, serialization, and code patterns that determine whether your notebook runs in 2 minutes or 2 hours.

Understanding Fabric's Spark Architecture

Compute Model

Fabric Spark uses a capacity-based compute model rather than dedicated clusters:

| Concept | Fabric Spark | Traditional Spark | |---|---|---| | Cluster management | Automatic (starter pools) | Manual sizing and configuration | | Scaling | Auto-scales within capacity limits | Manual or auto-scale with policies | | Cold start | 15-30 seconds (starter pools) | 3-10 minutes (new cluster creation) | | Session isolation | Per notebook, shared within high-concurrency | Per cluster or shared via interactive pools | | Cost model | Capacity Units (CU) per second | VM hours per node |

Starter pools pre-warm Spark nodes so your first cell executes quickly. However, the available compute is bounded by your Fabric capacity SKU. An F64 capacity provides significantly more Spark CUs than an F2—know your limits before designing large-scale processing jobs.

Session Lifecycle

A Spark session starts when the first cell in a notebook executes and persists across all subsequent cells. The session ends after an idle timeout (default 20 minutes, configurable in workspace settings). Key optimization insight: keep sessions alive for iterative development by executing lightweight cells periodically, but end sessions explicitly in production pipelines to release capacity for other workloads.

High concurrency mode shares a single Spark session across multiple notebooks, reducing cold start time for concurrent users. Enable this for development and exploration workloads where multiple data engineers work in the same workspace simultaneously.

Data Partitioning Strategies

Partitioning is the single most impactful optimization for large datasets. A well-partitioned Delta table enables Spark to skip entire file groups (partition pruning) instead of scanning all data.

Choosing Partition Columns

Select partition columns based on your dominant query and filter patterns:

  • Date-based partitioning (Year/Month or Year/Month/Day): Best for time-series data where queries typically filter by date range. Creates a manageable number of partitions with even data distribution.
  • Category-based partitioning (Region, Department, Product Line): Best when queries consistently filter by a low-cardinality category column.
  • Hybrid partitioning (Year + Region): Combine two columns when queries filter on both dimensions. Use cautiously—this multiplies partition count.

Partition Sizing Rules

  • Target 100MB-1GB per partition file for optimal read performance
  • Avoid more than 10,000 partitions per table (metadata overhead exceeds benefits)
  • Never partition small tables (under 1GB total)—partitioning adds overhead that outweighs benefits on small data
  • Use V-Order optimization (enabled by default in Fabric) which applies special sorting within Parquet files for faster reads by Power BI Direct Lake models

Repartitioning In-Flight Data

When processing data that arrives with suboptimal partitioning, use repartition() or coalesce() strategically:

  • `df.repartition(numPartitions, col("partition_column"))` — Full shuffle, use when you need even distribution or repartitioning by a new column
  • `df.coalesce(numPartitions)` — No shuffle, only reduces partition count. Use when you have too many small partitions and need to consolidate

Join Optimization

Joins are the most expensive Spark operation because they typically require shuffling data across executors. Choosing the right join strategy can reduce job runtime by orders of magnitude.

Broadcast Joins (Map-Side Joins)

When one side of a join is small enough to fit in executor memory (typically under 100MB), broadcast it:

``` from pyspark.sql.functions import broadcast result = large_fact_df.join(broadcast(small_dim_df), "key_column") ```

Broadcast joins eliminate shuffle entirely. The small table is serialized and sent to every executor, where the join executes locally against each partition of the large table. This transforms an O(N) shuffle operation into an O(1) broadcast.

Fabric's Spark automatically broadcasts tables under 10MB by default (controlled by `spark.sql.autoBroadcastJoinThreshold`). Increase this threshold when you have dimension tables in the 10-200MB range that would benefit from broadcasting.

Sort-Merge Joins

For joins between two large tables where neither fits in memory, ensure both sides are pre-sorted and partitioned on the join key. This enables sort-merge joins instead of hash joins, reducing memory pressure and improving stability for very large joins.

Skew Handling

Data skew—where one join key has disproportionately more rows than others—causes a single executor to process most of the data while others sit idle. Detect skew by examining the Spark UI for tasks that take 10x+ longer than the median. Mitigate with:

  • Salting: Add a random suffix to the skewed key, join on the salted key, then aggregate away the salt
  • Splitting: Process the skewed key separately from the rest and union the results
  • AQE (Adaptive Query Execution): Fabric Spark enables AQE by default, which dynamically handles moderate skew by splitting overloaded partitions at runtime

Caching Strategy

Caching keeps DataFrames in memory (or on disk) so subsequent operations avoid re-reading and re-computing from source:

| Method | Storage | When to Use | |---|---|---| | `df.cache()` | Memory only (spills to disk) | Default choice for DataFrames used 2+ times | | `df.persist(StorageLevel.MEMORY_AND_DISK)` | Memory first, disk overflow | Large DataFrames that may not fit in memory | | `df.persist(StorageLevel.DISK_ONLY)` | Disk only | Very large DataFrames, memory-constrained | | Delta table (write and re-read) | OneLake | Between notebook steps or across notebooks |

Critical caching rules: - Only cache DataFrames used multiple times in downstream operations. Caching a DataFrame used once adds overhead with zero benefit. - Unpersist when done: Call `df.unpersist()` after the cached DataFrame is no longer needed to free memory for subsequent operations. - Cache after filtering: Cache the filtered/transformed result, not the raw source. Caching a 100GB raw table wastes memory when your analysis only needs 5GB of filtered data. - Materialization requirement: Cache is lazy—call `df.count()` or another action after `.cache()` to force materialization.

Code-Level Optimizations

Column Pruning

Select only the columns you need as early as possible in your pipeline. Every unnecessary column consumes memory through serialization, shuffling, and caching:

``` # Good: Select early df = spark.read.format("delta").load(path).select("id", "date", "amount", "category")

# Bad: Read all columns, filter later df = spark.read.format("delta").load(path) # Reads all 50 columns ```

Predicate Pushdown

Filter as early as possible. When reading from Delta tables, Spark pushes filter predicates down to the file scan level, skipping entire Parquet row groups that do not match:

``` # Good: Filter during read (predicate pushdown) df = spark.read.format("delta").load(path).filter("year = 2024 AND region = 'US'")

# Bad: Read everything, then filter df = spark.read.format("delta").load(path) df_filtered = df.filter("year = 2024 AND region = 'US'") # Same result but optimizer may not push down ```

Avoid collect() and toPandas() on Large Data

`collect()` pulls all data to the driver node as a Python list. `toPandas()` converts the entire DataFrame to a Pandas DataFrame on the driver. Both will crash your session if the data exceeds driver memory. Use these only on aggregated results (thousands of rows, not millions).

Use DataFrame API Over SQL Strings

The DataFrame API enables Spark's Catalyst optimizer to build and optimize the full execution plan. While Spark SQL strings also go through Catalyst, the DataFrame API provides better IDE support, type safety, and composability for complex transformations.

Monitoring with Spark UI

The Spark UI (accessible from the notebook toolbar during and after execution) is your primary diagnostic tool:

  • Jobs tab: See all Spark jobs triggered by your notebook cells. Identify which cells are slow.
  • Stages tab: Drill into stages to find shuffle-heavy operations (large "Shuffle Read" and "Shuffle Write" values indicate expensive data movement).
  • Tasks tab: Identify skew by looking for tasks with durations 10x+ the median—these are bottlenecks.
  • SQL/DataFrame tab: View the physical execution plan including which joins are broadcast vs shuffle, which filters were pushed down, and where data was cached.

Related Resources

Frequently Asked Questions

What is the default Spark cluster size in Fabric?

Fabric automatically scales Spark clusters based on workload. You can configure starter pool settings, but the system manages scaling dynamically. Check your capacity settings for limits.

How do I reduce Spark job cold start time?

Use high concurrency mode to share sessions across notebooks, keep frequently used notebooks running with scheduled refreshes, and consider workspace pools for dedicated compute.

Microsoft FabricSparkPerformanceOptimization

Industry Solutions

See how we apply these solutions across industries:

Need Help With Power BI?

Our experts can help you implement the solutions discussed in this article.

Ready to Transform Your Data Strategy?

Get a free consultation to discuss how Power BI and Microsoft Fabric can drive insights and growth for your organization.