Monitoring Fabric with Capacity Metrics
Microsoft Fabric
Microsoft Fabric15 min read

Monitoring Fabric with Capacity Metrics

Track Microsoft Fabric capacity utilization and identify performance bottlenecks. Dashboard monitoring, alerts, and optimization recommendations.

By Errin O'Connor, Chief AI Architect

The Microsoft Fabric Capacity Metrics app is the essential monitoring tool for Fabric administrators — it shows whether your capacity is right-sized, which workloads consume the most resources, when throttling occurs, and which items or users are monopolizing shared compute. Install this app on day one of any Fabric deployment. Without it, you are making capacity decisions blindfolded.

I configure the Capacity Metrics app as the first step in every Fabric engagement. In one recent deployment, the app revealed that a single unoptimized Spark notebook was consuming 43% of a client's F64 capacity during business hours — throttling interactive Power BI queries for 2,000 users. We optimized the notebook (broadcast joins, column pruning, proper partitioning), reduced its CU consumption by 78%, and eliminated all interactive throttling without upgrading the capacity SKU. Our Microsoft Fabric consulting services include capacity optimization as a standard engagement component.

Understanding Fabric Capacity Units

All Fabric workloads consume a shared pool of Capacity Units (CUs). Understanding how CUs work is fundamental to interpreting metrics:

ConceptDescriptionKey Detail
CU SecondsUnit of compute measurement1 CU consumed for 1 second = 1 CU-second
CU AllocationCUs available per SKUF2 = 2 CUs, F8 = 8 CUs, F64 = 64 CUs, etc.
Interactive OperationsUser-initiated queries, report rendersEvaluated in 30-second windows, throttled at 100% utilization
Background OperationsScheduled refreshes, pipelines, Spark jobsEvaluated in 24-hour windows, throttled at sustained overuse
SmoothingCU consumption is smoothed over evaluation windowsA 10-second spike does not immediately trigger throttling
BurstShort-term consumption can exceed SKU allocationBursting borrows from future capacity — repaid over the evaluation window

The distinction between interactive and background operations is critical. Interactive operations (report queries, dashboard refreshes) are throttled quickly when capacity is overloaded — users see slow or failed reports within seconds. Background operations (scheduled refreshes, Spark jobs) are throttled more gradually, with jobs queued rather than rejected. I always tell clients: "Interactive throttling is a user-facing emergency. Background throttling is a planning problem."

Installing and Configuring the App

Prerequisites

  • Fabric Capacity Administrator or Global Administrator role
  • At least one Fabric capacity (F2 or higher, or Power BI Premium P1 or higher)
  • A workspace to host the app

Installation Steps

  1. Navigate to Power BI Service > Apps > Get apps and search for "Microsoft Fabric Capacity Metrics"
  2. Click Install and select the workspace where the app will be stored
  3. After installation, open the app and click Connect to configure the data source
  4. Enter your Capacity ID (found in the Fabric Admin Portal under Capacity Settings) and select the date range
  5. The app begins loading historical data — initial load may take several minutes for large capacities

Data Retention

The app stores 14 days of detailed metrics and 30 days of summarized data. For longer retention, create a dataflow that exports capacity metrics to a Lakehouse on a scheduled basis. I recommend 90-day retention for trend analysis and 12-month retention for capacity planning and budget justification.

Key Dashboard Pages

Overview Page

The overview page shows the most critical health indicators at a glance:

  • CU Utilization Trend: Line chart showing CU consumption percentage over time. Sustained utilization above 80% indicates you are approaching capacity limits. Sustained utilization below 20% suggests over-provisioning — you may be paying for capacity you do not need.
  • Throttling Events: Count and duration of throttling events. Any throttling of interactive operations means users experienced degraded performance. Zero interactive throttling should be your target.
  • Top Items by CU: Ranked list of Fabric items (datasets, notebooks, pipelines) consuming the most CUs. This identifies your optimization targets — the Pareto principle applies here. In most environments, 10-15% of items consume 70-80% of capacity.

Workload Breakdown Page

Shows CU consumption segmented by workload type:

  • Power BI: Report queries, dataset refreshes, paginated report renders
  • Data Engineering: Spark notebook execution, Lakehouse operations
  • Data Factory: Pipeline runs, dataflow refreshes, copy activities
  • Real-Time Analytics: Eventstream processing, KQL queries
  • Data Science: ML experiment runs, model training

This breakdown reveals which workloads dominate your capacity. If 70% of CU consumption comes from Spark notebooks, optimizing those notebooks delivers the most savings. If Power BI queries dominate during business hours, focus on DAX optimization and aggregation tables.

Item-Level Detail Page

Drill into individual items to see:

  • CU consumption per refresh or execution
  • Duration trends over time (increasing duration signals degradation)
  • User who triggered the operation
  • Success/failure status
  • Queuing time (time waiting for available CUs before execution starts) — queuing time above zero is an early warning sign of capacity pressure

Interpreting Capacity Health

Healthy Capacity

  • CU utilization averages 40-60% during peak hours
  • Zero or minimal interactive throttling events
  • Background operations complete within scheduled windows
  • No single item consumes more than 20% of total CU budget

Capacity at Risk

  • CU utilization peaks above 80% regularly
  • Occasional interactive throttling during peak hours
  • Background operations starting to queue and delay
  • One or two items dominate CU consumption — these are your immediate optimization targets

Capacity in Crisis

  • CU utilization sustained above 100% (consuming burst capacity)
  • Frequent interactive throttling — users reporting slow or failed reports
  • Background operations significantly delayed or failing
  • Throttling policy may reject new operations entirely

Optimization Strategies by Finding

Finding: Single Refresh Consuming Excessive CUs

Root cause: A large dataset refresh (often with full refresh instead of incremental) monopolizes capacity during refresh windows.

Solution: Implement incremental refresh to reduce the data volume refreshed each cycle. Switch from full refresh to partition-level refresh where possible. Schedule the refresh during off-peak hours when capacity headroom is available. I once reduced a client's refresh CU consumption by 92% simply by implementing incremental refresh on their three largest datasets.

Finding: Spark Notebooks Consuming Majority of CUs

Root cause: Unoptimized Spark jobs with excessive shuffling, missing partitioning, or oversized cluster configurations.

**Solution**: Review Spark job optimization — apply proper partitioning, use broadcast joins for small tables, cache intermediate results, and reduce the cluster size to the minimum needed for the workload. Column pruning alone (selecting only needed columns early in the pipeline) typically reduces CU consumption by 20-40%.

Finding: Many Users Running Reports Simultaneously at Peak

Root cause: Report rendering generates interactive CU demand that exceeds capacity during business hours.

Solution: Optimize DAX queries in the most-consumed reports (use Performance Analyzer to identify slow measures). Enable query caching on frequently accessed datasets. Consider scaling the capacity up during peak hours and down during off-hours (Fabric supports capacity pause/resume via API for scheduling).

Finding: Development Workloads Consuming Production Capacity

Root cause: Data engineers and analysts running ad-hoc Spark notebooks or dataflow tests on the production capacity.

Solution: Create a separate development capacity (smaller SKU, paused when not in use). Move development workspaces to the development capacity. Implement workspace governance that prevents development workloads from running on production. This is one of the most common issues I find in Fabric environments — development and production sharing capacity is a recipe for user-facing performance problems.

Setting Up Alerts

Create proactive alerts to catch capacity issues before users report them:

  • Power BI data alert: Pin the CU Utilization card to a dashboard, then create an alert for when utilization exceeds 85%
  • Power Automate flow: Trigger a Teams notification or email when throttling events are detected
  • Azure Monitor integration: For Premium capacities, configure Azure Monitor alerts on capacity metrics with automatic scaling responses
  • Weekly capacity review: Schedule a 15-minute weekly review of the Capacity Metrics app with your Fabric admin team. Track utilization trends and identify items that need optimization before they cause throttling.

Capacity Planning with Metrics Data

Use historical metrics data to plan capacity changes:

  • Growth trending: If average CU utilization increases 5% month-over-month, project when you will hit capacity limits. Plan your SKU upgrade 2-3 months before projected saturation.
  • Workload forecasting: When onboarding a new department or workload, review similar existing workloads to estimate CU impact
  • SKU optimization: If utilization never exceeds 30%, consider downgrading to a smaller SKU. If throttling occurs regularly, upgrade to the next SKU tier.
  • Cost modeling: Calculate cost-per-CU-second to determine ROI of optimization efforts. A $100/hour optimization effort that reduces daily CU consumption by 20% may pay for itself within a week.
  • Seasonal patterns: Many organizations have monthly, quarterly, or annual peaks (month-end close for finance, enrollment periods for education, holiday seasons for retail). Track these patterns over multiple cycles to right-size capacity for peak periods rather than average periods.

Building a Capacity Governance Process

The Capacity Metrics app is a tool, not a process. To maximize its value, build a governance process around it:

  1. Weekly review: A 15-minute standing meeting where the Fabric admin reviews the past week's utilization, throttling events, and top-consuming items
  2. Monthly optimization sprint: Identify the top 3 CU consumers and assign optimization tasks to the responsible teams
  3. Quarterly capacity planning: Project forward based on growth trends and planned workload additions. Present to IT leadership with specific SKU recommendations and cost implications.
  4. Incident response protocol: When interactive throttling is detected, have a documented escalation path — who gets notified, what immediate actions to take (pause non-critical background jobs, scale capacity via API), and how to conduct post-incident review

This process turns reactive capacity management ("users are complaining reports are slow") into proactive capacity optimization ("we have 3 months before we need to upgrade SKU based on current growth").

Related Resources

Frequently Asked Questions

How often does the Capacity Metrics app refresh?

The app refreshes every 30 minutes by default. You can see real-time utilization in the Azure portal, but the app provides more detailed historical analysis.

Can I set up alerts for capacity issues?

Yes, you can create Power BI data alerts on key metrics like utilization percentage or throttling events to receive notifications when thresholds are exceeded.

Microsoft FabricMonitoringCapacityPerformance

Industry Solutions

See how we apply these solutions across industries:

Need Help With Power BI?

Our experts can help you implement the solutions discussed in this article.

Ready to Transform Your Data Strategy?

Get a free consultation to discuss how Power BI and Microsoft Fabric can drive insights and growth for your organization.