Using the Fabric Monitoring Hub
Microsoft Fabric
Microsoft Fabric16 min read

Using the Fabric Monitoring Hub

Monitor all Microsoft Fabric activities from a central hub. Track pipeline runs, Spark jobs, data refreshes, and capacity utilization in real time.

By Errin O'Connor, Chief AI Architect

The Microsoft Fabric Monitoring Hub provides unified visibility into every operation running across your Fabric environment, including pipeline runs, Spark jobs, notebook executions, dataflow refreshes, and semantic model refreshes, all in a single interface. For administrators and data engineers managing production Fabric workloads, it is the first place to look when something fails, runs slowly, or consumes unexpected resources. If you are operating Fabric at enterprise scale, you need to check the Monitoring Hub daily, and more importantly, you need to build alerting and custom monitoring on top of it so issues find you before users report them.

I manage Fabric environments processing 500+ pipeline runs daily for enterprise clients, and the Monitoring Hub is the operational nerve center. In one engagement, Monitoring Hub data helped us identify that a single misconfigured Spark notebook was consuming 40% of total capacity during business hours, causing report slowdowns for 2,000 users. Without centralized monitoring, that issue would have taken weeks to diagnose. Our Microsoft Fabric consulting team builds comprehensive monitoring strategies that combine the built-in Monitoring Hub with custom alerting and operational dashboards.

Accessing the Monitoring Hub

The Monitoring Hub is available from two entry points. At the workspace level, navigate to any workspace and select "Monitoring hub" from the left navigation pane, which shows activities for that specific workspace. At the Fabric homepage level, select "Monitoring hub" from the main navigation, which shows activities across all workspaces you have access to. I recommend the homepage-level view for operations teams who need cross-workspace visibility.

Permissions determine what you see. Workspace admins and members see all activities in their workspaces. Contributors see their own activities and activities on items they have contributor access to. Viewers see only their own triggered activities. Fabric administrators see everything across the entire tenant. For effective monitoring, ensure your operations team has at least Member access to all production workspaces.

Understanding the Activity View

The Monitoring Hub displays a table of activities with key columns:

ColumnDescriptionWhat I Look For
Item NameThe artifact that ran (pipeline, notebook, semantic model)Identifying which specific workload ran
Item TypeCategory of the item (Pipeline, Spark Job, etc.)Filtering by workload type during investigation
StatusRunning, Succeeded, Failed, CancelledFailed items first, then long-running items
Start TimeWhen the activity beganCorrelating failures with infrastructure events
DurationTotal elapsed timeComparing against historical baselines
Submitted ByUser or service principal that triggered the runAccountability and debugging ad-hoc runs
WorkspaceWhich workspace contains the itemOrganizational context for cross-workspace triage

Filtering and Search Strategies

Effective monitoring depends on finding relevant activities quickly among potentially thousands of entries. Here are the filtering strategies I use daily:

Status Filters: Filter by Running (currently active), Succeeded (completed successfully), Failed (completed with errors), or Cancelled (manually or automatically stopped). For troubleshooting, always filter by Failed first. For capacity planning, filter by Running to see current load.

Item Type Filters: Narrow to specific workload types: Pipeline activities, Spark job runs, Dataflow Gen2 refreshes, Semantic model refreshes, KQL queryset runs, or Notebook executions. This is essential when investigating a specific workload category. When users report slow reports, I filter to Semantic model refreshes to check if refresh delays are causing stale data.

Time Range Filters: The Monitoring Hub shows the last 30 days of activity by default. Adjust the time range to focus on recent issues (last 24 hours) or investigate historical patterns (last 7 days, last 30 days). For trend analysis beyond 30 days, export data to your own storage.

Workspace Filters: When monitoring cross-workspace from the Fabric homepage, filter by specific workspaces to focus on a particular team or project. I create workspace naming conventions (Prod-Finance, Prod-Sales) that make filtering intuitive.

Investigating Failed Activities

When an activity fails, the Monitoring Hub provides the starting point for root cause analysis. Here is my systematic investigation approach:

Error Messages: Click on a failed activity to view the error details. Error messages range from clear (connection timeout, insufficient permissions) to cryptic (internal error codes). I maintain a runbook mapping common error codes to their solutions. After 6 months of operation, your runbook will cover 90% of failure scenarios.

Activity Details: The detail pane shows parameters passed to the activity, input/output datasets, and execution stages. For pipelines, you can see which specific activity within the pipeline failed, saving you from investigating the entire pipeline.

Spark Job Details: For failed Spark jobs (notebooks, Spark SQL), the Monitoring Hub links to the Spark application UI where you can examine executor logs, DAG visualization, and SQL query plans. This is essential for debugging distributed computation failures. The most common Spark failures I see are out-of-memory errors (scale up executors), data type mismatches (fix source schema), and timeout issues (optimize query or increase timeout).

Common Failure Patterns and Resolutions:

Failure PatternTypical CauseResolution
Connection failuresSource systems unreachable, expired credentialsVerify connectivity, refresh credentials, check firewall rules
Timeout failuresLong-running queries, capacity throttlingOptimize query, increase timeout, schedule during off-peak
Data errorsSchema changes in source, null violationsUpdate schema mapping, add null handling in transformations
Capacity errorsInsufficient CUs, concurrent job limitsScale capacity, stagger job schedules, optimize heavy workloads
Permission errorsService principal access revokedRe-grant workspace access, verify tenant settings

Performance Monitoring and Trending

Beyond failure detection, the Monitoring Hub helps identify performance degradation before it becomes a user-reported incident:

Duration Trending: Sort by duration to find your slowest activities. If a pipeline that normally completes in 5 minutes is now taking 30 minutes, investigate before it becomes a failure. I track the 95th percentile duration for critical pipelines and alert when it exceeds 2x the baseline.

Queue Time Analysis: When activities show significant queue time (time between submission and execution start), your capacity is overloaded. Either scale up the capacity SKU, reschedule activities to off-peak times, or optimize existing workloads to consume fewer CUs. Queue time over 5 minutes during business hours is my threshold for action.

Concurrent Activity Analysis: Multiple long-running activities competing for the same capacity cause mutual slowdown. Identify overlapping heavy workloads and stagger their schedules. I map out a 24-hour capacity timeline showing when each major workload runs, ensuring heavy Spark jobs do not overlap with peak report viewing hours.

Refresh Duration Tracking: For semantic model refreshes specifically, I track refresh duration trends weekly. A gradually increasing refresh time (5 minutes becoming 8, then 12, then 18 over several weeks) indicates growing data volumes or degrading query performance that needs proactive attention.

Building Custom Monitoring Solutions

The Monitoring Hub's 30-day retention and basic filtering may be insufficient for enterprise monitoring. Here is how I extend it:

Export to Lakehouse: Use the Fabric REST APIs to programmatically export Monitoring Hub data to a Lakehouse for long-term retention and advanced analysis. Build Power BI reports on top of this data for executive-level operational dashboards. I run an export pipeline every 6 hours to capture monitoring data before it ages out.

Azure Monitor Integration: Configure Fabric diagnostic settings to send activity logs to Azure Log Analytics. This enables KQL-based querying, alerting on failure patterns, and integration with broader Azure monitoring infrastructure (Grafana dashboards, PagerDuty alerts). For clients with existing Azure Monitor investments, this integration provides a unified operations view.

Data Activator Alerts: Connect Monitoring Hub data to Data Activator to create automated alerts based on conditions like "any pipeline failure in production workspaces," "Spark job duration exceeds 2x historical average," or "more than 3 failures in the last hour." Data Activator sends notifications to Teams, email, or Power Automate flows. I configure escalation paths: first alert goes to the data engineer, second alert (if unresolved after 30 minutes) goes to the team lead.

Custom Operational Dashboard: The dashboard I build for every client tracks these KPIs: - Pipeline success rate (target: 99%+ for production) - Mean time to recovery (MTTR) for failed activities - Average and P95 duration for critical pipelines - Capacity utilization by hour of day - Top 10 most expensive activities by CU consumption - Failure trend by category (connection, data, timeout, capacity)

Operational Best Practices

  • Check the Monitoring Hub daily as part of your morning operational review. I schedule 15 minutes at 8 AM.
  • Set up alerts for all production pipeline failures. Do not rely on manual checking.
  • Track mean time to recovery (MTTR) for failed activities as an operational KPI. Target under 30 minutes.
  • Maintain a runbook documenting common failure patterns and their resolutions. Update it after every new failure type.
  • Export monitoring data monthly for capacity planning and cost optimization analysis.
  • Review activity durations weekly to catch gradual performance degradation before users notice.
  • Document expected schedules for all production pipelines so you can quickly identify unexpected runs or missing runs.

Related Resources

Frequently Asked Questions

How long does Monitoring Hub retain historical data?

Monitoring Hub retains job history for 30 days by default. For longer retention, export data to your own storage or use Azure Log Analytics integration for extended historical analysis.

Can I set up alerts from Monitoring Hub?

Currently, direct alerting from Monitoring Hub is limited. Use Data Activator or Azure Monitor for comprehensive alerting on Fabric job status and performance metrics.

Microsoft FabricMonitoringOperationsAdmin

Industry Solutions

See how we apply these solutions across industries:

Need Help With Power BI?

Our experts can help you implement the solutions discussed in this article.

Ready to Transform Your Data Strategy?

Get a free consultation to discuss how Power BI and Microsoft Fabric can drive insights and growth for your organization.