
Data Factory Pipelines in Fabric
Orchestrate complex data workflows with Microsoft Fabric Data Factory pipelines. ETL patterns, scheduling, monitoring, and error handling best practices.
Data Factory pipelines in Microsoft Fabric provide enterprise-grade data orchestration for building, scheduling, and monitoring complex data workflows that move and transform data across lakehouses, warehouses, and external sources. They replace the need for standalone Azure Data Factory in many scenarios by integrating pipeline orchestration directly into the Fabric platform with native OneLake connectivity, capacity-based billing, and unified monitoring. If you are building a data platform on Fabric, Data Factory pipelines are the glue that coordinates everything, and getting pipeline architecture right from the start saves months of rework later.
I have designed pipeline architectures for Fabric environments processing everything from 10GB daily incremental loads to 5TB full refreshes across 200+ source tables. The common mistake I see is treating pipelines as simple copy jobs when they should be engineered as resilient, observable, and maintainable data workflows. Our Microsoft Fabric consulting team builds production-grade pipeline architectures that handle failures gracefully and scale with your data volumes.
Pipeline Architecture Fundamentals
A Data Factory pipeline is a container of activities executed in a defined sequence with control flow logic. Understanding the three execution patterns is essential for designing efficient workflows:
Sequential Execution: Activities run one after another. Activity B starts only after Activity A succeeds. This is the default pattern for dependent operations like: extract data, then transform data, then load to warehouse. Use this when downstream activities depend on upstream output.
Parallel Execution: Independent activities run simultaneously. Loading data from five different source systems can happen in parallel, reducing total pipeline duration from the sum of individual load times to the duration of the slowest single load. I always parallelize source system extractions because they have no interdependencies.
Conditional Branching: Activities execute based on the success, failure, or completion status of previous activities. If the data extraction succeeds, proceed to transformation. If it fails, send an alert notification and skip downstream activities. This is what separates production pipelines from prototypes.
Core Activities and When to Use Each
Copy Activity
The Copy activity moves data between 100+ supported sources and sinks. It handles schema mapping, data type conversion, and can process terabytes of data efficiently.
Common Patterns I Implement: - SQL Server to Lakehouse (initial and incremental loads) - REST API to Lakehouse (API ingestion with pagination handling) - File storage to OneLake (CSV, Parquet, JSON file ingestion) - Lakehouse to Warehouse (promotion from raw to curated layers) - Cross-tenant data sharing (OneLake shortcuts combined with Copy for compliance boundaries)
Performance Tuning: Configure parallel copy degree (number of concurrent data movement threads) based on source system capacity. Set appropriate batch sizes for database sources (10,000-50,000 rows per batch is my typical starting point). Enable staging through OneLake for cross-region copies. Use column mapping to select only required columns, reducing data transfer volume by 30-60% in my experience.
Notebook Activity
The Notebook activity executes a Fabric Spark notebook as a pipeline step. This is the primary mechanism for running PySpark transformations within an orchestrated workflow.
Parameterization: Pass pipeline parameters to notebooks as cell parameters. The notebook receives values for date ranges, file paths, processing modes, and configuration settings. This makes notebooks reusable across different pipeline contexts. I create one generic transformation notebook per source system and parameterize everything.
Output Capture: Notebooks can return values to the pipeline using the mssparkutils.notebook.exit(value) function. Subsequent pipeline activities can reference these output values for conditional logic or parameter passing. I use this to pass row counts from transformation notebooks to data validation activities.
Dataflow Gen2 Activity
Execute Power Query-based transformations within a pipeline. Dataflow Gen2 provides a visual, low-code transformation experience that complements code-heavy notebook transformations.
Best Use Cases: Data cleansing that business analysts can maintain (column mapping, type conversion, filtering), simple aggregations, and transformations that do not require Spark's distributed computing power. I use Dataflow Gen2 for dimension table preparation and reserve notebooks for large fact table transformations.
Stored Procedure Activity
Call stored procedures in Fabric Warehouse or external SQL databases. This enables leveraging existing SQL-based transformation logic without rewriting in PySpark. For organizations migrating from SQL Server-based ETL, this preserves years of tested business logic.
Web Activity
Call external REST APIs, trigger Azure Functions, or interact with third-party services. I use Web activities for: notifying Slack/Teams channels on pipeline completion, triggering downstream processes in external systems, fetching metadata from configuration APIs, and calling Azure Functions for complex validation logic.
Advanced Pipeline Patterns
ForEach Loop with Metadata-Driven Architecture
The ForEach loop iterates over a collection of items and executes activities for each item. Combined with a Lookup activity, this enables metadata-driven pipelines that are my preferred architecture pattern for enterprise Fabric deployments.
How It Works: A configuration table in your warehouse lists all source tables with their schema, load type (full or incremental), schedule, and target destination. A Lookup activity reads this table. A ForEach loop iterates through each row, executing a Copy activity parameterized with the source table details. Adding a new source table requires only a row in the configuration table, not pipeline modifications.
Configuration: Define the items collection (an array expression), set the batch count for parallel execution within the loop (default 20, max 50), and place activities inside the loop body. For 50+ source tables, I set batch count to 10-15 to avoid overwhelming source systems.
Until Loop for Polling Patterns
Repeat activities until a condition is met. Useful for polling scenarios: check if a source file has arrived, wait 5 minutes if not, check again, repeat until the file appears or timeout is reached. I implement a maximum iteration count (typically 24 iterations at 5-minute intervals = 2 hours maximum wait) to prevent infinite loops.
Parent-Child Pipeline Architecture
For complex workflows, I use a parent pipeline that orchestrates child pipelines. The parent handles scheduling, error handling, and notification. Child pipelines handle individual workload domains (finance data, HR data, sales data). This separation enables independent testing and deployment of each domain while maintaining centralized orchestration.
Error Handling and Monitoring
Robust error handling separates production-grade pipelines from fragile prototypes. Here is the error handling framework I implement for every client:
Retry Policies: Configure retry count and interval on individual activities. Transient failures (network timeouts, temporary API unavailability) resolve automatically with retries. Set 3 retries with 30-second intervals as a starting default. For API-based sources with rate limiting, increase the interval to 60-120 seconds.
Failure Paths: Connect activities with "On Failure" dependencies to execute error-handling activities when upstream activities fail. My standard failure handler includes: sending Teams/email notification with error details, logging the failure to a monitoring table with timestamp and error message, and cleaning up partial outputs to prevent downstream data quality issues.
Timeout Configuration: Set activity-level timeouts to prevent runaway processes. A notebook that normally completes in 10 minutes should have a 30-minute timeout. I set pipeline-level timeouts at 2x the expected total duration. A timeout firing is always better than a hung pipeline consuming capacity indefinitely.
**Monitoring Dashboard**: All pipeline runs appear in the Fabric Monitoring Hub with status, duration, and error details. I build a custom monitoring Power BI dashboard on top of monitoring data that tracks pipeline success rates, average durations, error patterns over time, and SLA compliance. This dashboard is the first thing the operations team checks every morning.
Scheduling and Triggers
Scheduled Triggers: Run pipelines on a recurring schedule (hourly, daily, weekly). Define the start time, recurrence pattern, and end time. I stagger pipeline schedules to avoid capacity spikes: finance data loads at 2 AM, sales data at 3 AM, HR data at 4 AM.
Event-Based Triggers: Trigger pipelines when events occur, such as a new file arriving in OneLake, a table being updated, or an external webhook being received. Event triggers enable near-real-time data processing without polling. I use these for customer-facing data products where freshness is critical.
Manual Triggers: Execute pipelines on demand through the Fabric portal, REST API, or from other pipelines (pipeline chaining). I always expose a manual trigger for production pipelines so operations teams can re-run after fixing source system issues.
Cost Optimization Tips
- Schedule heavy pipelines during off-peak hours when capacity has headroom
- Use incremental loads instead of full refreshes wherever possible (reduces CU consumption by 70-90%)
- Configure Copy activity to select only needed columns rather than SELECT *
- Monitor CU consumption per pipeline and optimize the most expensive ones first
- Pause development pipelines that run on schedule but serve no active users
Related Resources
Frequently Asked Questions
Is Data Factory in Fabric the same as Azure Data Factory?
They share similar concepts and interface but have differences. Fabric Data Factory is integrated with the Fabric platform, uses capacity billing, and has native OneLake integration. ADF remains a separate Azure service.
Can I migrate ADF pipelines to Fabric?
Some migration is possible by exporting and importing pipeline definitions. However, connections and Fabric-specific features may require reconfiguration. Microsoft provides migration guidance and tools.