Power BI Dataflows Gen2: Self-Service ETL at Enterprise Scale
Power BI Dataflows Gen2: Self-Service ETL at Enterprise Scale
Build self-service ETL pipelines with Dataflows Gen2 in Microsoft Fabric. Gen1 vs Gen2 comparison, incremental refresh, and enterprise patterns.
<h2>Dataflows Gen2: The Evolution of Self-Service ETL</h2> <p>Dataflows Gen2 in Microsoft Fabric enable business analysts to build self-service ETL pipelines that output directly to lakehouse Delta tables without writing code. Built on the same Power Query engine that millions of users already know from Excel and Power BI Desktop, Gen2 adds Fabric lakehouse output, enhanced Spark-based compute, automatic OneLake staging, and tighter integration with the broader data platform.</p> <p>In my 25+ years of enterprise consulting, the biggest challenge in data platforms has always been the gap between what IT can deliver and what business teams need. IT builds centralized pipelines that take weeks to modify; business teams need new data integrations in days. Dataflows Gen2 bridges this gap by giving business analysts a governed, scalable, no-code ETL tool that outputs to the same enterprise lakehouse that data engineers use — eliminating the shadow IT spreadsheet pipelines that proliferate when central IT cannot keep pace with demand.</p> <p><a href="/services/data-analytics">Data analytics consulting</a> helps enterprises design dataflow architectures that balance self-service flexibility with governance and performance requirements.</p>
<h2>Gen1 vs Gen2: Key Differences</h2> <p>Critical differences that affect enterprise architecture:</p> <ul> <li><strong>Output destination</strong> — Gen1 writes to CDM folders in Azure Data Lake Storage Gen2; Gen2 writes directly to Fabric lakehouse tables (Delta Lake format)</li> <li><strong>Compute engine</strong> — Gen2 uses Fabric Spark compute for enhanced mashup operations</li> <li><strong>Staging</strong> — Gen2 automatically stages data in OneLake for better performance</li> <li><strong>Integration</strong> — Gen2 is a first-class Fabric item, integrable with Data Pipelines, notebooks, and other Fabric items</li> <li><strong>Refresh</strong> — Gen2 supports orchestration via Data Pipelines with dependencies and error handling</li> <li><strong>Licensing</strong> — Gen1 requires Power BI Premium or PPU; Gen2 requires Fabric capacity</li> </ul>
<p>The single most important difference for enterprise architects is the output destination. Gen1 dataflows write to CDM (Common Data Model) folders in Azure Data Lake Storage — a format that requires additional processing to be useful. Gen2 writes directly to Delta Lake tables in a Fabric lakehouse, making the output immediately queryable by the SQL analytics endpoint, consumable by Direct Lake semantic models, and accessible to Spark notebooks. This eliminates an entire layer of data movement and transformation that Gen1 required.</p>
<h2>Creating Dataflows in Fabric</h2> <p>Navigate to your Fabric workspace, select New > Dataflow Gen2. The Power Query Online editor opens with familiar transformation capabilities: connect to 200+ data sources, apply transforms, and configure output to lakehouse tables. The authoring experience is virtually identical to Power BI Desktop's Power Query editor — any analyst who knows Power Query in Excel or Power BI Desktop can author Gen2 dataflows with zero additional training. This is the key advantage over Spark notebooks, which require Python or Scala programming skills that most business analysts do not have.</p>
<h2>Power Query Online Capabilities</h2> <p>The full Power Query M language is available with additional capabilities in Gen2:</p> <ul> <li><strong>Query folding</strong> — Pushes transformations to source databases for optimal performance. <a href="/blog/query-folding-power-query-troubleshooting-guide-2026">Query folding troubleshooting</a> is critical for dataflow performance.</li> <li><strong>Diagram view</strong> — Visual representation of query dependencies and data lineage</li> <li><strong>AI Insights</strong> — Text analytics, vision, and Azure ML model scoring within Power Query</li> <li><strong>Schema detection</strong> — Automatic schema inference with explicit type mapping</li> </ul>
<h2>Incremental Refresh in Dataflows</h2> <p>Configure incremental refresh to process only new or changed data:</p> <ol> <li>Add RangeStart and RangeEnd parameters (DateTime type)</li> <li>Filter your source query using these parameters</li> <li>Enable incremental refresh in dataflow settings</li> <li>Configure the refresh and archive windows</li> </ol> <p>This is especially important for large datasets where full refresh would exceed timeout or capacity limits. See <a href="/blog/power-bi-incremental-refresh-data-partitioning-guide-2026">incremental refresh patterns</a> for detailed implementation guidance.</p>
<h2>Computed and Linked Entities</h2> <p>In Gen1, computed entities reference other entities within the same dataflow (avoiding re-querying the source). Linked entities reference entities from other dataflows. In Gen2, the staging lakehouse achieves similar benefits — all intermediate data is persisted in Delta Lake format, enabling downstream queries to read from the lakehouse rather than re-querying sources. This is a significant architectural improvement because it means every dataflow refresh creates a durable, queryable dataset in OneLake that can be consumed by any Fabric workload, not just the downstream dataflow or semantic model it was originally built for.</p>
<h2>Enterprise Dataflow Patterns</h2> <h3>Staging Pattern</h3> <p>Create a staging dataflow that ingests raw data from sources with minimal transformation — essentially a direct copy from source to lakehouse. Downstream dataflows or notebooks then transform the staged data. This pattern separates ingestion from transformation, provides a reusable raw data layer, and isolates downstream transformations from source system changes. When a source API changes, you only fix the staging dataflow — all downstream logic remains untouched.</p>
<h3>Medallion Architecture</h3> <p>Use dataflows for Bronze (raw ingestion) and Silver (cleansed, conformed) layers. Gold-layer aggregations can be handled by dataflows for simpler transformations or <a href="/blog/microsoft-fabric-notebooks-pyspark-data-engineering-2026">Fabric notebooks</a> for complex logic requiring Python or Spark SQL. The medallion architecture provides progressive data refinement where each layer adds quality, consistency, and business meaning to the raw data. Business analysts can confidently build reports on Silver or Gold layer tables knowing the data has been validated and conformed.</p>
<h3>Shared Certified Dataflows</h3> <p>Publish certified dataflows that multiple teams can reference. This creates a governed self-service data layer where IT manages the core data pipelines and business users consume certified entities in their own dataflows or semantic models. Certification is a signal to consumers that the data has been validated, is refreshed on a reliable schedule, and has a designated owner responsible for quality. Without certification, you end up with 15 different versions of the "customer" table across the organization, each with different definitions, filters, and quality levels.</p>
<h3>When Dataflows Are Not the Right Choice</h3> <p>Dataflows Gen2 are not the right tool for every scenario. Use Fabric notebooks instead when you need complex transformations requiring Python libraries (pandas, scikit-learn), when data volumes exceed 100 million rows per table (notebooks parallelize across Spark executors), when you need to write custom data quality validation logic, or when the transformation requires joining data from both Fabric lakehouses and external REST APIs in a single pipeline. The best enterprise architectures use both tools: dataflows for the 80% of data integrations that are straightforward source-to-lakehouse pipelines, and notebooks for the 20% that require programming flexibility.</p>
<h2>Monitoring and Scheduling</h2> <p>Schedule dataflows individually or orchestrate them via Fabric Data Pipelines for dependency management and conditional execution. Monitoring options include refresh history in the workspace, Fabric Monitoring Hub for capacity-level visibility, and custom monitoring via the <a href="/blog/power-bi-rest-api-automating-enterprise-operations-2026">REST API</a>. Configure alerts for refresh failures using <a href="/blog/microsoft-fabric-data-activator-reflex-alerting-2026">Data Activator</a> or Power Automate — in production environments, every dataflow should have failure notification configured. A silent dataflow failure that goes undetected for days produces stale dashboards that erode user trust in the entire analytics platform.</p> <p>For enterprise deployments with 20+ dataflows, build a monitoring dashboard that tracks refresh duration trends (increasing duration signals growing data volumes or degrading source performance), success/failure rates, and capacity consumption per dataflow. This operational visibility enables proactive optimization before performance issues impact end users.</p>
<h2>Migration from Gen1 to Gen2</h2> <p>Migration involves recreating dataflows in the Gen2 experience. While direct migration tools are limited, the query logic (M code) can be copied. The main effort is redirecting output from CDM folders to Fabric lakehouse tables and updating downstream dependencies. Plan for 1-2 weeks of testing per critical dataflow.</p>
<h2>Error Handling and Resilience Patterns</h2> <p>Production dataflows need robust error handling. In my 25+ years of enterprise consulting, the number one cause of dataflow failures is source system changes — a column renamed, an API endpoint updated without notification. Build resilience with schema validation steps, explicit null handling, row count validation against expected ranges, and Data Pipelines orchestration with retry policies and failure notifications. Configure failure alerts through <a href="/blog/power-bi-data-alerts-subscriptions-notification-2026">data alerts</a> or Power Automate so the right people know immediately when a dataflow fails.</p>
<h2>Performance Optimization Strategies</h2> <p>Maximize query folding using <a href="/blog/query-folding-power-query-troubleshooting-guide-2026">diagnostics</a> to verify which steps fold to the source. Apply filters as early as possible — loading 100M rows then filtering to 1M is dramatically slower than filtering at the source. Remove unnecessary columns immediately after the source step. Gen2 automatic staging in OneLake means intermediate results persist, improving reliability and debuggability for complex multi-step transformations. One enterprise client with 45 active dataflows reduced their average refresh failure rate from 12% to under 2% by implementing these optimization patterns alongside proper error handling and monitoring.</p>
<p>Whether you are migrating from Gen1 dataflows, replacing custom Python ETL scripts, or building a self-service data platform from scratch, Dataflows Gen2 provides the right balance of accessibility for business analysts and governance for IT administrators. Ready to modernize your ETL with Dataflows Gen2? <a href="/contact">Contact EPC Group</a> for a free consultation on Fabric data engineering.</p>
Frequently Asked Questions
Should I use Dataflows Gen2 or Fabric notebooks for ETL?
Use Dataflows Gen2 when: the transformations are achievable in Power Query (joins, filters, pivots, type conversions), the users are business analysts familiar with Power Query, and the data volumes are moderate. Use notebooks when: you need complex logic (ML, advanced statistics), data volumes are very large (billions of rows), or you need programming language flexibility (Python, Scala, SQL).
Can I migrate my Gen1 dataflows to Gen2?
There is no automated migration tool. You need to recreate dataflows in the Gen2 experience. The M code can be copied, but output configuration changes from CDM folders to Fabric lakehouse tables. Plan for testing all downstream dependencies (semantic models, reports) after migration.
How does query folding work in Dataflows Gen2?
Query folding in Gen2 works the same as in Power BI Desktop — the Power Query engine translates M operations into native source queries (SQL, OData, etc.) when possible. Gen2 adds automatic staging in OneLake, which means even non-foldable operations benefit from intermediate persistence rather than re-querying the source.
What is the maximum data volume for Dataflows Gen2?
Dataflows Gen2 can handle substantial data volumes, limited by your Fabric capacity size and refresh timeout settings. For very large datasets (hundreds of millions of rows), use incremental refresh to process only changed data. If you hit capacity limits, consider Fabric notebooks with Spark for parallel processing.
How do Dataflows Gen2 affect Fabric capacity consumption?
Dataflows Gen2 consume Fabric Capacity Units (CUs) during refresh operations. The compute cost depends on data volume, transformation complexity, and whether query folding is achieved. Staging operations also consume storage in OneLake. Monitor consumption via Fabric Capacity Metrics to optimize costs.