Power BI Dataflows & Power Query Online: Enterprise ETL Guide for 2026
Data Engineering
Data Engineering14 min read

Power BI Dataflows & Power Query Online: Enterprise ETL Guide for 2026

Master Power BI Dataflows and Power Query Online for scalable enterprise ETL—covering Gen1 vs Gen2, incremental refresh, computed entities, and ADLS Gen2.

By EPC Group

Power BI Dataflows solve one of the most persistent problems in enterprise business intelligence: the duplication of data transformation logic across dozens or hundreds of Power BI datasets. Without dataflows, every dataset author independently connects to sources, writes their own Power Query M code, and publishes their own version of "clean" data—leading to conflicting business rules, redundant gateway load, and no single source of truth for transformations. Dataflows centralize that transformation layer into a shared, governed, reusable asset that multiple datasets consume. Our data analytics consulting team has deployed dataflow architectures for Fortune 500 organizations managing thousands of datasets across regulated industries.

What Are Power BI Dataflows

A dataflow is a collection of Power Query transformations that execute in the Power BI Service (cloud) rather than in Power BI Desktop. The output is stored as tables—either in Power BI's internal storage (CDM folders in Azure Data Lake Storage) or in Azure Data Lake Storage Gen2 under your control. Datasets then connect to dataflow tables instead of connecting directly to source systems.

The architecture creates a clear separation of concerns:

| Layer | Responsibility | Owner | |---|---|---| | Source systems | Operational databases, APIs, files | IT / Database admins | | Dataflows | Extract, transform, cleanse, standardize | Data engineering team | | Datasets / Semantic models | Star schema modeling, DAX measures, RLS | BI developers | | Reports | Visualizations, dashboards, paginated output | Report authors |

This separation is critical for enterprise governance. Source credentials live in the dataflow (managed by data engineers with appropriate access), not scattered across individual datasets. Business rules—currency conversion rates, fiscal calendar mappings, customer segmentation logic—are defined once in the dataflow and inherited by every downstream dataset. When a rule changes, you update it in one place.

Dataflows Gen1 vs Gen2 (Fabric)

Microsoft now offers two generations of dataflows, and the distinction matters for architecture decisions:

Dataflows Gen1

  • Available in Power BI Pro, Premium Per User (PPU), and Premium capacity workspaces
  • Output stored as CDM (Common Data Model) folders in internal Azure Data Lake Storage
  • Can optionally output to your own ADLS Gen2 storage account (organization-level setting)
  • Refreshed via the Power BI service refresh engine
  • Supports linked entities (referencing tables from other dataflows) and computed entities (transformations that run on top of previously materialized tables)
  • Mature, production-proven, widely deployed

Dataflows Gen2 (Fabric)

  • Available in Microsoft Fabric workspaces (requires Fabric capacity)
  • Output lands directly in Fabric Lakehouse or Warehouse as Delta tables
  • Uses the Fabric Data Factory pipeline engine for orchestration
  • Supports 300+ connectors (same connector library as Fabric Data Factory)
  • Integrates with Fabric data pipelines for conditional logic, error handling, and multi-step orchestration
  • Output is immediately queryable by SQL analytics endpoints, Spark notebooks, and Direct Lake semantic models
  • Represents Microsoft's forward investment direction

For organizations already on Fabric, Gen2 dataflows are the clear choice—they eliminate the intermediate CDM storage layer and write directly to Lakehouse Delta tables. For organizations on Power BI Premium without Fabric, Gen1 dataflows remain fully supported and production-grade. Our Microsoft Fabric consulting practice helps organizations plan the Gen1-to-Gen2 migration path.

| Factor | Dataflows Gen1 | Dataflows Gen2 (Fabric) | |---|---|---| | Required license | Pro, PPU, or Premium | Fabric capacity (F64+) | | Output format | CDM folder (CSV/Parquet in ADLS) | Delta table in Lakehouse/Warehouse | | Connector count | ~90 | 300+ | | Orchestration | Built-in schedule only | Fabric Data Pipelines (full orchestration) | | Downstream consumers | Power BI datasets (Import/DirectQuery) | Datasets, Spark, SQL endpoint, Direct Lake | | Computed entities | Yes (Premium/PPU only) | Yes (via staging Lakehouse) | | Incremental refresh | Yes (Premium/PPU) | Yes | | Microsoft investment | Maintenance mode | Active development |

Power Query Online M Code Patterns

Dataflows author transformations using Power Query M—the same language used in Power BI Desktop, but executed in the cloud. Mastering a core set of M patterns covers 90% of enterprise ETL requirements.

Source Connection Patterns

``` // SQL Server with query folding let Source = Sql.Database("server.database.windows.net", "SalesDB"), Sales = Source{[Schema="dbo", Item="FactSales"]}[Data], Filtered = Table.SelectRows(Sales, each [OrderDate] >= #date(2024, 1, 1)) in Filtered ```

The key principle is query folding: write transformations that the M engine can translate into native SQL and push down to the source database. Operations like `Table.SelectRows`, `Table.SelectColumns`, `Table.Group`, and `Table.Sort` fold when applied directly after a SQL source. Operations like `Table.AddColumn` with custom functions typically break folding.

Data Cleansing Patterns

``` // Standardize and cleanse customer data let Source = PreviousStep, TrimmedNames = Table.TransformColumns(Source, { {"CustomerName", Text.Trim}, {"Email", each Text.Lower(Text.Trim(_))} }), RemovedNulls = Table.SelectRows(TrimmedNames, each [CustomerName] <> null and [CustomerName] <> ""), DeduplicatedByEmail = Table.Distinct(RemovedNulls, {"Email"}) in DeduplicatedByEmail ```

Fiscal Calendar Generation

``` // Generate fiscal calendar (July fiscal year start) let StartDate = #date(2020, 1, 1), EndDate = #date(2026, 12, 31), DateList = List.Dates(StartDate, Duration.Days(EndDate - StartDate) + 1, #duration(1,0,0,0)), DateTable = Table.FromList(DateList, Splitter.SplitByNothing(), {"Date"}, null, ExtraValues.Error), Typed = Table.TransformColumnTypes(DateTable, {{"Date", type date}}), FiscalYear = Table.AddColumn(Typed, "FiscalYear", each if Date.Month([Date]) >= 7 then Date.Year([Date]) + 1 else Date.Year([Date]), Int64.Type), FiscalQuarter = Table.AddColumn(FiscalYear, "FiscalQuarter", each if Date.Month([Date]) >= 7 then Number.RoundUp((Date.Month([Date]) - 6) / 3) else Number.RoundUp((Date.Month([Date]) + 6) / 3), Int64.Type) in FiscalQuarter ```

These patterns—and dozens more—are the building blocks of production dataflow development. Our Power BI architecture consultants develop M code libraries tailored to each client's source systems and business rules.

Incremental Refresh for Dataflows

Full refresh of large dataflows is expensive—both in compute time and gateway load. Incremental refresh solves this by refreshing only new or changed data while preserving historical partitions.

Configuration requirements:

  1. The dataflow table must have a date/time column that reliably indicates when rows were created or modified
  2. Define the store period (total historical window, e.g., 3 years) and the refresh period (how far back each refresh looks, e.g., 10 days)
  3. Power Query parameters `RangeStart` and `RangeEnd` must be used to filter the source query (this enables partition pruning)
  4. Query folding must work through the date filter (the filter must push down to the source database as a WHERE clause)

Implementation steps:

  • Add `RangeStart` and `RangeEnd` as Power Query date/time parameters
  • Filter your source table: `Table.SelectRows(Source, each [ModifiedDate] >= RangeStart and [ModifiedDate] < RangeEnd)`
  • In the dataflow settings, enable incremental refresh and set the store/refresh windows
  • The service automatically creates daily or monthly partitions and refreshes only the partitions within the refresh window

For a 50-million-row fact table, incremental refresh typically reduces refresh time from 45 minutes to under 5 minutes—and reduces source database load proportionally. This is not optional for enterprise dataflows; it is a requirement.

Computed Entities and Linked Entities

These two features enable the most powerful dataflow architecture pattern: layered dataflows that mirror the medallion architecture (Bronze, Silver, Gold).

Linked Entities

A linked entity references a table from another dataflow without re-executing the source query. Dataflow B can link to a table in Dataflow A, consuming its already-materialized output as a starting point. This creates a dependency chain: Dataflow A must refresh before Dataflow B.

Computed Entities

A computed entity references a table within the same dataflow that has already been loaded to storage. Instead of re-querying the source, the computed entity reads from the materialized output and applies additional transformations. Computed entities require Premium, PPU, or Fabric capacity—they do not work in Pro workspaces.

Layered Architecture Pattern

| Layer | Dataflow Role | Example | |---|---|---| | Bronze (Raw) | Extract from source systems, minimal transformation, store raw data | Dataflow: Sales_Raw (extracts from ERP with type casting only) | | Silver (Cleansed) | Linked entity from Bronze, apply business rules, deduplication, standardization | Dataflow: Sales_Cleansed (linked to Sales_Raw, applies cleansing logic) | | Gold (Modeled) | Computed/linked from Silver, aggregate, join dimensions, prepare for BI consumption | Dataflow: Sales_Modeled (star schema output for datasets) |

This architecture provides clear lineage, independent refresh control at each layer, and the ability for different teams to own different layers. The data analytics team at EPC Group implements this pattern across healthcare systems, financial platforms, and government reporting environments.

CDM (Common Data Model) Integration

Dataflows Gen1 store output as CDM folders—a Microsoft standard that defines a metadata layer (`model.json`) describing the schema of Parquet or CSV files in Azure Data Lake Storage. CDM enables interoperability: the same data can be consumed by Power BI, Azure Data Factory, Azure Synapse, Dynamics 365, and any tool that reads CDM metadata.

CDM folder structure:

``` /dataflow-name/ model.json (schema definition, column types, relationships) /TableName/ part-00001.csv (or .parquet) part-00002.csv ```

For organizations adopting Fabric, CDM folders give way to Delta tables with Lakehouse metadata—but the principle remains: dataflow output is stored in an open format accessible by multiple engines.

Azure Data Lake Storage Gen2 for Dataflows

By default, Gen1 dataflow output is stored in Microsoft-managed internal storage. For enterprise deployments, configuring your own ADLS Gen2 account provides critical advantages:

  • Data sovereignty: Output files reside in your Azure subscription, your region, your storage account
  • Cross-service access: Azure Data Factory, Azure Synapse, Databricks, and custom applications read from the same ADLS Gen2 account
  • Backup and compliance: Apply Azure Blob lifecycle policies, immutability policies, and geo-redundant replication
  • Cost control: Storage costs are billed to your Azure subscription (typically cheaper than Power BI managed storage for large volumes)

Configuration: An Azure AD Global Admin or Power BI Admin sets the ADLS Gen2 connection at the tenant or workspace level in Power BI Admin settings. Once configured, all new dataflows in that scope write to the specified ADLS Gen2 account. Existing dataflows can be migrated by re-creating them in the new scope.

Error Handling and Monitoring

Production dataflows fail. Sources go offline, schemas change, credentials expire, data volumes spike. Robust error handling is non-negotiable for enterprise deployments.

Error Handling Patterns

  • `try...otherwise` in M code: Wrap unreliable transformations in try blocks to return default values instead of failing the entire refresh
  • Data quality columns: Add validation columns (e.g., `IsValid`, `ValidationErrors`) that flag problematic rows without stopping the pipeline
  • Schema drift protection: Use `Table.SelectColumns` with `MissingField.UseNull` to handle columns that may not exist in every refresh

Monitoring

  • Power BI REST API: Query dataflow refresh history programmatically (`GET /groups/{groupId}/dataflows/{dataflowId}/transactions`)
  • Email alerts: Configure refresh failure alerts in dataflow settings (sends email to workspace admins)
  • Power Automate integration: Trigger Power Automate flows on refresh failure for advanced alerting (Teams, Slack, PagerDuty, ServiceNow)
  • Fabric monitoring hub: For Gen2 dataflows, Fabric provides a unified monitoring hub with detailed run history, duration tracking, and error details

Implement monitoring from day one—not after the first production outage.

Dataflow vs Dataset Refresh Architecture

Understanding the relationship between dataflow refresh and dataset refresh is critical for scheduling and performance:

  1. Dataflow refreshes first: Extracts data from sources, applies transformations, writes output to storage (CDM folder or Lakehouse)
  2. Dataset refreshes second: Reads from dataflow output tables, builds the in-memory model (VertiPaq), and makes data available to reports
  3. Schedule dependency: Dataset refresh must start after dataflow refresh completes—schedule datasets 30-60 minutes after their upstream dataflows
  4. For Gen2 + Direct Lake: Dataflow writes Delta tables to Lakehouse; the Direct Lake semantic model reads directly from Delta files without a separate import refresh—eliminating the dataset refresh step entirely

| Architecture | Refresh Chain | Total Latency | |---|---|---| | Gen1 Dataflow + Import Dataset | Dataflow refresh → Dataset refresh → Reports updated | 30-90 min | | Gen2 Dataflow + Import Dataset | Dataflow refresh → Dataset refresh → Reports updated | 20-60 min | | Gen2 Dataflow + Direct Lake | Dataflow refresh → Reports updated (no dataset refresh) | 5-20 min |

Direct Lake with Gen2 dataflows is the lowest-latency architecture available in the Microsoft BI stack. Our Power BI architecture specialists design refresh chains optimized for each client's SLA requirements.

Reusable ETL Patterns Across Workspaces

The highest-value dataflow architecture pattern is cross-workspace reuse. A centralized "data engineering" workspace contains core dataflows (customer master, product catalog, fiscal calendar, exchange rates). Multiple departmental workspaces link to these core dataflows, adding their own domain-specific transformations while inheriting the centralized business rules.

Implementation pattern:

  • Central workspace (Premium/Fabric): Core entity dataflows maintained by the data engineering team
  • Department workspaces: Linked entities from the central workspace + department-specific computed entities
  • Governance: Central workspace has restricted contributor access; department workspaces have broader contributor access with deployment pipelines for promotion

This pattern eliminates the "50 different definitions of Active Customer" problem that plagues large Power BI deployments. Every department's reports derive from the same cleansed, standardized, governed entity definitions.

Enterprise Best Practices

  1. Separate extraction from transformation: First dataflow extracts raw data (Bronze); second dataflow applies business rules (Silver/Gold). This isolates source connectivity issues from transformation logic issues.
  2. Use parameters for environment management: Define server names, database names, and file paths as Power Query parameters. Use deployment pipelines to swap parameter values between Dev, Test, and Production.
  3. Document every dataflow: Use the dataflow description field and table descriptions in Power Query to explain business context—not just technical steps.
  4. Enforce query folding: Verify folding for every step using the "View Native Query" option in Power Query Online. If a step breaks folding on a large table, restructure the query.
  5. Set refresh schedules intentionally: Align dataflow refresh schedules with source system availability windows and downstream dataset schedules. Do not default to "every hour" without justification.
  6. Version control M code externally: Export dataflow JSON definitions to Git. Power BI does not natively version-control dataflows; manual export/import or API-based CI/CD is required.
  7. Monitor refresh duration trends: A dataflow that took 5 minutes last month and takes 25 minutes this month indicates data volume growth or query folding regression. Catch it before it hits the 2-hour timeout.
  8. Use dataflow-level credentials: Assign service accounts or service principals for source connections. Do not use individual user credentials that expire when the user changes passwords or leaves the organization.

Getting Started

Power BI Dataflows are the most underutilized enterprise feature in the Power BI ecosystem. Organizations that implement centralized dataflow architectures consistently report 40-60% reduction in dataset development time, 70%+ reduction in data inconsistencies across reports, and significantly lower gateway load from consolidating source queries.

If your organization manages more than 20 Power BI datasets or has multiple teams building reports against the same source systems, dataflows should be a foundational component of your BI architecture.

Ready to implement a production-grade dataflow architecture? Contact EPC Group. Our data analytics practice and Power BI architecture specialists design and deploy dataflow solutions for healthcare, financial services, and government organizations with strict compliance requirements.

Frequently Asked Questions

What is the difference between Power BI Dataflows Gen1 and Gen2?

Gen1 dataflows store output as CDM folders in Azure Data Lake Storage (internal or your own ADLS Gen2) and are available in Pro, PPU, and Premium workspaces. Gen2 dataflows are part of Microsoft Fabric, write output directly as Delta tables in a Fabric Lakehouse or Warehouse, support 300+ connectors (vs ~90 in Gen1), and integrate with Fabric Data Pipelines for orchestration. Gen2 represents Microsoft forward investment, but Gen1 remains fully supported and production-grade for organizations not yet on Fabric.

Do I need Power BI Premium to use dataflows?

Basic dataflows work in Power BI Pro workspaces. However, computed entities, linked entities, incremental refresh, and enhanced compute (query folding to the dataflow engine) require Premium, Premium Per User (PPU), or Fabric capacity. For enterprise deployments, Premium or Fabric capacity is effectively required to access the features that make dataflows valuable at scale.

How do Power BI Dataflows handle incremental refresh?

Dataflows support incremental refresh by defining RangeStart and RangeEnd Power Query parameters that filter source data by a date/time column. You configure a store period (total historical window) and a refresh period (how far back each refresh looks). The service creates time-based partitions and only refreshes partitions within the refresh window. This typically reduces refresh time by 80-90% for large tables and proportionally reduces source system load. Query folding must work through the date filter for incremental refresh to function correctly.

Can multiple Power BI datasets consume the same dataflow?

Yes, and this is the primary value proposition of dataflows. A single dataflow table (e.g., a cleansed Customer dimension) can be consumed by unlimited downstream datasets across multiple workspaces. All datasets inherit the same transformation logic, business rules, and data quality standards. This eliminates duplicate source queries, reduces gateway load, and ensures every report in the organization uses consistent entity definitions.

What is query folding and why does it matter for dataflows?

Query folding is the ability of the Power Query M engine to translate transformation steps into native source queries (e.g., SQL WHERE clauses, GROUP BY, JOINs) that execute on the source database rather than in the Power BI service. For dataflows processing millions of rows, query folding is the difference between a 5-minute refresh and a 2-hour timeout. Operations like Table.SelectRows, Table.SelectColumns, and Table.Group typically fold when applied directly after a relational source. Custom M functions and certain transformations break folding. Always verify folding using View Native Query in Power Query Online.

How do I connect my own Azure Data Lake Storage Gen2 to Power BI Dataflows?

A Power BI Admin or Azure AD Global Admin configures the ADLS Gen2 connection at the tenant or workspace level in the Power BI Admin portal under Dataflow Storage settings. You provide the storage account URL with a hierarchical namespace-enabled ADLS Gen2 account. Once configured, new dataflows in that scope write CDM folders to your storage account. This gives you data sovereignty, cross-service access (Azure Data Factory, Synapse, Databricks can read the same files), backup control, and Azure lifecycle policy support.

What happens if a Power BI Dataflow refresh fails?

When a dataflow refresh fails, the previously materialized data remains available—downstream datasets continue to work with the last successful refresh output. The failure is logged in the dataflow refresh history accessible via the Power BI service UI or REST API. Email alerts can be configured to notify workspace admins. For advanced alerting, Power Automate flows can trigger on refresh failure to send notifications to Teams, Slack, PagerDuty, or ServiceNow. In Fabric, the Monitoring Hub provides detailed error information. Best practice is to implement monitoring and alerting from day one of production deployment.

Power BI DataflowsPower Query OnlineETLData EngineeringDataflows Gen2Microsoft FabricIncremental RefreshCommon Data ModelAzure Data Lake StorageM CodeComputed EntitiesData TransformationEnterprise BIPower BI ConsultingReusable ETL

Need Help With Power BI?

Our experts can help you implement the solutions discussed in this article.

Ready to Transform Your Data Strategy?

Get a free consultation to discuss how Power BI and Microsoft Fabric can drive insights and growth for your organization.