Building a Modern Data Lakehouse with Microsoft Fabric
Data Engineering
Data Engineering17 min read

Building a Modern Data Lakehouse with Microsoft Fabric

Build a modern data lakehouse with Microsoft Fabric and OneLake. Learn proven architecture patterns for scalable enterprise analytics solutions.

By Errin O'Connor, Chief AI Architect

The data lakehouse architecture combines the ACID transactions and SQL query support of a data warehouse with the scalability, flexibility, and low cost of a data lake—eliminating the forced choice between the two that has plagued enterprise analytics for over a decade. Traditional data warehouses provide schema enforcement, ACID transactions, and fast SQL queries—but at high cost, limited scalability, and rigidity when schemas change. Data lakes offer unlimited storage, schema-on-read flexibility, and support for diverse data types—but lack transaction support, have no built-in data quality enforcement, and require separate tools for BI access. The lakehouse combines both: Delta Lake provides ACID transactions and schema enforcement on top of scalable object storage, while SQL analytics endpoints enable direct BI tool connectivity without data movement. Microsoft Fabric makes building this architecture accessible through its integrated Lakehouse item. I have helped organizations migrate from legacy warehouse architectures costing $200K+/year in Azure Synapse licensing to Fabric Lakehouses at a fraction of the cost while gaining capabilities they never had—schema evolution, time travel, and Direct Lake reporting. Our Microsoft Fabric consulting team designs lakehouse architectures for enterprise clients across healthcare, finance, and government.

Why the Lakehouse Architecture Wins

CapabilityData WarehouseData LakeLakehouse
ACID transactionsYesNoYes (Delta Lake)
Schema enforcementStrict (schema-on-write)None (schema-on-read)Flexible (both modes)
Storage costHigh ($$$)Low ($)Low ($)
SQL query supportNativeRequires external engineBuilt-in SQL endpoint
Unstructured dataNot supportedYesYes (Files section)
BI tool accessDirect connectionRequires ETL to warehouseDirect Lake / SQL endpoint
Data science accessLimitedNative (Spark, Python)Native (Spark, Python)
Time travelLimitedNoYes (Delta versioning)
Real-time dataDifficultStreaming ingestionEventstreams + mirroring

In my experience implementing lakehouse architectures for 15+ enterprise clients since Fabric's general availability, the pattern delivers the best ROI for organizations building new analytics platforms or modernizing aging warehouse infrastructure. The lakehouse architecture does not replace data warehouses for every use case. Organizations with deeply established T-SQL workloads, complex stored procedures, and certified data marts may keep their warehouse and add a lakehouse alongside it. But for new analytics platforms, the lakehouse provides the most complete architecture.

Fabric Lakehouse Components

When you create a Lakehouse in Fabric, you automatically get:

OneLake Storage

All data stored in OneLake using the Delta Lake format (Parquet files + transaction log). OneLake provides hierarchical namespace, RBAC security, and access from any Fabric workload. Storage costs are the same as Azure Data Lake Storage Gen2.

SQL Analytics Endpoint

An automatically generated, read-only SQL endpoint that exposes your Lakehouse tables as T-SQL queryable objects. This endpoint supports:

  • Standard T-SQL SELECT queries
  • Views and stored procedures (read-only)
  • Power BI DirectQuery and Direct Lake connections
  • SSMS, Azure Data Studio, and ODBC/JDBC tool connectivity

Spark Runtime

Full Apache Spark runtime with PySpark, Spark SQL, Scala, and R support. Use Spark for:

  • Large-scale data transformations (ETL/ELT)
  • Data quality validation and cleansing
  • Feature engineering for machine learning
  • Ad-hoc data exploration

Tables and Files

Lakehouses contain two top-level sections:

  • Tables: Delta-formatted tables with schema, managed by the SQL analytics endpoint. This is where structured data lives.
  • Files: Unstructured or semi-structured files (CSV, JSON, Parquet, images, PDFs). These can be read by Spark but are not automatically queryable via SQL.

Building Your Lakehouse: Step by Step

Step 1: Design the Layer Architecture

Before creating any tables, design your medallion layer structure:

Bronze Lakehouse — Raw ingestion landing zone. Receives data as-is from sources. Tables partitioned by ingestion date. Minimal transformations (add ingestion metadata only). Purpose: safety net for reprocessing.

Silver Lakehouse — Cleansed and conformed data. Validated against quality rules, deduplicated, standardized. Slowly changing dimensions managed with SCD Type 2. Purpose: trusted, consistent data for all consumers.

Gold Lakehouse — Business-ready consumption layer. Star schema fact and dimension tables. Aggregation tables for performance. V-Order optimized for Direct Lake. Purpose: serve Power BI and data science.

For small-to-medium projects, all three layers can coexist in a single Lakehouse using folder conventions (bronze/, silver/, gold/ in the Files section and _bronze_, _silver_, _gold_ table name prefixes). For enterprise deployments, separate Lakehouses per layer enable independent access control and capacity management. I strongly recommend separate Lakehouses for any organization with more than 50 data consumers or strict compliance requirements—the access control granularity alone justifies the additional setup effort.

Step 2: Build the Bronze Ingestion Layer

Batch ingestion with Data Pipelines: Create a Fabric Data Pipeline with Copy Activity to pull data from source systems (SQL databases, REST APIs, file storage) and write to Bronze Lakehouse tables. Schedule the pipeline for daily or hourly execution.

Real-time ingestion with Eventstreams: For streaming sources (IoT devices, clickstream, application events), create an Eventstream that routes events to a Bronze Lakehouse table in near-real time.

Zero-copy with Shortcuts: For data already in Azure Data Lake Storage, S3, or another Fabric Lakehouse, create OneLake shortcuts instead of copying data. This provides instant access without storage duplication.

Database mirroring: For supported operational databases (Azure SQL, Cosmos DB, Snowflake), configure Fabric Mirroring for near-real-time CDC replication directly to your Bronze layer.

Step 3: Build the Silver Transformation Layer

Use Spark notebooks to transform Bronze data into Silver:

  • Schema enforcement: Define the expected schema and validate incoming data. Route schema violations to a quarantine table.
  • Deduplication: Use window functions to identify and remove duplicate records based on business keys and timestamps.
  • Data quality gates: Check null rates, range validations, referential integrity. Log quality metrics to a monitoring table.
  • Slowly Changing Dimensions: Implement SCD Type 2 using Delta Lake merge operations with effective dating.
  • Standardization: Normalize codes, formats, time zones, and units across all source systems.

Step 4: Build the Gold Consumption Layer

Transform Silver data into business-ready datasets:

  • Star schema: Build fact tables (transactions, events) and dimension tables (customer, product, date, geography) with integer surrogate keys.
  • Aggregation tables: Pre-compute daily, weekly, and monthly summaries for high-performance dashboard queries.
  • V-Order optimization: Ensure all Gold tables are written with V-Order enabled for optimal Direct Lake performance.
  • Semantic model alignment: Design Gold table names, column names, and relationships to match your planned Power BI semantic model. This simplifies model creation—Gold tables map directly to model tables.

Step 5: Connect Power BI with Direct Lake

Create a semantic model from your Gold Lakehouse tables using Direct Lake storage mode:

  1. Navigate to the Lakehouse, click "New semantic model"
  2. Select Gold layer tables
  3. Define relationships, hierarchies, and measures
  4. Build reports on top of the model

Direct Lake reads Delta tables directly from OneLake with near-import performance—no data copying, no scheduled refresh (metadata framing takes seconds). When Gold tables update, reports automatically reflect the latest data.

Orchestrating the Pipeline

Use Fabric Data Pipelines to orchestrate the entire flow:

  1. Bronze activities run first: Copy Activity from sources, Eventstream for streaming
  2. Silver notebooks run after Bronze completes: Data quality, deduplication, SCD processing
  3. Gold notebooks run after Silver completes: Star schema building, aggregation, V-Order optimization
  4. Semantic model framing triggers automatically when Gold tables update

Add error handling at each stage: if one source's Bronze ingestion fails, Silver processing for other sources can still proceed. Log pipeline status and duration to a monitoring table for operational visibility.

Best Practices

  • Use Delta format for all tables: Parquet files without Delta transaction logs lack ACID transactions, time travel, and schema evolution support
  • Partition large tables by date: Tables over 100 million rows benefit from date-based partitioning for efficient query pruning
  • Run OPTIMIZE regularly: Compact small files into larger ones (100MB-1GB target) for optimal read performance
  • Monitor with Capacity Metrics: Track CU consumption per Lakehouse to right-size your Fabric capacity
  • Implement Git integration: Version control your notebook code and pipeline definitions for change tracking and rollback
  • Document table lineage: Maintain a data dictionary mapping each Gold table to its Silver sources and Bronze origins
  • Plan for disaster recovery: OneLake provides built-in redundancy, but maintain runbook documentation for restoring pipeline state after failures. Test recovery procedures quarterly

Sizing and Capacity Planning

Proper capacity sizing prevents the most common lakehouse deployment failures:

Workload ProfileRecommended Starting SKUTypical Monthly CostSupports
Small team (5-10 users, <100GB)F2~$260/monthBasic medallion architecture
Mid-size (20-50 users, 100GB-1TB)F8~$1,000/monthFull medallion + Direct Lake reporting
Enterprise (100+ users, 1-10TB)F32-F64~$4,000-8,000/monthMultiple Lakehouses, concurrent Spark + reporting
Large enterprise (500+ users, 10TB+)F128+~$16,000+/monthFull platform with real-time, ML, and heavy ETL

Monitor capacity utilization weekly for the first month after deployment. I typically see organizations right-size within 2-3 SKU adjustments during the first quarter. The Fabric Capacity Metrics app provides the data needed for informed scaling decisions.

Related Resources

Frequently Asked Questions

What is the difference between a data lake and a lakehouse?

A data lake stores raw data in various formats without enforcing structure. A lakehouse adds data warehouse capabilities like ACID transactions, schema enforcement, and SQL query support while maintaining the cost benefits of lake storage.

Why use medallion architecture in Microsoft Fabric?

Medallion architecture (Bronze, Silver, Gold layers) provides clear data lineage, separates concerns, enables incremental processing, and ensures data quality improves at each stage before reaching business users.

Microsoft FabricLakehouseData EngineeringOneLakeDelta Lake

Industry Solutions

See how we apply these solutions across industries:

Need Help With Power BI?

Our experts can help you implement the solutions discussed in this article.

Ready to Transform Your Data Strategy?

Get a free consultation to discuss how Power BI and Microsoft Fabric can drive insights and growth for your organization.