
Building a Modern Data Lakehouse with Microsoft Fabric
Step-by-step guide to implementing a data lakehouse architecture using Microsoft Fabric and OneLake.
The data lakehouse architecture eliminates the forced choice between data warehouse structure and data lake flexibility. Traditional data warehouses provide schema enforcement, ACID transactions, and fast SQL queries—but at high cost, limited scalability, and rigidity when schemas change. Data lakes offer unlimited storage, schema-on-read flexibility, and support for diverse data types—but lack transaction support, have no built-in data quality enforcement, and require separate tools for BI access. The lakehouse combines both: Delta Lake provides ACID transactions and schema enforcement on top of scalable object storage, while SQL analytics endpoints enable direct BI tool connectivity without data movement. Microsoft Fabric makes building this architecture accessible through its integrated Lakehouse item.
Why the Lakehouse Architecture Wins
| Capability | Data Warehouse | Data Lake | Lakehouse | |---|---|---|---| | ACID transactions | Yes | No | Yes (Delta Lake) | | Schema enforcement | Strict (schema-on-write) | None (schema-on-read) | Flexible (both modes) | | Storage cost | High ($$$) | Low ($) | Low ($) | | SQL query support | Native | Requires external engine | Built-in SQL endpoint | | Unstructured data | Not supported | Yes | Yes (Files section) | | BI tool access | Direct connection | Requires ETL to warehouse | Direct Lake / SQL endpoint | | Data science access | Limited | Native (Spark, Python) | Native (Spark, Python) | | Time travel | Limited | No | Yes (Delta versioning) | | Real-time data | Difficult | Streaming ingestion | Eventstreams + mirroring |
The lakehouse architecture does not replace data warehouses for every use case. Organizations with deeply established T-SQL workloads, complex stored procedures, and certified data marts may keep their warehouse and add a lakehouse alongside it. But for new analytics platforms, the lakehouse provides the most complete architecture.
Fabric Lakehouse Components
When you create a Lakehouse in Fabric, you automatically get:
OneLake Storage
All data stored in OneLake using the Delta Lake format (Parquet files + transaction log). OneLake provides hierarchical namespace, RBAC security, and access from any Fabric workload. Storage costs are the same as Azure Data Lake Storage Gen2.
SQL Analytics Endpoint
An automatically generated, read-only SQL endpoint that exposes your Lakehouse tables as T-SQL queryable objects. This endpoint supports:
- Standard T-SQL SELECT queries
- Views and stored procedures (read-only)
- Power BI DirectQuery and Direct Lake connections
- SSMS, Azure Data Studio, and ODBC/JDBC tool connectivity
Spark Runtime
Full Apache Spark runtime with PySpark, Spark SQL, Scala, and R support. Use Spark for:
- Large-scale data transformations (ETL/ELT)
- Data quality validation and cleansing
- Feature engineering for machine learning
- Ad-hoc data exploration
Tables and Files
Lakehouses contain two top-level sections:
- Tables: Delta-formatted tables with schema, managed by the SQL analytics endpoint. This is where structured data lives.
- Files: Unstructured or semi-structured files (CSV, JSON, Parquet, images, PDFs). These can be read by Spark but are not automatically queryable via SQL.
Building Your Lakehouse: Step by Step
Step 1: Design the Layer Architecture
Before creating any tables, design your medallion layer structure:
Bronze Lakehouse — Raw ingestion landing zone. Receives data as-is from sources. Tables partitioned by ingestion date. Minimal transformations (add ingestion metadata only). Purpose: safety net for reprocessing.
Silver Lakehouse — Cleansed and conformed data. Validated against quality rules, deduplicated, standardized. Slowly changing dimensions managed with SCD Type 2. Purpose: trusted, consistent data for all consumers.
Gold Lakehouse — Business-ready consumption layer. Star schema fact and dimension tables. Aggregation tables for performance. V-Order optimized for Direct Lake. Purpose: serve Power BI and data science.
For small-to-medium projects, all three layers can coexist in a single Lakehouse using folder conventions (bronze/, silver/, gold/ in the Files section and _bronze_, _silver_, _gold_ table name prefixes). For enterprise deployments, separate Lakehouses per layer enable independent access control and capacity management.
Step 2: Build the Bronze Ingestion Layer
Batch ingestion with Data Pipelines: Create a Fabric Data Pipeline with Copy Activity to pull data from source systems (SQL databases, REST APIs, file storage) and write to Bronze Lakehouse tables. Schedule the pipeline for daily or hourly execution.
Real-time ingestion with Eventstreams: For streaming sources (IoT devices, clickstream, application events), create an Eventstream that routes events to a Bronze Lakehouse table in near-real time.
Zero-copy with Shortcuts: For data already in Azure Data Lake Storage, S3, or another Fabric Lakehouse, create OneLake shortcuts instead of copying data. This provides instant access without storage duplication.
Database mirroring: For supported operational databases (Azure SQL, Cosmos DB, Snowflake), configure Fabric Mirroring for near-real-time CDC replication directly to your Bronze layer.
Step 3: Build the Silver Transformation Layer
Use Spark notebooks to transform Bronze data into Silver:
- Schema enforcement: Define the expected schema and validate incoming data. Route schema violations to a quarantine table.
- Deduplication: Use window functions to identify and remove duplicate records based on business keys and timestamps.
- Data quality gates: Check null rates, range validations, referential integrity. Log quality metrics to a monitoring table.
- Slowly Changing Dimensions: Implement SCD Type 2 using Delta Lake merge operations with effective dating.
- Standardization: Normalize codes, formats, time zones, and units across all source systems.
Step 4: Build the Gold Consumption Layer
Transform Silver data into business-ready datasets:
- Star schema: Build fact tables (transactions, events) and dimension tables (customer, product, date, geography) with integer surrogate keys.
- Aggregation tables: Pre-compute daily, weekly, and monthly summaries for high-performance dashboard queries.
- V-Order optimization: Ensure all Gold tables are written with V-Order enabled for optimal Direct Lake performance.
- Semantic model alignment: Design Gold table names, column names, and relationships to match your planned Power BI semantic model. This simplifies model creation—Gold tables map directly to model tables.
Step 5: Connect Power BI with Direct Lake
Create a semantic model from your Gold Lakehouse tables using Direct Lake storage mode:
- Navigate to the Lakehouse, click "New semantic model"
- Select Gold layer tables
- Define relationships, hierarchies, and measures
- Build reports on top of the model
Direct Lake reads Delta tables directly from OneLake with near-import performance—no data copying, no scheduled refresh (metadata framing takes seconds). When Gold tables update, reports automatically reflect the latest data.
Orchestrating the Pipeline
Use Fabric Data Pipelines to orchestrate the entire flow:
- Bronze activities run first: Copy Activity from sources, Eventstream for streaming
- Silver notebooks run after Bronze completes: Data quality, deduplication, SCD processing
- Gold notebooks run after Silver completes: Star schema building, aggregation, V-Order optimization
- Semantic model framing triggers automatically when Gold tables update
Add error handling at each stage: if one source's Bronze ingestion fails, Silver processing for other sources can still proceed. Log pipeline status and duration to a monitoring table for operational visibility.
Best Practices
- Use Delta format for all tables: Parquet files without Delta transaction logs lack ACID transactions, time travel, and schema evolution support
- Partition large tables by date: Tables over 100 million rows benefit from date-based partitioning for efficient query pruning
- Run OPTIMIZE regularly: Compact small files into larger ones (100MB-1GB target) for optimal read performance
- Monitor with Capacity Metrics: Track CU consumption per Lakehouse to right-size your Fabric capacity
- Implement Git integration: Version control your notebook code and pipeline definitions for change tracking and rollback
- Document table lineage: Maintain a data dictionary mapping each Gold table to its Silver sources and Bronze origins
Related Resources
Frequently Asked Questions
What is the difference between a data lake and a lakehouse?
A data lake stores raw data in various formats without enforcing structure. A lakehouse adds data warehouse capabilities like ACID transactions, schema enforcement, and SQL query support while maintaining the cost benefits of lake storage.
Why use medallion architecture in Microsoft Fabric?
Medallion architecture (Bronze, Silver, Gold layers) provides clear data lineage, separates concerns, enables incremental processing, and ensures data quality improves at each stage before reaching business users.