What is the difference between a data lake and a lakehouse?

A data lake stores raw data in various formats without enforcing structure. A lakehouse adds data warehouse capabilities like ACID transactions, schema enforcement, and SQL query support while maintaining the cost benefits of lake storage.

Why use medallion architecture in Microsoft Fabric?

Medallion architecture (Bronze, Silver, Gold layers) provides clear data lineage, separates concerns, enables incremental processing, and ensures data quality improves at each stage before reaching business users.

Building a Modern Data Lakehouse with Microsoft Fabric

The data lakehouse architecture eliminates the forced choice between data warehouse structure and data lake flexibility. Traditional data warehouses provide schema enforcement, ACID transactions, and fast SQL queries—but at high cost, limited scalability, and rigidity when schemas change. Data lakes offer unlimited storage, schema-on-read flexibility, and support for diverse data types—but lack transaction support, have no built-in data quality enforcement, and require separate tools for BI access. The lakehouse combines both: Delta Lake provides ACID transactions and schema enforcement on top of scalable object storage, while SQL analytics endpoints enable direct BI tool connectivity without data movement. Microsoft Fabric makes building this architecture accessible through its integrated Lakehouse item.

Why the Lakehouse Architecture Wins

| Capability | Data Warehouse | Data Lake | Lakehouse | |---|---|---|---| | ACID transactions | Yes | No | Yes (Delta Lake) | | Schema enforcement | Strict (schema-on-write) | None (schema-on-read) | Flexible (both modes) | | Storage cost | High ($$$) | Low ($) | Low ($) | | SQL query support | Native | Requires external engine | Built-in SQL endpoint | | Unstructured data | Not supported | Yes | Yes (Files section) | | BI tool access | Direct connection | Requires ETL to warehouse | Direct Lake / SQL endpoint | | Data science access | Limited | Native (Spark, Python) | Native (Spark, Python) | | Time travel | Limited | No | Yes (Delta versioning) | | Real-time data | Difficult | Streaming ingestion | Eventstreams + mirroring |

The lakehouse architecture does not replace data warehouses for every use case. Organizations with deeply established T-SQL workloads, complex stored procedures, and certified data marts may keep their warehouse and add a lakehouse alongside it. But for new analytics platforms, the lakehouse provides the most complete architecture.

Fabric Lakehouse Components

When you create a Lakehouse in Fabric, you automatically get:

OneLake Storage

All data stored in OneLake using the Delta Lake format (Parquet files + transaction log). OneLake provides hierarchical namespace, RBAC security, and access from any Fabric workload. Storage costs are the same as Azure Data Lake Storage Gen2.

SQL Analytics Endpoint

An automatically generated, read-only SQL endpoint that exposes your Lakehouse tables as T-SQL queryable objects. This endpoint supports:

Standard T-SQL SELECT queries
Views and stored procedures (read-only)
Power BI DirectQuery and Direct Lake connections
SSMS, Azure Data Studio, and ODBC/JDBC tool connectivity

Spark Runtime

Full Apache Spark runtime with PySpark, Spark SQL, Scala, and R support. Use Spark for:

Large-scale data transformations (ETL/ELT)
Data quality validation and cleansing
Feature engineering for machine learning
Ad-hoc data exploration

Tables and Files

Lakehouses contain two top-level sections:

Tables: Delta-formatted tables with schema, managed by the SQL analytics endpoint. This is where structured data lives.
Files: Unstructured or semi-structured files (CSV, JSON, Parquet, images, PDFs). These can be read by Spark but are not automatically queryable via SQL.

Building Your Lakehouse: Step by Step

Step 1: Design the Layer Architecture

Before creating any tables, design your medallion layer structure:

Bronze Lakehouse — Raw ingestion landing zone. Receives data as-is from sources. Tables partitioned by ingestion date. Minimal transformations (add ingestion metadata only). Purpose: safety net for reprocessing.

Silver Lakehouse — Cleansed and conformed data. Validated against quality rules, deduplicated, standardized. Slowly changing dimensions managed with SCD Type 2. Purpose: trusted, consistent data for all consumers.

Gold Lakehouse — Business-ready consumption layer. Star schema fact and dimension tables. Aggregation tables for performance. V-Order optimized for Direct Lake. Purpose: serve Power BI and data science.

For small-to-medium projects, all three layers can coexist in a single Lakehouse using folder conventions (bronze/, silver/, gold/ in the Files section and _bronze_, _silver_, _gold_ table name prefixes). For enterprise deployments, separate Lakehouses per layer enable independent access control and capacity management.

Step 2: Build the Bronze Ingestion Layer

Batch ingestion with Data Pipelines: Create a Fabric Data Pipeline with Copy Activity to pull data from source systems (SQL databases, REST APIs, file storage) and write to Bronze Lakehouse tables. Schedule the pipeline for daily or hourly execution.

Real-time ingestion with Eventstreams: For streaming sources (IoT devices, clickstream, application events), create an Eventstream that routes events to a Bronze Lakehouse table in near-real time.

Zero-copy with Shortcuts: For data already in Azure Data Lake Storage, S3, or another Fabric Lakehouse, create OneLake shortcuts instead of copying data. This provides instant access without storage duplication.

Database mirroring: For supported operational databases (Azure SQL, Cosmos DB, Snowflake), configure Fabric Mirroring for near-real-time CDC replication directly to your Bronze layer.

Step 3: Build the Silver Transformation Layer

Use Spark notebooks to transform Bronze data into Silver:

Schema enforcement: Define the expected schema and validate incoming data. Route schema violations to a quarantine table.
Deduplication: Use window functions to identify and remove duplicate records based on business keys and timestamps.
Data quality gates: Check null rates, range validations, referential integrity. Log quality metrics to a monitoring table.
Slowly Changing Dimensions: Implement SCD Type 2 using Delta Lake merge operations with effective dating.
Standardization: Normalize codes, formats, time zones, and units across all source systems.

Step 4: Build the Gold Consumption Layer

Transform Silver data into business-ready datasets:

Star schema: Build fact tables (transactions, events) and dimension tables (customer, product, date, geography) with integer surrogate keys.
Aggregation tables: Pre-compute daily, weekly, and monthly summaries for high-performance dashboard queries.
V-Order optimization: Ensure all Gold tables are written with V-Order enabled for optimal Direct Lake performance.
Semantic model alignment: Design Gold table names, column names, and relationships to match your planned Power BI semantic model. This simplifies model creation—Gold tables map directly to model tables.

Step 5: Connect Power BI with Direct Lake

Create a semantic model from your Gold Lakehouse tables using Direct Lake storage mode:

Navigate to the Lakehouse, click "New semantic model"
Select Gold layer tables
Define relationships, hierarchies, and measures
Build reports on top of the model

Direct Lake reads Delta tables directly from OneLake with near-import performance—no data copying, no scheduled refresh (metadata framing takes seconds). When Gold tables update, reports automatically reflect the latest data.

Orchestrating the Pipeline

Use Fabric Data Pipelines to orchestrate the entire flow:

Bronze activities run first: Copy Activity from sources, Eventstream for streaming
Silver notebooks run after Bronze completes: Data quality, deduplication, SCD processing
Gold notebooks run after Silver completes: Star schema building, aggregation, V-Order optimization
Semantic model framing triggers automatically when Gold tables update

Add error handling at each stage: if one source's Bronze ingestion fails, Silver processing for other sources can still proceed. Log pipeline status and duration to a monitoring table for operational visibility.

Best Practices

Use Delta format for all tables: Parquet files without Delta transaction logs lack ACID transactions, time travel, and schema evolution support
Partition large tables by date: Tables over 100 million rows benefit from date-based partitioning for efficient query pruning
Run OPTIMIZE regularly: Compact small files into larger ones (100MB-1GB target) for optimal read performance
Monitor with Capacity Metrics: Track CU consumption per Lakehouse to right-size your Fabric capacity
Implement Git integration: Version control your notebook code and pipeline definitions for change tracking and rollback
Document table lineage: Maintain a data dictionary mapping each Gold table to its Silver sources and Bronze origins

Building a Modern Data Lakehouse with Microsoft Fabric

Why the Lakehouse Architecture Wins

Fabric Lakehouse Components

OneLake Storage

SQL Analytics Endpoint

Spark Runtime

Tables and Files

Building Your Lakehouse: Step by Step

Step 1: Design the Layer Architecture

Step 2: Build the Bronze Ingestion Layer

Step 3: Build the Silver Transformation Layer

Step 4: Build the Gold Consumption Layer

Step 5: Connect Power BI with Direct Lake

Orchestrating the Pipeline

Best Practices

Related Resources

Frequently Asked Questions

What is the difference between a data lake and a lakehouse?

Why use medallion architecture in Microsoft Fabric?

Related Articles

Getting Started with Fabric Notebooks and PySpark

Building ML Models in Microsoft Fabric

Optimizing Spark Jobs in Fabric

Related Services

Microsoft Fabric Consulting

Data Analytics

Architecture Consulting

Industry Solutions

Need Help With Power BI?

Ready to Transform Your Data Strategy?