
Building a Modern Data Lakehouse with Microsoft Fabric
Build a modern data lakehouse with Microsoft Fabric and OneLake. Learn proven architecture patterns for scalable enterprise analytics solutions.
The data lakehouse architecture combines the ACID transactions and SQL query support of a data warehouse with the scalability, flexibility, and low cost of a data lake—eliminating the forced choice between the two that has plagued enterprise analytics for over a decade. Traditional data warehouses provide schema enforcement, ACID transactions, and fast SQL queries—but at high cost, limited scalability, and rigidity when schemas change. Data lakes offer unlimited storage, schema-on-read flexibility, and support for diverse data types—but lack transaction support, have no built-in data quality enforcement, and require separate tools for BI access. The lakehouse combines both: Delta Lake provides ACID transactions and schema enforcement on top of scalable object storage, while SQL analytics endpoints enable direct BI tool connectivity without data movement. Microsoft Fabric makes building this architecture accessible through its integrated Lakehouse item. I have helped organizations migrate from legacy warehouse architectures costing $200K+/year in Azure Synapse licensing to Fabric Lakehouses at a fraction of the cost while gaining capabilities they never had—schema evolution, time travel, and Direct Lake reporting. Our Microsoft Fabric consulting team designs lakehouse architectures for enterprise clients across healthcare, finance, and government.
Why the Lakehouse Architecture Wins
| Capability | Data Warehouse | Data Lake | Lakehouse |
|---|---|---|---|
| ACID transactions | Yes | No | Yes (Delta Lake) |
| Schema enforcement | Strict (schema-on-write) | None (schema-on-read) | Flexible (both modes) |
| Storage cost | High ($$$) | Low ($) | Low ($) |
| SQL query support | Native | Requires external engine | Built-in SQL endpoint |
| Unstructured data | Not supported | Yes | Yes (Files section) |
| BI tool access | Direct connection | Requires ETL to warehouse | Direct Lake / SQL endpoint |
| Data science access | Limited | Native (Spark, Python) | Native (Spark, Python) |
| Time travel | Limited | No | Yes (Delta versioning) |
| Real-time data | Difficult | Streaming ingestion | Eventstreams + mirroring |
In my experience implementing lakehouse architectures for 15+ enterprise clients since Fabric's general availability, the pattern delivers the best ROI for organizations building new analytics platforms or modernizing aging warehouse infrastructure. The lakehouse architecture does not replace data warehouses for every use case. Organizations with deeply established T-SQL workloads, complex stored procedures, and certified data marts may keep their warehouse and add a lakehouse alongside it. But for new analytics platforms, the lakehouse provides the most complete architecture.
Fabric Lakehouse Components
When you create a Lakehouse in Fabric, you automatically get:
OneLake Storage
All data stored in OneLake using the Delta Lake format (Parquet files + transaction log). OneLake provides hierarchical namespace, RBAC security, and access from any Fabric workload. Storage costs are the same as Azure Data Lake Storage Gen2.
SQL Analytics Endpoint
An automatically generated, read-only SQL endpoint that exposes your Lakehouse tables as T-SQL queryable objects. This endpoint supports:
- Standard T-SQL SELECT queries
- Views and stored procedures (read-only)
- Power BI DirectQuery and Direct Lake connections
- SSMS, Azure Data Studio, and ODBC/JDBC tool connectivity
Spark Runtime
Full Apache Spark runtime with PySpark, Spark SQL, Scala, and R support. Use Spark for:
- Large-scale data transformations (ETL/ELT)
- Data quality validation and cleansing
- Feature engineering for machine learning
- Ad-hoc data exploration
Tables and Files
Lakehouses contain two top-level sections:
- Tables: Delta-formatted tables with schema, managed by the SQL analytics endpoint. This is where structured data lives.
- Files: Unstructured or semi-structured files (CSV, JSON, Parquet, images, PDFs). These can be read by Spark but are not automatically queryable via SQL.
Building Your Lakehouse: Step by Step
Step 1: Design the Layer Architecture
Before creating any tables, design your medallion layer structure:
Bronze Lakehouse — Raw ingestion landing zone. Receives data as-is from sources. Tables partitioned by ingestion date. Minimal transformations (add ingestion metadata only). Purpose: safety net for reprocessing.
Silver Lakehouse — Cleansed and conformed data. Validated against quality rules, deduplicated, standardized. Slowly changing dimensions managed with SCD Type 2. Purpose: trusted, consistent data for all consumers.
Gold Lakehouse — Business-ready consumption layer. Star schema fact and dimension tables. Aggregation tables for performance. V-Order optimized for Direct Lake. Purpose: serve Power BI and data science.
For small-to-medium projects, all three layers can coexist in a single Lakehouse using folder conventions (bronze/, silver/, gold/ in the Files section and _bronze_, _silver_, _gold_ table name prefixes). For enterprise deployments, separate Lakehouses per layer enable independent access control and capacity management. I strongly recommend separate Lakehouses for any organization with more than 50 data consumers or strict compliance requirements—the access control granularity alone justifies the additional setup effort.
Step 2: Build the Bronze Ingestion Layer
Batch ingestion with Data Pipelines: Create a Fabric Data Pipeline with Copy Activity to pull data from source systems (SQL databases, REST APIs, file storage) and write to Bronze Lakehouse tables. Schedule the pipeline for daily or hourly execution.
Real-time ingestion with Eventstreams: For streaming sources (IoT devices, clickstream, application events), create an Eventstream that routes events to a Bronze Lakehouse table in near-real time.
Zero-copy with Shortcuts: For data already in Azure Data Lake Storage, S3, or another Fabric Lakehouse, create OneLake shortcuts instead of copying data. This provides instant access without storage duplication.
Database mirroring: For supported operational databases (Azure SQL, Cosmos DB, Snowflake), configure Fabric Mirroring for near-real-time CDC replication directly to your Bronze layer.
Step 3: Build the Silver Transformation Layer
Use Spark notebooks to transform Bronze data into Silver:
- Schema enforcement: Define the expected schema and validate incoming data. Route schema violations to a quarantine table.
- Deduplication: Use window functions to identify and remove duplicate records based on business keys and timestamps.
- Data quality gates: Check null rates, range validations, referential integrity. Log quality metrics to a monitoring table.
- Slowly Changing Dimensions: Implement SCD Type 2 using Delta Lake merge operations with effective dating.
- Standardization: Normalize codes, formats, time zones, and units across all source systems.
Step 4: Build the Gold Consumption Layer
Transform Silver data into business-ready datasets:
- Star schema: Build fact tables (transactions, events) and dimension tables (customer, product, date, geography) with integer surrogate keys.
- Aggregation tables: Pre-compute daily, weekly, and monthly summaries for high-performance dashboard queries.
- V-Order optimization: Ensure all Gold tables are written with V-Order enabled for optimal Direct Lake performance.
- Semantic model alignment: Design Gold table names, column names, and relationships to match your planned Power BI semantic model. This simplifies model creation—Gold tables map directly to model tables.
Step 5: Connect Power BI with Direct Lake
Create a semantic model from your Gold Lakehouse tables using Direct Lake storage mode:
- Navigate to the Lakehouse, click "New semantic model"
- Select Gold layer tables
- Define relationships, hierarchies, and measures
- Build reports on top of the model
Direct Lake reads Delta tables directly from OneLake with near-import performance—no data copying, no scheduled refresh (metadata framing takes seconds). When Gold tables update, reports automatically reflect the latest data.
Orchestrating the Pipeline
Use Fabric Data Pipelines to orchestrate the entire flow:
- Bronze activities run first: Copy Activity from sources, Eventstream for streaming
- Silver notebooks run after Bronze completes: Data quality, deduplication, SCD processing
- Gold notebooks run after Silver completes: Star schema building, aggregation, V-Order optimization
- Semantic model framing triggers automatically when Gold tables update
Add error handling at each stage: if one source's Bronze ingestion fails, Silver processing for other sources can still proceed. Log pipeline status and duration to a monitoring table for operational visibility.
Best Practices
- Use Delta format for all tables: Parquet files without Delta transaction logs lack ACID transactions, time travel, and schema evolution support
- Partition large tables by date: Tables over 100 million rows benefit from date-based partitioning for efficient query pruning
- Run OPTIMIZE regularly: Compact small files into larger ones (100MB-1GB target) for optimal read performance
- Monitor with Capacity Metrics: Track CU consumption per Lakehouse to right-size your Fabric capacity
- Implement Git integration: Version control your notebook code and pipeline definitions for change tracking and rollback
- Document table lineage: Maintain a data dictionary mapping each Gold table to its Silver sources and Bronze origins
- Plan for disaster recovery: OneLake provides built-in redundancy, but maintain runbook documentation for restoring pipeline state after failures. Test recovery procedures quarterly
Sizing and Capacity Planning
Proper capacity sizing prevents the most common lakehouse deployment failures:
| Workload Profile | Recommended Starting SKU | Typical Monthly Cost | Supports |
|---|---|---|---|
| Small team (5-10 users, <100GB) | F2 | ~$260/month | Basic medallion architecture |
| Mid-size (20-50 users, 100GB-1TB) | F8 | ~$1,000/month | Full medallion + Direct Lake reporting |
| Enterprise (100+ users, 1-10TB) | F32-F64 | ~$4,000-8,000/month | Multiple Lakehouses, concurrent Spark + reporting |
| Large enterprise (500+ users, 10TB+) | F128+ | ~$16,000+/month | Full platform with real-time, ML, and heavy ETL |
Monitor capacity utilization weekly for the first month after deployment. I typically see organizations right-size within 2-3 SKU adjustments during the first quarter. The Fabric Capacity Metrics app provides the data needed for informed scaling decisions.
Related Resources
Frequently Asked Questions
What is the difference between a data lake and a lakehouse?
A data lake stores raw data in various formats without enforcing structure. A lakehouse adds data warehouse capabilities like ACID transactions, schema enforcement, and SQL query support while maintaining the cost benefits of lake storage.
Why use medallion architecture in Microsoft Fabric?
Medallion architecture (Bronze, Silver, Gold layers) provides clear data lineage, separates concerns, enables incremental processing, and ensures data quality improves at each stage before reaching business users.