
Deep Dive into Medallion Architecture
Advanced implementation patterns for Bronze, Silver, and Gold medallion architecture layers in Microsoft Fabric. Data quality, SCD handling, and optimization.
The medallion architecture (Bronze → Silver → Gold) provides a structured framework for progressively refining raw data into business-ready analytics assets. While the concept is simple—ingest raw data, cleanse and standardize it, then curate it for consumption—the implementation details determine whether your data platform is maintainable, performant, and trustworthy at scale. This guide covers advanced implementation patterns for each layer in Microsoft Fabric, including specific techniques for handling slowly changing dimensions, data quality enforcement, and optimization for Direct Lake semantic models. Our data analytics consulting team implements medallion architectures with automated quality gates and monitoring.
Architecture Overview
Each medallion layer serves a distinct purpose with different quality guarantees, schemas, and access patterns:
| Layer | Purpose | Data Quality | Schema | Primary Users | Storage Pattern | |---|---|---|---|---|---| | Bronze | Raw ingestion landing zone | As-is from source (no quality guarantees) | Schema-on-read, preserves source format | Data engineers (debugging, reprocessing) | Append-only, partitioned by ingestion date | | Silver | Cleansed, conformed, validated | Quality gates enforced, nulls handled, deduped | Schema-on-write, standardized types | Data engineers, analysts | Upsert with merge, SCD Type 2 for dimensions | | Gold | Business-ready, optimized | Fully validated, aggregated, documented | Star schema for BI, wide tables for ML | Analysts, BI tools, data scientists | Optimized for query (V-Order, partitioned by business key) |
In Fabric, each layer typically maps to a separate Lakehouse (Bronze Lakehouse, Silver Lakehouse, Gold Lakehouse) within the same workspace or across workspaces based on your governance model. Separating layers into distinct Lakehouses enables independent access control, capacity management, and lifecycle policies.
Bronze Layer: Raw Ingestion
Design Principles
The Bronze layer is your safety net. Its job is to capture an exact copy of source data with minimal transformation so that you can always reprocess from raw data if downstream logic changes.
Rules for Bronze: - Never modify source data structure or content - Add metadata columns: ingestion timestamp, source system identifier, batch ID - Preserve source data types (do not convert or parse) - Store in Delta format for ACID transactions and time travel - Partition by ingestion date for efficient reprocessing
Ingestion Patterns
Full Snapshot Pattern: Copy the entire source table on each run. Simple but storage-intensive. Best for small reference tables (< 1 million rows) or sources without reliable change tracking.
Incremental Append Pattern: Capture only new rows since the last ingestion using a watermark column (last modified timestamp, auto-increment ID). Efficient for large transaction tables. Implement using Fabric Data Pipelines with a lookup activity to get the last watermark and a copy activity filtered to rows after that watermark.
CDC (Change Data Capture) Pattern: Capture inserts, updates, and deletes as individual change events. Use Fabric Mirroring for supported databases (Azure SQL, Cosmos DB, Snowflake) or configure CDC on the source and stream changes through Eventstreams. This is the most storage-efficient and lowest-latency pattern for large operational databases.
File-Based Ingestion: For file drops (CSV, JSON, Parquet from SFTP, S3, or Azure Blob), use a Data Pipeline with a ForEach activity to process each file. Move processed files to an archive folder. Track file-level metadata (filename, size, row count) in a Bronze control table.
Bronze Table Structure
Every Bronze table should include standard metadata columns alongside source columns:
- `_ingestion_timestamp`: When the row was ingested (TIMESTAMP)
- `_source_system`: Source identifier (STRING, e.g., "SAP_ERP", "Salesforce")
- `_batch_id`: Unique identifier for the ingestion batch (STRING/GUID)
- `_source_file`: For file-based ingestion, the source filename (STRING)
These columns enable lineage tracking, debugging, and selective reprocessing without modifying source data.
Silver Layer: Cleansing and Conforming
Design Principles
The Silver layer transforms raw Bronze data into a clean, standardized, trusted dataset. This is where you enforce data quality, handle slowly changing attributes, deduplicate records, and conform data types and codes across source systems.
Data Quality Gates
Implement quality checks as explicit steps in your Silver layer notebooks or dataflows. When records fail quality checks, route them to a quarantine table rather than dropping them silently:
| Quality Check | Implementation | Failure Action | |---|---|---| | Null checks on required fields | `df.filter(col("customer_id").isNull())` → quarantine | Route to quarantine, alert data steward | | Range validations | Filter amounts outside expected bounds (e.g., negative revenue) | Route to quarantine with violation reason | | Referential integrity | Left anti-join between fact FK and dimension PK | Route orphan records to quarantine | | Duplicate detection | Window function ranking by business key, ordered by timestamp | Keep latest record, archive duplicates | | Data type conformity | Try-parse with fallback to null | Route unparseable records to quarantine |
Track quarantine metrics over time. A sudden spike in quarantine volume indicates a source system issue that needs investigation.
Slowly Changing Dimension (SCD) Type 2
Silver layer dimension tables should implement SCD Type 2 for attributes where historical accuracy matters (customer region, product category, employee department):
- New record arrives: If no existing record with this business key exists, insert with EffectiveStartDate = current date, EffectiveEndDate = null, IsCurrent = true
- Changed record arrives: Close the existing record (set EffectiveEndDate = current date - 1, IsCurrent = false), then insert the new version with EffectiveStartDate = current date
- No change: Skip—do not create duplicate records for unchanged data
Implement in Spark notebooks using a merge (upsert) operation against the Silver dimension table. The merge condition checks the business key plus attribute equality to detect changes.
Cross-Source Conforming
When multiple source systems contribute data to the same Silver table, resolve conflicts:
- Code standardization: Map source-specific codes to a common code table (e.g., "US", "USA", "United States" → "US")
- Currency conversion: Convert all monetary amounts to a standard currency using exchange rates from a reference table
- Date normalization: Parse all date formats to a standard YYYY-MM-DD format in UTC timezone
- Entity resolution: Match customer records across systems using fuzzy matching on name + address or deterministic matching on email/phone
Gold Layer: Business-Ready Consumption
Design Principles
The Gold layer produces the final datasets consumed by Power BI semantic models, machine learning pipelines, and operational applications. Gold tables should be optimized for their primary consumer—typically star schema for BI and wide denormalized tables for ML.
Star Schema for BI
Build Gold layer star schema tables specifically for Power BI Direct Lake consumption:
- Fact tables: Aggregate to the grain needed for reporting. If transaction-level detail is rarely needed, aggregate to daily grain and store transaction detail in Silver for drill-through.
- Dimension tables: Fully denormalized with all hierarchy levels. Include the SCD Type 2 current-state view (IsCurrent = true) for standard reporting and the full history view for point-in-time analysis.
- V-Order optimization: Ensure Gold tables are written with V-Order enabled for optimal Direct Lake compression. Run OPTIMIZE after bulk writes.
- Integer surrogate keys: Replace business keys with sequential integer surrogate keys in Gold. This maximizes VertiPaq compression and relationship performance.
Aggregation Tables
For datasets exceeding 100 million rows, create pre-aggregated Gold tables alongside the detail tables:
- Daily summary: Aggregate hourly or transaction data to daily grain with SUM, COUNT, AVG measures
- Monthly summary: Further aggregate for dashboard-level KPIs
- Dimensional aggregation: Pre-compute totals by common groupings (region, product category, customer segment)
Power BI's aggregation feature automatically routes queries to the appropriate aggregation table based on the columns and measures requested, delivering sub-second performance on billion-row datasets.
Refresh Orchestration
Orchestrate the Bronze → Silver → Gold pipeline using Fabric Data Factory pipelines:
- Bronze notebooks/copy activities run first, ingesting raw data
- Silver notebooks run after Bronze completes, applying quality gates and conforming
- Gold notebooks run after Silver completes, building star schema and aggregations
- Semantic model framing triggers automatically when Gold tables update (Direct Lake detects Delta log changes)
Use pipeline dependencies to ensure correct execution order. Add error handling at each stage: if Bronze fails for one source, Silver and Gold can still process other sources that succeeded.
Related Resources
Frequently Asked Questions
Is medallion architecture required in Fabric?
No, medallion is a recommended pattern but not required. It provides clear data lineage and progressive quality improvement. Smaller projects might simplify, but enterprise implementations benefit from the structure.
How do I handle data quality failures in medallion architecture?
Implement quarantine tables in the Bronze or Silver layer. Records failing quality checks go to quarantine for review while valid data progresses. Track and resolve quarantined records through a defined process.