
Fabric Medallion Architecture: Bronze, Silver, Gold Best Practices for 2026
Implement medallion architecture (Bronze, Silver, Gold layers) in Microsoft Fabric with data quality patterns, incremental processing, and governance.
Medallion architecture—organizing data into Bronze, Silver, and Gold layers—has become the industry standard for lakehouse data engineering. Microsoft Fabric natively supports this pattern with OneLake, Delta Lake, and integrated governance. This comprehensive guide covers design patterns, implementation steps, and lessons learned from enterprise deployments. Our Microsoft Fabric consulting team specializes in medallion architecture implementations.
What is Medallion Architecture?
The Three-Layer Pattern
Medallion architecture organizes data processing into three progressive refinement layers:
- Bronze Layer - Raw data landing zone (as-is from source systems)
- Silver Layer - Cleansed and conformed data (validated, standardized)
- Gold Layer - Business-ready aggregated data (optimized for consumption)
Analogy: Think of ore refining in metallurgy: - Bronze: Raw ore from mines (unrefined) - Silver: Smelted and purified metal (high quality) - Gold: Final jewelry product (ready for market)
For architectural context, see our Fabric lakehouse guide.
Why Medallion Architecture?
Problem Without Medallion: - Data engineers overwrite source data (no audit trail) - Multiple teams create conflicting transformation logic - Reports break when source schemas change - No data lineage or quality tracking
Benefits of Medallion: - Immutable bronze - Preserve original data for compliance and debugging - Incremental processing - Only transform new/changed data (cost savings) - Data quality enforcement - Validate at silver layer before reports consume - Clear ownership - Data engineers own bronze/silver, analysts own gold - Reusability - Multiple gold datasets derive from same silver layer
Bronze Layer: Raw Data Ingestion
Purpose and Characteristics
Bronze Layer Goals: - Ingest data from source systems with minimal transformation - Preserve original structure, data types, and values - Support full historical reload if needed - Enable auditing and compliance (immutable append-only)
Storage Format: Delta Lake (ACID transactions, time travel)
Schema Strategy: Schema-on-read (flexible, accommodates source changes)
Partitioning: By ingestion date for lifecycle management - Example: /bronze/sales/year=2026/month=01/day=27/
Bronze Layer Implementation
Step 1: Create Bronze Lakehouse CREATE LAKEHOUSE bronze_erp
Step 2: Ingest with Data Pipeline Source: SQL Server (production ERP database) Destination: bronze_erp.sales_transactions Format: Delta Mode: Append (incremental daily loads) Metadata Columns: _ingestion_timestamp, _source_file, _pipeline_run_id
Step 3: Add Audit Columns ALTER TABLE bronze_erp.sales_transactions ADD COLUMN _ingestion_timestamp TIMESTAMP ALTER TABLE bronze_erp.sales_transactions ADD COLUMN _source_system STRING
Bronze Layer Best Practices
✅ Never delete bronze data - Archive to cold storage instead ✅ Partition by ingestion date - year=2026/month=01/day=27 ✅ Capture full audit trail - Who, what, when for every row ✅ Handle schema evolution - Use permissive schema merge mode ✅ Validate source connection - Log failures, do not skip data
Anti-Pattern: Transforming data in bronze layer (defer to silver)
Example: Bronze Sales Data
Source: ERP system exports CSV to Azure Blob Storage daily
Bronze Ingestion Notebook (PySpark): from pyspark.sql.functions import current_timestamp, lit
df = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load("abfss://raw-data/sales/2026-01-27.csv")
df_with_audit = df .withColumn("_ingestion_timestamp", current_timestamp()) .withColumn("_source_system", lit("ERP_PROD")) .withColumn("_source_file", lit("2026-01-27.csv"))
df_with_audit.write.format("delta") .mode("append") .partitionBy("_ingestion_date") .save("Tables/bronze_sales_transactions")
Result: Bronze layer contains exact copy of CSV with audit metadata.
Silver Layer: Cleansed and Standardized Data
Purpose and Characteristics
Silver Layer Goals: - Validate data quality (reject invalid rows or fix errors) - Standardize formats (dates, currencies, naming conventions) - Conform schemas (align column names across sources) - Apply business rules (derive calculated fields) - De-duplicate records
Storage Format: Delta Lake (optimized with ZORDER)
Schema Strategy: Schema-on-write (enforced schema for consistency)
Partitioning: By business date (order_date, transaction_date) - Example: /silver/sales/year=2026/month=01/
Silver Layer Transformation Patterns
Pattern 1: Data Quality Checks
Validate that: - Required fields are not null - Dates are in valid range (not future dates for historical data) - Amounts are positive (where business rules require) - Foreign keys exist in dimension tables
Example Validation: from pyspark.sql.functions import col
df_bronze = spark.read.format("delta").load("Tables/bronze_sales_transactions")
df_validated = df_bronze .filter(col("order_id").isNotNull()) .filter(col("order_date") <= current_date()) .filter(col("amount") > 0) .filter(col("customer_id").isin(valid_customers))
df_validated.write.format("delta") .mode("overwrite") .option("mergeSchema", "true") .save("Tables/silver_sales_transactions")
Invalid Records: Log to silver_quality_exceptions table for review.
Pattern 2: Standardization
Example: Standardize country codes - Bronze: "USA", "US", "United States" - Silver: "US" (ISO 3166 standard)
Implementation: from pyspark.sql.functions import when
df_standardized = df_validated .withColumn("country_code", when(col("country").isin("USA", "United States"), "US") .when(col("country") == "Canada", "CA") .otherwise(col("country")) )
Pattern 3: De-Duplication
Example: Remove duplicate orders (same order_id, different ingestion times)
df_deduped = df_standardized.dropDuplicates(["order_id"])
Pattern 4: Slowly Changing Dimensions (SCD Type 2)
Example: Track customer address changes over time
df_customer_history = df_customer .withColumn("valid_from", col("_ingestion_timestamp")) .withColumn("valid_to", lit("9999-12-31").cast("timestamp")) .withColumn("is_current", lit(True))
Use Delta Lake MERGE to handle updates while preserving history.
For SCD patterns, see our Power BI row-level security guide.
Silver Layer Best Practices
✅ Document data quality rules - What validations are applied ✅ Log rejected records - Do not silently drop data ✅ Use standardized column names - snake_case, descriptive ✅ Apply business logic - Calculate derived fields (profit = revenue - cost) ✅ Optimize with ZORDER - Cluster by frequently filtered columns
Anti-Pattern: Aggregating data in silver (defer to gold)
Example: Silver Sales Transformation
Objective: Cleanse bronze sales data, standardize currencies, validate customers
Silver Transformation Notebook: from pyspark.sql.functions import col, when, current_date from delta.tables import DeltaTable
# Read bronze df_bronze = spark.read.format("delta").load("Tables/bronze_sales_transactions")
# Validate df_valid = df_bronze .filter(col("order_id").isNotNull()) .filter(col("order_date") <= current_date()) .filter(col("amount") > 0)
# Standardize currency df_standardized = df_valid .withColumn("amount_usd", when(col("currency") == "EUR", col("amount") * 1.10) .when(col("currency") == "GBP", col("amount") * 1.27) .otherwise(col("amount")) ) .withColumn("currency", lit("USD"))
# Write to silver (incremental merge) silver_table = DeltaTable.forPath(spark, "Tables/silver_sales_transactions")
silver_table.alias("target").merge( df_standardized.alias("source"), "target.order_id = source.order_id" ).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
Result: Silver layer contains validated, standardized sales data.
Gold Layer: Business-Ready Analytics
Purpose and Characteristics
Gold Layer Goals: - Aggregate data for specific business questions - Pre-join dimension tables (star schema) - Optimize for Power BI consumption (fast queries) - Apply final business logic (KPIs, metrics)
Storage Format: Delta Lake or Fabric Warehouse (based on use case)
Schema Strategy: Strict schema (star/snowflake for BI)
Partitioning: Minimal (gold tables are aggregated, smaller size)
Gold Layer Aggregation Patterns
Pattern 1: Time-Series Aggregations
Example: Daily sales summary (for trending dashboards)
df_daily_sales = spark.sql(""" SELECT order_date, product_category, SUM(amount_usd) AS total_sales, COUNT(DISTINCT order_id) AS order_count, COUNT(DISTINCT customer_id) AS unique_customers FROM silver_sales_transactions GROUP BY order_date, product_category """)
df_daily_sales.write.format("delta") .mode("overwrite") .partitionBy("order_date") .save("Tables/gold_daily_sales_summary")
Power BI Consumption: DirectQuery or Import mode (fast aggregations)
Pattern 2: Star Schema for BI
Fact Table: gold_fact_sales Dimension Tables: gold_dim_customer, gold_dim_product, gold_dim_date
Example: Join sales with dimensions in gold layer
df_gold_sales = spark.sql(""" SELECT s.order_id, s.order_date, c.customer_name, c.customer_segment, p.product_name, p.product_category, s.amount_usd, s.quantity FROM silver_sales_transactions s INNER JOIN silver_customers c ON s.customer_id = c.customer_id INNER JOIN silver_products p ON s.product_id = p.product_id """)
df_gold_sales.write.format("delta") .mode("overwrite") .save("Tables/gold_fact_sales")
Power BI Model: Import gold tables, define relationships, create measures.
For BI optimization, see Power BI performance tuning.
Pattern 3: Business KPIs
Example: Customer lifetime value (CLV)
df_clv = spark.sql(""" SELECT customer_id, customer_name, SUM(amount_usd) AS total_spent, COUNT(DISTINCT order_id) AS order_count, DATEDIFF(MAX(order_date), MIN(order_date)) AS customer_tenure_days, SUM(amount_usd) / COUNT(DISTINCT order_id) AS avg_order_value FROM gold_fact_sales GROUP BY customer_id, customer_name """)
df_clv.write.format("delta") .mode("overwrite") .save("Tables/gold_customer_lifetime_value")
Power BI Report: "Top 100 Customers by CLV"
Gold Layer Best Practices
✅ Optimize for consumption - Pre-aggregate, pre-join ✅ Document business logic - How KPIs are calculated ✅ Version gold datasets - Track schema changes with Delta time travel ✅ Test with Power BI - Ensure query performance meets SLAs ✅ Use Warehouse for high-concurrency - If 100+ users query simultaneously
Anti-Pattern: Querying bronze/silver directly from Power BI (slow, unoptimized)
Example: Gold Customer Analytics
Objective: Create customer segmentation dataset for marketing dashboards
Gold Aggregation Notebook: df_customer_segmentation = spark.sql(""" SELECT c.customer_id, c.customer_name, c.customer_segment, COUNT(DISTINCT s.order_id) AS lifetime_orders, SUM(s.amount_usd) AS lifetime_value, AVG(s.amount_usd) AS avg_order_value, DATEDIFF(CURRENT_DATE, MAX(s.order_date)) AS days_since_last_order, CASE WHEN SUM(s.amount_usd) > 10000 THEN 'VIP' WHEN SUM(s.amount_usd) > 5000 THEN 'High Value' WHEN SUM(s.amount_usd) > 1000 THEN 'Medium Value' ELSE 'Low Value' END AS value_tier FROM silver_customers c LEFT JOIN silver_sales_transactions s ON c.customer_id = s.customer_id GROUP BY c.customer_id, c.customer_name, c.customer_segment """)
df_customer_segmentation.write.format("delta") .mode("overwrite") .save("Tables/gold_customer_segmentation")
Power BI Semantic Model: Import gold_customer_segmentation, create "Customer Segmentation Dashboard"
Incremental Processing: Bronze → Silver → Gold
Challenge: Processing Only New Data
Problem: Full refresh of 10 TB dataset takes 8 hours and wastes compute Solution: Incremental processing (process only new/changed rows)
Bronze to Silver Incremental Pattern
Step 1: Track last processed watermark CREATE TABLE silver_watermarks ( table_name STRING, last_processed_timestamp TIMESTAMP )
Step 2: Read only new bronze data last_watermark = spark.sql(""" SELECT last_processed_timestamp FROM silver_watermarks WHERE table_name = 'sales_transactions' """).first()[0]
df_new_bronze = spark.read.format("delta") .load("Tables/bronze_sales_transactions") .filter(col("_ingestion_timestamp") > last_watermark)
Step 3: Transform and merge to silver df_transformed = transform_bronze_to_silver(df_new_bronze)
silver_table = DeltaTable.forPath(spark, "Tables/silver_sales_transactions") silver_table.alias("target").merge( df_transformed.alias("source"), "target.order_id = source.order_id" ).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
Step 4: Update watermark spark.sql(f""" UPDATE silver_watermarks SET last_processed_timestamp = '{max_timestamp}' WHERE table_name = 'sales_transactions' """)
Performance Improvement: 8 hours → 15 minutes (process only today's data)
Silver to Gold Incremental Pattern
Approach 1: Incremental Aggregation
For additive metrics (SUM, COUNT): df_new_aggregates = spark.sql(""" SELECT order_date, SUM(amount_usd) AS daily_sales FROM silver_sales_transactions WHERE order_date >= CURRENT_DATE - INTERVAL 7 DAYS GROUP BY order_date """)
gold_table = DeltaTable.forPath(spark, "Tables/gold_daily_sales") gold_table.alias("target").merge( df_new_aggregates.alias("source"), "target.order_date = source.order_date" ).whenMatchedUpdate(set={"daily_sales": "target.daily_sales + source.daily_sales"}) .whenNotMatchedInsertAll().execute()
Approach 2: Full Refresh for Non-Additive Metrics
For AVG, MEDIAN, percentiles: Re-calculate from silver (cannot incrementally update)
Data Quality and Validation
Data Quality Framework
Quality Dimensions: 1. Completeness - No missing required fields 2. Accuracy - Values match source system 3. Consistency - Same data in different tables matches 4. Timeliness - Data available within SLA 5. Validity - Values conform to business rules
Implementing Data Quality Checks
Bronze Quality: Minimal (check file arrival, row count) from pyspark.sql.functions import count
row_count = df_bronze.count() if row_count == 0: raise Exception("Bronze load failed: zero rows ingested")
Silver Quality: Comprehensive validation def validate_silver_sales(df): # Rule 1: No nulls in key fields null_checks = df.filter(col("order_id").isNull() | col("customer_id").isNull()) if null_checks.count() > 0: log_quality_issue("Null key fields", null_checks)
# Rule 2: Amounts must be positive negative_amounts = df.filter(col("amount") <= 0) if negative_amounts.count() > 0: log_quality_issue("Negative amounts", negative_amounts)
# Rule 3: Dates in valid range invalid_dates = df.filter((col("order_date") < '2020-01-01') | (col("order_date") > current_date())) if invalid_dates.count() > 0: log_quality_issue("Invalid dates", invalid_dates)
validate_silver_sales(df_silver)
Gold Quality: Business rule validation # Rule: Total sales should not decrease month-over-month by more than 20% df_monthly_sales = spark.sql(""" SELECT month, SUM(amount_usd) AS total_sales FROM gold_daily_sales_summary GROUP BY month ORDER BY month """)
for current, previous in zip(df_monthly_sales[1:], df_monthly_sales[:-1]): if (current.total_sales / previous.total_sales) < 0.8: alert("Anomaly detected: 20%+ drop in monthly sales")
For governance integration, see Fabric data governance.
Orchestration with Data Pipelines
End-to-End Medallion Pipeline
Pipeline Steps: 1. Bronze Ingestion - Copy from source systems to bronze layer 2. Silver Transformation - Notebook: bronze_to_silver.py 3. Gold Aggregation - Notebook: silver_to_gold.py 4. Data Quality Validation - Notebook: validate_quality.py 5. Power BI Refresh - Trigger semantic model refresh
Fabric Data Pipeline (JSON): { "name": "MedallionPipeline_Sales", "activities": [ { "name": "IngestToBronze", "type": "Copy", "source": "SQL_ERP", "sink": "bronze_sales" }, { "name": "TransformToSilver", "type": "Notebook", "notebookPath": "bronze_to_silver", "dependsOn": ["IngestToBronze"] }, { "name": "AggregateToGold", "type": "Notebook", "notebookPath": "silver_to_gold", "dependsOn": ["TransformToSilver"] }, { "name": "ValidateQuality", "type": "Notebook", "notebookPath": "validate_quality", "dependsOn": ["AggregateToGold"] }, { "name": "RefreshPowerBI", "type": "PowerBIRefresh", "datasetId": "sales_dataset_id", "dependsOn": ["ValidateQuality"] } ], "triggers": [ { "name": "DailySchedule", "type": "Schedule", "recurrence": { "frequency": "Day", "interval": 1, "startTime": "06:00:00" } } ] }
Trigger: Daily at 6:00 AM (process previous day's data)
Monitoring and Observability
Key Metrics to Monitor
Bronze Layer: - Ingestion success rate (99.9% target) - Row count by source system - File arrival time (SLA: within 30 minutes of source export)
Silver Layer: - Data quality rule pass rate (95% target) - Transformation duration (SLA: < 30 minutes) - Rejected row count (alert if > 5%)
Gold Layer: - Aggregation duration (SLA: < 15 minutes) - Power BI refresh success rate (99% target) - Query performance (P95 < 2 seconds)
Fabric Capacity Metrics Integration
Dashboard: Track medallion pipeline CU consumption import matplotlib.pyplot as plt
# Sample data layers = ['Bronze Ingest', 'Silver Transform', 'Gold Aggregate'] cu_consumption = [150, 500, 200]
plt.bar(layers, cu_consumption) plt.ylabel('CU-Seconds') plt.title('Medallion Pipeline CU Consumption') plt.show()
Cost Optimization: If silver transformation consumes 60% of CUs, optimize Spark jobs or use smaller capacity.
For capacity planning, see Fabric SKU sizing.
Common Challenges and Solutions
Challenge 1: Schema Evolution in Bronze
Problem: Source system adds new column, breaks silver transformation Solution: Use permissive schema merge in Delta Lake df.write.format("delta") .mode("append") .option("mergeSchema", "true") .save("Tables/bronze_sales")
Challenge 2: Late-Arriving Data
Problem: Yesterday's transaction arrives today (missed in silver processing) Solution: Re-process last 7 days of bronze data daily (window processing) df_bronze_window = spark.read.format("delta") .load("Tables/bronze_sales") .filter(col("order_date") >= current_date() - 7)
Challenge 3: Gold Layer Explosion
Problem: Creating 100+ gold tables for different reports (management overhead) Solution: Consolidate into fewer, reusable gold datasets (star schema)
Anti-Pattern: One gold table per Power BI report Best Practice: Shared gold fact/dimension tables for multiple reports
Challenge 4: Performance Degradation Over Time
Problem: Silver transformation takes 2 hours (used to take 30 minutes) Solution: Optimize Delta tables regularly OPTIMIZE silver_sales_transactions ZORDER BY (customer_id, order_date)
VACUUM silver_sales_transactions RETAIN 168 HOURS
Run weekly via scheduled notebook.
Conclusion
Medallion architecture is the foundation of modern lakehouse data engineering. Key takeaways:
Bronze Layer: - Preserve raw data immutably - Partition by ingestion date - Add audit columns for lineage
Silver Layer: - Validate data quality rigorously - Standardize formats and schemas - De-duplicate and conform data
Gold Layer: - Aggregate for specific business needs - Optimize for Power BI consumption - Pre-join dimensions (star schema)
Incremental Processing: - Process only new/changed data - Track watermarks for each layer - Reduce costs by 70-90%
Organizations implementing medallion architecture achieve: - 50% faster data pipelines (incremental processing) - 90% fewer data quality incidents (validation at silver) - 30% lower infrastructure costs (optimized processing) - Higher analyst productivity (clean, trusted gold data)
Ready to build production-grade medallion pipelines? Contact our data engineering team for implementation services.
Frequently Asked Questions
Do I need all three layers (Bronze, Silver, Gold) or can I skip one?
While you can technically skip layers, we recommend implementing all three for production systems. You might skip Bronze if: source systems provide pre-validated data (rare), you do not need audit trails or compliance (risky), and storage cost is critical (short-sighted). You might skip Silver if: source data is already clean (unlikely for real-world systems), no standardization needed (different sources rarely align), you do not need reusable transformations (limits scalability). You might skip Gold if: analysts can query Silver directly (slower for complex BI), low user count (fewer than 10 users), and simple reporting needs (no aggregations or star schema). Best practice: Implement all three layers. Use Bronze for immutability and audit, Silver for quality and standardization, Gold for performance and business logic. The upfront investment pays off in maintainability, scalability, and data trust.
Should I use Lakehouse or Warehouse for Gold layer?
The choice depends on consumption patterns. Use Lakehouse for Gold if: fewer than 50 concurrent users, exploratory analytics and data science, cost optimization is priority, and you need schema flexibility. Use Warehouse for Gold if: more than 50 concurrent users, production BI dashboards requiring sub-second queries, SQL-centric analyst team, and you need frequent updates/deletes. Hybrid approach (recommended for large enterprises): Store Gold aggregated fact tables in Lakehouse (cost-effective), mirror to Warehouse using shortcuts for high-concurrency BI, use Warehouse for small dimension tables (customer, product), keep large historical data in Lakehouse. Our architecture team designs optimal Gold layer storage based on your specific workload patterns and user requirements.
How often should I run OPTIMIZE and VACUUM on Delta tables?
Recommended schedules: OPTIMIZE: Daily for active tables with frequent writes (Bronze, Silver incremental loads), weekly for stable tables with infrequent updates (Gold, dimensions), after large batch loads (millions of rows). VACUUM: Weekly for all layers (removes old versions, reduces storage costs), retain 7 days minimum (allows time travel for debugging), retain 30 days for compliance-heavy industries (audit requirements). Performance impact: OPTIMIZE improves query performance 2-10x by compacting small files and clustering data, VACUUM reduces storage costs 10-30% by deleting unused files. Automation: Schedule notebooks via Fabric Data Pipeline to run OPTIMIZE and VACUUM during off-peak hours (overnight). Monitor Fabric Capacity Metrics to ensure optimization jobs do not interfere with business hours workloads.