What is Medallion Architecture?
Medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally improving the structure and quality of data as it flows through each layer (Bronze → Silver → Gold).
Bronze Layer (Raw)
Raw data ingested from source systems with minimal transformation. Preserves original data for auditing and reprocessing.
- Purpose: Data ingestion and historical archive
- Format: As-is from source (JSON, CSV, Parquet, Avro)
- Schema: Schema-on-read, flexible structure
- Quality: No validation, may contain duplicates/errors
- Users: Data engineers, data scientists (exploratory)
Silver Layer (Refined)
Cleaned, validated, and enriched data. Standardized formats with quality checks applied.
- Purpose: Data quality and standardization
- Format: Standardized (Delta Lake, Iceberg, Hudi)
- Schema: Enforced schema with data types
- Quality: Validated, deduplicated, null handling
- Users: Data scientists, ML engineers, analysts
Gold Layer (Curated)
Business-ready aggregates, features, and metrics optimized for consumption by BI tools and applications.
- Purpose: Business analytics and reporting
- Format: Optimized tables (star schema, denormalized)
- Schema: Business-friendly column names and structure
- Quality: Production-grade, SLA-backed
- Users: Business analysts, executives, BI tools
4-Phase Implementation Process
Design Layer Structure
Days 1-3- Map source systems to Bronze tables
- Define Silver layer transformations and quality rules
- Design Gold layer aggregates and business metrics
- Plan incremental processing strategy
Build Bronze Layer
Week 1- Set up ingestion pipelines from source systems
- Implement change data capture (CDC) where needed
- Store raw data with metadata (ingestion timestamp, source)
- Validate data arrival and completeness
Build Silver Layer
Weeks 2-3- Apply data quality rules (deduplication, null handling)
- Standardize data types and formats
- Enrich data with lookups and joins
- Implement incremental processing (merge/upsert)
Build Gold Layer
Week 4- Create business aggregates and metrics
- Build star schema or denormalized tables
- Optimize for BI tool performance (partitioning, indexing)
- Connect BI tools and validate reports
Medallion Architecture Best Practices
Use Incremental Processing
Process only new/changed data in each layer using watermarks, CDC, or merge operations. This reduces processing time from hours to minutes and enables near-real-time analytics.
Preserve Raw Data in Bronze
Never delete or modify Bronze layer data. It serves as your source of truth for reprocessing if Silver/Gold logic changes or data quality issues are discovered.
Implement Data Quality Checks
Add quality checks at Silver layer: completeness, accuracy, consistency, uniqueness, validity, timeliness. Quarantine bad data instead of failing pipelines.
Use Consistent Naming Conventions
Prefix tables with layer name (bronze_*, silver_*, gold_*) and use descriptive names. This makes data lineage clear and prevents accidental cross-layer queries.
Optimize Gold for Consumption
Denormalize Gold tables for BI tool performance. Pre-aggregate common metrics. Use partitioning and Z-ordering for fast queries. Gold should be optimized for reads, not writes.
Track Data Lineage
Document transformations between layers. Use metadata tables to track source → Bronze → Silver → Gold lineage. This enables impact analysis and troubleshooting.
People Also Ask
Do I need all three layers (Bronze, Silver, Gold)?
Not always. For simple use cases, you might skip Bronze and ingest directly to Silver. However, Bronze provides valuable benefits: data recovery if transformations fail, ability to reprocess with new logic, and audit trail of raw data. For production systems, all three layers are recommended.
Can I have multiple Gold layers for different use cases?
Yes. It's common to have multiple Gold layers optimized for different consumers: gold_bi for BI tools, gold_ml for machine learning features, gold_api for application APIs. Each can have different aggregation levels and optimization strategies.
How do I handle slowly changing dimensions (SCD) in medallion architecture?
Implement SCD in Silver layer. Use SCD Type 2 (historical tracking) by adding effective_date, end_date, and is_current columns. Bronze stores raw snapshots, Silver maintains history, and Gold can present either current state or historical views depending on business needs.
What's the difference between medallion and lambda architecture?
Lambda architecture separates batch and streaming into different paths (batch layer + speed layer + serving layer). Medallion architecture unifies batch and streaming in a single path (Bronze → Silver → Gold) using technologies like Delta Lake that support both. Medallion is simpler to maintain with one codebase instead of two.
How long should I retain data in each layer?
Bronze: Retain indefinitely or per compliance requirements (7+ years). Silver: Retain 1-3 years for analysis and ML training. Gold: Retain 6-12 months for active reporting, archive older data. Use lifecycle policies to automatically move cold data to cheaper storage tiers.