Version: devel View Markdown

Data quality lifecycle

The data quality lifecycle has rarely been achievable in a single tool due to the runtime constraints of traditional ETL vendors.

One library, end-to-end ingestion, transformation, with data quality and lineage

dlt is an open source pythonic ingestion library while dlthub is a commercial addition to dlt spanning into transformation and other areas of the data stack.

Because dlt together with dltHub span the entire pipeline, starting from ingestion, passing through a portable staging layer, and extending into the transformation, it uniquely bridges these gaps.

Instead of stitching together four or five separate tools, you write Python code that works across the entire pipeline. No glue scripts. No context lost between systems, end to end lineage and metadata.

Data Quality Lifecycle

The three checkpoints for data quality:

In-flight: Check individual records as data is extracted, before loading it.
Staging: We optionally load the data to an optionally transient staging area where we can test it without breaking production.
Destination: Check properties of the full dataset currently written to the destination.

The five pillars of data quality

dlt addresses quality across five core dimensions, offering support for implementing these checks across the entire data lifecycle.

Structural Integrity: Does the data fit the destination schema and types?
Semantic Validity: Does the data make business sense?
Uniqueness & Relations: Is the dataset consistent with itself?
Privacy & Governance: Is the data safe and compliant?
Operational Health: Is the pipeline running correctly?

Five Pillars of Data Quality

1. Structural Integrity

Does the data fit the destination schema and types?

These checks ensure incoming data conforms to the expected shape and technical types before loading, preventing broken pipelines and "garbage" tables.

Job to be Done	dlt Solution	Learn More	Availability
Prevent unexpected columns	Schema Contracts (Frozen Mode): Set your schema to `frozen` to raise an immediate error if the source API adds an undocumented field.	Schema Contracts	dlt
Enforce data types	Type Coercion: `dlt` automatically coerces compatible types (e.g., string `"100"` to int `100`) and rejects non-coercible values to ensure column consistency.	Schema	dlt
Fix naming errors	Normalization: `dlt` automatically cleans table and column names (converting to `snake_case`) to prevent SQL syntax errors in the destination.	Naming Convention	dlt
Enforce required fields	Nullability Constraints: Mark fields as `nullable=False` in your resource hints to drop or error on records missing critical keys.	Resource	dlt

2. Semantic Validity

Does the data make business sense?

These checks verify the content of the data against business logic. While structural checks handle types (is it a number?), semantic checks handle meaning (is it a valid age?).

Job to be Done	dlt Solution	Learn More	Availability
Validate logic & ranges	Pydantic Models: Attach Pydantic models to your resources to enforce logic like `age > 0` or email format validation in-stream.	Schema Contracts	dlt
Filter bad rows	`add_filter`: Apply a predicate function to exclude records that don't meet criteria (e.g., `lambda x: x["status"] != "deleted"`).	Transform with add_map	dlt
Check batch anomalies	Staging Tests: Use the portable runtime (e.g., Ibis/DuckDB) to query the staging buffer. Example: "Alert if the average order value in this batch is > $10k."	Staging	dlt
Built-in data checks	Data Quality Checks: Use built-in checks like `is_in()`, `is_unique()`, `is_primary_key()` with pre-load or post-load execution, plus actions on failure (drop, quarantine, alert).	Data Quality	dlthub

3. Uniqueness & Relations

Is the dataset consistent with itself?

These checks manage duplication and preserve relationships between different tables in your dataset.

Job to be Done	dlt Solution	Learn More	Availability
Prevent duplicates	Merge Disposition: Define `primary_key` and `write_disposition='merge'` to automatically upsert records. `dlt` handles the deduping logic for you.	Incremental Loading	dlt
Track historical changes	SCD2 Strategy: Use `write_disposition={"disposition": "merge", "strategy": "scd2"}` to automatically maintain validity windows (`_dlt_valid_from`, `_dlt_valid_to`) for dimension tables.	Merge Loading	dlt
Link parent/child data	Automatic Lineage: `dlt` automatically generates foreign keys (`_dlt_parent_id`) when unnesting complex JSON, preserving the link between parent and child tables.	Destination Tables	dlt
Find orphan keys	Post-Load Assertions: Run SQL tests on the destination to identify child records missing a valid parent (e.g., Orders without Customers).	SQL Transformations	dlt

4. Privacy & Governance

Is the data safe and compliant?

Data quality also means compliance. These features ensure sensitive data is handled correctly before it becomes a liability in your warehouse.

Job to be Done	dlt Solution	Learn More	Availability
Mask/Hash PII	Transformation Hooks: Use `add_map` to hash emails or redact names in-stream. Data is sanitized in memory before it ever touches the disk.	Pseudonymizing Columns	dlt
Drop sensitive columns	Column Removal: Use `add_map` to completely remove columns (e.g., `ssn`, `credit_card`) before they ever reach the destination.	Removing Columns	dlt
Enforce PII Contracts	Pydantic Models: Use Pydantic schemas to strictly define and detect sensitive fields (e.g., `EmailStr`), ensuring they are caught and hashed before loading.	Schema Contracts	dlt
Join on private data	Deterministic Hashing: Use a secret salt via `dlt.secrets` to deterministically hash IDs, allowing you to join tables on "User ID" without exposing the actual user identity.	Credentials Setup	dlt
Track PII through transformations	Column-Level Hint Forwarding: PII hints (e.g., `x-annotation-pii`) are automatically propagated through SQL transformations, so downstream tables retain knowledge of sensitive origins.	Transformations	dlthub

5. Operational Health

Is the pipeline running correctly?

Monitoring the reliability of the delivery mechanism itself. Even perfectly valid data is "bad quality" if it arrives 24 hours late.

Job to be Done	dlt Solution	Learn More	Availability
Detect empty loads	Volume Monitoring: Inspect `load_info` metrics after a run. Trigger an alert if `row_count` drops to zero unexpectedly.	Running in Production	dlt
Collect custom metrics	Custom Metrics: Track business-specific statistics during extraction (e.g., page counts, API call counts) using `dlt.current.resource_metrics()`.	Resource	dlt
Monitor Freshness (SLA)	Load Metadata: Query the `_dlt_loads` table in your destination to verify the `inserted_at` timestamp meets your freshness SLA.	Destination Tables	dlt
Audit Schema Drift	Schema History: Even in permissive modes, `dlt` tracks every schema change. Use the audit trail to see exactly when a new column was introduced.	Schema Evolution	dlt
Alert on schema changes	Schema Update Alerts: Inspect `load_info.load_packages[].schema_update` after each run to detect new tables/columns and trigger alerts (e.g., Slack notification when schema drifts).	Running in Production	dlt
Trace Lineage	Load IDs: Every row in your destination is tagged with `_dlt_load_id`. You can trace any specific record back to the exact pipeline run that produced it.	Destination Tables	dlt
Alert on failures	Slack Integration: Send pipeline success/failure notifications via Slack incoming webhooks configured in `dlt.secrets`.	Alerting	dlt

Validate data quality during development

Use the dlt Dashboard to interactively inspect your pipeline during development. The dashboard lets you:

Query loaded data and verify row counts match expectations
Inspect schemas, columns, and all column hints
Check incremental state of each resource
Review load history and trace information
Catch issues like pagination bugs (suspiciously round counts) before they reach production

dlt pipeline {pipeline_name} show

Get the full lifecycle with dltHub

The features marked dlt in the tables above are available today in the open-source library. dltHub provides a managed runtime and additional data quality capabilities:

Run dlt on the dltHub runtime — Execute all your existing dlt pipelines with managed infrastructure, scheduling, and observability built-in.
Built-in data quality checks — Use is_in(), is_unique(), is_primary_key(), and more with row-level and batch-level validation.
Pre-load and post-load execution — Run checks in staging before data hits your warehouse, or validate after load with full dataset access.
Follow-up actions on failure — Bad data quarantine to enable faster debugging.
Column-level hint forwarding — Track PII and other sensitive column hints through SQL transformations.

Early Access

Interested in the full data quality lifecycle? Join dltHub early access to get started.

Learn more about dltHub Data Quality →

Data quality lifecycle

One library, end-to-end ingestion, transformation, with data quality and lineage

The five pillars of data quality

1. Structural Integrity

2. Semantic Validity

3. Uniqueness & Relations

4. Privacy & Governance

5. Operational Health

Validate data quality during development

Get the full lifecycle with dltHub

DHelp

Ask a question

One library, end-to-end ingestion, transformation, with data quality and lineage​

The five pillars of data quality​

1. Structural Integrity​

2. Semantic Validity​

3. Uniqueness & Relations​

4. Privacy & Governance​

5. Operational Health​

Validate data quality during development​

Get the full lifecycle with dltHub​

DHelp

Ask a question

One library, end-to-end ingestion, transformation, with data quality and lineage

The five pillars of data quality

1. Structural Integrity

2. Semantic Validity

3. Uniqueness & Relations

4. Privacy & Governance

5. Operational Health

Validate data quality during development

Get the full lifecycle with dltHub