Preventing Data Nightmares: Top 5 Data Quality Checks Every ETL Pipeline Needs

3 min readDec 2, 2024

Data is the lifeblood of any organization, but dirty or inconsistent data can lead to bad decisions, compliance issues, and missed opportunities. That’s why data quality checks are essential for every ETL (Extract, Transform, Load) pipeline.

Image derived from https://vitalflux.com/etl-data-quality-test-cases-examples/

In this article, we’ll explore five critical checks you must implement and show how to automate them using custom Python scripts.

1. Schema Validation: Guard Against Unexpected Changes

Ever had a pipeline break because someone added an unexpected column? Schema validation ensures the incoming data conforms to the expected structure.

What to Check: Column names, data types, and constraints (e.g., NOT NULL).
Sample Implementation: Use pandas to compare column metadata:

import pandas as pd 
expected_columns = ["id", "name", "timestamp", "value"] 
actual_columns = list(df.columns) 
assert expected_columns == actual_columns, "Schema mismatch!"

2. Null Value Detection: Catch Missing Data Early

Nulls in critical fields can wreak havoc downstream. Ensure essential fields are populated.

What to Check: Missing or null values in key columns.
Sample Implementation: A Python script for quick checks on null records:

if df["id"].isnull().sum() > 0:
     raise ValueError("Null values found in 'id' column!")

3. Range Checks: Keep Outliers in Check

Data outside expected ranges can signal upstream issues or anomalies.

What to Check: Numeric values within valid thresholds (e.g., dates should not be in the future).
Sample Implementation: Custom Python code for checking known edge cases:

if (df["value"] < 0).any() or (df["value"] > 1000).any():
     raise ValueError("Out-of-range values detected!")

4. Uniqueness Constraints: Eliminate Duplicate Records

Duplicate entries can inflate metrics and cause errors. Ensure uniqueness where needed.

What to Check: Unique constraints on IDs or combinations of fields.
Sample Implementation: Write a simple script to catch duplicates:

if df.duplicated(subset=["id"]).any():
     raise ValueError("Duplicate IDs found!")

5. Referential Integrity: Ensure Data Relationships Hold

For pipelines pulling from multiple sources, ensuring relationships between datasets is critical.

What to Check: Foreign key relationships between tables or datasets.
Sample Implementation: Use SQL queries to validate relationships, or leverage Python:

merged_df = df.merge(reference_table, on="foreign_key", how="left") 
if merged_df["foreign_key"].isnull().sum() > 0:
     raise ValueError("Referential integrity violation!")

Automating the Quality Checks

While the scripts above get the job done, automating data quality checks ensures consistency. Integrate them into your ETL pipelines using Airflow or Prefect:

Airflow Example:
Create a custom task in an Airflow DAG for data validation.

from airflow.operators.python_operator import PythonOperator  

def validate_data(**kwargs):
     # Run quality checks here
     pass  

validation_task = PythonOperator(
     task_id="data_validation", python_callable=validate_data, dag=dag
)

https://lakefs.io/data-quality/data-quality-testing/

Conclusion

Quality checks are your first line of defense against bad data infiltrating your systems. By implementing schema validation, null checks, range checks, uniqueness constraints, and referential integrity, you can ensure your pipelines remain robust and reliable.

Whether you use any out-of-the-box tools or roll out your own Python scripts, these checks will save you time, headaches, and costly mistakes!!

I hope you find this article useful. Thank you for reading and do follow for more such content on GenAI, Data Engineering, ML & AI!

Want to Stay Connected?

Let’s Connect on — Linkedin | Twitter | Instagram | Github | Facebook, for More Insights and Updates!