Ensuring Data Integrity with Great Expectations: A Developer’s Practical Guide
A single data quality issue can cascade into significant financial and reputational damage.
Consider the case of a retail giant, like Amazon, whose product recommendation engine misidentified a popular shoe as a “men’s sock.” This led to widespread customer confusion and a significant dip in sales for that specific item, illustrating the tangible impact of poor data.
The cost of poor data quality is staggering. A study by IBM found that poor data quality costs the U.S. economy an estimated $3.1 trillion annually source.
For developers, proactively addressing data quality is not just good practice; it’s a critical component of building reliable and trustworthy AI and data systems.
This guide provides developers with a comprehensive, hands-on approach to implementing robust data quality testing using Great Expectations, a powerful open-source Python library.
Foundations of Data Quality Assurance in Development
Before diving into the specifics of Great Expectations, it’s essential to understand the foundational principles that underpin effective data quality assurance within a development lifecycle. Building high-quality data pipelines and machine learning models requires a shift-left approach, where data quality checks are integrated early and often, rather than being an afterthought. This proactive stance prevents costly rework and ensures the integrity of downstream processes.
The Business Imperative for Data Quality
The business impact of data quality issues extends far beyond simple errors. Inaccurate customer data can lead to flawed marketing campaigns, resulting in wasted ad spend and a diminished customer experience.
For instance, incorrect demographic information could lead Starbucks to target promotions to the wrong customer segments, reducing the efficacy of their campaigns and potentially alienating customers.
Similarly, financial institutions rely on precise data for risk assessment and regulatory compliance.
A misclassification of a financial transaction could trigger regulatory penalties or lead to incorrect financial reporting, as seen in instances where firms like Wells Fargo have faced scrutiny for data-related compliance failures.
The stakes are incredibly high, underscoring the need for developers to prioritize data quality as a core development responsibility.
Integrating Data Quality into the CI/CD Pipeline
Modern software development practices emphasize continuous integration and continuous delivery (CI/CD). Integrating data quality checks into this pipeline is paramount. Instead of manually verifying data, automated checks can be triggered at various stages.
For example, upon merging new code that affects data processing, automated tests can run to ensure the transformed data adheres to predefined expectations. This prevents “bad” data from entering production environments.
Tools like Zapier can orchestrate these workflows, triggering data validation pipelines when new data arrives or when code changes are committed.
This automation ensures that data quality is consistently monitored without manual intervention, a crucial step towards scalable and reliable data operations.
Implementing Great Expectations for Data Validation
Great Expectations is an open-source library that helps you validate data, document data quality, and profile data quality at scale. It allows you to define expectations about your data and then validate that your data meets those expectations. This creates a “data contract” that ensures data remains consistent and trustworthy over time.
Setting Up Your First Expectations
The journey with Great Expectations begins with defining your data’s expected characteristics. This involves connecting to your data source and creating an “expectation suite,” which is a collection of assertions about your data.
First, you’ll need to install the library:
pip install great_expectations
Next, initialize Great Expectations in your project directory. This will create a great_expectations folder containing configuration files and directories for your data sources and expectation suites.
great_expectations init
Once initialized, you can create a new “data context” which acts as the central hub for managing your Great Expectations configuration.
from great_expectations import DataContext
context = DataContext()
To create an expectation suite, you’ll need to connect to your data. Let’s assume you have a CSV file named sales_data.csv located in your project. You can add it as a datasource.
datasource_config = {
"name": "my_pandas_datasource",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"default_runtime_connector": {
"class_name": "RuntimeDataConnector",
"module_name": "great_expectations.datasource.data_connector",
"batch_spec_passthrough": {
"class_name": "InferredAssetGCSBatchSpec",
"update_batch_identifiers": ["default_identifier_name"],
},
"assets": {
"my_asset": {
"class_name": "PandasDataFrameAsset",
"module_name": "great_expectations.datasource.data_connector.asset",
}
},
}
},
}
context.add_datasource(**datasource_config)
Now, you can create an expectation suite for this datasource. Let’s define a few basic expectations for a hypothetical sales table.
# Create a new expectation suite
suite_name = "sales_expectations"
suite = context.create_expectation_suite(expectation_suite_name=suite_name, overwrite_existing=True)
# Add expectations
suite.add_expectation(
expectation_type="expect_table_columns_to_match_set",
kwargs={
"column_set": ["order_id", "customer_id", "product_id", "quantity", "price", "order_date"],
"exact_match": True,
},
name="expect_sales_columns"
)
suite.add_expectation(
expectation_type="expect_column_values_to_not_be_null",
kwargs={"column": "order_id"},
name="expect_order_id_not_null"
)
suite.add_expectation(
expectation_type="expect_column_values_to_be_between",
kwargs={
"column": "quantity",
"min_value": 1,
"max_value": 100,
},
name="expect_quantity_in_range"
)
suite.add_expectation(
expectation_type="expect_column_values_to_be_datetime",
kwargs={"column": "order_date"},
name="expect_order_date_is_datetime"
)
# Save the expectation suite
context.save_expectation_suite(expectation_suite=suite, expectation_suite_name=suite_name)
Running Validation Checks and Generating Data Docs
Once your expectation suite is defined, you can run a validation against your data. This generates a “validation result,” which details whether each expectation passed or failed.
# Load your data (e.g., from a pandas DataFrame)
import pandas as pd
df = pd.read_csv("sales_data.csv")
# Create a batch request
batch_request = {
"datasource_name": "my_pandas_datasource",
"data_connector_name": "default_runtime_connector",
"data_asset_name": "my_asset",
"runtime_parameters": {"batch_data": df},
}
# Run the validation
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name=suite_name
)
validation_result = validator.validate()
# Print the validation result
print(validation_result)
The validation_result will show a detailed JSON output of each expectation and its status. To make these results easily digestible and shareable, Great Expectations can generate “Data Docs”—HTML documentation of your data quality.
great_expectations docs build
This command will build HTML files in the great_expectations/uncommitted/data_docs directory, providing a visual report of your data quality. You can also generate these docs programmatically within your Python script.
Advanced Expectations and Profiling
Great Expectations supports a vast array of expectation types, catering to complex data validation needs. Beyond basic checks like nullability and type, you can define expectations for:
- Uniqueness: Ensuring specific columns contain unique values.
- Value Distribution: Verifying that values fall within a certain statistical distribution.
- Set Membership: Checking if column values are present in a predefined set.
- Regex Matching: Validating string formats using regular expressions.
- Custom SQL/Python Expectations: Allowing for highly specific, custom validation logic.
You can explore these in detail within the Great Expectations documentation. Furthermore, the library offers data profiling capabilities. Profiling analyzes your data and automatically suggests expectations based on its characteristics. This is incredibly useful for initial data exploration and for discovering potential data quality issues you might not have anticipated.
To profile your data and generate initial expectations:
# Assuming you have already loaded df and added your datasource
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="my_profiling_suite",
# A new suite for profiling
create_expectation_suite_if_not_exists=True
)
# Profile the data and add expectations to the suite
profile_result = validator.profile_data()
# Save the suite after profiling
context.save_expectation_suite(
expectation_suite_name="my_profiling_suite",
expectation_suite=validator.expectation_suite
)
The profile_result will contain information about the profiling process, and the my_profiling_suite will be populated with automatically generated expectations.
Integrating Data Quality into Your Development Workflow
Simply defining expectations is only half the battle. For true data quality assurance, these checks must be embedded within your development and operational workflows. This ensures that data quality is continuously monitored and that issues are caught before they impact production systems or end-users.
Automating Validations in CI/CD
The most effective way to ensure data quality is to automate validation checks within your CI/CD pipeline. When a new branch is merged into your main development branch, or when a new deployment is staged, automated data quality checks can be triggered. This ensures that no flawed data enters your production environment.
Tools like Repo Ranger can help monitor code repositories for changes that might impact data pipelines, potentially triggering Great Expectations validation runs. Services like Jenkins, GitHub Actions, or GitLab CI/CD can be configured to execute Great Expectations validation commands as part of their job execution.
For instance, a GitHub Action could be set up to run a Python script that executes Great Expectations validations. If any expectation fails, the Action can be configured to fail the build, preventing the deployment of code that might introduce data quality problems. This proactive approach significantly reduces the risk of data-related production incidents.
Monitoring Data Quality in Production
Beyond CI/CD, ongoing monitoring of data quality in production environments is critical. Data can drift over time, meaning the statistical properties of your data can change, leading to unexpected model behavior or incorrect insights. Great Expectations can be scheduled to run against production data sources regularly.
This can be achieved using task schedulers like cron on Linux or by integrating with cloud-based scheduling services like AWS CloudWatch Events or Google Cloud Scheduler. The results of these scheduled validations can be logged, alerted on, and visualized. For example, if a key metric like the average order value suddenly deviates significantly from its historical norms, an alert can be triggered to notify the data engineering team.
This continuous monitoring allows for early detection of data drift or corruption, enabling timely intervention. Furthermore, integrating this monitoring with dashboarding tools like Grafana or Tableau can provide a clear, visual overview of data quality trends over time.
Leveraging Observability Platforms
Modern data stacks increasingly rely on observability platforms to provide end-to-end visibility into data pipelines. Great Expectations can integrate with these platforms to enrich data observability. By pushing validation results and data quality metrics to systems like Datadog or Honeycomb, you can correlate data quality issues with other system performance metrics.
This holistic view is invaluable for debugging complex issues. For example, if a machine learning model deployed via a platform like Deploy LLMs with Ansible starts performing poorly, examining the data quality reports alongside application logs and infrastructure metrics can quickly pinpoint whether the root cause lies in the data itself. This interconnectedness allows for faster root cause analysis and resolution of data-related problems.
Real-World Applications and Case Studies
The principles and tools discussed are not theoretical; they are actively used by leading organizations to maintain data integrity. Companies like Netflix, which manages vast amounts of user data, rely heavily on automated data validation to ensure the accuracy of their recommendations and personalization engines. Imagine the impact of incorrect viewing history data on their sophisticated algorithms; it would quickly degrade the user experience.
Similarly, financial technology firms such as Square must maintain extremely high standards of data accuracy for transaction processing, compliance reporting, and fraud detection. A single data anomaly could have significant financial repercussions. By implementing data quality checks using frameworks like Great Expectations, they can ensure that their systems are processing accurate and reliable financial data.
Even in the rapidly evolving field of artificial intelligence, data quality is paramount. Projects leveraging large language models (LLMs) are highly sensitive to the quality of their training and inference data.
For instance, a team building a customer service chatbot using an LLM like GPT-3.5 Turbo accessed via the ChatGPT GPT-3.5 Turbo API client in Golang must ensure that the data used for fine-tuning and for processing user queries is clean, relevant, and free from biases.
Poor data quality can lead to nonsensical responses, reputational damage, and a failure to meet business objectives.
The Stanford HAI (Human-Centered Artificial Intelligence) institute emphasizes the critical role of data quality in building responsible and effective AI systems, underscoring the importance of tools like Great Expectations in this domain.
Practical Recommendations for Developers
To effectively implement and maintain data quality using Great Expectations, consider these actionable recommendations:
- Start Simple, Iterate Often: Don’t try to define every possible expectation from day one. Begin with critical data elements and essential checks (e.g., non-null columns, expected data types). As you gain experience and understand your data better, incrementally add more complex expectations. This iterative approach makes the process manageable and allows for continuous improvement.
- Integrate Early and Continuously: Embed data quality checks into your development workflow as early as possible. This means integrating Great Expectations validations into your local development setup, your CI pipelines, and your automated testing frameworks. The earlier you catch an issue, the cheaper and easier it is to fix.
- Treat Expectations as Code: Store your expectation suites in version control alongside your application code. This allows you to track changes, collaborate with team members, and roll back to previous versions if necessary. Treating data quality definitions as code reinforces their importance in the development process.
- Automate Documentation Generation: Regularly build and publish your Data Docs. This serves as living documentation of your data’s quality and is invaluable for onboarding new team members, communicating data standards to stakeholders, and maintaining transparency.
- Establish Alerting Mechanisms: Configure your data quality monitoring to trigger alerts when expectations fail. This ensures that your team is immediately notified of any data quality regressions, allowing for prompt investigation and resolution. Without active alerting, automated checks can become invisible and ineffective.
Common Questions About Data Quality Testing
How can I automatically generate initial Great Expectations based on my data?
Great Expectations provides a powerful data profiling feature that can analyze your dataset and automatically generate a draft of expectation suites. You can run this directly from the command line or through Python code. The validator.profile_data() method is your primary tool here.
It examines column types, value distributions, null counts, and more, creating a set of suggested expectations. This is an excellent starting point for establishing baseline data quality checks, especially when dealing with new datasets or unfamiliar data sources.
What are the best practices for version controlling Great Expectations suites?
Treat your expectation suites just like any other code artifact. Store them in a version control system (like Git) within your project repository. This allows you to track changes, revert to previous versions, and collaborate effectively with your team.
Ensure that your CI/CD pipeline pulls the correct version of the expectation suite when running validations. Some teams also choose to store expectation suites in a central repository that multiple projects can reference, promoting consistency across an organization.
How does Great Expectations help prevent data drift in machine learning models?
Data drift occurs when the statistical properties of your production data change over time, diverging from the data used to train your machine learning model. Great Expectations helps detect this by enabling you to define expectations about expected data distributions, ranges, and cardinalities.
By running these validations regularly against production data and comparing the results over time, you can identify gradual or sudden shifts. When a significant deviation is detected, it serves as an early warning that your model’s performance may degrade, prompting you to retrain or update it.
The ability to generate data quality reports over time is crucial for this monitoring.
Can Great Expectations be used with cloud-based data warehouses like Snowflake or BigQuery?
Absolutely. Great Expectations has excellent integration capabilities with major cloud data warehouses and data platforms.
You can configure PandasExecutionEngine, SparkDFExecutionEngine, or SQLAlchemyExecutionEngine to connect to services like Snowflake, BigQuery, Redshift, and PostgreSQL.
This allows you to validate data directly within your data warehouse, ensuring that data quality is maintained at its source or during its ingestion into these platforms. Companies like Databricks also support running Great Expectations on their platform for large-scale data validation.
By incorporating data quality testing with Great Expectations into your development lifecycle, you build more reliable systems, reduce technical debt, and foster greater trust in your data assets. The investment in proactive data validation pays significant dividends in the long run, preventing costly errors and ensuring the integrity of your data-driven initiatives.