Python Data Validation Pipeline
Build data quality validation systems.
Act as a data engineer specializing in data quality and validation for data pipelines serving machine learning models, business intelligence dashboards, and operational reporting, having built validation frameworks handling millions of records daily. Generate a comprehensive data validation pipeline in Python using Pydantic, Great Expectations, or custom validators for a specific data source type (API, database, CSV, JSON, Parquet, streaming), including schema validation, business rule enforcement, anomaly detection, and alerting. Begin with data quality dimensions including completeness (null counts per column, null percentage thresholds, required field validation, default value filling for optional, partial record handling), accuracy (value range checks min/max, allowed value sets enumeration, reference data matching, cross-field consistency, precision and rounding), consistency (format patterns regex validation, unit conversion correctness, cross-system reconciliation, historical trend alignment, referential integrity), uniqueness (duplicate detection by key, composite key uniqueness, timestamp deduplication, fuzzy matching for near-duplicates, window-based deduplication), timeliness (data freshness monitoring ingestion timestamp, processing delay SLA, out-of-order handling, late-arriving data, batch vs micro-batch latency), and validity (data type validation, constraint satisfaction, foreign key existence, business rule application, conditional validation logic). Develop schema validation using Pydantic models (field type coercion, required vs optional, field aliases for renaming, default values, custom validators with @validator decorator, root_validator for cross-field, model_config for extra fields), JSON Schema validation (generation from Pydantic, jsonschema library, custom format checks, patternProperties for dynamic fields, anyOf/oneOf composition), and Avro or Protobuf schema registry integration (schema versioning, backward compatibility, evolution rules, schema ID header). Create business rule validators including range rules (numeric between, date before/after, string length between, array size bounds, overlapping intervals), set membership (value in allowed list, value from reference dataset, code lookup existence, category assignment validity, status transition rules), cross-field rules (field A < field B, field C implies field D, mutual exclusivity, if-then conditions, formula validation, aggregation consistency), temporal rules (timestamp monotonicity, date effective ranges, no future dates, logical ordering, duration reasonableness, gap detection), and statistical rules (distribution consistency, z-score outlier detection, IQR fence 1.5x rule, rolling mean deviation, category proportion stability). Implement anomaly detection including univariate detection (Z-score threshold 3, modified Z-score with MAD, IQR 1.5x rule, percentile-based capping, seasonal decomposition), multivariate detection (Mahalanobis distance, isolation forest, DBSCAN clustering, autoencoder reconstruction error, PCA projection), time series detection (seasonal-trend decomposition residual, rolling window statistics, change point detection, threshold on derivatives, forecast error monitoring), categorical drift (population stability index, chi-square test, proportion shift detection, category emergence/disappearance, mutual information change), and data drift monitoring (Kolmogorov-Smirnov test for continuous, Jensen-Shannon divergence for categorical, Wasserstein distance, feature importance change, prediction drift). Add validation reporting including Great Expectations integration (expectation suite, data docs generation, validation results JSON, batch request configuration, checkpoint execution), custom validation summary (pass/fail counts by rule, records affected, severity levels error/warning/info, examples of violations, rule execution time), alerting configuration (email for critical failures, Slack webhook notifications, PagerDuty integration, Jira ticket creation, dashboard update triggers), and data quality scorecard (DQ dimensions weighted score, trend over time, SLA attainment, per-source reliability, per-attribute confidence). Provide action rules including soft fail (log warning but continue, mark record with quality flag, quarantine to dead letter queue, send to human review), hard fail (stop pipeline, reject batch, raise exception, trigger rollback, alert on-call engineer), and auto-remediation (impute missing values, correct data types, normalize formats, filter outliers, retry ingestion). Include CI/CD integration for validation rules (unit tests for validators, version controlled expectations, staging vs production rule sets, canary deployment for new rules, A/B testing of thresholds).