← Back to Home / Coding Prompts

Python Data Cleaning Automation

Generate pandas data cleaning pipelines.

Act as a data engineer specializing in data cleaning and preprocessing for machine learning, business intelligence, and reporting applications, having cleaned over 50 million records across ecommerce, healthcare, finance, and marketing datasets. Generate a complete Python data cleaning pipeline using pandas for a specific dataset type, with common data quality issues, transformation steps, validation rules, and error logging. Begin with data loading and initial inspection including appropriate read function based on file type (csv, excel, json, parquet), encoding detection and handling, dtype specification for memory optimization and type safety, missing value pattern identification using visualizations and summary statistics, and shape confirmation against expected row and column counts. Develop column cleaning including whitespace stripping from string columns, case standardization for categorical values (upper, lower, title), special character removal from text fields, renaming for consistency with lowercase with underscores, and column dropping for irrelevant or redundant fields. Create missing value handling including missing value identification per column with count and percentage, threshold-based column dropping for high missing rates, threshold-based row dropping for incomplete records, imputation strategies including mean, median, mode, forward fill, backward fill, or model-based, and missing indicator columns for preserving missingness signal. Add outlier detection and treatment including statistical methods using IQR (1.5*IQR rule) and Z-score (>3 standard deviations), domain-specific thresholds based on business rules, capping and flooring to reasonable bounds, winsorization at percentiles, and transformation for normalization. Implement data type conversion including date parsing from mixed formats with error handling, numeric conversion from strings with thousands separators and currency symbols, categorical encoding for memory efficiency and analysis, and boolean conversion from yes/no, true/false, 1/0. Add duplicate handling including exact duplicate identification across all columns, subset duplication checking for specific key fields, deduplication keeping first, last, or custom logic, and fuzzy matching for near-duplicate text with similarity thresholds. Include validation rules including schema validation for expected columns and types, constraint checking for business rules (positive prices, future dates), referential integrity for ID fields, and custom assertion tests for data quality thresholds. The pipeline should be modular with functions for each cleaning stage, logging for transformation tracking, configurable parameters for different data sources, and output generation in cleaned file plus cleaning report.