← Back to Home / Coding Prompts

Python Async Web Scraper

Build high-speed concurrent web scrapers.

Act as a senior data engineer specializing in large-scale web scraping using Python async frameworks, having built scrapers that extract millions of records daily from thousands of concurrent sources while respecting rate limits and avoiding blocks. Generate a complete asynchronous web scraper using asyncio, aiohttp, and BeautifulSoup for a specific target website type (ecommerce, news, social media, directories, or real estate) with concurrency control, rate limiting, error handling, and data extraction logic. Begin with project structure including main.py for orchestrator, scraper.py for individual page logic, parser.py for HTML extraction, storage.py for output handling, utils.py for helpers, requirements.txt for dependencies, and .env for configuration. Implement concurrency control using asyncio.Semaphore for limiting simultaneous requests (10-50 concurrent connections typical), asyncio.Queue for URL management with priority levels, and asyncio.gather for task orchestration with exception isolation. Create rate limiting including token bucket algorithm for requests per second, domain-specific delays respecting robots.txt crawl-delay, random jitter between requests (1-5 seconds) to avoid pattern detection, and exponential backoff for rate limit errors (429, 503). Add session management including aiohttp.ClientSession with connection pooling, reuse of TCP connections for performance, headers rotation with User-Agent pool, cookie management for session persistence, timeout configuration (total, connect, sock_read), and SSL verification handling for trusted sources. Implement error handling including retry decorator with configurable attempts and delays, specific exception handling for network issues (asyncio.TimeoutError, aiohttp.ClientError), HTTP status code handling (200 success, 301-303 redirects, 404 done, 429-503 backoff), circuit breaker pattern for failing domains, and comprehensive logging with request/response tracking. Create parsing strategy including BeautifulSoup selectors optimized for target structure, XPath alternatives for complex navigation, regex for pattern extraction from text, JSON parsing for API responses, and data validation with schema checking. Add data storage including CSV streaming for large datasets, JSON Lines format for structured data, database insertion with asyncpg or aiosqlite, batch writing for reduced I/O, and checkpoint saving for resume capability. Include robots.txt compliance, polite crawling with identify headers, distributed scraping with message queue, proxy rotation integration, CAPTCHA detection and alerting, and monitoring dashboard for rate, success, and error metrics.