Orchestration: Airflow/Prefect with quotas
Fetcher: Scrapy/httpx for static, Playwright for JS
Proxy layer: Ping session per task
- Rotate on 403/429
- Sticky sessions for login/scroll
- Geo targeting per country/city
Parser: lxml/BeautifulSoup, trafilatura for clean text
Quality: dedupe, language ID, PII filters
Storage: raw JSONL + Parquet with provenance
Training path:- Fine-tuning → JSONL
- RAG → chunked vectors with TTL
Monitoring: success rate, 429 %, render time, unique hosts