National Physician License Aggregator
We built a production-grade data collection system that aggregates physician licensing records from all 50 US states plus federal territories and the NPPES national provider database. The pipeline handles three fundamentally different data access patterns — REST APIs, bulk file downloads, and complex browser automation — and feeds a unified, standardized dataset for license verification and compliance workflows.
Key Features
- ✓50 states + US territories covered
- ✓Socrata REST API integration (6 states)
- ✓Direct bulk file download (20+ states)
- ✓Playwright browser automation (20+ states)
- ✓Multi-process async scraping (200 concurrent requests)
- ✓CAPTCHA detection & resume/checkpoint capability
- ✓A–Z name iteration with subdivision fallback
- ✓iFrame & JavaScript SPA navigation
- ✓Pandas data normalization & CSV export
- ✓Mage pipeline orchestration with Supabase load
Data Access Methods
Each state uses the most efficient available access pattern:
Socrata API
6 states
DE, IL, TX, CO, CT, WA
Direct Download
20+ states
AL, CA, FL, NV, WI, etc.
Web Scraping
20+ states
AZ, KY, PA, OH, MI, etc.
NPPES Federal
27M+ records
CMS national provider data

Tech Stack
PythonPlaywrightasynciorequestspandasSupabase (PostgreSQL)AWS S3Mage (Pipeline Orchestration)Socrata SODA APIGitHub Actions