National Physician License Aggregator

We built a production-grade data collection system that aggregates physician licensing records from all 50 US states plus federal territories and the NPPES national provider database. The pipeline handles three fundamentally different data access patterns — REST APIs, bulk file downloads, and complex browser automation — and feeds a unified, standardized dataset for license verification and compliance workflows.

Key Features

  • 50 states + US territories covered
  • Socrata REST API integration (6 states)
  • Direct bulk file download (20+ states)
  • Playwright browser automation (20+ states)
  • Multi-process async scraping (200 concurrent requests)
  • CAPTCHA detection & resume/checkpoint capability
  • A–Z name iteration with subdivision fallback
  • iFrame & JavaScript SPA navigation
  • Pandas data normalization & CSV export
  • Mage pipeline orchestration with Supabase load

Data Access Methods

Each state uses the most efficient available access pattern:

Socrata API
6 states
DE, IL, TX, CO, CT, WA
Direct Download
20+ states
AL, CA, FL, NV, WI, etc.
Web Scraping
20+ states
AZ, KY, PA, OH, MI, etc.
NPPES Federal
27M+ records
CMS national provider data
National Physician License Aggregator

Tech Stack

PythonPlaywrightasynciorequestspandasSupabase (PostgreSQL)AWS S3Mage (Pipeline Orchestration)Socrata SODA APIGitHub Actions
SupabasePlaywrightAWS S3GitHub