Charlie KrugThe Build Log

← All posts

Substack has no public API. I built the next best thing.

A scraper for public Substack data: post archives, full content, nested comment threads, author profiles, and category leaderboards with subscriber estimates. Six modes, no login.

Substack is where an enormous slice of independent writing lives now, and if you want to study it (which newsletters grow, what topics a writer returns to, how comment sections behave), you hit a wall fast: there's no public API. The data is right there in your browser, and completely inconvenient to work with at any scale.

Substack Scraper is my answer: a tool that pulls clean, structured data from any public Substack. No login, no cookies borrowed from your browser session, no credentials. If a logged-out visitor can see it, the scraper can collect it as JSON.

Six modes, because "scrape it" means six different jobs

The core design decision was admitting that people who say "scrape this newsletter" want different things. So the tool has six modes instead of one giant crawl:

  • Archive: every post in a publication's back catalog, with dates, titles, and links. The skeleton of a newsletter's history.
  • Posts: full content for the posts you point it at, ready for text analysis.
  • Comments: whole threads, with the nested reply structure preserved, not flattened into soup. Comment sections are conversations; the tree is the data.
  • Author profiles: who writes this thing, what else they write.
  • Category leaderboards: Substack's own rankings by category, snapshotted into data you can actually sort and compare.
  • Subscriber estimates: the leaderboards pair rankings with subscriber-count signals, so you can watch relative sizes, not just positions.

Each mode returns predictable structured records, which means the output plugs straight into a spreadsheet, a notebook, or whatever analysis pipeline you already have. It runs on Apify, the scraping platform, so scheduling, retries, and storage come with the venue instead of being my problem to reinvent.

The polite-scraper rules

Scrapers have a reputation, some of it earned, so this one has house rules worth stating plainly. It touches public pages only: no paywalled content, no login flow, nothing a regular logged-out reader couldn't see. Subscriber numbers are estimates derived from public signals, labeled as estimates. And it's rate-limited to be a polite guest rather than a load test. The use case is research and market analysis, not wholesale content lifting: think "map the finance-newsletter landscape before launching mine," not "repost someone's archive."

Of everything in it, the nested comments were the fussiest to get right and are probably the most underrated output. Public conversation threads with structure intact are hard to come by, and they're where a newsletter's actual community lives.

Try it

The landing page below has the details and links to the actor. Point it at a publication you read, pull the archive, and see how the posting cadence changed over a year. Newsletters have growth stories written in their own timestamps.

Substack Scraper is live. Free, in your browser, no signup.

This post is part of the build log: every app my automated factory ships gets written up here, honestly. Browse everything at apps.charliekrug.com, or subscribe via RSS. Comments are open below.

Comments

Loading comments…