~/crawlgraph — zsh
$crawl stripe.com --top 5
github.io92follow
css-tricks.com88follow
lobste.rs86follow
algolia.com84follow
web.dev80follow

common crawl, explained for SEOs

the 4.4B-edge open dataset that quietly powers half the AI training runs - and how we turn it into rank intelligence.

pete the seo wizard
April 8, 2026 · 6 min read · 640 words
sharexlinkedin

every month, a non-profit in san francisco publishes a snapshot of the public web - billions of pages, every link between them, free to anyone with disk space. it's the same dataset half the world's language models train on. and it's what crawlgraph runs on.

what common crawl actually is

common crawl is a 501(c)(3) that has been crawling the web since 2008. each monthly release - called cc-main-yyyy-mm - contains roughly 3 billion pages and 4.4 billion hyperlinks. the data is published as parquet files on s3 under a permissive license. you don't have to ask anyone for access. you don't have to sign anything.

the catch is that the raw archive is huge: about 90 TiB compressed per release. you can't open it in excel. you need a query engine that can read parquet at scale - duckdb, athena, bigquery, clickhouse - and a schema that turns the raw page-level data into a graph: domains, edges, host counts.

why a graph and not a list

a list of backlinks tells you who links to you. a graph tells you who links to them, what their topical neighbourhood looks like, and whether the authority you're inheriting is real or recursive. the difference is the difference between a vanity metric and a ranking decision.

the numbers

the april 2026 release covers 120 million unique domains and 4.4 billion edges (counting one edge per source-destination pair, not per page). that's smaller than what ahrefs or semrush report - they recrawl on top of common crawl with their own bots, so they pick up more long-tail and ephemeral pages.

but the structure is similar. when we benchmark crawlgraph's domain-level reports against ahrefs's, the top-50 referring domains overlap by ~94%. the bottom of the long tail differs more, but if you're filtering for domains worth caring about, you're filtering them out anyway.

how we use it

common crawl publishes roughly monthly. crawlgraph currently indexes a quarterly composite of those releases - the live index reads CC-MAIN-2026-jan-feb-mar at the time of writing. on each refresh we ingest into a partitioned columnar store, deduplicate edges to canonical hostnames, and compute a few derived columns: host-level authority, link counts per source-destination pair, redirect chains. customers reach the result through the web ui or the public api at /api/v1. there is no sql console - the api keeps the surface small.

the open methodology angle

because the source dataset is public, anyone can verify our numbers. point your own duckdb or athena at the same parquet release we read from and you can reproduce a domain-level report from first principles. that's not a marketing claim - it's a property of using open data.

what common crawl is not good at

freshness. common crawl publishes monthly and crawlgraph re-ingests on a quarterly composite, so a link that appeared yesterday won't show up for weeks. for fast-moving link-building campaigns where you need 15-minute granularity, this is the wrong tool. for everything slower than that - the 95% of audits - it's the right one.

it's also not exhaustive. very small or very new domains won't be in every release. if your site went live last week, give it a month before you query.

ahrefs · backlinkslocked
upgrade required · $129/mo
crawlgraph · live $99 once
G
github.io92
C
css-tricks.com88
L
lobste.rs86
A
algolia.com84
W
web.dev80
same data · one-time
$99$129/moonce
unlock the data →
stripe checkout · instant access
methodology#common crawl#methodology#open data
sharexlinkedin
pete the seo wizard
author

writes the queries we run internally. ships one tactical post a week.

keep reading
methodology
methodology11 min read

how to run a backlink gap analysis for free

a gap analysis finds the domains that link to your competitors but not to you. that overlap is the highest-roi prospect list in link building. here is a free, step-by-step way to build one.

pete the seo wizardJun 2
methodology
methodology8 min read

free backlink gap analysis without ahrefs (2026)

the domains that link to your competitors but not you are the best link prospects you have. here is how to find them for free, no ahrefs or semrush subscription.

pete the seo wizardJun 2
where ahrefs and moz get their backlink data
methodology11 min read

where ahrefs and moz get their backlink data

ahrefs, moz, semrush, and majestic each run their own proprietary crawler and ship a curated index - which is why no two backlink tools ever agree. here's the methodology, the size claims, and why common crawl is the only open alternative.

pete the seo wizardMay 23
the dispatch
one post a week.

+ a free domain audit when you sign up.