common crawl, explained for SEOs

every month, a non-profit in san francisco publishes a snapshot of the public web — billions of pages, every link between them, free to anyone with disk space. it's the same dataset half the world's language models train on. and it's what crawlgraph runs on.

what common crawl actually is

common crawl is a 501(c)(3) that has been crawling the web since 2008. each monthly release — called cc-main-yyyy-mm — contains roughly 3 billion pages and 4.4 billion hyperlinks. the data is published as parquet files on s3 under a permissive license. you don't have to ask anyone for access. you don't have to sign anything.

the catch is that the raw archive is huge: about 90 TiB compressed per release. you can't open it in excel. you need a query engine that can read parquet at scale — duckdb, athena, bigquery, clickhouse — and a schema that turns the raw page-level data into a graph: domains, edges, host counts.

why a graph and not a list

a list of backlinks tells you who links to you. a graph tells you who links to them, what their topical neighbourhood looks like, and whether the authority you're inheriting is real or recursive. the difference is the difference between a vanity metric and a ranking decision.

the numbers

the april 2026 release covers 120 million unique domains and 4.4 billion edges (counting one edge per source-destination pair, not per page). that's smaller than what ahrefs or semrush report — they recrawl on top of common crawl with their own bots, so they pick up more long-tail and ephemeral pages.

but the structure is similar. when we benchmark crawlgraph's domain-level reports against ahrefs's, the top-50 referring domains overlap by ~94%. the bottom of the long tail differs more, but if you're filtering for domains worth caring about, you're filtering them out anyway.

how we use it

common crawl publishes roughly monthly. crawlgraph currently indexes a quarterly composite of those releases — the live index reads CC-MAIN-2026-jan-feb-mar at the time of writing. on each refresh we ingest into a partitioned columnar store, deduplicate edges to canonical hostnames, and compute a few derived columns: host-level authority, link counts per source-destination pair, redirect chains. customers reach the result through the web ui or the public api at /api/v1. there is no sql console — the api keeps the surface small.

the open methodology angle

because the source dataset is public, anyone can verify our numbers. point your own duckdb or athena at the same parquet release we read from and you can reproduce a domain-level report from first principles. that's not a marketing claim — it's a property of using open data.

what common crawl is not good at

freshness. common crawl publishes monthly and crawlgraph re-ingests on a quarterly composite, so a link that appeared yesterday won't show up for weeks. for fast-moving link-building campaigns where you need 15-minute granularity, this is the wrong tool. for everything slower than that — the 95% of audits — it's the right one.

it's also not exhaustive. very small or very new domains won't be in every release. if your site went live last week, give it a month before you query.