ask most seos how ahrefs finds backlinks and you'll get a shrug or "they crawl the web." that's half right. ahrefs, moz, semrush, and majestic each run their own proprietary crawler, build their own index, and ship a curated database - and the differences in methodology explain why no two backlink tools ever agree. here's how each of the big four actually builds its index, and why common crawl (the open data source crawlgraph uses) is the audit trail none of them publish. it's also why five free ways to find backlinks exist alongside the paid tools at all.
the short answer: each vendor runs its own crawler
none of the major backlink tools share a data pipeline. they all run their own web crawl, store their own index, score links with their own algorithms, and decide their own dedup rules. that's why your ahrefs number and your semrush number for the same domain are never identical - they're measuring slightly different webs.
the only open, audit-able backlink data source is common crawl. everything else is a proprietary index you trust on the vendor's word.
ahrefs: ahrefsbot and the 3.4 trillion url index
ahrefs runs ahrefsbot, a user-agent string you'll see in any web server log. ahrefs claims the bot crawls roughly 8 billion pages per day and maintains a live index of about 3.4 trillion urls- both numbers come from ahrefs's own ahrefsbot documentation.
the index refresh window is short. ahrefs publishes a fresh index updated every 15 minutes for recent links, plus a live index that lags by hours. sites that block ahrefsbot in robots.txt - common among news publishers and large ecommerce - are absent from the index entirely, which is one source of the ahrefs-vs-real-web gap.
what gets dropped: pages with noindex directives, urls that return 404/410 across multiple crawl passes, and pages whose canonical points elsewhere (the canonical target counts, not the redirector). see ahrefs vs crawlgraph for what that crawl scale buys you in practice.
moz: mozscape and link explorer
moz is on its second backlink index. the original was mozscape, retired in 2022. the current product is link explorer, which moz rebuilt from scratch with a new crawler architecture and (per moz) a far larger index - they claim 40 trillion urls in the link explorer index.
the crawler is mozbot (sometimes still tagged as rogerbotin legacy headers). refresh cadence is slower than ahrefs - link explorer updates on a 4-8 week cycle, which is part of why moz's backlink counts often look lower for very recent links and higher for older ones (the index accumulates rather than expires).
moz's scoring is built around domain authority (da) and page authority (pa), which are derived metrics applied to the index, not properties of the underlying crawl. for the methodology gap between moz's da and equivalent metrics see the comparison table below.
semrush: the backlink database
semrush's crawler is semrushbot, documented at semrush.com/bot. semrush positions its backlink product as the backlink database, claiming about 43 trillion urls indexed - the largest of the major vendors' public claims.
semrush splits its index into a live index (active backlinks) and an archive of lost and found backlinks (links that previously existed but no longer do, or new links discovered since the last sync). the lost-and-found view is a semrush strength - most competitors fold lost links into the main index or drop them entirely.
refresh cadence: live index refreshes daily-to-weekly per domain, depending on crawl priority. for the methodology spec side-by-side see crawlgraph vs semrush.
majestic: mj12bot and distributed crawling
majestic is the architectural dark horse. instead of running a centralized crawler on owned infrastructure, majestic uses mj12bot - a distributed crawl that runs on volunteer machines. anyone can install mj12bot on a spare server and contribute crawl capacity in exchange for majestic-pro credit. the full bot policy is at mj12bot.com.
majestic exposes two indexes:
- fresh index - rolling 90-day window. updated daily. ~1.4 trillion urls.
- historic index - 5-year accumulated archive. updated monthly. larger than the fresh index by an order of magnitude.
majestic's scoring is built around trust flow and citation flow- both proprietary scores derived from the link graph topology, computed differently from moz's da or ahrefs's dr but conceptually similar.
common crawl: ccbot and the open alternative
common crawl is the only major web-scale crawl with a public dataset. its crawler is ccbot. unlike the proprietary vendors, common crawl publishes:
- the full warc files (web archive format) - the raw responses ccbot received from every url it crawled. roughly 3-5 billion pages per monthly crawl.
- the derived hyperlink graph - one file with every vertex (registered domain) and one with every edge (source-domain → destination-domain pair). this is the data layer crawlgraph queries against.
- the full methodology - exactly how the crawler is configured, what robots.txt rules it honors, what gets dropped.
crawlgraph runs on this. when you query a domain, we're running a sql query against duckdb-loaded common crawl edges. see what common crawl is and how it's structured for the underlying architecture, and the common crawl release schedule for the data-freshness cadence.
when ahrefs says your competitor has 12,000 backlinks, you have to take that number on trust - the index isn't inspectable. when crawlgraph says the same site has 8,500, you can download the underlying common crawl edges file yourself and verify the count. the proprietary vendors are bigger; the open dataset is the only one anyone can audit.
methodology comparison: the five crawlers side-by-side
| vendor | crawler | claimed index size | refresh cadence | transparency | cost |
|---|---|---|---|---|---|
| ahrefs | ahrefsbot | 3.4T urls | fresh: 15 min · live: hours | proprietary | $129+/mo |
| moz | mozbot / rogerbot | 40T urls (link explorer) | 4-8 weeks | proprietary | $99+/mo |
| semrush | semrushbot | 43T urls | daily-weekly per domain | proprietary | $140+/mo |
| majestic | mj12bot (distributed) | 1.4T fresh · ~10T historic | fresh: daily · historic: monthly | proprietary, distributed crawl | $50+/mo |
| common crawl | ccbot | 3-5B pages per crawl | monthly (1-2 mo cadence) | fully open - warc + graph published | free |
the index-size numbers aren't directly comparable. ahrefs and moz count urls (every distinct page), common crawl counts pages crawled per release. moz's 40 trillion is a cumulative figure across years; ahrefs's 3.4 trillion is a live snapshot. but the order-of-magnitude story holds: proprietary crawls are bigger, common crawl is smaller, and only the open dataset lets you check the work.
why tools disagree about your backlink count
plug the same domain into ahrefs, moz, semrush, and majestic and you'll get four different referring-domain counts. the spread is rarely under 20%, often over 50%. four mechanical reasons:
- crawl frontier differences - each crawler discovers different urls because each starts from a different seed list and follows different priority heuristics. ahrefsbot will crawl a low-traffic blog semrushbot won't bother with, and vice versa.
- dedup choices - is
blog.example.comthe same referring domain asexample.com? ahrefs and crawlgraph say yes; moz and majestic sometimes split them. one or two splits per domain swings the count. - dofollow detection - ahrefs and semrush flag
rel=nofollow,rel=ugc, andrel=sponsoredindividually. moz treats ugc and sponsored as nofollow. majestic treats them as dofollow. if you filter by "dofollow only" the numbers diverge. - javascript rendering - ahrefs renders js. common crawl doesn't (mostly). a backlink that only appears after react hydration may show in ahrefs and be invisible in crawlgraph.
what this means for your backlink research
the honest take: ahrefs is more thorough for live-link discovery on a tight deadline. common crawl is more transparent and free. for most research workflows you want both - ahrefs (or a free workaround) for the live picture, common crawl for the audit trail and the unverified-site analysis.
the practical move:
- if you're monitoring a single live link-building campaign with sub-week freshness, a proprietary tool earns its keep.
- if you're doing competitor backlink analysis or quarterly audits, common crawl matches the requirement and costs nothing.
- if you want to verify a specific claim ("ahrefs says my competitor has 12,000 backlinks - is that real?"), download the relevant common crawl edges file and count. it's the only data source where you can.
when an seo client asks "why does ahrefs show 8,500 and crawlgraph shows 6,200?" the answer is always one of the four reasons above - crawl frontier, dedup, dofollow detection, or js rendering. for a deeper dive on which one applies when, see referring domain vs backlink for the dedup half and the common crawl release schedule for the freshness half.
faq
does ahrefs crawl the entire web?
no. ahrefs claims a 3.4 trillion url index, but the open web is estimated at 50+ trillion urls and ahrefsbot is blocked by a meaningful fraction of large sites (most news publishers, many ecommerce platforms). no proprietary crawler covers more than a slice of the indexable web.
is common crawl as accurate as ahrefs?
not for live-link discovery - common crawl releases monthly, ahrefs in minutes. for the long-tail referring-domain picture on stable sites, common crawl is within 20% of ahrefs's numbers and sometimes catches links ahrefs misses (sites that block ahrefsbot in robots.txt are visible to ccbot and vice versa).
why do ahrefs and semrush show different backlink counts?
four reasons: their crawl frontiers cover different slices of the web, they dedup subdomains differently, they classify rel=nofollow and rel=ugc differently, and one renders javascript more aggressively than the other. expect 20-50% divergence on almost any domain you query.
can i see what's in ahrefs's index without paying for ahrefs?
no - ahrefs's index is proprietary and isn't exposed via any free endpoint. but you can query common crawl directly (or via a tool like crawlgraph) to see the open-data equivalent. for any site that ccbot indexes, the data is free and auditable.
writes the queries we run internally. ships one tactical post a week.
+ a free domain audit when you sign up.