The 5% That Should Have Been 95%
We built a site with over 1,300 programmatically generated location pages — and then discovered Google had indexed barely 5% of them. Of 1,418 URLs in our sitemap, only 68 had made it into the index. Sixty of those were blog posts. Eight were core site pages. Not a single location page had been indexed.
That is a brutal number to look at when you have spent weeks building a hub-and-spoke matrix designed to scale to thousands of targeted landing pages. The tempting diagnosis was technical: a broken sitemap, a crawl-budget ceiling, a canonical mismatch, a stray noindex. Those are the comfortable answers, because they are the ones you can fix with a config change.
The real answer was none of those. The technical infrastructure was correct. The problem was content quality, and it was hiding behind a perfectly green technical audit. This post is the story of how five AI agents working in parallel found it in a single pass — and why a traditional sequential audit would have cleared the site as healthy.
This is Part 2 of our Building JJM series. Part 1 covered the hub-and-spoke architecture we used to design the matrix. This is the cautionary sequel: what happens when you build the matrix without differentiating the content inside it.
Why a Sequential Audit Would Have Cleared Us
A standard technical SEO audit walks a checklist top to bottom. Is the sitemap valid? Yes. Is robots.txt allowing the crawler? Yes. Are canonicals self-referential? Yes. Is there a noindex tag leaking? No. Are pages returning 200? Yes. Is structured data present? Yes.
Every one of those checks passes on our location pages. A consultant running that checklist would reach the bottom, find nothing red, and conclude the site is technically healthy — then blame "Google being slow to crawl new pages" and recommend waiting. We would have waited for months for an index rate that was never coming.
The failure mode of a sequential audit is that it answers questions in the order they are easiest to ask, not in the order most likely to find the root cause. Content quality sits at the very bottom of most checklists because it is the hardest to measure objectively. By the time a tired auditor gets there, confirmation bias has already set in: everything above was green, so this probably is too.
The fix is not a better checklist. The fix is to stop asking the questions sequentially.
The 5-Agent Parallel Audit
Instead of one auditor walking a list, we dispatched five specialist AI agents at once, each owning a single dimension of site health and each returning an independent verdict:
- An indexing and crawl-health agent — sitemaps,
robots.txt, canonical integrity, coverage. - An on-page metadata agent — titles, descriptions, headings, keyword targeting.
- A schema and AI-citation agent — structured data,
llms.txt, machine readability. - A performance agent — Core Web Vitals, render path, page weight.
- A content-quality and authority agent — uniqueness, depth, E-E-A-T signals.
Each agent ran against the live site with no knowledge of what the others were finding. That isolation is the whole point. When five specialists examine the same patient independently and four of them report "healthy" while one reports a specific, severe problem, you have a high-confidence signal about where the root cause lives. There is no cross-contamination, no anchoring on the first plausible answer, no checklist fatigue.
This is the Claude Code agent-teams pattern applied to diagnosis rather than construction. It is the same reason a hospital runs blood work, imaging, and a physical exam as parallel tracks instead of making one doctor do everything in sequence: parallelism converts a long serial investigation into a single round, and independence makes the result trustworthy.
What Each Agent Found
The agents came back, and the pattern was immediate and unambiguous.
The indexing agent confirmed the infrastructure was clean: sitemaps valid and submitted, robots.txt permissive, canonicals self-referential. Every location page was technically eligible for indexing. The performance agent found Core Web Vitals within acceptable ranges — nothing that would suppress indexing. The schema agent found structured data present and AI-citation readiness in progress, with a low but explainable AI-visibility score of 14 out of 100.
Three green lights. Then two agents fired.
The metadata agent found a title-tag leak: a single-character error in the templating logic was shipping homepage-generic keywords across 661 service-location pages, so hundreds of pages that should have targeted distinct "service in suburb" queries were instead competing for the same generic terms. And the content-quality agent surfaced the finding that explained everything: 93 to 99 percent content overlap across the location hub pages. The pages were near-identical to one another, differentiated by little more than a swapped suburb name.
When the indexing, performance, and schema agents all return clean and only the content and metadata agents fire, the diagnosis writes itself. This is not a crawl problem. This is a content problem.
The Single-Character Title Leak
Before we get to the headline finding, the title-tag leak deserves its own moment, because it is the kind of bug that only a dedicated metadata pass catches. A single character in the title template — the difference between interpolating the page's own location variable and falling back to a site-wide default — meant 661 pages were all shipping effectively the same title.
In programmatic SEO, the title tag is the single strongest on-page signal you control at scale. If 661 pages declare nearly the same title, you have told Google that 661 pages are about the same thing. That is not a typo; at scale it is a structural duplication signal that compounds the underlying content-overlap problem. One character, 661 pages, one very bad message to the crawler.
This is why "it builds and deploys fine" is never the same as "it is correct." The template compiled. The pages rendered. The titles were present in the HTML. Every automated build check was green — and the output was still wrong in a way that only a human reading five sample pages side by side, or an agent specifically tasked with metadata uniqueness, would ever notice.
The Real Finding: 93-99% Content Overlap
Here is the heart of it. Our location pages were 93 to 99 percent identical to each other. The hub-and-spoke matrix had produced the right number of pages and the right URL structure, but the content inside each cell was a near-clone of every other cell. Swap "Wollongong" for "Shellharbour" and the page was otherwise the same paragraphs, the same headings, the same value propositions, the same calls to action.
That is the textbook definition of a doorway page: a large set of near-duplicate pages created primarily to rank for many location or keyword variations, funnelling everyone to the same destination, with no genuinely unique value on each individual page.
We did not set out to build doorway pages. We set out to build a programmatic location-page system, and we built it well — technically. But "technically well-built" and "worth indexing" are different bars, and Google measures the second one. The matrix was the skeleton; we never put unique muscle on the bones.
The Doorway-Page Penalty in Slow Motion
The most important thing we learned: there was no manual action. No penalty notice in Search Console, no message, no warning. Google's quality systems do not need a human reviewer to flag near-duplicate programmatic content. They simply stop crawling and indexing it.
This is the doorway-page penalty in slow motion. It does not arrive as an event you can point to. It arrives as an absence — pages that get discovered, maybe crawled once, and then quietly never indexed. The algorithm decides the marginal page adds nothing the index does not already have, and it moves on. From the outside it looks exactly like "Google is slow," which is precisely why it fools a technical audit.
The signals that would normally reassure you are all present and correct: the page is in the sitemap, robots.txt allows it, the canonical points to itself, the server returns 200. The site looks healthy by every infrastructure metric. The suppression is happening one layer up, at the quality-evaluation layer, where your config files have no vote.
If your programmatic content has an index rate under 10 percent and your technical checks are clean, do not reach for crawl-budget theories. You are almost certainly looking at quality-based suppression, and the fix lives in the content, not the config.
The Tier-1 Fix and the Liability We Removed
We split the response into tiers. Tier 1 was the set of fixes that were unambiguously correct and shippable immediately.
First, we fixed the title-tag leak so all 661 service-location pages carried genuinely distinct, location-specific titles. Second — and this came directly from the authority agent — we removed three fabricated case studies. The site had been seeded with case studies containing invented client names and hallucinated performance figures: numbers that read well but described work that never happened. That is not just embarrassing; it is a direct E-E-A-T liability. Experience, Expertise, Authoritativeness, and Trust are the signals Google uses to decide whether a site deserves to rank, and fabricated proof actively undermines all four. An AI-assisted content pipeline will happily invent a glowing case study if you let it; the authority agent's job was to catch exactly that, and it did.
Tier 2 is the harder, slower decision still open: what to do with the thin location pages themselves. The two defensible options are to prune them back to a canonical set of locations we can genuinely differentiate, or to enrich each surviving page with a real, suburb-specific content block that would survive a quality review. Both are real work. Neither is a config change. That is the honest cost of having built the matrix before the content — and it is the lesson worth carrying into the next build.
How to Run This Audit on Your Own Site
You do not need our exact stack to use this approach. The transferable method is: replace one sequential audit with five parallel specialists, give each one a single dimension and no visibility into the others, and read the pattern of their verdicts rather than any single result. When four come back clean and one fires, trust the one.
If you run a programmatic or location-based site and your index rate is under 10 percent, start with the content-quality agent, not the technical one. Pull a handful of your pages side by side and ask the blunt question: if I swap the location name, is anything else different? If the answer is no, you have found your problem before you have run a single crawl report.
Want help diagnosing why your pages are not indexing — or building programmatic pages that actually earn their place in the index? That is exactly what our SEO service does. And if your growth depends on ranking across many suburbs or regions, our local SEO work is built around the differentiation these pages were missing.
What's Next
Part 3 of Building JJM will document the Tier-2 decision in full: whether we pruned or enriched the location matrix, and what the index rate did in response. The throughline of this series is that infrastructure is necessary but never sufficient — the matrix gets you eligible to rank, and only genuinely differentiated content gets you indexed. Build the skeleton, yes. But do not skip the muscle.
Building JJM: Site Infrastructure
Share This Article
Spread the knowledge