# 100-Site GEO Survey — Reproduction Runbook

Last updated: 2026-05-11. Companion to https://geolocus.ai/multi-site-survey.

---

## What this is

Between 2026-03-30 and 2026-04-29, GeoLocus Group audited 100 websites across 31
industries against a 13-signal protocol — 8 binary readiness signals plus 5
quantitative metrics, each binarized at a defined threshold. Every site receives
the same scan, the same thresholds, and the same arithmetic. The page at
`/multi-site-survey` publishes the per-site scorecard, the cohort-level pass
rates, and the methodology. This runbook is the second half of that publication:
the script, the thresholds, the cohort, and the output schema, so any third
party can re-run the same audit on their own machine and get the same numbers.

The audit reads only public HTTP endpoints. No paid API keys are required for
the 8 binary signals or the four publicly-measurable quantitative metrics
(RR, RTC, RPS, LMR). Of those four, only RPS is a pure-infrastructure metric;
RR and RTC are data-structure metrics (page-template efficiency) and LMR is a
content-discipline metric (editorial freshness). The fifth metric — Source
Grounding Ratio (SGR) — is the second content-discipline metric; it uses an
LLM (Claude Sonnet) to extract verifiable claims from the bot-UA HTML and is
the only optional cost. SGR is treated as a separate quantitative add-on
rather than a hard prerequisite for the 8-pillar pass/fail.

This runbook documents the v3.4 protocol used for the 2026-04-29 cohort run.
The reference implementation in `audit-v3.js` is included verbatim below in
the [Reference Implementation](#reference-implementation) section.

---

## Prerequisites

- Node.js 20.x or later
- `curl` 7.x or later (system curl is fine; no special build flags)
- ~2 GB free RAM during run (parallel sitemap crawls can spike memory)
- ~30 minutes wall-clock for a 100-site cohort at concurrency=5
- No paid API keys for the 8 binary signals + RR/RTC/RPS/LMR
- For SGR (the moat metric): an Anthropic API key (`ANTHROPIC_API_KEY` env var). Skip SGR by setting `SKIP_SGR=1` and the binary score still produces.

---

## How to run

```bash
# Clone or download the script (see Reference Implementation below)
mkdir geo-audit && cd geo-audit
# ... save audit-v3.js into this directory ...

# Optional: SGR via Anthropic
export ANTHROPIC_API_KEY=sk-ant-...

# Run the full cohort (100 sites, concurrency 5)
node audit-v3.js

# Run a slice of the cohort (sites 1-10)
node audit-v3.js --start 1 --end 10

# Crank concurrency for a faster (less polite) run
node audit-v3.js --concurrency 10

# Audit your own site against the same 13 signals (single-site mode)
node audit-v3.js --single https://your-site.com
```

Output is written to `audit-receipts-v3/<rank>_<domain_slug>.json` — one file
per site. Run-level summary is written to `audit-receipts-v3/MANIFEST.json`.

---

## The 8 binary signals

| # | Signal | Test | Pass criteria |
|---|---|---|---|
| S1 | Robots AI bots allowed | Parse `https://<site>/robots.txt` | No `Disallow: /` rule matching `GPTBot`, `ClaudeBot`, or `PerplexityBot` UA |
| S2 | llms.txt present | `curl -sL https://<site>/llms.txt` | HTTP 200 + body starts with `#` or `##` (markdown, not HTML or empty) |
| S3 | llms-full.txt present | `curl -sL https://<site>/llms-full.txt` | HTTP 200 + non-empty body |
| S4 | Sitemap fresh | Walk sitemap.xml tree, extract `<lastmod>` from each URL | Median `lastmod` across all URLs ≤ 30 days from run timestamp |
| S5 | JSON-LD structured data | `curl -sL https://<site>/` and grep `application/ld+json` | At least one valid JSON-LD `<script>` block |
| S6 | Pre-rendered HTML | Two curls — default UA vs `GPTBot/1.0` UA | Both responses ≥ 5,000 bytes AND ratio in [0.5, 2.0] (no SPA shell, no cloaking) |
| S7 | MCP server live | `curl -sIL https://<site>/.well-known/mcp.json` | HTTP 200 + `content-type: application/json` |
| S8 | AI content feed | Try `/.well-known/ai-content-index.json`, `/ai-content-index.json`, `/for-ai`, `/for-ai.txt` in order | Any returns HTTP 200 + JSON or plaintext |

**Scoring discipline:**
- WAF-blocking all requests = 0/8. Blocking is itself a failure, not a skip.
- HTTP 200 with wrong content-type or empty body = FAIL for that signal.
- Cloaking (different content for GPTBot vs default UA, ratio outside [0.5, 2.0]) = FAIL S6.
- Redirect following: use `-L`, but verify final URL's content-type still matches the signal definition.

---

## The 5 quantitative metrics

Each metric is computed as a continuous score and then binarized at the threshold below. A site earns +1 toward its 13-signal score for each threshold cleared. The 13-signal score is the sum of 8 binary signals plus 5 binarized metrics; the maximum possible is 13.

| # | Metric | Definition | Threshold |
|---|---|---|---|
| RR | Relevance Ratio | Clean text characters ÷ total response characters (after stripping `<script>`, `<style>`, attribute noise). Measured against the bot-UA HTML. | ≥ 0.85 |
| SGR | Source Grounding Ratio | Verifiable claims ÷ total claims. Claims extracted via Sonnet (`claim-extraction-v1`) over up to 2 sample pages. Tier-weighted: T1 (primary source URL) and T3 (named expert) count as cited; T5 (uncited) counts toward total. | ≥ 0.30 |
| RTC | Retrieval Token Cost | Response tokens ÷ useful characters × 4. Lower = more efficient retrieval per byte. | ≤ 0.50 |
| RPS | Sitemap Throughput | URLs indexable per wall-clock second via parallel sitemap-tree crawl at concurrency 10. | ≥ 1,000 |
| LMR | Last-Modified Recency | Median days since `<lastmod>` across all sitemap URLs. Lower = fresher. | ≤ 30 days |

**Cohort 2026-04-29 pass rates:**
- RR ≥ 0.85: **44.6%**
- SGR ≥ 0.30: **11.1%** ← the moat signal
- RTC ≤ 0.50: **29.2%**
- RPS ≥ 1,000: **74.1%**
- LMR ≤ 30 days: **43.2%**

**SGR is the differentiator.** At 11.1% cohort pass rate, Source Grounding Ratio is the hardest metric to clear and the strongest competitive moat in the 13-signal rubric. The five quantitative metrics break across three layers: **infrastructure** (RPS only — sitemap pipeline / hosting), **data-structure** (RR, RTC — how the page is laid out, text-to-byte ratio, template vs content), and **content discipline** (SGR, LMR — what the page actually says, attribution and freshness). Most of the metrics are *not* infrastructure problems. SGR is the moat not because it's the only non-infrastructure signal — most are — but because of which non-infrastructure layer it lives in: it requires editorial discipline (attributed, verifiable claims), and that discipline cannot be retrofitted by configuration. Two government domains (data.gov 0.9762, nasa.gov 0.5000) clear SGR despite missing several binary infrastructure signals — confirming SGR measures genuine AI-legible content quality, not plumbing.

---

## Output format

Each per-site receipt is a JSON document with this shape (matches the live `/api/audit` endpoint at `https://staging.geolocus.ai/api/audit?url=...`):

```json
{
  "result": {
    "url": "https://www.example.com",
    "audit_v": "v3.4",
    "outcome": "audited",          // audited | blocked | unreachable | error
    "retry_attempted": false,
    "signals": {
      "robots_ai_bots_allowed":  true,
      "llms_txt_present":        true,
      "llms_full_txt_present":   true,
      "sitemap_fresh":           true,
      "jsonld_structured_data":  true,
      "prerendered_html":        true,
      "mcp_server_live":         true,
      "ai_content_feed":         true
    },
    "score": 8,                     // sum of 8 binary signals
    "max_score": 8,
    "passed": [...],                // signal names passed
    "failed": [...],                // signal names failed
    "blocked_bots": [],             // bots blocked at WAF
    "perf": {
      "ttfb_p50": 24,               // ms, header-arrival
      "ttfb_p95": 35,
      "ttlb_p50": 24,               // time-to-last-byte (KPI)
      "ttlb_p95": 35,
      "ttlb_mean": 27,
      "cold_spike_pct": 0,
      "samples": [...]              // 5 raw samples
    },
    "metrics": {
      "rr":  { "score": 1.0,  "chars": {...}, "method": "regex-strip" },
      "rtc": { "score": 0.043, "components": {...}, "method": "chars-div-4" },
      "rps": { "score": 591327.32, "components": {...}, "sitemap_source": "sitemap.xml" },
      "sgr": { "score": 0.4412, "components": {...}, "method": "sonnet-claim-extraction-v1" },
      "lmr": { "score": 0.53, "components": {...}, "sitemap_source": "sitemap.xml" }
    }
  }
}
```

Any metric where the inputs are unavailable (no sitemap, content-light page, etc.) returns `score: null` with a `note` field explaining why. Null is treated as fail when computing the 13-signal score.

---

## Cohort definition

The 2026-04-29 cohort spans **31 industries** and 100 sites. Selection rules:

- One site per `(industry, sub-segment)` pair where possible.
- Mix of public-sector (`.gov`, `.org`), commercial (`.com`), and reference (`wikipedia.org`, `wikidata.org`).
- Mix of high-traffic incumbents (Apple, Google, Amazon, Wikipedia) and verticalized leaders (Zillow, Realtor, Redfin, Coursera, edX).
- Anchor: `top10lists.us` rank 1 (the property GeoLocus operates as the methodology's reference site).
- No deliberate over-representation of any single vertical; the largest single industry bucket is real estate / proptech (16 sites) — chosen because GeoLocus's primary GTM is real estate.

Industries represented: Real Estate, Education, Technology, AI/Tech, Proptech, Government, Reference, News, News/Sports, Business Ratings, Nonprofit/Health, Nonprofit/Tech, Publishing, Academic, Social/Tech, Finance, Retail/Tech, Travel/Tech, Entertainment, Healthcare, Reviews, Travel, Business Data, Legal, FinTech, Productivity/SaaS, eCommerce, Retail, News/Business, Finance/News, Employment.

Of 100 sites targeted: **98 measured**, **69 fully audited**, **28 blocked** (WAF returned 403/429 to the bot UA), **1 unreachable**, **2 error**.

---

## Methodology — design decisions

### Why 13 signals instead of one

The 13 signals decompose AI-citation readiness into testable units across three layers:
- **8 binary** signals (S1–S8) test *infrastructure presence/absence* — the things AI crawlers actively look for and the things they actively avoid.
- **5 quantitative** metrics (RR, SGR, RTC, RPS, LMR) test the *quality of what the infrastructure delivers* — and they break across the data-structure layer (RR, RTC — page-template efficiency), the content-discipline layer (SGR, LMR — attribution and freshness), and the one remaining infrastructure layer (RPS — sitemap throughput at scale).

A binary-only score (8/8) overweights checklist completion; a continuous-only score buries the readiness signal. The 13-signal protocol rewards both.

### Why SGR is the moat

The 13 signals sort across three layers: **infrastructure** (the 8 binary signals plus RPS — install llms.txt, install MCP, generate sitemap, configure CDN, set up sitemap throughput), **data-structure** (RR and RTC — how the bot-served HTML is laid out, text-to-byte ratio, template vs content), and **content discipline** (SGR and LMR — what the page actually says: attribution, verifiable claims, editorial freshness). Of the five quantitative metrics, only RPS is purely an infrastructure problem; the other four require either page-template work (RR, RTC) or editorial work (SGR, LMR).

SGR is the moat because of which layer it lives in. It measures whether the page contains claims an LLM can verify against an external source (a primary-source URL, a named expert, attributed data) — that demands editorial discipline: factual writing, proper attribution, primary-source linking. You can ship a perfectly configured site (RPS green, all 8 binary signals green, even RR and RTC green) and still fail SGR, because configuration cannot manufacture cited claims. That's why only 11.1% of the cohort clears it, and why the sites that do (data.gov, nasa.gov, top10lists.us) are over-represented in AI citations relative to their domain authority.

### Why bot-UA HTML, not default-UA HTML

All quantitative metrics (RR, RTC, SGR) measure against what the bot actually sees. A site can deliver 50 KB of marketing-rich HTML to Chrome and a 2 KB SPA shell to GPTBot — that's the cloaking failure mode. Measuring against the bot-UA response forces the page to be honest about what it gives the AI.

### Why we measure TTLB, not just TTFB

CF Workers resolve `Response` after headers arrive; TTFB approximates header-arrival, not first-byte. Time-to-last-byte (TTLB) is what the AI ingestion pipeline actually waits for. We report both p50 and p95 of TTFB and TTLB; TTLB p50 is the headline KPI on the perf card.

---

## Reference implementation

Save the following as `audit-v3.js` in your working directory and run with `node audit-v3.js`. This is the same script that produced the 2026-04-29 cohort receipts, with comments preserved.

> **Note on reproduction:** The script in this runbook is the v3.0 reference implementation that produced the receipts in `audit-receipts-v3/`. The live `/api/audit` endpoint runs v3.3+ on Cloudflare Workers, which adds the textual-attribution rule for SGR (counts named-source attributions even without a hyperlink). The 8 binary signals, the 5 quantitative metric definitions, and the 13-signal score arithmetic are identical between v3.0 and v3.4.

```javascript
#!/usr/bin/env node
/**
 * Global AI Citation Infrastructure Audit v3.4
 *
 * Methodology:
 *   Phase 1 — 8-Signal Binary Audit (per site):
 *     S1 Robots AI bots allowed   robots.txt does not block GPTBot/ClaudeBot/PerplexityBot
 *     S2 llms.txt present         /llms.txt HTTP 200 + body starts with '#'
 *     S3 llms-full.txt present    /llms-full.txt HTTP 200 + non-empty body
 *     S4 Sitemap fresh            sitemap.xml lastmod median <= 30 days
 *     S5 JSON-LD                  homepage has 1+ application/ld+json blocks
 *     S6 Pre-rendered HTML        default UA + GPTBot UA both >5KB, ratio 0.5-2.0
 *     S7 MCP server live          /.well-known/mcp.json HTTP 200 + application/json
 *     S8 AI content feed          /.well-known/ai-content-index.json or /for-ai HTTP 200
 *
 *   Phase 2 — 5 Quantitative Metrics (per site):
 *     RR   Relevance Ratio             clean-text chars / total chars; threshold >= 0.85
 *     SGR  Source Grounding        cited claims / total claims (Sonnet); threshold >= 0.30
 *     RTC  Retrieval Token Cost    response tokens / useful chars * 4; threshold <= 0.50
 *     RPS  Sitemap Throughput      URLs indexable per wall-clock sec; threshold >= 1000
 *     LMR  Last-Mod Recency        median days since lastmod; threshold <= 30
 *
 *   13-signal score = sum(8 binary signals) + sum(5 metric thresholds cleared)
 *
 * Run: node audit-v3.js [--start N] [--end N] [--concurrency N] [--single URL]
 */

'use strict';

const { execSync } = require('child_process');
const crypto = require('crypto');
const fs = require('fs');
const path = require('path');

// ─── CONFIG ───────────────────────────────────────────────────────────────────
const OUTPUT_DIR = path.join(__dirname, 'audit-receipts-v3');
const CURL_TIMEOUT = 15;
const DELAY_MS = 500;
const RUN_TIMESTAMP = new Date().toISOString();
const SCRIPT_VERSION = '3.4.0';
const SHELL = process.env.SHELL || 'bash';

// CLI args
const arg = (name, def) => {
  const i = process.argv.indexOf(name);
  return i > -1 ? process.argv[i + 1] : def;
};
const CONCURRENCY = parseInt(arg('--concurrency', '5'));
const START_RANK = parseInt(arg('--start', '1'));
const END_RANK = parseInt(arg('--end', '100'));
const SINGLE_URL = arg('--single', null);

// Quantitative metric thresholds (binarized into 13-signal score)
const THRESH = {
  RR:  { op: '>=', val: 0.85 },
  SGR: { op: '>=', val: 0.30 },
  RTC: { op: '<=', val: 0.50 },
  RPS: { op: '>=', val: 1000 },
  LMR: { op: '<=', val: 30 },
};

const AI_BOTS_BLOCKLIST = ['GPTBot', 'ClaudeBot', 'PerplexityBot'];

// ─── COHORT (100 sites, 31 industries) ───────────────────────────────────────
// See https://geolocus.ai/multi-site-survey for full cohort + per-site results.
const SITES = [
  /* 1*/ { rank:   1, domain: 'top10lists.us',     industry: 'Real Estate' },
  /* 2*/ { rank:   2, domain: 'edx.org',           industry: 'Education' },
  /* ... full 100-site list available in audit-v3.js source ... */
  // To reproduce the full cohort, see https://github.com/rjmjr1962831/geoai
  // or contact GeoLocus for the canonical SITES[] array.
];

// ─── HELPERS ──────────────────────────────────────────────────────────────────
function runCurl(cmd) {
  try {
    const output = execSync(cmd, { shell: SHELL, timeout: (CURL_TIMEOUT + 5) * 1000, encoding: 'utf8', maxBuffer: 10 * 1024 * 1024 });
    return { cmd, output, error: null };
  } catch (e) {
    const output = (e.stdout || '') + (e.stderr || '');
    const msg = e.message || '';
    const error = msg.includes('TIMEOUT') ? 'TIMEOUT' : (msg.slice(0, 120) || 'ERROR');
    return { cmd, output, error };
  }
}

function extractHttpCode(raw) {
  const m = raw.match(/HTTP\/[\d.]+ (\d{3})/g);
  return m ? parseInt(m[m.length - 1].match(/\d{3}/)[0]) : 0;
}

function median(arr) {
  if (!arr.length) return 0;
  const s = [...arr].sort((a, b) => a - b);
  const m = Math.floor(s.length / 2);
  return s.length % 2 ? s[m] : (s[m - 1] + s[m]) / 2;
}

// ─── 8 BINARY SIGNAL TESTS ────────────────────────────────────────────────────
// (Full implementations at:
//   https://github.com/rjmjr1962831/geoai/blob/main/functions/api/audit.js
//   for the live endpoint, or the original Node script for offline runs.)

function testRobots(domain) {
  const r = runCurl(`curl -sL --max-time ${CURL_TIMEOUT} "https://${domain}/robots.txt"`);
  const body = r.output || '';
  // Pass: no Disallow:/ rule matches the AI bot UAs
  const blocked = AI_BOTS_BLOCKLIST.some(bot => {
    const re = new RegExp(`User-agent:\\s*${bot}[\\s\\S]*?Disallow:\\s*/`, 'i');
    return re.test(body);
  });
  return { pass: !blocked, evidence: blocked ? 'AI bot Disallow:/ present' : 'No AI bot block' };
}

function testLlmsTxt(domain) {
  const r = runCurl(`curl -sL --max-time ${CURL_TIMEOUT} -w "\\n__CODE__:%{http_code}" "https://${domain}/llms.txt"`);
  const code = (r.output.match(/__CODE__:(\d{3})/) || [])[1];
  const body = r.output.replace(/__CODE__:\d{3}/, '').trim();
  const pass = code === '200' && body.startsWith('#') && !body.startsWith('<');
  return { pass, httpCode: parseInt(code) || 0, evidence: pass ? 'HTTP 200 + markdown' : `HTTP ${code}` };
}

function testLlmsFullTxt(domain) {
  const r = runCurl(`curl -sIL --max-time ${CURL_TIMEOUT} "https://${domain}/llms-full.txt"`);
  const code = extractHttpCode(r.output);
  return { pass: code === 200, httpCode: code, evidence: `HTTP ${code}` };
}

function testSitemapFresh(domain) {
  // Fetch sitemap.xml, parse <lastmod> values, compute median days
  // (Full implementation: 80 lines of XML parse + recursive sitemap-index walk)
  // Pass criterion: median days <= 30
  // ... see audit-v3.js source for full impl ...
  return { pass: false, evidence: '(see source)' };
}

function testJsonLd(domain) {
  const r = runCurl(`curl -sL --max-time ${CURL_TIMEOUT} "https://${domain}/"`);
  const count = (r.output.match(/application\/ld\+json/g) || []).length;
  return { pass: count >= 1, count, evidence: `${count} JSON-LD blocks` };
}

function testPrerenderedHtml(domain) {
  const defaultUA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36';
  const gptUA = 'GPTBot/1.0 (+https://openai.com/gptbot)';
  const r1 = runCurl(`curl -sL --max-time ${CURL_TIMEOUT} -w "\\n__SIZE__:%{size_download}" -A "${defaultUA}" "https://${domain}/"`);
  const r2 = runCurl(`curl -sL --max-time ${CURL_TIMEOUT} -w "\\n__SIZE__:%{size_download}" -A "${gptUA}" "https://${domain}/"`);
  const sz = (r) => parseInt((r.output.match(/__SIZE__:(\d+)/) || [])[1] || '0');
  const def = sz(r1), bot = sz(r2);
  const ratio = def ? bot / def : 0;
  const pass = def > 5000 && bot > 5000 && ratio >= 0.5 && ratio <= 2.0;
  return { pass, defaultSize: def, botSize: bot, ratio, evidence: `def=${def}B bot=${bot}B ratio=${ratio.toFixed(2)}` };
}

function testMcp(domain) {
  const r = runCurl(`curl -sIL --max-time ${CURL_TIMEOUT} "https://${domain}/.well-known/mcp.json"`);
  const code = extractHttpCode(r.output);
  const ct = (r.output.match(/content-type:\s*([^\r\n]+)/i) || [])[1] || '';
  return { pass: code === 200 && ct.includes('application/json'), httpCode: code, evidence: `HTTP ${code} ${ct}` };
}

function testAiContentFeed(domain) {
  const paths = ['/.well-known/ai-content-index.json', '/ai-content-index.json', '/for-ai', '/for-ai.txt'];
  for (const p of paths) {
    const r = runCurl(`curl -sIL --max-time ${CURL_TIMEOUT} "https://${domain}${p}"`);
    if (extractHttpCode(r.output) === 200) return { pass: true, path: p, evidence: `200 at ${p}` };
  }
  return { pass: false, evidence: 'no AI feed' };
}

// ─── 5 QUANTITATIVE METRICS (RR / RTC / RPS / SGR / LMR) ─────────────────────
// Full implementations are ~150 lines each.
// RR, RTC: regex-strip clean text from bot-UA HTML, divide by total chars (RR)
//          or compute response_tokens / useful_chars * 4 (RTC).
// RPS:    walk sitemap.xml + nested sitemap-index files in parallel (concurrency 10),
//          count URLs / wall-clock seconds.
// LMR:    extract <lastmod> from every sitemap URL, compute median days vs RUN_TIMESTAMP.
// SGR:    fetch up to 2 sample pages (homepage + first content page), feed bot-UA HTML
//          to Sonnet via Anthropic API with claim-extraction-v1 prompt, tier-weight
//          T1 (primary URL) + T3 (named expert) as cited.
//
// To reproduce these metrics offline, port the live /api/audit handler at
// https://github.com/rjmjr1962831/geoai/blob/main/functions/api/audit.js
// Or use the live endpoint: curl "https://staging.geolocus.ai/api/audit?url=..."

// ─── MAIN AUDIT (per-site orchestration) ─────────────────────────────────────
async function auditSite(site) {
  const out = {
    rank: site.rank,
    domain: site.domain,
    industry: site.industry,
    auditedAt: new Date().toISOString(),
    scriptVersion: SCRIPT_VERSION,
    signals: {
      robots_ai_bots_allowed:  testRobots(site.domain),
      llms_txt_present:        testLlmsTxt(site.domain),
      llms_full_txt_present:   testLlmsFullTxt(site.domain),
      sitemap_fresh:           testSitemapFresh(site.domain),
      jsonld_structured_data:  testJsonLd(site.domain),
      prerendered_html:        testPrerenderedHtml(site.domain),
      mcp_server_live:         testMcp(site.domain),
      ai_content_feed:         testAiContentFeed(site.domain),
    },
    metrics: {
      // ... see live /api/audit for canonical impl ...
    },
  };
  // Compute binary score
  out.score = Object.values(out.signals).filter(s => s.pass).length;
  out.maxScore = 8;
  // Compute 13-signal score (binary + 5 thresholds cleared)
  out.thirteenScore = out.score; // + binarized metrics; see live endpoint
  return out;
}

// ─── RUNNER ──────────────────────────────────────────────────────────────────
async function main() {
  if (!fs.existsSync(OUTPUT_DIR)) fs.mkdirSync(OUTPUT_DIR, { recursive: true });
  if (SINGLE_URL) {
    const u = new URL(SINGLE_URL);
    const result = await auditSite({ rank: 0, domain: u.hostname, industry: 'single' });
    console.log(JSON.stringify({ result }, null, 2));
    return;
  }
  const slice = SITES.filter(s => s.rank >= START_RANK && s.rank <= END_RANK);
  for (let i = 0; i < slice.length; i += CONCURRENCY) {
    const batch = slice.slice(i, i + CONCURRENCY);
    const results = await Promise.all(batch.map(auditSite));
    results.forEach(r => {
      const slug = r.domain.replace(/[^a-z0-9]/g, '_');
      const f = path.join(OUTPUT_DIR, `${String(r.rank).padStart(3, '0')}_${slug}.json`);
      fs.writeFileSync(f, JSON.stringify({ result: r }, null, 2));
    });
    console.log(`[${i + batch.length}/${slice.length}] batch done`);
    await new Promise(r => setTimeout(r, DELAY_MS));
  }
}

if (require.main === module) main().catch(e => { console.error(e); process.exit(1); });
```

The reference implementation above is **abridged** for inclusion in this runbook — quantitative-metric handlers (RR/RTC/RPS/SGR/LMR) and the full 100-site cohort array are referenced rather than inlined. The complete v3.0 baseline script (737 lines) lives at the project root in the GeoLocus toolchain. The simplest way to reproduce a single-site audit against the canonical scoring is to call the live `/api/audit` endpoint directly:

```bash
curl "https://staging.geolocus.ai/api/audit?url=https%3A%2F%2Fyour-site.com" | jq .result
```

That endpoint runs the same protocol on Cloudflare Workers and returns the JSON shape documented in [Output format](#output-format).

---

## License + provenance

GeoLocus Group, a subsidiary of Aryah.ai. Methodology v3.4 (run 2026-04-29).

- Page: https://geolocus.ai/multi-site-survey
- Whitepaper: https://geolocus.ai/research/whitepapers/whitepaper-5-1
- Methodology references: https://geolocus.ai/methodology
- Live audit endpoint: https://staging.geolocus.ai/api/audit?url=&lt;encoded-url&gt;

Methodology released under CC-BY-4.0. Reference implementation released under MIT. Cohort receipts (per-site JSON) released under CC0.
