How We Score Your AI Readiness: Full Methodology

Your AI Readiness Score is a composite metric (0–100) that measures how well your website is prepared to appear in AI-generated answers from ChatGPT, Perplexity, Gemini, Claude, and Google AI Overviews. Every point is computed by deterministic code — no AI guesswork, no black box.

1. How Your AI Readiness Score Is Calculated

Your overall score is a weighted average of six categories, plus a small E-E-A-T bonus (up to 5 points):

Category	Weight	What It Measures
AI Crawler Access	20%	Can AI platforms actually reach your content?
Structured Data & Schema	10%	Does your site speak the machine-readable language AI systems understand?
Content AI-Citability	25%	Is your content structured so AI can quote it accurately?
Technical SEO Foundations	15%	Are the basics in place — meta tags, SSR, security, semantic HTML?
LLM Discoverability	15%	Can large language models find, navigate, and understand your site?
Brand & Authority Signals	10%	Does the web confirm you are a real, trustworthy entity?

Content citability carries the highest weight because AI engines need quotable passages. Crawler access is the prerequisite — if bots cannot crawl your site, nothing else matters. Brand authority plays a supporting role since AI citation favors passage quality over domain reputation.

2. The 14 AI Crawlers We Check

We parse your robots.txt file (per RFC 9309) and check access rules for each of these crawlers individually:

Crawler	Platform	What It Does	Why Blocking Hurts
GPTBot	ChatGPT / OpenAI	Indexes content for ChatGPT responses	Blocks you from the most-used AI assistant
OAI-SearchBot	OpenAI Search	Powers OpenAI's search feature	Removes you from OpenAI search results
ChatGPT-User	ChatGPT browsing	Real-time page fetching	Prevents live citation of your content
ClaudeBot	Anthropic Claude	Indexes content for Claude	Excludes you from Claude's knowledge base
anthropic-ai	Anthropic training	Training data collection	Prevents inclusion in future Claude models
PerplexityBot	Perplexity AI	AI search engine indexing	Removes you from a fast-growing AI search tool
Google-Extended	Gemini / Google AI	Training for Gemini and AI Overviews	Blocks Google's AI features (not traditional search)
GoogleOther	Google AI	Secondary Google AI crawler	Limits Google's AI-powered features
Bytespider	TikTok / ByteDance AI	ByteDance AI data collection	Blocks visibility in ByteDance's AI ecosystem
Applebot-Extended	Apple Intelligence	Apple's on-device AI features	Excludes you from Siri and Apple Intelligence
CCBot	Common Crawl	Open dataset used by many LLMs	Indirectly blocks training for multiple AI systems
cohere-ai	Cohere AI	Enterprise AI / RAG systems	Limits visibility in enterprise AI tools
Meta-ExternalAgent	Meta AI	Meta's AI across FB, Instagram, WhatsApp	Blocks visibility across Meta's 3B+ users
Amazonbot	Alexa / Amazon AI	Alexa answers and Amazon AI	Removes you from voice and Amazon AI search
FacebookBot	Meta / Facebook	Content indexing for Meta platforms	Limits content appearance on Meta platforms

Each crawler receives one of five statuses: ALLOWED (explicitly permitted), BLOCKED (explicitly disallowed), PARTIALLY_BLOCKED (some paths restricted), BLOCKED_BY_WILDCARD (caught by a blanket User-agent: * / Disallow: / rule), or NOT_MENTIONED (no specific rule, defaults to allowed).

What You Should Do

Allow at minimum GPTBot, ClaudeBot, PerplexityBot, and Google-Extended — these cover the four most influential AI search platforms. If you have a blanket Disallow: / under User-agent: *, add explicit Allow: / rules for the crawlers you want to reach.

3. Content Citability Scoring

This is the most distinctive part of our analysis. We break your page into content blocks (grouped by heading), then score each block on five metrics that predict whether an AI system will select that passage as a citation.

The Five Metrics

Answer Block Quality (max 30 points) — Does the passage directly answer a question? We detect definition patterns ("X is a..."), early placement of key facts in the first 60 words, question-answer heading pairs, clear sentence structure (5–25 words per sentence), and attribution signals ("according to," "research shows").

Good: "Generative Engine Optimization (GEO) is the practice of structuring website content so AI search engines can accurately cite it in their responses. According to research from Princeton and Georgia Tech, sites that implement GEO see up to 40% more visibility in AI-generated answers."

Bad: "We have been doing this for a long time and our approach is really comprehensive and covers everything you need."

Self-Containment (max 25 points) — Can the passage stand alone without context from surrounding paragraphs? We measure passage length (134–167 words is optimal), pronoun density (lower is better — AI strips context), and proper noun density (named entities help AI attribute correctly).

Good: "HubSpot's 2024 State of Marketing Report found that 64% of marketers already use AI tools for content creation. The study surveyed 1,460 B2B and B2C marketers across North America, Europe, and Asia-Pacific."

Bad: "They found that most of them already use it. This is higher than the previous year."

Structural Readability (max 20 points) — Is the passage easy for both humans and machines to parse? We check average sentence length (10–20 words is ideal), transition markers ("first," "additionally," "moreover"), numbered lists, and paragraph breaks.

Statistical Density (max 15 points) — Does the passage contain concrete data? We count percentages, dollar amounts, named quantities ("1,460 marketers"), year references, and citations to recognized sources (Gartner, Forrester, Google, etc.).

Uniqueness Signals (max 10 points) — Does the passage contain original insight? We detect first-party research language ("our research found," "we analyzed"), case study references, and specific tool/methodology mentions.

The Optimal Passage Length: 134–167 Words

Passages in the 134–167 word range are cited most often by AI systems. This length contains a complete answer with evidence, but avoids truncation. Under 80 words rarely has enough substance; over 250 words gets split or summarized, losing attribution.

Grade Distribution

Grade	Score	Meaning
A	80–100	Highly Citable — AI systems will prefer this passage
B	65–79	Good Citability — solid, with room for improvement
C	50–64	Moderate Citability — needs more specificity or structure
D	35–49	Low Citability — vague, pronoun-heavy, or lacks data
F	0–34	Poor Citability — unlikely to be cited by AI

Your overall citability score is the average across all passages, with bonuses for having multiple A/B passages and for maintaining optimal passage lengths.

4. Schema Validation

We parse all <script type="application/ld+json"> blocks on your page and check for the schema types that matter most for AI entity recognition, following the schema.org vocabulary.

Organization — The foundation of entity identity. Must include name, url, logo, and sameAs (links to social profiles, Wikipedia, Wikidata). Missing sameAs is the most common gap — it is how AI connects your site to your broader online presence.

Article / BlogPosting / NewsArticle — Signals authored content with a publication date. Must include headline, datePublished, and author. The optional speakable property (per Google Search Central) marks sections suitable for AI reading.

FAQPage — Directly maps question-answer pairs for AI extraction. One of the highest-impact schema types for AI visibility.

BreadcrumbList — Helps AI understand site hierarchy and generate accurate source attributions.

knowsAbout — An underused Organization/Person property that declares your expertise areas. AI systems use this for topical authority.

WebSite — Enables sitelinks search box and helps AI treat your site as a unified entity.

5. Technical Appendix

This section documents the exact scoring formulas implemented in analyzer.py. All scores are deterministic — running the same page twice will produce identical results.

Overall Score Formula

overall = (crawler_score * 0.20) + (schema_score * 0.10) + (citability_score * 0.25)
        + (tech_score * 0.15) + (llm_score * 0.15) + (brand_score * 0.10)
        + min(experience_signals * 1.5 + expertise_signals * 1.0, 5)

Result is clamped to [0, 100].

Category 1: AI Crawler Access (weight 20%)

No robots.txt exists: base score = 60 (accessible but not ideal).
robots.txt exists: allowed_ratio = 1 - (blocked + partial * 0.5) / total_crawlers. Base = allowed_ratio * 70.
Bonuses: robots.txt exists (+10), sitemap referenced in robots (+10), llms.txt exists (+10), llms.txt valid format (+5).
Penalty: -10 for each blocked key crawler (GPTBot, ClaudeBot, PerplexityBot, Google-Extended).
llms.txt validation follows the llmstxt.org spec: must have # Title, > Description, ## Section headings, and - [Link](url) entries.

Category 2: Structured Data & Schema (weight 10%)

Points are additive: Organization (+15), sameAs links 5+ (+15) / 2+ (+8) / 1+ (+4), Article (+10), FAQ (+8), Product (+8), BreadcrumbList (+5), WebSite (+5), speakable (+5), any JSON-LD present (+5). If no JSON-LD exists at all, score is forced to 0. Penalties: -5 per schema issue (up to -10 for 5+ issues), -5 if Organization exists but lacks knowsAbout.

Category 3: Content AI-Citability (weight 25%)

Base = average passage score across all content blocks. Bonuses: 3+ optimal-length passages (+10) or 1+ (+5); 3+ grade A/B passages (+10) or 1+ (+5); 3+ question headings (+5) or 1+ (+2). Penalty: word count under 300 (-20) or under 500 (-10).

Passage scoring (0–100 per block):

Metric	Max	Key Signals
Answer Block Quality	30	Definition patterns (+15), facts in first 60 words (+15), question heading (+10), clear sentences (+10), attribution (+10)
Self-Containment	25	134–167 words (+10), pronoun ratio < 2% (+8), 3+ proper nouns (+7)
Structural Readability	20	Avg sentence 10–20 words (+8), transitions (+4), numbered lists (+4), line breaks (+4)
Statistical Density	15	Percentages (+3 each, max 6), dollar amounts (+3 each, max 5), named quantities (+2 each, max 4), year refs (+2), source names (+2)
Uniqueness Signals	10	First-party research (+5), case studies (+3), named tools (+2)

Category 4: Technical SEO Foundations (weight 15%)

Points: title (+8), meta description (+8), canonical (+5), viewport (+5), complete OG tags (+8), sitemap with lastmod (+8), semantic HTML (+5), <main> tag (+3), single H1 (+5), valid hierarchy (+3), image alt ratio (up to +5), SSR detected (+10), security headers (up to +8), lang (+2). Security: HTTPS (+4), HSTS (+2), CSP/X-Content-Type/X-Frame-Options/Referrer-Policy (+1 each).

Category 5: LLM Discoverability (weight 15%)

Points: llms.txt exists (+20), valid format (+15), llms-full.txt (+5), FAQ schema (+10), 5+ question headings (+10) / 2+ (+5), 3+ high answer-quality passages (+10) / 1+ (+5), Organization + meta description (+10), sitemap (+5), 20+ internal links (+5) / 10+ (+3), 1500+ words (+5) / 800+ (+3).

Category 6: Brand & Authority Signals (weight 10%)

Social platforms (capped at 40 total): YouTube (+15), Reddit (+12), LinkedIn (+10), Twitter/X (+8), GitHub (+8), Facebook (+5), Instagram (+5), TikTok (+3). YouTube carries the highest weight per the Ahrefs brand authority study (r = 0.737). Additional: Wikipedia (+15), Wikidata (+10), About/Contact pages (+5 each), author info (+5), sameAs 5+ (+10) / 2+ (+5), testimonials (+5), privacy policy (+3), terms (+2).

Edge Cases

SPA / client-side rendering: Near-empty #app, #root, __next, or __nuxt containers (< 50 chars) flag SSR failure. AI crawlers generally do not execute JavaScript.
Wildcard robots.txt blocks: User-agent: * / Disallow: / blocks all crawlers including AI bots, flagged separately from per-crawler blocks.
No robots.txt: Scored as permissive (60) per RFC 9309 Section 2.3 — absence means all crawlers allowed.
Schema in @graph: CMS platforms (WordPress/Yoast) nest schema in @graph arrays. We parse these recursively.
Minimal content pages: Under 300 words receives a -20 citability penalty.

References

Google Search Central: Structured Data — Schema markup guidelines and supported types.
web.dev: Technical SEO — Semantic HTML and meta tag best practices.
schema.org — Full vocabulary reference for Organization, Article, FAQPage, speakable, knowsAbout.
llmstxt.org — Specification for the llms.txt standard.
RFC 9309: Robots Exclusion Protocol — The authoritative standard for robots.txt parsing.
GEO: Generative Engine Optimization — Princeton/Georgia Tech/IIT Delhi research on optimizing content for AI search engines.