CTAIO Labs Ask AI Subscribe free

Schema Citation Test: 12 Schema.org Variations, 4 LLMs, 14 Days

Twelve schema variations on one identical article. Citation rate measured weekly across ChatGPT, Perplexity, Gemini, and Claude. The methodology, the variants under test, and the per-engine deltas when results land.

Season 3 add-on · Experiment running
Methodology and twelve variations are live. The fourteen-day measurement window is open. Subscribe below for the scorecard when results drop.
Background reading

The category-level explainer for schema in AI search lives at wetheflywheel.com/en/ai-search/schema-for-agentic-search/. That piece is the framework; this experiment is the empirical test.

Key Takeaways

  • The question — Generative engines parse JSON-LD even when Google does not surface rich results. The open question is how much each schema choice actually moves citation rate. Twelve variations on one article, measured against the same baseline, will tell us.
  • What is under test — The 12 variations cover the major axes: Article vs subtypes (NewsArticle, BlogPosting, TechArticle), author as Person vs string, full sameAs vs minimal, dateModified hygiene, FAQPage on/off, HowTo on/off, breadcrumb depth, and a few combinations of these.
  • What we measure — Citation rate on a fixed 40-prompt query set, per engine (ChatGPT, Perplexity, Gemini, Claude with search), per variation, weekly for fourteen days. Same publication date, same canonical answer block, same prose. Only the JSON-LD changes.
  • Why this matters — The category narrative is that schema is important. The category is light on actual numbers. Most published schema advice is heuristic. This experiment is the controlled test the heuristics need.

Why this experiment

Generative engines parse JSON-LD on the pages they cite. That is a confirmed fact in 2026, easily verified by prompting ChatGPT or Perplexity for a structured-data field that exists only in a page's schema. The category narrative — "schema matters for AI search" — is correct.

What the narrative does not have is numbers. The published advice is heuristic. Practitioner posts recommend Article + FAQPage + HowTo without a controlled measurement of how much each choice actually moves citation rate. The Aggarwal et al. paper that named GEO measured nine content strategies on a 10,000-query benchmark, but it did not isolate Schema.org variations as one of the axes. Nobody has, in public.

This experiment closes that gap with the simplest design that can. Twelve schema variations. Same article body. Same query set. Fourteen days of measurement across four engines. The numbers we get out are the first controlled answer to "which schema choices move citation rate, by how much, on which engine."

Methodology

The article under test

One article on a controlled domain. The body is 2,400 words on a technical topic with stable terminology (chosen to minimise engine-side topic-volatility effects). The canonical answer block at the top is identical across all twelve variations. The publication date is identical. The author byline points to the same Person URL. The only thing that differs between the twelve URLs is the embedded JSON-LD.

The twelve variations

ID Schema configuration What this isolates
V1 Baseline: Article + Organization + BreadcrumbList + Person author (string only) Control. Bare minimum that still validates.
V2 V1 + dateModified bumped weekly Isolates the recency-signal effect.
V3 V1 + author as Person object with profile URL Isolates the author-entity effect.
V4 V3 + author.sameAs to LinkedIn + Wikidata Isolates the entity-disambiguation effect.
V5 V3 + reviewedBy (Person) Isolates the credibility-review signal.
V6 V3 + FAQPage with 6 Q/A Isolates the FAQ-extraction effect.
V7 V3 + HowTo with steps Isolates the procedural-content signal.
V8 V3 with @type=TechArticle instead of Article Isolates the subtype-specificity effect.
V9 V3 with @type=BlogPosting instead of Article Same as V8 but with a softer subtype.
V10 V3 + speakable schema for the answer block Isolates voice-extraction signal (relevant for Gemini Live and similar surfaces).
V11 V3 + WebSite + WebPage + nested @id graph Tests whether engines reward the fully-linked entity graph vs flat blocks.
V12 Everything in V3..V11 combined The "kitchen sink" upper bound. Sets the ceiling we can attribute to schema work.

The query set

Forty queries, split across four topic clusters that map to the test article. Ten queries per cluster. Mixed informational ("what is...") and procedural ("how do I...") intent. The full list is fixed before the experiment starts and ships with the results.

The measurement loop

Citation rate per engine, per variation, per week. Tools used: Profound (daily refresh, span-level attribution) plus Otterly (per-URL granularity) as a cross-check. Engines covered: ChatGPT with search, Perplexity, Gemini, Claude with web search. Bing Copilot is excluded from this round to keep the data shape manageable; it will be added in a follow-up if the methodology holds.

The twelve variants in plain language

The variant axes were chosen to isolate the schema decisions practitioners actually argue about. None of them are exotic; all of them appear in real production pages.

  • Baseline (V1). The bare minimum that validates: Article + Organization + BreadcrumbList + a string author name. Most sites that have schema at all sit here.
  • dateModified hygiene (V2). Same as V1 but with weekly dateModified bumps that match real edits. Tests the recency-signal hypothesis on Perplexity in particular.
  • Author as Person object (V3). Author with profile URL on the same domain. Tests the author-entity hypothesis on ChatGPT specifically.
  • Author with sameAs (V4). V3 plus sameAs to LinkedIn and Wikidata. Tests entity disambiguation.
  • reviewedBy (V5). V3 plus a reviewedBy Person. Tests whether engines reward visible credibility-review signals.
  • FAQPage (V6). V3 plus FAQPage with six Question/Answer pairs. Tests the FAQ-extraction hypothesis directly.
  • HowTo (V7). V3 plus a HowTo with five steps. Tests procedural-content surfacing.
  • TechArticle subtype (V8). Same as V3 but with @type=TechArticle. Tests whether subtype specificity moves the needle.
  • BlogPosting subtype (V9). Same as V3 but @type=BlogPosting. The softer subtype counterpart to V8.
  • Speakable (V10). V3 plus speakable schema on the answer block. Tests voice-extraction surfaces.
  • Full entity graph (V11). V3 plus WebSite + WebPage with nested @id linkage. Tests the connected-graph hypothesis.
  • Kitchen sink (V12). Everything in V3 through V11. Sets the ceiling for what schema work alone can move.

What we expect to find (before we have the data)

Pre-registered predictions, so we cannot retrofit the interpretation after the fact:

  1. The dateModified-hygiene variation (V2) will move Perplexity the most and ChatGPT the least.
  2. The author-Person-with-sameAs variation (V4) will move ChatGPT measurably, with smaller effects elsewhere.
  3. The FAQPage variation (V6) will produce the largest extraction-rate increase but a small share-of-voice increase, because the engine will quote the FAQ verbatim rather than cite the rest of the article.
  4. The TechArticle subtype (V8) will move citation rate marginally compared to plain Article. We expect the subtype effect to be smaller than the marketing literature suggests.
  5. The kitchen sink (V12) will set a ceiling around 25–35 percent citation-rate improvement over baseline, with diminishing returns past V6 or V7.

If the data disagrees with any of these predictions, the methodology section above will name which prediction failed and what the data showed. That is the contract.

Caveats and threats to validity

Five known limitations, named in advance so the discussion of the results can stay on the data rather than on the meta-arguments.

  • Single article, single domain. The deltas are valid for one body of text on one site. Generalising to other topics or other domains requires the same experiment, run again. Results are directional, not universal.
  • LLM output is non-deterministic. The same query can produce different answers in the same week. Citation-rate measurements include variance bands; the per-engine sample size has to be large enough to discriminate signal from noise. Forty queries weekly across four engines gives us that, but the bands are real.
  • Engine updates mid-window. One of the four engines may ship an update during the fourteen-day window. If that happens, the data is split into pre-update and post-update buckets in the report.
  • Visibility tracker drift. Profound and Otterly may update their methodology mid-window. Both are cross-checked weekly against manual prompt sampling.
  • The kitchen sink is correlated. V12 stacks every other variation; we cannot attribute its lift cleanly to any single layer. It is included to establish the upper bound, not to demonstrate causality.

How this fits into Season 3

Season 3 of CTAIO Labs has been the agentic-search season. Three experiments have run already:

This schema citation test is the Season 3 add-on that isolates the most-discussed layer underneath. Together the four pieces give the practitioner side of the AI-search optimisation question: which tools to measure with, which framework to optimise under, which site-level files to ship, and which per-page schema choices actually move the metric.

Why a controlled schema A/B test rather than another vendor comparison?

Because the underlying question is unsettled. Vendor tools tell you which pages got cited; they cannot tell you why one page got cited and another did not. The only honest way to attribute the why to a schema choice is to publish the same article twelve times with twelve different schema configurations and measure the delta. That is what this experiment does.

How is variation isolation handled when the prose is identical?

Each variation is published at a different URL on the same site. The article body, the canonical answer block, the date, and the author byline are identical. Only the JSON-LD blocks differ. The query set is the same for each URL. The engines see twelve nearly-identical pages with twelve different machine-readable descriptions and choose what to cite. The delta is attributable to the schema choice with the usual confidence-interval caveats around LLM-output non-determinism.

How long until results are available?

Fourteen days of measurement, then a week of analysis. Methodology is published now (this page). The full scorecard with per-engine, per-variation citation deltas drops as a Season 3 add-on episode of the CTAIO Labs podcast. Subscribe below if you want the numbers when they land.

How does this fit with the other Season 3 experiments?

S3E1 tested ten visibility tools on three real brands. S3E2 tested the same article rewritten under three optimisation frameworks (GEO, AEO, LLM-SEO). S3E3 measured the citation lift from rolling out an llms.txt across three sites. This experiment isolates the schema layer specifically. Together the four pieces give the practitioner side of the AI-search optimisation question on real budget across real engines.

Which Schema.org types matter most for AI-mediated search in 2026?

Based on practitioner reports and the framework essay on We The Flywheel, the seven types worth shipping are Article (and its subtypes), FAQPage, HowTo, Product, Organization, Person, and Review. Most pages need two or three of those. The full reference is at wetheflywheel.com/en/ai-search/schema-for-agentic-search/.

Do generative engines actually read JSON-LD?

Yes. The independent verification is that you can prompt ChatGPT or Perplexity for a specific JSON-LD field (the publication date, the author URL, a sourced statistic stored only in the schema) and the engine returns it inside a week of the schema deploying. The experiment design assumes this; the deltas it measures depend on it; the practitioner literature confirms it.

Will the methodology change as the experiment runs?

Some. If an unexpected confound surfaces (a vendor API drift, an engine update mid-window, a measurement anomaly), the methodology section above will be updated and the change logged. The query set will not change mid-experiment. The variations will not change mid-experiment. Anything else, we publish the diff.

How was the 40-prompt query set chosen?

Forty queries split across four topic clusters that map to the underlying test article (technical SEO, AI search optimisation, schema reference, JSON-LD examples). Ten queries per cluster, mixed across informational ("what is...") and procedural ("how do I...") intent. The list is fixed before the experiment starts and is published in full alongside the results.

What is the relationship to Thomas Prommer's case study on prommer.net?

Thomas's case study at prommer.net/en/tech/guides/what-i-changed-to-get-cited-by-chatgpt/ documents the broader six-intervention rollout on his site. This experiment isolates the schema layer specifically and runs it as a controlled comparison. The two are complementary: his is the holistic field test on a real site; this is the controlled A/B that isolates one variable.

No comments yet. Be the first!