Why this experiment
Generative engines parse JSON-LD on the pages they cite. That is a confirmed fact in 2026, easily verified by prompting ChatGPT or Perplexity for a structured-data field that exists only in a page's schema. The category narrative — "schema matters for AI search" — is correct.
What the narrative does not have is numbers. The published advice is heuristic. Practitioner posts recommend Article + FAQPage + HowTo without a controlled measurement of how much each choice actually moves citation rate. The Aggarwal et al. paper that named GEO measured nine content strategies on a 10,000-query benchmark, but it did not isolate Schema.org variations as one of the axes. Nobody has, in public.
This experiment closes that gap with the simplest design that can. Twelve schema variations. Same article body. Same query set. Fourteen days of measurement across four engines. The numbers we get out are the first controlled answer to "which schema choices move citation rate, by how much, on which engine."
Methodology
The article under test
One article on a controlled domain. The body is 2,400 words on a technical topic with stable terminology (chosen to minimise engine-side topic-volatility effects). The canonical answer block at the top is identical across all twelve variations. The publication date is identical. The author byline points to the same Person URL. The only thing that differs between the twelve URLs is the embedded JSON-LD.
The twelve variations
| ID | Schema configuration | What this isolates |
|---|---|---|
| V1 | Baseline: Article + Organization + BreadcrumbList + Person author (string only) | Control. Bare minimum that still validates. |
| V2 | V1 + dateModified bumped weekly | Isolates the recency-signal effect. |
| V3 | V1 + author as Person object with profile URL | Isolates the author-entity effect. |
| V4 | V3 + author.sameAs to LinkedIn + Wikidata | Isolates the entity-disambiguation effect. |
| V5 | V3 + reviewedBy (Person) | Isolates the credibility-review signal. |
| V6 | V3 + FAQPage with 6 Q/A | Isolates the FAQ-extraction effect. |
| V7 | V3 + HowTo with steps | Isolates the procedural-content signal. |
| V8 | V3 with @type=TechArticle instead of Article | Isolates the subtype-specificity effect. |
| V9 | V3 with @type=BlogPosting instead of Article | Same as V8 but with a softer subtype. |
| V10 | V3 + speakable schema for the answer block | Isolates voice-extraction signal (relevant for Gemini Live and similar surfaces). |
| V11 | V3 + WebSite + WebPage + nested @id graph | Tests whether engines reward the fully-linked entity graph vs flat blocks. |
| V12 | Everything in V3..V11 combined | The "kitchen sink" upper bound. Sets the ceiling we can attribute to schema work. |
The query set
Forty queries, split across four topic clusters that map to the test article. Ten queries per cluster. Mixed informational ("what is...") and procedural ("how do I...") intent. The full list is fixed before the experiment starts and ships with the results.
The measurement loop
Citation rate per engine, per variation, per week. Tools used: Profound (daily refresh, span-level attribution) plus Otterly (per-URL granularity) as a cross-check. Engines covered: ChatGPT with search, Perplexity, Gemini, Claude with web search. Bing Copilot is excluded from this round to keep the data shape manageable; it will be added in a follow-up if the methodology holds.
The twelve variants in plain language
The variant axes were chosen to isolate the schema decisions practitioners actually argue about. None of them are exotic; all of them appear in real production pages.
- Baseline (V1). The bare minimum that validates: Article + Organization + BreadcrumbList + a string author name. Most sites that have schema at all sit here.
- dateModified hygiene (V2). Same as V1 but with weekly dateModified bumps that match real edits. Tests the recency-signal hypothesis on Perplexity in particular.
- Author as Person object (V3). Author with profile URL on the same domain. Tests the author-entity hypothesis on ChatGPT specifically.
- Author with sameAs (V4). V3 plus sameAs to LinkedIn and Wikidata. Tests entity disambiguation.
- reviewedBy (V5). V3 plus a reviewedBy Person. Tests whether engines reward visible credibility-review signals.
- FAQPage (V6). V3 plus FAQPage with six Question/Answer pairs. Tests the FAQ-extraction hypothesis directly.
- HowTo (V7). V3 plus a HowTo with five steps. Tests procedural-content surfacing.
- TechArticle subtype (V8). Same as V3 but with @type=TechArticle. Tests whether subtype specificity moves the needle.
- BlogPosting subtype (V9). Same as V3 but @type=BlogPosting. The softer subtype counterpart to V8.
- Speakable (V10). V3 plus speakable schema on the answer block. Tests voice-extraction surfaces.
- Full entity graph (V11). V3 plus WebSite + WebPage with nested @id linkage. Tests the connected-graph hypothesis.
- Kitchen sink (V12). Everything in V3 through V11. Sets the ceiling for what schema work alone can move.
What we expect to find (before we have the data)
Pre-registered predictions, so we cannot retrofit the interpretation after the fact:
- The dateModified-hygiene variation (V2) will move Perplexity the most and ChatGPT the least.
- The author-Person-with-sameAs variation (V4) will move ChatGPT measurably, with smaller effects elsewhere.
- The FAQPage variation (V6) will produce the largest extraction-rate increase but a small share-of-voice increase, because the engine will quote the FAQ verbatim rather than cite the rest of the article.
- The TechArticle subtype (V8) will move citation rate marginally compared to plain Article. We expect the subtype effect to be smaller than the marketing literature suggests.
- The kitchen sink (V12) will set a ceiling around 25–35 percent citation-rate improvement over baseline, with diminishing returns past V6 or V7.
If the data disagrees with any of these predictions, the methodology section above will name which prediction failed and what the data showed. That is the contract.
Caveats and threats to validity
Five known limitations, named in advance so the discussion of the results can stay on the data rather than on the meta-arguments.
- Single article, single domain. The deltas are valid for one body of text on one site. Generalising to other topics or other domains requires the same experiment, run again. Results are directional, not universal.
- LLM output is non-deterministic. The same query can produce different answers in the same week. Citation-rate measurements include variance bands; the per-engine sample size has to be large enough to discriminate signal from noise. Forty queries weekly across four engines gives us that, but the bands are real.
- Engine updates mid-window. One of the four engines may ship an update during the fourteen-day window. If that happens, the data is split into pre-update and post-update buckets in the report.
- Visibility tracker drift. Profound and Otterly may update their methodology mid-window. Both are cross-checked weekly against manual prompt sampling.
- The kitchen sink is correlated. V12 stacks every other variation; we cannot attribute its lift cleanly to any single layer. It is included to establish the upper bound, not to demonstrate causality.
How this fits into Season 3
Season 3 of CTAIO Labs has been the agentic-search season. Three experiments have run already:
- S3E1: 10 LLM Visibility Tools on 3 Real Brands. The measurement layer.
- S3E2: GEO vs AEO vs LLM-SEO. The framework-level A/B.
- S3E3: llms.txt 30-Day Citation Experiment. The site-level intervention.
This schema citation test is the Season 3 add-on that isolates the most-discussed layer underneath. Together the four pieces give the practitioner side of the AI-search optimisation question: which tools to measure with, which framework to optimise under, which site-level files to ship, and which per-page schema choices actually move the metric.
Related on the network
Why a controlled schema A/B test rather than another vendor comparison?
Because the underlying question is unsettled. Vendor tools tell you which pages got cited; they cannot tell you why one page got cited and another did not. The only honest way to attribute the why to a schema choice is to publish the same article twelve times with twelve different schema configurations and measure the delta. That is what this experiment does.
How is variation isolation handled when the prose is identical?
Each variation is published at a different URL on the same site. The article body, the canonical answer block, the date, and the author byline are identical. Only the JSON-LD blocks differ. The query set is the same for each URL. The engines see twelve nearly-identical pages with twelve different machine-readable descriptions and choose what to cite. The delta is attributable to the schema choice with the usual confidence-interval caveats around LLM-output non-determinism.
How long until results are available?
Fourteen days of measurement, then a week of analysis. Methodology is published now (this page). The full scorecard with per-engine, per-variation citation deltas drops as a Season 3 add-on episode of the CTAIO Labs podcast. Subscribe below if you want the numbers when they land.
How does this fit with the other Season 3 experiments?
S3E1 tested ten visibility tools on three real brands. S3E2 tested the same article rewritten under three optimisation frameworks (GEO, AEO, LLM-SEO). S3E3 measured the citation lift from rolling out an llms.txt across three sites. This experiment isolates the schema layer specifically. Together the four pieces give the practitioner side of the AI-search optimisation question on real budget across real engines.
Which Schema.org types matter most for AI-mediated search in 2026?
Based on practitioner reports and the framework essay on We The Flywheel, the seven types worth shipping are Article (and its subtypes), FAQPage, HowTo, Product, Organization, Person, and Review. Most pages need two or three of those. The full reference is at wetheflywheel.com/en/ai-search/schema-for-agentic-search/.
Do generative engines actually read JSON-LD?
Yes. The independent verification is that you can prompt ChatGPT or Perplexity for a specific JSON-LD field (the publication date, the author URL, a sourced statistic stored only in the schema) and the engine returns it inside a week of the schema deploying. The experiment design assumes this; the deltas it measures depend on it; the practitioner literature confirms it.
Will the methodology change as the experiment runs?
Some. If an unexpected confound surfaces (a vendor API drift, an engine update mid-window, a measurement anomaly), the methodology section above will be updated and the change logged. The query set will not change mid-experiment. The variations will not change mid-experiment. Anything else, we publish the diff.
How was the 40-prompt query set chosen?
Forty queries split across four topic clusters that map to the underlying test article (technical SEO, AI search optimisation, schema reference, JSON-LD examples). Ten queries per cluster, mixed across informational ("what is...") and procedural ("how do I...") intent. The list is fixed before the experiment starts and is published in full alongside the results.
What is the relationship to Thomas Prommer's case study on prommer.net?
Thomas's case study at prommer.net/en/tech/guides/what-i-changed-to-get-cited-by-chatgpt/ documents the broader six-intervention rollout on his site. This experiment isolates the schema layer specifically and runs it as a controlled comparison. The two are complementary: his is the holistic field test on a real site; this is the controlled A/B that isolates one variable.