How the test works
Voice (EP01) replicates how I sound. The face (EP02) replicates how I look. The brain — knowledge, opinions, judgement — is the part that actually does the work, and it is the part I have been ignoring while the AI twin season ran. EP03 is the cleanup.
The setup: one shared corpus, three architectures that claim to answer
questions over it, the same seven questions plus one multi-turn
working-memory probe sent to all three. The corpus is the published English
content on ctaio.dev: 961 chunks across 223 pages, ingested by
the Pagefind index and re-embedded for the RAG and long-context tests.
Total experiment spend: $4.30 across all three systems for
the full battery.
The questions span seven failure modes I wanted to expose: specific recall, cross-document synthesis, faithfulness to my actual opinion, recency awareness, cross-cluster reasoning, out-of-scope handling, and hallucination resistance. The eighth — Q7 — is the one that matters: a five-turn conversation with a constraint set in turn one, designed to test whether the system still honours that constraint by turn five.
The framing comes from Robert Sfeir's essay on AI working memory. Long-term memory in AI systems is solved — that is what RAG and vector databases do. Short-term context is solved — that is what the model's window does. The unsolved layer is the working memory in between: the active layer that holds a constraint across a long pipeline. Every second-brain product in 2026 has a different answer to where that layer lives. Most of them are wrong.
Three paradigms, tested head-to-head
The space of "answer questions over my knowledge" splits into three architectures. I tested one production-quality instance of each and ran the same battery against all three.
Ask CTAIO (production RAG)
Embeddings + vector search + small LLM generation
- Stack
- OpenAI text-embedding-3-small (1536d) → sqlite-vec → top-k=6 with MMR → gpt-4.1-mini
- Cost / query
- $0.0009–$0.0032 (median ~$0.0014)
- Latency
- 5–9s per query, 9s total when 7 questions parallelize
- Update friction
- Re-build Pagefind index, run incremental ingest (~$0.008 for 423 new chunks), rsync DB to sin, restart service
- Privacy
- Self-hosted on sin. Corpus never leaves the box. OpenAI sees only question text and returned chunk content
- Working memory
- 3-turn rolling window (server-enforced cap). 4+ turn history returns HTTP 400. Constraints set early in a session evaporate by turn 4
- Integration
- Live at https://ctaio.dev/en/ask-ctaio/. The same backend serves prommer.net via persona config
- Score
- 2 of 7 wins
Verdict: Cheapest and fastest. Hallucinates when retrieval is weak. Working memory is structurally absent.
Gemini 2.5 Pro long-context dump
Paste 705k tokens of corpus into a 1M-context model. No retrieval, no chunking
- Stack
- gemini-2.5-pro · explicit cached_content (1hr TTL) · max_output_tokens raised from 2048 to 8192 after first run
- Cost / query
- $0.45–$0.50 (cache reads dominate)
- Latency
- 10–26s per query, sequential because cache reads do not parallelize cleanly
- Update friction
- Re-export corpus, recreate cache. No ingest pipeline. The "fix" for stale knowledge is rerunning the script
- Privacy
- Corpus uploaded to Google. Cache lives on their infrastructure for the TTL window
- Working memory
- Per-session full window. No rolling cap. Only constraint is the 1M token ceiling
- Integration
- Script-only. No UI. Could be wrapped as a service but at $0.45/query that is a different product
- Score
- 1 of 7 wins outright (Q4) — the question every other paradigm got wrong
Verdict: Most faithful when it answers. Burns its output token budget on internal "thinking" on hard questions and produces empty responses.
File-based + Claude Code (Karpathy LLM Wiki)
Many markdown files in a few dozen repos. Agent with Read/Grep/Glob navigates on demand
- Stack
- Claude Sonnet 4.6 sub-agent · /opt/ctaio.dev/src/pages/en/ + /opt/agentic-coding/rules/ + EP02 workfiles
- Cost / query
- ~$0.02–$0.05 (Sonnet sub-agent, 53 tool calls across 7 questions)
- Latency
- ~30s per question on average. 53 tool uses across 7 questions in 3.4 minutes
- Update friction
- Edit a file. That is the entire pipeline
- Privacy
- Files never leave the disk unless the agent explicitly reads them. Corpus is local. Anthropic only sees question + the chunks the agent chose to surface
- Working memory
- Per-session full context window. The agent itself can keep constraints across turns within its window
- Integration
- It is the IDE. The "second brain" is whatever you have in /opt at the moment
- Score
- 5 of 7 wins
Verdict: Most faithful, most thorough, most honest about failure. Loses on lexical mismatch — does not know that "the wrong story" matches the H1 "Is the Wrong Story" if grep is case-sensitive.
Worth flagging: I substituted Gemini 2.5 Pro for Claude Opus 4.7 in the long-context slot because the Anthropic API account I run had insufficient credit on the day I ran the battery. Same architecture, different model. The architectural claim — that you can skip retrieval entirely and dump the corpus into a 1M-context window — holds for either model. The specific numbers in this lab are Gemini's.
Try it: Ask CTAIO live
The production RAG instance from this lab is live at ctaio.dev/en/ask-ctaio/. Same corpus tested here. Open it, ask any of the seven questions from this article, and watch the failure modes show up in real time. Q4 — "which platform did I most recently call the wrong story?" — is the one to watch. The system will identify HeyGen Avatar V correctly and then invent a reason.
The full-page embed has the same chat surface used here. Citation chips link back to the source pages. The 3-turn history cap is the one I describe in the working-memory section below.
Scoreboard
One row per question. Three columns for the three systems. Strongest answer wins per row.
| Question | Tests | Ask CTAIO (RAG) | Long-context (Gemini) | File-based + Claude Code | Winner |
|---|---|---|---|---|---|
| Q1 — What lip-sync score did I give HeyGen Avatar V in EP02, and what scoring dimensions did I use? | Long-term recall · specific factual lookup | 7/10 + four dimensions (basic) | Adds the actual rationale ("mouth opens too wide on plosives") and full dimension definitions | Adds the 1–10 scale anchors (10/7/5/3/1) from the methodology section | File-based |
| Q2 — How does CAIO total compensation compare to a Fractional CTO engagement at a Series-B startup? | Cross-document synthesis · two pages must be assembled | $450–$650K CAIO vs $180–$480K Fractional CTO with summary table | Truncated mid-sentence at MAX_TOKENS — 1731 thinking tokens consumed budget | Series-B-specific $320–$450K CAIO + $180–$360K Fractional CTO + $216–$420K Fractional CAIO. Flagged the synthesis as inferred | File-based |
| Q3 — What is my position on enterprise vendors like Synthesia given the practitioner gap? | Faithfulness to voice · opinion question | Hedgy. Flattens the user's actual take into "nuanced and generally cautious" | After max_tokens fix: comprehensive, faithful framing, names the L&D vs editorial split | Use-case-conditional verdict + the central CTO line: "the gap is your problem to understand, not theirs" | File-based |
| Q4 — Which platform did I most recently call "the wrong story" and why? | Recency awareness · semantic vs lexical retrieval | Identified HeyGen Avatar V correctly but FABRICATED an "ElevenLabs shutdown rumour" that does not exist anywhere | Correctly identified HeyGen Avatar V AND the real reasoning: practitioner gap, the "real CTO story" framing, zero Synthesia mentions in Advise Slack | Honest failure: "the exact phrase 'the wrong story' does not appear anywhere in the files searched". Vocabulary-locked grep missed the H1 match | Long-ctx |
| Q5 — For a Series-B CTO allocating budget between hiring a CAIO and AI tooling, what framework should they use? | Cross-cluster synthesis · hardest of the seven | Strong: 8–15% governance benchmark + fractional CAIO pricing + CTO/CAIO budget split | Names the explicit "buy order" framework from the AI Governance Tools guide (inventory → bias testing → model cards → platform at 20+ models) | Six CAIO hiring signals with thresholds + the fractional cost-threshold heuristic ($10M cutoff) | Tie |
| Q6 — What is the best CRM for a 50-person AI startup? | Out-of-scope handling · the corpus has no CRM coverage | Polite decline + 3 related-resource suggestions (AI ROI, Governance Framework) | Concise decline. No alternative suggestions | Searched broadly, flagged the only incidental CRM mention (Chief Transformation Officer comparison), confirmed clean out-of-scope | File-based |
| Q8 — True or false: I concluded that Synthesia produced the best voice clone in EP01. | Hallucination resistance · planted false premise | Correctly false. Names Cartesia as winner (4/5 blind tests) | Correctly false + caught the deeper inference: Synthesia was not in EP01 at all (it was tested in EP02 with ≤2/10) | Correctly false + Synthesia not in EP01 + cites two source files explicitly | File-based |
Final tally: file-based + Claude Code wins 5 of 7 outright. Long-context (Gemini) wins 1. Q5 is a tie between long-context and file-based. Ask CTAIO (production RAG) does not win any question cleanly.
The result surprised me. I started this lab expecting the production RAG to be the baseline that the other two tried to beat on accuracy at higher cost. The opposite happened. The /opt + Claude Code setup I had been running for months as a coding assistant — and which I never described as a "second brain" — was the most faithful answerer once I pointed it at the same corpus.
The working-memory probe
Q7 was different from the seven above. It was a five-turn conversation designed to test whether a constraint set in turn 1 survives to turn 5.
Architectural cause
The /chat endpoint enforces a 3-turn (6-message) history cap server-side. Sending more returns HTTP 400. By turn 5, the constraint set in turn 1 is literally not visible to the LLM because it has fallen out of the rolling window. The model is not "forgetting" — the constraint is no longer in the prompt.
Verdict: FAIL. Working memory in this paradigm is architecturally absent, not tunable. The fix is not better prompting. The fix is the consolidation step Sfeir argues for: a session-scoped working buffer that holds active constraints across the whole conversation, with explicit ADD / UPDATE / DELETE / NOOP operations on each new turn (the framing Mem0 introduced). Postgres-only. Idle-triggered. Hard 48-hour cap. Not shipped on Ask CTAIO yet. EP04 candidate.
Three failure modes
Knowing how each paradigm answers a question is half the picture. Knowing how each one fails is the other half. Each system has a signature failure mode you should be able to spot in the wild.
RAG: confabulation
Q4 asked which platform I most recently called "the wrong story." Ask CTAIO identified HeyGen Avatar V correctly — the page exists, the H1 is literally "HeyGen Avatar V Is the Wrong Story" — and then fabricated the reason. The system invented an "ElevenLabs shutdown rumour" that does not exist anywhere in the corpus, in the news cycle, or in reality. The retrieval score (topScore 0.482) was the lowest of any question that returned context. Below that floor, the LLM stops grounding and starts guessing. Fluent, plausible, wrong.
This is the standard RAG failure mode in 2026. The fix is either (a) raise the no-context threshold and refuse weak-signal questions, accepting more "I don't know" responses, or (b) put a stronger model on the generation side and pay the cost. Most production RAG tools default to gpt-4.1-mini or equivalent. They confabulate at the rate this lab observed.
Long-context: budget exhaustion
Q3 and Q5 came back empty from Gemini 2.5 Pro on the first run. Not refused —
empty. The model burned through its 2048-token output budget on internal
"thinking" before generating a single user-facing word. Raising
max_output_tokens from 2048 to 8192 fixed both questions on the
retry — answers landed at 4262 and 3244 characters respectively, with
finish_reason STOP — but the underlying issue stays. Frontier models with
long thinking budgets can silently fail when the answer requires deep
reasoning over a large context. You will not always notice. The output is
empty, not malformed.
File-based + Claude Code: agent tooling discipline
Q4 broke the file-based agent in a different way. The agent searched for the
exact phrase "the wrong story" across .astro and .md
files and returned zero matches. The H1 of the EP02 page is "HeyGen Avatar V
Is the Wrong Story" — the words are right there. The agent missed it because
its grep did not use case-insensitive matching. grep -i "the wrong
story" would have hit. The agent did not reach for -i
and did not pivot to a semantic search before declaring "no match."
The good news: the agent flagged the failure honestly with "I cannot
confidently answer this question from the files as phrased" and
grounded: false. RAG on the same question hallucinated. The
file-based agent declined. In a CTO context, "I do not know" is a feature.
The fix is the prompt, not the architecture: tell the agent to grep case-insensitive by default, fall back to semantic matching on no-hits, and surface near-misses rather than declaring miss. The vector-fallback option works too. Each fix has its own cost.
CTO playbook
Pick the paradigm that matches the question volume and the cost ceiling. The architectures are not interchangeable.
| Use case | Paradigm | Why | Approximate $ / month |
|---|---|---|---|
| Public-facing Q&A on your blog | RAG | Sub-cent cost, sub-10s latency, citation-first surface | $10–$50 (small), $200–$1k (1M+ visitors) |
| Internal research / synthesis on your own corpus | File-based + Claude Code | Highest faithfulness. You already pay for the IDE seat | Bundled in Claude Code subscription |
| One-shot deep query, accuracy matters most | Long-context dump | Most faithful when it answers. Pay $0.50 once per query | $5–$50 / month at low volume |
| Multi-turn assistant with persistent constraints | None of the three (yet) | Working-memory layer is unsolved. Build it or wait | — |
| Team-shared "ask the company knowledge base" | RAG with a hosted UI on top | Volume + ease-of-onboarding. Accept the confabulation risk and add a feedback loop | $50–$500 / month |
What I skipped, and why
Four candidates that did not make the test matrix. Each is here because the skip is itself a finding.
NotebookLM (Google)
Why skipped: Hosted PKM-AI with the audio-overview feature. Skipped because: (a) requires a Google login and a manual upload flow that does not script, (b) at the price point (free for personal, paid via Workspace) it is functionally a fourth instance of the long-context paradigm with a chat skin and a podcast-generator gimmick, (c) the comparison was already 3 paradigms with 3 distinct failure modes — adding NotebookLM would not have added a fourth.
When to use: You want a 10-minute podcast summary of 30 PDFs with zero engineering. The audio-overview is genuinely good. Not a serious second-brain by itself.
Claude Projects
Why skipped: Anthropic's native "upload some files, chat over them" surface. Skipped for the same reason as NotebookLM — a fourth long-context-with-skin instance — and because the headline claim of Projects (Claude knows your context) is the file-based + Claude Code paradigm with a different UI. We tested the architecture; the wrapper is downstream.
When to use: You want a hosted version of file-based + Claude Code that your team can share without managing a repo. Pay for the seat, drop files in, get the same results minus the Read/Grep tools you do not actually need.
Letta / Mem0 / agentic memory products
Why skipped: Letta Code launched April 2026. Mem0 is still niche. Self-hosted infrastructure burden + integration cost is below the bar for a CTO making a 2026 decision. Worth a return visit in 6 months.
When to use: You are building a stateful agent that needs persistent memory across sessions. Different problem from "answer questions over a personal knowledge base."
Fine-tuning a small open model on personal corpus
Why skipped: Karpathy himself called this the wrong default in 2026. Opus 4.7 with 1M context obsoletes most personal-scale fine-tuning. The training cost no longer justifies the marginal quality lift over an in-context dump.
When to use: You have data you cannot legally send to a frontier API and the privacy bar dominates the quality bar. Otherwise: do not.
Competitive landscape
The personal-AI-second-brain category has more product names than architectures. Most of what ships in 2026 — mem.ai, Reflect, Notion AI, Mem, Heyday — is a wrapper on one of the three paradigms above. The wrapper changes the UX. It does not change the trade-offs. Pick the wrapper your team will actually use; the underlying failure modes still apply.
The architectures themselves moved in 2026. Anthropic's Skills format is
the file-based paradigm with a packaging contract. Gemini 2.5 Pro brought
1M context out of beta with caching that makes long-context dumping
financially viable. Letta shipped a stateful memory product targeting the
working-memory gap directly. The hosted side is catching up to what
practitioners were already doing in /opt.
What still has not shipped, as of May 2026, is a clean implementation of Sfeir's consolidation step. Mem0 has the ADD/UPDATE/DELETE/NOOP framing. Letta has archival memory. Both are partial. The first vendor that ships a real consolidation layer with idle-triggered promotion and explicit mutation rules will leapfrog this whole comparison.
FAQ
What is a "second brain" in the AI context?
The phrase originated with Tiago Forte in PKM circles — a structured place outside your head to capture, connect, and retrieve everything you have learned. The 2026 AI version replaces "structured place" with "system that answers questions about your knowledge in your voice." That can be a chatbot grounded in your blog (RAG), a frontier model with all your notes pasted in (long-context), or an agent reading your files on demand (file-based + Claude Code). The label is the same. The architectures are not.
Why did production RAG (Ask CTAIO) lose 5 of 7 questions to a file-based agent?
Two reasons. First: chunking is lossy. Top-k=6 with 700-word chunks gives the LLM less to work with than reading the whole article would. Second: the LLM is small. gpt-4.1-mini is fast and cheap and confabulates on weak retrieval — that is what produced the "ElevenLabs shutdown rumour" hallucination on Q4. Trade ergonomics for faithfulness; the file-based agent makes the opposite trade.
Did long-context dumping replace RAG in 2026?
Not for production. At $0.45 per query with 10–26s latency and a non-trivial empty-answer rate from "thinking burn," long-context is a different price/latency tier from RAG. RAG is right for high-volume Q&A surfaces. Long-context is right when you need the most faithful answer once and you have $0.50 to spend on it.
What is "Sfeir's working-memory gap"?
Robert Sfeir's essay reframed the second-brain problem: every system has long-term memory (RAG, vector DBs) and a context window (the model's cache), but no consolidation step in between. So constraints set early in a session evaporate. The Q7 probe in this lab demonstrated it reproducibly on production RAG: turn 1 set a rule, turns 2–4 did filler topics, turn 5 broke the rule. The 6-message rolling history window is the architectural cause.
How do I run my own version of Ask CTAIO?
The backend code is the multi-persona ask-tom service at ask.tfw.bz. Pagefind indexes your site, a Python ingest script chunks and embeds with text-embedding-3-small (1536d) into sqlite-vec, a Fastify server retrieves top-k=6 with MMR diversity, gpt-4.1-mini generates with strict citation rules. Cost: under $0.01 per query at the volumes I run. Same pattern as prommer.net/en/ask-tom/.
How big can the long-context paradigm get before it breaks?
Gemini 2.5 Pro is rated for 1M tokens of input. The corpus in this lab was 705k tokens — about 70% of the window. At full window the cache write cost roughly doubles. The breakage I hit was not the window; it was the output-token budget being eaten by the model's internal "thinking" on harder questions. Raising max_output_tokens from 2048 to 8192 fixed Q3 and Q5.
Is the file-based + Claude Code paradigm just "search your filesystem"?
In mechanism, yes. In effect, no. The agent is the difference. A grep over /opt/agentic-coding/rules/ returns matches. A Claude Code agent reads three files, synthesizes a framework that no single file states, and flags when it had to infer. That is what made Q2 — "compare CAIO comp to Fractional CTO" — a clean answer even though no single page made the comparison directly. The agent did the synthesis a vector DB cannot do.
Where does this paradigm break?
Agent tooling discipline. Q4 asked which platform I most recently called "the wrong story" — and the H1 of the EP02 page is literally "HeyGen Avatar V Is the Wrong Story." The agent's grep was case-sensitive and missed the capitalised match; grep -i would have hit. In this run that was 1 of 7 questions (Q4). When the user phrases the question outside the source's exact wording, the fix is either a smarter agent prompt (case-insensitive by default, semantic fallback on no-hit) or a thin vector layer for re-ranking. Honest about the failure beats RAG's confabulation either way.
Should I use OpenAI Embeddings, Voyage, or Cohere for the RAG version?
I use OpenAI text-embedding-3-small (1536d) for cost reasons — about $0.02 per million tokens at ingest, and the quality is fine for English. Voyage and Cohere are competitive at the high end. The bottleneck is not the embedding model in 2026; it is chunk strategy, retrieval ranker, and generation prompt. Optimize those before swapping the embedder.
How do I fix the working-memory gap in my own RAG?
Sfeir's policy proposal is the one I am implementing for the next ask-tom revision: a session-scoped working buffer (Postgres-only, no Redis) with idle-triggered consolidation, a 15-minute scheduled fallback, and a hard 48-hour cap. Mem0's ADD/UPDATE/DELETE/NOOP framing handles the consolidation step. None of this is shipped on Ask CTAIO yet. Watch for an EP04 follow-up.
What about Anthropic Skills?
Skills are a packaging format for the file-based pattern. A SKILL.md plus its bundled resources. Same paradigm — agent reads files on demand — with a standardized contract. If your knowledge fits the skill format, ship it as a skill; the engine is identical to what this lab tested.
Why did you not test mem.ai or Reflect or Notion AI?
They are products built on top of the three paradigms tested. mem.ai is a hosted RAG with a slick UI. Notion AI is RAG-on-your-Notion-pages. Reflect is hosted long-context. Testing them would have given me three more data points along the same three architectural axes. The labs page is about architectures, not products. Pick the wrapper that fits your team; the underlying trade-offs are the same.