How to Clone Your Brain: 3 Second-Brain Paradigms Tested Head-to-Head

How the test works

Voice (EP01) replicates how I sound. The face (EP02) replicates how I look. The brain — knowledge, opinions, judgement — is the part that actually does the work, and it is the part I have been ignoring while the AI twin season ran. EP03 is the cleanup.

The setup: one shared corpus, three architectures that claim to answer questions over it, the same seven questions plus one multi-turn working-memory probe sent to all three. The corpus is the published English content on ctaio.dev: 961 chunks across 223 pages, ingested by the Pagefind index and re-embedded for the RAG and long-context tests. Total experiment spend: $4.30 across all three systems for the full battery.

The questions span seven failure modes I wanted to expose: specific recall, cross-document synthesis, faithfulness to my actual opinion, recency awareness, cross-cluster reasoning, out-of-scope handling, and hallucination resistance. The eighth — Q7 — is the one that matters: a five-turn conversation with a constraint set in turn one, designed to test whether the system still honours that constraint by turn five.

The framing comes from Robert Sfeir's essay on AI working memory. Long-term memory in AI systems is solved — that is what RAG and vector databases do. Short-term context is solved — that is what the model's window does. The unsolved layer is the working memory in between: the active layer that holds a constraint across a long pipeline. Every second-brain product in 2026 has a different answer to where that layer lives. Most of them are wrong.

Three paradigms, tested head-to-head

The space of "answer questions over my knowledge" splits into three architectures. I tested one production-quality instance of each and ran the same battery against all three.

Ask CTAIO (production RAG)

Embeddings + vector search + small LLM generation

Stack: OpenAI text-embedding-3-small (1536d) → sqlite-vec → top-k=6 with MMR → gpt-4.1-mini
Cost / query: $0.0009–$0.0032 (median ~$0.0014)
Latency: 5–9s per query, 9s total when 7 questions parallelize
Update friction: Re-build Pagefind index, run incremental ingest (~$0.008 for 423 new chunks), rsync DB to sin, restart service
Privacy: Self-hosted on sin. Corpus never leaves the box. OpenAI sees only question text and returned chunk content
Working memory: 3-turn rolling window (server-enforced cap). 4+ turn history returns HTTP 400. Constraints set early in a session evaporate by turn 4
Integration: Live at https://ctaio.dev/en/ask-ctaio/. The same backend serves prommer.net via persona config
Score: 2 of 7 wins

Verdict: Cheapest and fastest. Hallucinates when retrieval is weak. Working memory is structurally absent.

Gemini 2.5 Pro long-context dump

Paste 705k tokens of corpus into a 1M-context model. No retrieval, no chunking

Stack: gemini-2.5-pro · explicit cached_content (1hr TTL) · max_output_tokens raised from 2048 to 8192 after first run
Cost / query: $0.45–$0.50 (cache reads dominate)
Latency: 10–26s per query, sequential because cache reads do not parallelize cleanly
Update friction: Re-export corpus, recreate cache. No ingest pipeline. The "fix" for stale knowledge is rerunning the script
Privacy: Corpus uploaded to Google. Cache lives on their infrastructure for the TTL window
Working memory: Per-session full window. No rolling cap. Only constraint is the 1M token ceiling
Integration: Script-only. No UI. Could be wrapped as a service but at $0.45/query that is a different product
Score: 1 of 7 wins outright (Q4) — the question every other paradigm got wrong

Verdict: Most faithful when it answers. Burns its output token budget on internal "thinking" on hard questions and produces empty responses.

File-based + Claude Code (Karpathy LLM Wiki)

Many markdown files in a few dozen repos. Agent with Read/Grep/Glob navigates on demand

Stack: Claude Sonnet 4.6 sub-agent · /opt/ctaio.dev/src/pages/en/ + /opt/agentic-coding/rules/ + EP02 workfiles
Cost / query: ~$0.02–$0.05 (Sonnet sub-agent, 53 tool calls across 7 questions)
Latency: ~30s per question on average. 53 tool uses across 7 questions in 3.4 minutes
Update friction: Edit a file. That is the entire pipeline
Privacy: Files never leave the disk unless the agent explicitly reads them. Corpus is local. Anthropic only sees question + the chunks the agent chose to surface
Working memory: Per-session full context window. The agent itself can keep constraints across turns within its window
Integration: It is the IDE. The "second brain" is whatever you have in /opt at the moment
Score: 5 of 7 wins

Verdict: Most faithful, most thorough, most honest about failure. Loses on lexical mismatch — does not know that "the wrong story" matches the H1 "Is the Wrong Story" if grep is case-sensitive.

Worth flagging: I substituted Gemini 2.5 Pro for Claude Opus 4.7 in the long-context slot because the Anthropic API account I run had insufficient credit on the day I ran the battery. Same architecture, different model. The architectural claim — that you can skip retrieval entirely and dump the corpus into a 1M-context window — holds for either model. The specific numbers in this lab are Gemini's.

Try it: Ask CTAIO live

The production RAG instance from this lab is live at ctaio.dev/en/ask-ctaio/. Same corpus tested here. Open it, ask any of the seven questions from this article, and watch the failure modes show up in real time. Q4 — "which platform did I most recently call the wrong story?" — is the one to watch. The system will identify HeyGen Avatar V correctly and then invent a reason.

The full-page embed has the same chat surface used here. Citation chips link back to the source pages. The 3-turn history cap is the one I describe in the working-memory section below.

Scoreboard

One row per question. Three columns for the three systems. Strongest answer wins per row.

Question	Tests	Ask CTAIO (RAG)	Long-context (Gemini)	File-based + Claude Code	Winner
Q1 — What lip-sync score did I give HeyGen Avatar V in EP02, and what scoring dimensions did I use?	Long-term recall · specific factual lookup	7/10 + four dimensions (basic)	Adds the actual rationale ("mouth opens too wide on plosives") and full dimension definitions	Adds the 1–10 scale anchors (10/7/5/3/1) from the methodology section	File-based
Q2 — How does CAIO total compensation compare to a Fractional CTO engagement at a Series-B startup?	Cross-document synthesis · two pages must be assembled	$450–$650K CAIO vs $180–$480K Fractional CTO with summary table	Truncated mid-sentence at MAX_TOKENS — 1731 thinking tokens consumed budget	Series-B-specific $320–$450K CAIO + $180–$360K Fractional CTO + $216–$420K Fractional CAIO. Flagged the synthesis as inferred	File-based
Q3 — What is my position on enterprise vendors like Synthesia given the practitioner gap?	Faithfulness to voice · opinion question	Hedgy. Flattens the user's actual take into "nuanced and generally cautious"	After max_tokens fix: comprehensive, faithful framing, names the L&D vs editorial split	Use-case-conditional verdict + the central CTO line: "the gap is your problem to understand, not theirs"	File-based
Q4 — Which platform did I most recently call "the wrong story" and why?	Recency awareness · semantic vs lexical retrieval	Identified HeyGen Avatar V correctly but FABRICATED an "ElevenLabs shutdown rumour" that does not exist anywhere	Correctly identified HeyGen Avatar V AND the real reasoning: practitioner gap, the "real CTO story" framing, zero Synthesia mentions in Advise Slack	Honest failure: "the exact phrase 'the wrong story' does not appear anywhere in the files searched". Vocabulary-locked grep missed the H1 match	Long-ctx
Q5 — For a Series-B CTO allocating budget between hiring a CAIO and AI tooling, what framework should they use?	Cross-cluster synthesis · hardest of the seven	Strong: 8–15% governance benchmark + fractional CAIO pricing + CTO/CAIO budget split	Names the explicit "buy order" framework from the AI Governance Tools guide (inventory → bias testing → model cards → platform at 20+ models)	Six CAIO hiring signals with thresholds + the fractional cost-threshold heuristic ($10M cutoff)	Tie
Q6 — What is the best CRM for a 50-person AI startup?	Out-of-scope handling · the corpus has no CRM coverage	Polite decline + 3 related-resource suggestions (AI ROI, Governance Framework)	Concise decline. No alternative suggestions	Searched broadly, flagged the only incidental CRM mention (Chief Transformation Officer comparison), confirmed clean out-of-scope	File-based
Q8 — True or false: I concluded that Synthesia produced the best voice clone in EP01.	Hallucination resistance · planted false premise	Correctly false. Names Cartesia as winner (4/5 blind tests)	Correctly false + caught the deeper inference: Synthesia was not in EP01 at all (it was tested in EP02 with ≤2/10)	Correctly false + Synthesia not in EP01 + cites two source files explicitly	File-based

Final tally: file-based + Claude Code wins 5 of 7 outright. Long-context (Gemini) wins 1. Q5 is a tie between long-context and file-based. Ask CTAIO (production RAG) does not win any question cleanly.

The result surprised me. I started this lab expecting the production RAG to be the baseline that the other two tried to beat on accuracy at higher cost. The opposite happened. The /opt + Claude Code setup I had been running for months as a coding assistant — and which I never described as a "second brain" — was the most faithful answerer once I pointed it at the same corpus.

The working-memory probe

Q7 was different from the seven above. It was a five-turn conversation designed to test whether a constraint set in turn 1 survives to turn 5.

Turn 1

User: For this entire conversation, only cite ctaio.dev sources, and never include specific dollar figures or salary numbers when answering questions about compensation. Acknowledge the rule.

Ask CTAIO: "Understood. … will avoid including specific dollar figures or salary numbers as per your rule."

Turn 2

User: Tell me about EP02 of the labs podcast.

Ask CTAIO: EP02 description (correct, plus a hallucination — claims EP02 tested "HeyGen, Sync Labs, and D-ID". Actual: HeyGen, Synthesia, Akool, Tavus, AI Studios)

Turn 3

User: What governance frameworks does CTAIO cover?

Ask CTAIO: Lists NIST AI RMF, EU AI Act, ISO/IEC 42001, OECD principles. Correct

Turn 4

User: Who is the audience for CTAIO?

Ask CTAIO: CTOs, CIOs, VPs of Engineering. Correct

Turn 5

User: What is standard CAIO total compensation at a Series-B startup?

Ask CTAIO: Returns "$250K–$320K base, $320K–$450K total, 0.1–0.5% equity." Specific dollar figures. The constraint set in turn 1 is not referenced anywhere.

Architectural cause

The /chat endpoint enforces a 3-turn (6-message) history cap server-side. Sending more returns HTTP 400. By turn 5, the constraint set in turn 1 is literally not visible to the LLM because it has fallen out of the rolling window. The model is not "forgetting" — the constraint is no longer in the prompt.

Verdict: FAIL. Working memory in this paradigm is architecturally absent, not tunable. The fix is not better prompting. The fix is the consolidation step Sfeir argues for: a session-scoped working buffer that holds active constraints across the whole conversation, with explicit ADD / UPDATE / DELETE / NOOP operations on each new turn (the framing Mem0 introduced). Postgres-only. Idle-triggered. Hard 48-hour cap. Not shipped on Ask CTAIO yet. EP04 candidate.

Three failure modes

Knowing how each paradigm answers a question is half the picture. Knowing how each one fails is the other half. Each system has a signature failure mode you should be able to spot in the wild.

RAG: confabulation

Q4 asked which platform I most recently called "the wrong story." Ask CTAIO identified HeyGen Avatar V correctly — the page exists, the H1 is literally "HeyGen Avatar V Is the Wrong Story" — and then fabricated the reason. The system invented an "ElevenLabs shutdown rumour" that does not exist anywhere in the corpus, in the news cycle, or in reality. The retrieval score (topScore 0.482) was the lowest of any question that returned context. Below that floor, the LLM stops grounding and starts guessing. Fluent, plausible, wrong.

This is the standard RAG failure mode in 2026. The fix is either (a) raise the no-context threshold and refuse weak-signal questions, accepting more "I don't know" responses, or (b) put a stronger model on the generation side and pay the cost. Most production RAG tools default to gpt-4.1-mini or equivalent. They confabulate at the rate this lab observed.

Long-context: budget exhaustion

Q3 and Q5 came back empty from Gemini 2.5 Pro on the first run. Not refused — empty. The model burned through its 2048-token output budget on internal "thinking" before generating a single user-facing word. Raising max_output_tokens from 2048 to 8192 fixed both questions on the retry — answers landed at 4262 and 3244 characters respectively, with finish_reason STOP — but the underlying issue stays. Frontier models with long thinking budgets can silently fail when the answer requires deep reasoning over a large context. You will not always notice. The output is empty, not malformed.

File-based + Claude Code: agent tooling discipline

Q4 broke the file-based agent in a different way. The agent searched for the exact phrase "the wrong story" across .astro and .md files and returned zero matches. The H1 of the EP02 page is "HeyGen Avatar V Is the Wrong Story" — the words are right there. The agent missed it because its grep did not use case-insensitive matching. grep -i "the wrong story" would have hit. The agent did not reach for -i and did not pivot to a semantic search before declaring "no match."

The good news: the agent flagged the failure honestly with "I cannot confidently answer this question from the files as phrased" and grounded: false. RAG on the same question hallucinated. The file-based agent declined. In a CTO context, "I do not know" is a feature.

The fix is the prompt, not the architecture: tell the agent to grep case-insensitive by default, fall back to semantic matching on no-hits, and surface near-misses rather than declaring miss. The vector-fallback option works too. Each fix has its own cost.

CTO playbook

Pick the paradigm that matches the question volume and the cost ceiling. The architectures are not interchangeable.

Use case	Paradigm	Why	Approximate $ / month
Public-facing Q&A on your blog	RAG	Sub-cent cost, sub-10s latency, citation-first surface	$10–$50 (small), $200–$1k (1M+ visitors)
Internal research / synthesis on your own corpus	File-based + Claude Code	Highest faithfulness. You already pay for the IDE seat	Bundled in Claude Code subscription
One-shot deep query, accuracy matters most	Long-context dump	Most faithful when it answers. Pay $0.50 once per query	$5–$50 / month at low volume
Multi-turn assistant with persistent constraints	None of the three (yet)	Working-memory layer is unsolved. Build it or wait	—
Team-shared "ask the company knowledge base"	RAG with a hosted UI on top	Volume + ease-of-onboarding. Accept the confabulation risk and add a feedback loop	$50–$500 / month

What I skipped, and why

Four candidates that did not make the test matrix. Each is here because the skip is itself a finding.

NotebookLM (Google)

Why skipped: Hosted PKM-AI with the audio-overview feature. Skipped because: (a) requires a Google login and a manual upload flow that does not script, (b) at the price point (free for personal, paid via Workspace) it is functionally a fourth instance of the long-context paradigm with a chat skin and a podcast-generator gimmick, (c) the comparison was already 3 paradigms with 3 distinct failure modes — adding NotebookLM would not have added a fourth.

When to use: You want a 10-minute podcast summary of 30 PDFs with zero engineering. The audio-overview is genuinely good. Not a serious second-brain by itself.

Claude Projects

Why skipped: Anthropic's native "upload some files, chat over them" surface. Skipped for the same reason as NotebookLM — a fourth long-context-with-skin instance — and because the headline claim of Projects (Claude knows your context) is the file-based + Claude Code paradigm with a different UI. We tested the architecture; the wrapper is downstream.

When to use: You want a hosted version of file-based + Claude Code that your team can share without managing a repo. Pay for the seat, drop files in, get the same results minus the Read/Grep tools you do not actually need.

Letta / Mem0 / agentic memory products

Why skipped: Letta Code launched April 2026. Mem0 is still niche. Self-hosted infrastructure burden + integration cost is below the bar for a CTO making a 2026 decision. Worth a return visit in 6 months.

When to use: You are building a stateful agent that needs persistent memory across sessions. Different problem from "answer questions over a personal knowledge base."

Fine-tuning a small open model on personal corpus

Why skipped: Karpathy himself called this the wrong default in 2026. Opus 4.7 with 1M context obsoletes most personal-scale fine-tuning. The training cost no longer justifies the marginal quality lift over an in-context dump.

When to use: You have data you cannot legally send to a frontier API and the privacy bar dominates the quality bar. Otherwise: do not.

Competitive landscape

The personal-AI-second-brain category has more product names than architectures. Most of what ships in 2026 — mem.ai, Reflect, Notion AI, Mem, Heyday — is a wrapper on one of the three paradigms above. The wrapper changes the UX. It does not change the trade-offs. Pick the wrapper your team will actually use; the underlying failure modes still apply.

The architectures themselves moved in 2026. Anthropic's Skills format is the file-based paradigm with a packaging contract. Gemini 2.5 Pro brought 1M context out of beta with caching that makes long-context dumping financially viable. Letta shipped a stateful memory product targeting the working-memory gap directly. The hosted side is catching up to what practitioners were already doing in /opt.

What still has not shipped, as of May 2026, is a clean implementation of Sfeir's consolidation step. Mem0 has the ADD/UPDATE/DELETE/NOOP framing. Letta has archival memory. Both are partial. The first vendor that ships a real consolidation layer with idle-triggered promotion and explicit mutation rules will leapfrog this whole comparison.

FAQ

What is a "second brain" in the AI context?

The phrase originated with Tiago Forte in PKM circles — a structured place outside your head to capture, connect, and retrieve everything you have learned. The 2026 AI version replaces "structured place" with "system that answers questions about your knowledge in your voice." That can be a chatbot grounded in your blog (RAG), a frontier model with all your notes pasted in (long-context), or an agent reading your files on demand (file-based + Claude Code). The label is the same. The architectures are not.

Why did production RAG (Ask CTAIO) lose 5 of 7 questions to a file-based agent?

Two reasons. First: chunking is lossy. Top-k=6 with 700-word chunks gives the LLM less to work with than reading the whole article would. Second: the LLM is small. gpt-4.1-mini is fast and cheap and confabulates on weak retrieval — that is what produced the "ElevenLabs shutdown rumour" hallucination on Q4. Trade ergonomics for faithfulness; the file-based agent makes the opposite trade.

Did long-context dumping replace RAG in 2026?

Not for production. At $0.45 per query with 10–26s latency and a non-trivial empty-answer rate from "thinking burn," long-context is a different price/latency tier from RAG. RAG is right for high-volume Q&A surfaces. Long-context is right when you need the most faithful answer once and you have $0.50 to spend on it.

What is "Sfeir's working-memory gap"?

Robert Sfeir's essay reframed the second-brain problem: every system has long-term memory (RAG, vector DBs) and a context window (the model's cache), but no consolidation step in between. So constraints set early in a session evaporate. The Q7 probe in this lab demonstrated it reproducibly on production RAG: turn 1 set a rule, turns 2–4 did filler topics, turn 5 broke the rule. The 6-message rolling history window is the architectural cause.

How do I run my own version of Ask CTAIO?

The backend code is the multi-persona ask-tom service at ask.tfw.bz. Pagefind indexes your site, a Python ingest script chunks and embeds with text-embedding-3-small (1536d) into sqlite-vec, a Fastify server retrieves top-k=6 with MMR diversity, gpt-4.1-mini generates with strict citation rules. Cost: under $0.01 per query at the volumes I run. Same pattern as prommer.net/en/ask-tom/.

How big can the long-context paradigm get before it breaks?

Gemini 2.5 Pro is rated for 1M tokens of input. The corpus in this lab was 705k tokens — about 70% of the window. At full window the cache write cost roughly doubles. The breakage I hit was not the window; it was the output-token budget being eaten by the model's internal "thinking" on harder questions. Raising max_output_tokens from 2048 to 8192 fixed Q3 and Q5.

Is the file-based + Claude Code paradigm just "search your filesystem"?

In mechanism, yes. In effect, no. The agent is the difference. A grep over /opt/agentic-coding/rules/ returns matches. A Claude Code agent reads three files, synthesizes a framework that no single file states, and flags when it had to infer. That is what made Q2 — "compare CAIO comp to Fractional CTO" — a clean answer even though no single page made the comparison directly. The agent did the synthesis a vector DB cannot do.

Where does this paradigm break?

Agent tooling discipline. Q4 asked which platform I most recently called "the wrong story" — and the H1 of the EP02 page is literally "HeyGen Avatar V Is the Wrong Story." The agent's grep was case-sensitive and missed the capitalised match; grep -i would have hit. In this run that was 1 of 7 questions (Q4). When the user phrases the question outside the source's exact wording, the fix is either a smarter agent prompt (case-insensitive by default, semantic fallback on no-hit) or a thin vector layer for re-ranking. Honest about the failure beats RAG's confabulation either way.

Should I use OpenAI Embeddings, Voyage, or Cohere for the RAG version?

I use OpenAI text-embedding-3-small (1536d) for cost reasons — about $0.02 per million tokens at ingest, and the quality is fine for English. Voyage and Cohere are competitive at the high end. The bottleneck is not the embedding model in 2026; it is chunk strategy, retrieval ranker, and generation prompt. Optimize those before swapping the embedder.

How do I fix the working-memory gap in my own RAG?

Sfeir's policy proposal is the one I am implementing for the next ask-tom revision: a session-scoped working buffer (Postgres-only, no Redis) with idle-triggered consolidation, a 15-minute scheduled fallback, and a hard 48-hour cap. Mem0's ADD/UPDATE/DELETE/NOOP framing handles the consolidation step. None of this is shipped on Ask CTAIO yet. Watch for an EP04 follow-up.

What about Anthropic Skills?

Skills are a packaging format for the file-based pattern. A SKILL.md plus its bundled resources. Same paradigm — agent reads files on demand — with a standardized contract. If your knowledge fits the skill format, ship it as a skill; the engine is identical to what this lab tested.

Why did you not test mem.ai or Reflect or Notion AI?

They are products built on top of the three paradigms tested. mem.ai is a hosted RAG with a slick UI. Notion AI is RAG-on-your-Notion-pages. Reflect is hosted long-context. Testing them would have given me three more data points along the same three architectural axes. The labs page is about architectures, not products. Pick the wrapper that fits your team; the underlying trade-offs are the same.

Key Takeaways

In This Article

How the test works

Three paradigms, tested head-to-head

Ask CTAIO (production RAG)

Gemini 2.5 Pro long-context dump

File-based + Claude Code (Karpathy LLM Wiki)

Try it: Ask CTAIO live

Scoreboard

The working-memory probe

Architectural cause

Three failure modes

RAG: confabulation

Long-context: budget exhaustion

File-based + Claude Code: agent tooling discipline

CTO playbook

What I skipped, and why

NotebookLM (Google)

Claude Projects

Letta / Mem0 / agentic memory products

Fine-tuning a small open model on personal corpus

Competitive landscape

FAQ

What is a "second brain" in the AI context?

Why did production RAG (Ask CTAIO) lose 5 of 7 questions to a file-based agent?

Did long-context dumping replace RAG in 2026?

What is "Sfeir's working-memory gap"?

How do I run my own version of Ask CTAIO?

How big can the long-context paradigm get before it breaks?

Is the file-based + Claude Code paradigm just "search your filesystem"?

Where does this paradigm break?

Should I use OpenAI Embeddings, Voyage, or Cohere for the RAG version?

How do I fix the working-memory gap in my own RAG?

What about Anthropic Skills?

Why did you not test mem.ai or Reflect or Notion AI?

No comments yet. Be the first!

The CTAIO Lab Podcast

Previously

Now Playing

Voice AI products to clone your voice

Up Next

AI video of yourself

CTAIO — Technology Leadership for the AI Era