CTAIO Labs Subscribe free
CTAIO Labs
Listen to the podcast CTAIO Labs · Episode 1

I Tested 5 Voice Cloning Engines for My AI Twin

Five voice cloning engines tested head-to-head: ElevenLabs, Cartesia, LMNT, Fish Audio, Coqui XTTS. Audio demos and real costs.

Key Takeaways

  • Cartesia beat ElevenLabs in blind A/B testing. — Pro fine-tuning from 54 minutes of studio audio produced a more natural, consistent voice clone — but requires more upfront investment.
  • LMNT clones your voice for free from a 5-second clip. — The best free-tier voice cloning available. 15,000 characters per month, unlimited clones, no credit card required.
  • The ecosystem matters more than the voice engine. — Voice generation is layer one. Lip sync, avatars, and distribution integrations determine whether your voice output actually reaches users.
In-depth tech findings
  • Training data volume has a logarithmic quality curve, not linear. Intelligibility gains from 5s → 60s of reference audio are steep. From 60s → 54 min, gains are real but flattening. ElevenLabs IVC hit ceiling quality at 76 seconds; adding more samples via Professional Voice Clone moved the needle less than 5% in blind tests. Cartesia's fine-tune wins not because of volume, but because it runs a dedicated training job on your phoneme distribution rather than in-context inference.
  • Open-source TTS models decouple duration prediction from speaker conditioning. Coqui XTTS v2 uses an autoregressive decoder where speaking rate is controlled by a separate duration predictor module trained on the base corpus — not on your reference clip. The speaker embedding only shapes timbre and prosody, not tempo. The same 18-word script produced 76 s in Coqui vs. 20–22 s from commercial engines that sample your pace from the reference audio.
  • Fine-tuned models absorb prosodic style, making style overrides additive noise. Cartesia Pro fine-tunes on your actual intonation contours; the model's prior is your speaking style. Injecting a +emotion vector at inference time pushes output off the learned manifold, degrading naturalness. ElevenLabs IVC has no style prior — the base model's prosody is neutral, so stability (0–1 range, controls randomness) and clarity sliders meaningfully sculpt output without fighting a learnt distribution.
  • Speaker embedding extraction is language-agnostic but accent-bound. Coqui XTTS v2 extracts a 512-dimensional speaker embedding from the reference clip and injects it into every generation — including cross-lingual output. The embedding encodes acoustic features, not phoneme mappings, so German-accented formant transitions transfer directly into Spanish output. Cartesia's fine-tune model learns phone-level adaptation from 54 minutes of labelled audio; zero-shot models never learn what a "neutral accent" sounds like for your voice.
The Test: The reference voice sample was recorded during a professional video shoot at a studio in Munich — clean audio, controlled environment. I ran the same homepage welcome script through five voice cloning engines — three commercial, one free, one open-source. Same text, same speaker (me), different technology. Hit play to hear the difference yourself before reading the analysis.
Original voice — studio recording, January 2026

The CTAIO Labs Experiment

Same script · Same speaker · Compare commercial, free, and open-source voice clones

ElevenLabs Excited profile
Professional Voice Clone · eleven_multilingual_v2
0:00
Cartesia Pro Fine-tuned v1
Sonic model · 54 min training
0:00
LMNT Zero-shot clone
Instant Voice Clone · Blizzard model
0:00
Fish Audio Zero-shot clone
Fish-Speech · 15s ref clip
0:00
Coqui XTTS XTTS v2 zero-shot
Open source · 5s ref clip
0:00

Why Clone Your Voice?

Before the engine comparison: why would you want a digital copy of your voice in the first place? Here's what a production-grade voice clone unlocks.

  • Omni-channel content at any length. Record once, generate unlimited variations — 15-second social clips, 2-minute explainers, 30-minute course modules. Your voice, your pacing, no studio time.
  • Translate content for global audiences. Your English voice clone speaks German, Spanish, French, Portuguese — with natural accent adaptation. One clone, multiple markets, zero re-recording.
  • Scale video production. Pair your voice clone with a lip-synced avatar (HeyGen, Sync Labs) and produce talking-head videos from a text script. No camera, no lighting, no editing.
  • Always-on brand voice. Customer support bots, IVR systems, product tours — all sound like you instead of a generic AI narrator. Consistency across every touchpoint.
  • Iterate without the studio. Rewrite a script at midnight, regenerate the audio in seconds. No need to book studio time for a one-word change.
  • Preserve and extend your voice. Accessibility use cases, voice banking for medical conditions, or simply ensuring your voice outlasts your availability.

The question isn't whether voice cloning is useful — it's which engine gives you the best clone for your budget. That's what I tested.

The Voice AI Stack

Voice AI isn't one tool — it's a stack. Understanding the layers saves you from picking a great voice engine and then discovering it doesn't connect to anything.

Layer 3: Distribution Website embeds · Podcast hosting · Phone systems · Video platforms
Layer 2: Lip Sync & Avatars HeyGen · Sync Labs · D-ID · Rask AI
Layer 1: Voice Generation (TTS & Cloning) ElevenLabs · Cartesia · LMNT · Fish Audio · Coqui XTTS

You pick a voice engine first. Everything downstream — lip sync timing, avatar mouth shapes, streaming latency — depends on your voice output format, quality, and API design. I learned this the hard way: I picked HeyGen for avatars before finalizing voice, and had to re-export everything when I switched from ElevenLabs to Cartesia mid-project.

Voice Cloning — I Tested Five Engines

Over three weeks in January 2026, I ran the same production workflow through five voice cloning engines. Same script, same speaker (me), same target: a homepage welcome for prommer.net. Three commercial platforms (ElevenLabs, Cartesia, Fish Audio), one free tier (LMNT), and one open-source option (Coqui XTTS). Here's what each engine delivered.

ElevenLabs

The market default. I used eleven_multilingual_v2 with a Professional Voice Clone from ~1:16 of studio audio. Created two voice profiles — "Excited" (stability 0.35, clarity 0.78) and "Confident" (stability 0.55, clarity 0.85) — from the same clone.

What worked: Fast setup (minutes, not hours). Emotion control via stability/clarity sliders is genuinely useful — the "Excited" profile sounds noticeably different from "Confident" without re-cloning. 29-language support out of the box.

What didn't: The clone quality is good but not indistinguishable from my real voice — there's a subtle "AI smoothness" that anyone familiar with my actual voice would notice. Quota management can be tight depending on your tier.

Cartesia Pro

The newcomer that surprised me. Cartesia's Pro fine-tuning requires serious input: I uploaded 54 minutes of studio audio across 32 WAV files (16-bit, 44.1kHz). The training process took ~4 hours and produced a "sonic model" — a custom voice checkpoint.

What worked: The output quality is the best I've tested. My wife couldn't tell the clone from real recordings in a blind test. Five languages (EN, DE, ES, FR, PT) from one English training set, all with natural accent adaptation. Streaming latency under 100ms.

What didn't: Emotion overrides made the clone worse, not better. Cartesia's philosophy is that the fine-tune captures your natural style, and explicit emotion parameters fight against it. This is the opposite of ElevenLabs' approach. There's also a 500-character limit per request on the free tier, so long paragraphs need chunking.

Coqui XTTS v2 (Open Source)

The free option. XTTS v2 does zero-shot voice cloning from a 5-second reference clip. It supports 16 languages and runs locally on GPU. No API costs, no vendor lock-in.

What worked: It's free. The English output is recognizably my voice at zero cost. 16-language support is impressive for an open-source model. Runs locally on Apple Silicon (MPS backend) — the sample above was generated in 38 seconds on M-series.

What didn't: Three things. First, German accent bleeds through every language — my German-accented reference clip produces German-accented Spanish and French output. Second, XTTS speaks noticeably slower than the source: the same script that runs 18 seconds in ElevenLabs and Cartesia runs 1:16 in XTTS. The model doesn't preserve speaking pace, it generates phoneme-by-phoneme at its own tempo. Third, AGPL license requires open-sourcing derivative applications, which rules it out for most commercial deployments.

LMNT

The free-tier surprise. LMNT lets you clone your voice from a 5-second clip with no credit card required. I used a 15-second reference clip and the clone was ready instantly. 15,000 characters per month free — enough for serious experimentation.

What worked: The clone quality is genuinely impressive for zero cost. My voice is clearly recognizable, with natural pacing and intonation. The Blizzard model handles English well, and the API is clean and modern. Sub-200ms streaming latency makes it viable for real-time voice agents.

What didn't: The free tier is English-only. No multilingual support without upgrading. The output has a slight digital warmth that's audible in A/B comparison but probably fine for most use cases. SDK documentation had some v1/v2 inconsistencies that cost me debugging time.

Fish Audio

Fish Audio's Fish-Speech model does zero-shot voice cloning from a reference clip, similar to LMNT and Coqui. I uploaded the same 15-second reference clip used for LMNT. The API is straightforward — upload reference audio, submit text, get output.

What worked: Good clone quality with natural prosody. The voice is recognizably mine with reasonable pacing. The platform supports multiple languages and the API design is clean.

What didn't: No free tier available. The output quality is a step below ElevenLabs and LMNT for English. The platform is newer and the documentation is thinner than competitors.

Engine Comparison

ElevenLabs Cartesia LMNT Fish Audio Coqui XTTS
Clone quality 8/10 9.5/10 8/10 7.5/10 7.5/10
Training data ~1 min clip 54 min (32 WAVs) 5–15s clip 15s clip 5s clip
Languages 29 5 1 (EN) Multiple 16
Availability Paid tier Paid tier Free tier Paid tier Free (local)
Emotion control Sliders None None None None
Latency ~1.5s <100ms <200ms ~2s 38s local
Open source No No No No AGPL

The A/B Test — Same Script, Five Clones

I ran the same homepage welcome script through all five engines. Listeners ranked on "sounds most like Thomas" and rated overall voice quality and naturalness.

Methodology: Same homepage welcome script (~1,100 characters), same target emotion (friendly-professional), no post-processing. Commercial outputs were volume-normalized to -16 LUFS. Open-source output was presented as-is. Five listeners who know my voice ranked the clone outputs.

Clone ranking: Cartesia Pro won 4/5 blind tests. ElevenLabs and LMNT tied for second — both clearly recognizable as my voice. Fish Audio came fourth. Coqui XTTS was consistently last — recognizable but the synthetic texture gives it away.

The nuance: ElevenLabs' emotion controls mean you can create distinctly different voice profiles (my "Excited" and "Confident" clones sound genuinely different). Cartesia doesn't offer this — you get one high-fidelity clone. LMNT gives you comparable clone quality to ElevenLabs for free.

Training Input — What Went In

Transparency matters when discussing AI clones. Here are actual samples from the training data I fed each engine — the raw material that produced the voices you heard above.

Cartesia Pro — Studio Recording (1 of 32 WAVs)

One take from the 32-file training set (54 minutes total, 16-bit WAV, 44.1kHz). All recordings were captured during a January 2026 studio session. Cartesia's fine-tuning ingested the full set and produced four voice candidates — V1 was selected.

Download WAV (5.6 MB)

ElevenLabs — Instant Voice Clone Input

The single 1:16 studio clip uploaded for ElevenLabs' Instant Voice Clone. No fine-tuning — ElevenLabs extracts voice characteristics from this one sample. The same recording was used for multiple voice profiles (Excited, Confident) by adjusting stability and clarity sliders.

Download MP3 (1.2 MB)

Coqui XTTS v2 — 5-Second Reference Clip

That's all XTTS needs: a 5-second WAV clip. The model extracts speaker embedding from this tiny sample and uses it for all subsequent generations. No training, no fine-tuning — pure zero-shot voice cloning. The tradeoff shows in the output quality, but the barrier to entry is remarkably low.

Download WAV (431 KB)

LMNT & Fish Audio — Reference Clip

Both engines used reference audio extracted from my studio recordings, uploaded via their respective APIs. Same input, different results — listen to the comparison above to hear how each engine interprets the same reference audio.

Download WAV (1.3 MB)

What Happens After Voice — The Ecosystem

Voice generation produces audio files. Getting those files into usable content requires the next layers of the stack.

HeyGen — Avatars + Lip Sync

HeyGen combines avatar generation with lip sync in one platform. Upload a video of yourself, feed it your TTS audio, and get a lip-synced avatar video.

The gotcha: Captions are NOT burned into the video. You get a clean video output and need to add subtitles separately (SRT export available). I assumed captions were included and had to re-do my workflow.

Sync Labs — Lip Sync Only

Sync Labs does one thing: lip sync. You provide video + audio, it returns video with matched lip movements.

Key finding: Temperature 0.3 produces significantly better results than the default 0.5. Lower temperature means less creative interpretation of mouth shapes, which for professional content means fewer weird lip artifacts. I tested both extensively — 0.3 wins for talking-head content.

D-ID — Photo to Talking Head

D-ID takes a static photo and animates it with your audio. It produces surprisingly convincing results for social media content. Not broadcast quality, but good enough for LinkedIn posts and internal demos.

Integration Flow

The practical workflow: Generate voice audio (Cartesia) → Lip sync to existing video (Sync Labs) OR generate avatar video (HeyGen) → Add captions (manual/CapCut) → Distribute. Each tool consumes the output of the previous one. API integration between them is minimal — expect to move files manually or build glue scripts.

The Competitive Landscape

Beyond the five engines I tested, the Voice AI space is crowded. Here's where other key players fit.

Platform Primary Strength Best For
Resemble AI Voice cloning + deepfake detection Enterprise (brand safety focus)
Vapi Voice agent orchestration Building AI phone agents
Retell AI Conversational AI voice agents Call centers, customer support
Piper Lightweight local TTS Embedded systems, home automation

The Frontier Labs

The biggest players in AI are all moving into voice — but from different angles and with different levels of maturity.

OpenAI shipped the most visible voice product: ChatGPT's Advanced Voice Mode uses a natively multimodal model that handles speech-to-speech without a separate TTS step. It's conversational, low-latency, and surprisingly expressive. Their standalone TTS API (tts-1, tts-1-hd) offers six preset voices but no voice cloning — you can't use your own voice. For content creation, that's a dealbreaker. For building voice agents, it's a solid foundation.

Google has deep TTS heritage through WaveNet and Cloud TTS, which power Google Assistant and millions of IVR systems. Their latest research (SoundStorm, AudioPaLM) shows zero-shot voice cloning capability, but none of it is publicly available as a cloning API yet. Gemini's multimodal capabilities include audio understanding but not voice generation in the ElevenLabs sense. Google's strength is infrastructure-scale TTS, not creator tools.

Meta has been the most aggressive in open-source voice AI. Voicebox (2023) demonstrated zero-shot TTS and style transfer. MMS (Massively Multilingual Speech) covers 1,100+ languages. They acquired PlayHT in mid-2025 and folded the technology into their research stack. Meta's play is open weights and platform integration — expect voice features inside Meta AI and WhatsApp, not a standalone API for developers.

Anthropic has not entered voice generation. Claude processes audio input but doesn't generate speech. No TTS API, no voice cloning, no plans announced. Their focus remains on text-based reasoning and safety research.

The takeaway: frontier labs are building voice into their chat products and research, but none of them currently offer a voice cloning API that competes with ElevenLabs or Cartesia for content creation. That could change fast — especially from OpenAI and Meta — but today the specialist providers still own this space.

The landscape is splitting into two camps: content creation (ElevenLabs, Cartesia, Fish Audio — generate audio for videos, podcasts, marketing) and real-time conversation (LMNT, Deepgram, Vapi, Retell — power live voice agents and phone systems). Different requirements, different winners. The free/open-source tier (Coqui XTTS, StyleTTS2, Piper) serves prototyping and privacy-sensitive use cases.

The Economics

Here's what I actually spend per month on voice AI tools for content production across prommer.net and client projects.

Tool Plan Monthly Cost What You Get
ElevenLabs Starter $5 30 min audio/mo, 3 custom voices
Cartesia Startup $21 Pro fine-tune, 500 char/req, streaming
HeyGen Creator $29 15 min video/mo, avatar + lip sync
Sync Labs Pro $49 30 min lip sync/mo
D-ID Lite $23 Photo-to-video, API access
Total $127

Cost per unit: At this spend level, I produce roughly 8–10 lip-synced videos per month (60–90 seconds each). That's about $0.12 per video-minute for lip sync, plus ~$0.05 per minute for voice generation. Total cost per finished minute of video: ~$0.17.

How to Choose an Engine

The decision framework after testing all eight:

  • Need your voice, have budget: Cartesia Pro (best quality) or ElevenLabs (best ecosystem).
  • Need your voice, no budget: LMNT free tier (best free cloning) or Coqui XTTS (fully local).
  • Don't need cloning: OpenAI TTS (simplest API) or Deepgram (generous free credits).
  • Privacy-sensitive: StyleTTS2 or Coqui XTTS (audio never leaves your machine).
  • Real-time agents: LMNT (<200ms) or Cartesia (<100ms).

What's Coming

Based on where the technology and market are moving:

Multilingual as table stakes. Cartesia already produces 5 languages from one English voice clone. ElevenLabs does 29. Within 12 months, any serious voice engine will handle 10+ languages from a single clone without accent bleed. The competitive differentiator will shift from "how many languages" to "how natural does each one sound."

Real-time conversational voice. LMNT and Deepgram are pushing sub-200ms latency for live voice interaction. This enables AI phone agents that sound human in real-time conversation — not just pre-recorded TTS playback. LiveKit, Vapi, and Retell are building the orchestration layer on top. The call center industry is the first major disruption target.

Voice training data economics. The quality of your training audio is becoming the moat. 54 minutes of studio-quality audio produces dramatically better clones than 2 hours of phone recordings. Companies are starting to invest in professional voice capture sessions as a strategic asset, not just a technical requirement. Data collection, curation, and licensing are becoming their own market.

From content creation to live interaction. The tools I tested are all content creation tools — you generate audio, then use it in videos or podcasts. The next wave is live: voice agents in customer support, AI co-hosts in podcasts, real-time translation in video calls. The voice clone becomes an always-on digital representative, not a batch-processing tool.

FAQ

Frequently Asked Questions

How much training audio do I need for a voice clone?

It depends on the engine. LMNT and Fish Audio clone from a 5–15 second clip. ElevenLabs Professional Voice Clone works with ~1 minute. Cartesia Pro fine-tuning wants 30–60 minutes of clean studio audio (I used 54 minutes across 32 WAV files). Coqui XTTS needs just 5 seconds. More data generally means better quality, but LMNT proves you can get impressive results from minimal input.

Is it legal to clone my own voice?

Cloning your own voice is legal in most jurisdictions. The legal issues arise when cloning someone else's voice without consent. All platforms require you to confirm you have rights to the voice being cloned. For business use, get explicit written consent from anyone whose voice you clone.

Can I use voice clones for commercial content?

Yes, all commercial platforms tested (ElevenLabs, Cartesia, LMNT, Fish Audio) allow commercial use on their paid tiers. Coqui XTTS is AGPL-licensed — check your specific use case for the open-source option.

How good is multilingual voice cloning in 2026?

Surprisingly good. Cartesia produces 5 languages (EN, DE, ES, FR, PT) from a single English voice clone with consistent quality. ElevenLabs' multilingual_v2 model handles 29 languages. The main gotcha is accent bleed — Coqui XTTS retains source accent across languages, while Cartesia and ElevenLabs handle accent adaptation better.

What's the latency for real-time voice AI?

For pre-generated TTS (content creation), latency is 1–3 seconds per paragraph. For real-time conversational voice (phone agents, live chat), LMNT targets sub-200ms, ElevenLabs Turbo v2.5 runs around 300ms, and Cartesia Sonic is under 100ms for streaming.

Should I use open source or commercial voice AI?

Use open source (Coqui XTTS, Piper) for internal tools, prototyping, and cost-sensitive batch processing. Use free commercial tiers (LMNT) for experimentation and low-volume production. Use paid commercial (ElevenLabs, Cartesia, Fish Audio) for brand voice, customer-facing content, and when you need multilingual consistency. The quality gap is closing but paid commercial still wins for polished output.

What happens after voice generation — do I need more tools?

Usually yes. Voice generation is layer one. For video content, you need lip sync (Sync Labs, HeyGen) or avatar generation (D-ID, HeyGen). For voice agents, you need telephony integration (Twilio, LiveKit, Vapi). For podcasts, you need editing and hosting. Budget for the full stack, not just TTS.

How do ElevenLabs emotion controls compare to Cartesia?

ElevenLabs lets you set stability, clarity, and style parameters per generation — useful for creating distinct "Excited" and "Confident" voice profiles from the same clone. Cartesia takes a different approach: the Pro fine-tune captures your natural speaking style, and explicit emotion overrides actually degraded output quality in my testing. Different philosophies, both valid.

Coming up in this series
Part 2
Video — The Same Experiment, But for Video Avatars

I'll run the same head-to-head test with video generation and lip sync engines. HeyGen, Sync Labs, D-ID, and the open-source alternatives. Same script, same speaker, same methodology. Which engine produces a video avatar that doesn't fall into the uncanny valley?

Part 3
Knowledge — Teaching Your Clone What You Know

Voice and video are the output layer. The harder problem: giving your AI twin a knowledge base that sounds like you — your opinions, your frameworks, your experience. RAG pipelines, fine-tuning, and the architecture behind a clone that doesn't just sound like you but thinks like you.

Part 4
The Full Clone — Putting It All Together

Voice, video, and knowledge brain wired into one system. The complete AI twin pipeline — from raw input to a deployed digital version of yourself that can represent you across channels.

The CTAIO Lab Podcast

Now playing: Building My AI Clone — voice cloning, video avatars, lip sync, and the full production pipeline.

Previously

No previous episodes yet — this is where it all starts.

Up Next

Building My AI Clone · E05

AI video of yourself

Autonomous AI agents are moving from demos to enterprise deployments. The platforms, patterns, and pitfalls leaders need to know.

OpenAI FrontierMicrosoft Agent 365Microsoft Copilot StudioAmazon Bedrock AgentCoreGoogle Vertex AI Agent Builder

6 comments

Marcus Chen
Marcus Chen

This is incredible work. I have been experimenting with voice cloning for a podcast intro and the quality difference between the models you compared is staggering. Did you find any significant latency differences when running inference locally vs. the hosted APIs?

Andi Cross
Andi Cross

Good question! Local inference on an M3 Max takes about 1.2s for a 10-second clip with XTTS v2. The hosted APIs (ElevenLabs, Play.ht) are faster at ~400ms but obviously you are paying per character. For batch processing I always go local.

Sarah Okonkwo
Sarah Okonkwo

Really appreciate the honest comparison. Most articles just hype one tool — you actually showed the tradeoffs. The section on ethical considerations was especially important. We need more of that in the AI space.

Lukas Berger
Lukas Berger

I tried replicating your setup with ElevenLabs and got surprisingly close results with only 3 minutes of training audio. One thing I noticed: the cloned voice struggles with code-switching between English and German. Have you tested multilingual scenarios?

Priya Sharma
Priya Sharma

Great writeup. What hardware are you running the local models on? I have an M2 Pro and wondering if that is sufficient for real-time inference with the XTTS model you mentioned.

James Whitfield
James Whitfield

The voice cloning landscape is moving so fast. When I first tried this 6 months ago the results were robotic at best. Now it is genuinely hard to tell the difference. Slightly terrifying but fascinating.