ElevenLabs voice settings are not cosmetic: what stability, similarity boost, and style exaggeration actually do

Most people treat ElevenLabs settings as sliders to fiddle with until the voice sounds "about right." After running a blind A/B panel across 12 parameter combinations, I have a more mechanical understanding of what each one does.

The four settings

Stability (0–1) controls variance in the voice across the audio. High stability = consistent, rehearsed delivery. Low stability = natural variation, cadence shifts, slight imperfections.

The default is around 0.5. For a sales voicemail, that's slightly too high — the output sounds like someone reading from a script rather than leaving a message. Dropping to 0.40–0.48 introduces the natural cadence variation that makes it sound spontaneous.

At 0.2, the model gets dramatically expressive in ways that feel theatrical rather than warm. The sweet spot for conversational use is 0.40–0.50.

Similarity boost (0–1) controls how closely the output matches the source voice clone. High values = tight fidelity to the training clips. Low values = the model drifts toward its "generic voice" baseline.

The default is around 0.75. For cloned voices, this is often too low — the voice sounds similar but not distinctly that person. 0.82–0.88 is where a clone stops sounding like an impression and starts sounding like the person.

Above 0.9, you start amplifying recording artifacts from the source clips (room noise, mic quality, breath sounds). The fidelity gain isn't worth the artifact amplification.

Style exaggeration (0–1) controls how much stylistic expression the model layers onto the voice. Zero = flat, neutral delivery. High values = emotionally expressive, theatrical.

This is the one most people over-tune. A value of 0.3 already sounds performative for anything that should feel like a natural conversation. For voicemails or CRM replies, 0.15–0.25 is the ceiling. Above 0.3, you're producing audio for a podcast intro, not a message from a person.

Speaker boost (boolean) runs an additional post-processing pass to tighten the voice match on cloned voices. It costs ~5% more latency. For cloned voices, it's always worth enabling — the perceptible quality improvement is real.

How they interact

Stability and style interact non-linearly. A high-stability voice with high style exaggeration sounds stiff and over-performed simultaneously — the two properties conflict. A low-stability voice with moderate style exaggeration sounds lively and natural. The combinations I tested:

| Config | Result | |---|---| | Stability 0.7, Style 0.3 | Robotic enthusiasm — wrong | | Stability 0.3, Style 0.4 | Emotionally volatile — wrong | | Stability 0.45, Style 0.2 | Warm, conversational — correct | | Stability 0.5, Style 0.0 | Safe but flat |

Model choice matters too

eleven_turbo_v2_5 vs eleven_multilingual_v2 — for English-only output, turbo is 50% cheaper with near-identical perceptual quality. The latency improvement is also meaningful for real-time pipelines: turbo returns audio roughly 30% faster on average.

Use multilingual_v2 only if you're actually targeting multiple languages or need the highest possible fidelity ceiling on a cloned voice.

Practical defaults for different use cases

| Use case | Stability | Similarity | Style | Speaker boost | |---|---|---|---|---| | Sales voicemail | 0.45 | 0.85 | 0.20 | on | | Podcast host (conversational) | 0.40 | 0.80 | 0.25 | on | | IVR / information | 0.65 | 0.75 | 0.05 | off | | Audiobook narrator | 0.55 | 0.80 | 0.30 | on |

These are starting points, not absolutes. The right test is a blind panel with 3–5 people who know what the target voice sounds like — not solo listening sessions where you've tuned your ear to accept what you built.