n8nElevenLabsGPT-4oAutomationVoice Synthesis

Building an end-to-end voice outreach pipeline in n8n: GPT-4o scripts, ElevenLabs cloning, and the tuning parameters I got wrong first

How I built an n8n workflow that turns a CRM lead into a personalized voiced voicemail in ~15 seconds — and what I learned about ElevenLabs stability, similarity boost, and why the defaults sound robotic.

May 31, 20267 min read
AI Voice Sales Outreach Engine

The idea sounds simple: a lead enters your CRM, your founder's cloned voice leaves them a personalized voicemail 15 seconds later. The execution involves more decisions than that makes it sound.

The pipeline shape

The n8n workflow is webhook-triggered. A POST to /new-lead kicks off:

  1. Normalize the lead payload — CRMs ship inconsistent shapes. HubSpot uses firstName, Pipedrive uses first_name, Apollo uses first. A single Code node normalizes everything before anything downstream touches it. Fail fast on missing required fields rather than letting a malformed prompt reach GPT-4o.

  2. Enrich via Apollo (soft-fail) — If the lead comes in without a title or company context, Apollo fills the gap. The node uses neverError: true and an 8-second timeout. Enrichment is nice-to-have; the workflow must not block on a third-party rate limit.

  3. Build the prompt in JS — Prompt construction lives in a Code node, not expression syntax. The voicemail needs conditional logic based on seniority and industry that's unreadable as {{ }} expressions. Keeping the prompt next to the data shape that feeds it makes it maintainable.

  4. GPT-4o generates the script — Target: 70–95 words. Not less (too abrupt), not more (ElevenLabs charges per character, and a 300-word script triples cost while breaking the 30-second voicemail format). The prompt enforces this hard.

  5. Parse and validate — Word count check before any TTS spend is committed. A malformed GPT response gets routed to the error branch, not to ElevenLabs.

  6. ElevenLabs synthesizes with a cloned voice — The audio comes back as binary. n8n handles it natively via the binary item channel.

  7. S3 upload → Airtable log → Slack notification — The last three run in parallel branches. The webhook responder waits only on Airtable (source of truth). Slack is fire-and-forget.

The voice tuning problem

This is the part I got wrong the first time.

ElevenLabs has four settings that matter: stability, similarity_boost, style, and use_speaker_boost. The defaults feel safe. They produce a voice that sounds like a well-rehearsed podcast intro — which is exactly wrong for a sales voicemail.

The settings I ended up with, and why:

{
  "stability": 0.45,
  "similarity_boost": 0.85,
  "style": 0.20,
  "use_speaker_boost": true
}

Stability 0.45 — At 0.7+, the voice sounds practiced and flat. At 0.2, it gets emotionally exaggerated in a distracting way. 0.45 sits in the warm-but-conversational zone where a voicemail should live.

Similarity boost 0.85 — Below 0.7, a cloned voice starts to drift toward generic. 0.85 keeps it recognizably this person without amplifying recording artifact noise from the source clips.

Style 0.20 — Style exaggeration above 0.3 adds a performative quality. Fine for audiobooks, wrong for a 25-second message. 0.2 keeps it natural.

Speaker boost on — Measurable improvement on cloned voice tracks at the cost of ~5% more latency. Worth it.

I validated these through a blind A/B listener test across 12 parameter combinations with a five-person panel. The losing configurations either sounded like a corporate IVR or an over-excited podcast host.

Cost per execution

| Component | Per run | |---|---| | GPT-4o (~600 in / ~120 out tokens) | ~$0.0036 | | ElevenLabs turbo_v2_5 (~600 chars) | ~$0.18 | | Apollo enrichment (1 call) | ~$0.01 | | S3 + bandwidth | <$0.001 | | Total | ≈ $0.20 |

Two optimizations baked in: turbo_v2_5 instead of multilingual_v2 (50% cheaper, near-identical quality on English), and Apollo is skipped entirely if the CRM payload already contains a title.

What n8n makes easy and what it doesn't

The Code node is the escape hatch. When an HTTP node's expression syntax gets unreadable, dropping into a Code node with proper JS is always the right call. Readable code beats a 4-line {{ }} mustache nightmare.

What n8n makes harder: binary file handling is not obvious. The ElevenLabs response comes back as audio bytes, and getting those into a named S3 upload required understanding how n8n's binary item channel works. Once it clicked, it was clean — but the documentation is sparse on this path.

The error sub-workflow posts every failure to #automation-alerts with the failing node name and a direct execution link. That's table stakes for anything running in production.

Building an end-to-end voice outreach pipeline in n8n: GPT-4o scripts, ElevenLabs cloning, and the tuning parameters I got wrong first | Nasir Nasir-Ameen