STT → function calling → TTS: what building a voice CRM interface taught me about GPT-4o tool dispatch
How the Voice-Driven CRM Assistant works — Whisper transcription, three GPT-4o function definitions, Switch-node dispatch, Supabase memory, and the defensive edge cases that made it actually usable.
The Voice-Driven CRM Assistant does one thing: you drop a voice note in Slack, and it transcribes it, figures out what you want, runs the right CRM action against Supabase, and replies with a spoken confirmation.
The loop is: Whisper → GPT-4o (function calling) → Supabase → ElevenLabs → Slack.
Simple to describe, less simple to build correctly.
Why function calling and not a prompt-only approach
The naive approach is to ask GPT-4o "what does this transcript mean?" and parse a free-form reply. That breaks the moment the model decides to say "Sure! I'll create a contact for Maria Lopez at Acme Corporation" instead of a JSON object your Switch node can route.
Function calling solves this at the API boundary. I defined three tools:
{
"name": "create_contact",
"description": "Create a new contact in the CRM",
"parameters": {
"type": "object",
"properties": {
"name": { "type": "string" },
"company": { "type": "string" },
"email": { "type": "string" },
"notes": { "type": "string" }
},
"required": ["name"]
}
}
The other two are log_activity (contact_name, type, notes, duration_minutes) and query_contact (search_term). With tool_choice: "auto", GPT-4o picks the right tool or falls back to a direct text reply when the request doesn't map to any of them.
The arguments arrive as a JSON string per the OpenAI spec — not a parsed object. A Code node handles the JSON.parse and schema validation before the Switch routes downstream.
The memory problem
"Log a call with him too" — that only works if the model knows who "him" is from the previous turn.
The Supabase voice_conversations table stores every turn with a user_id and channel_id. Before each GPT-4o call, I fetch the last 10 turns for that user and replay them as the message history:
const history = recentTurns.map(t => ({
role: t.role, // "user" or "assistant"
content: t.content, // transcript or reply text
}))
// Prepend to the current message array
messages = [...history, { role: "user", content: transcript }]
10 turns is the practical limit before token cost starts to matter and old context becomes noise. The window slides — oldest turns drop as new ones are added.
The Slack edge cases
The Slack Events API sends a url_verification challenge on first setup that must be echoed back immediately. Without a dedicated route node for this, the workflow processes the challenge as if it were an audio file and Slack never confirms the endpoint.
The file_shared event also fires for non-audio files (screenshots, PDFs, anything). A mime-type check before the Whisper call routes non-audio events to a no-op path.
The immediate 200 response is non-negotiable. Slack retries after 3 seconds if it doesn't get one. That means the audio download, Whisper transcription, and GPT-4o call all happen after the webhook has already responded — via a "Respond to Webhook" node positioned early in the flow.
Whisper hallucination
A 1-second voice clip can produce a full transcript from Whisper — it just makes it up. I gate on duration_seconds > 1.5 and transcript.split(' ').length > 2 before spending GPT-4o tokens on clearly invalid input. This catches most of the noise without blocking legitimate short commands.
What GPT-4o function calling gets right
The tool dispatch is remarkably reliable. "Create a contact for Maria Lopez at Acme" reliably hits create_contact. "What do we know about the Northwind account" reliably hits query_contact. It handles paraphrasing, filler words, and casual phrasing without needing keyword matching.
What it doesn't handle well: ambiguous multi-action requests ("create a contact and log a meeting with them"). The model picks one. For the use case here (quick voice notes, not multi-step automation), that's acceptable — but it's worth knowing going in.