STT → function calling → TTS: what building a voice CRM interface taught me about GPT-4o tool dispatch

The Voice-Driven CRM Assistant does one thing: you drop a voice note in Slack, and it transcribes it, figures out what you want, runs the right CRM action against Supabase, and replies with a spoken confirmation.

The loop is: Whisper → GPT-4o (function calling) → Supabase → ElevenLabs → Slack.

Simple to describe, less simple to build correctly.

Why function calling and not a prompt-only approach

The naive approach is to ask GPT-4o "what does this transcript mean?" and parse a free-form reply. That breaks the moment the model decides to say "Sure! I'll create a contact for Maria Lopez at Acme Corporation" instead of a JSON object your Switch node can route.

Function calling solves this at the API boundary. I defined three tools:

{
  "name": "create_contact",
  "description": "Create a new contact in the CRM",
  "parameters": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "company": { "type": "string" },
      "email": { "type": "string" },
      "notes": { "type": "string" }
    },
    "required": ["name"]
  }
}

The other two are log_activity (contact_name, type, notes, duration_minutes) and query_contact (search_term). With tool_choice: "auto", GPT-4o picks the right tool or falls back to a direct text reply when the request doesn't map to any of them.

The arguments arrive as a JSON string per the OpenAI spec — not a parsed object. A Code node handles the JSON.parse and schema validation before the Switch routes downstream.

The memory problem

"Log a call with him too" — that only works if the model knows who "him" is from the previous turn.

The Supabase voice_conversations table stores every turn with a user_id and channel_id. Before each GPT-4o call, I fetch the last 10 turns for that user and replay them as the message history:

const history = recentTurns.map(t => ({
  role: t.role,         // "user" or "assistant"
  content: t.content,  // transcript or reply text
}))
// Prepend to the current message array
messages = [...history, { role: "user", content: transcript }]

10 turns is the practical limit before token cost starts to matter and old context becomes noise. The window slides — oldest turns drop as new ones are added.

The Slack edge cases

The Slack Events API sends a url_verification challenge on first setup that must be echoed back immediately. Without a dedicated route node for this, the workflow processes the challenge as if it were an audio file and Slack never confirms the endpoint.

The file_shared event also fires for non-audio files (screenshots, PDFs, anything). A mime-type check before the Whisper call routes non-audio events to a no-op path.

The immediate 200 response is non-negotiable. Slack retries after 3 seconds if it doesn't get one. That means the audio download, Whisper transcription, and GPT-4o call all happen after the webhook has already responded — via a "Respond to Webhook" node positioned early in the flow.

Whisper hallucination

A 1-second voice clip can produce a full transcript from Whisper — it just makes it up. I gate on duration_seconds > 1.5 and transcript.split(' ').length > 2 before spending GPT-4o tokens on clearly invalid input. This catches most of the noise without blocking legitimate short commands.

What GPT-4o function calling gets right

The tool dispatch is remarkably reliable. "Create a contact for Maria Lopez at Acme" reliably hits create_contact. "What do we know about the Northwind account" reliably hits query_contact. It handles paraphrasing, filler words, and casual phrasing without needing keyword matching.

What it doesn't handle well: ambiguous multi-action requests ("create a contact and log a meeting with them"). The model picks one. For the use case here (quick voice notes, not multi-step automation), that's acceptable — but it's worth knowing going in.