Skip to main content

Call Turn-Taking (Legacy Orchestrator)

This page captures the “why” behind common reports like “EOT/VAD is slow” or “the agent talks over the user”, and maps that behavior back to the legacy orchestrator implementation. Relevant implementation:
  • wonderful-controller/components/orchestrators/call_orchestrator/orchestrator.go (onTranscription)
  • common/transcription/transcription.go (interrupt thresholds + backchannel detection)

Mental model

Turn-taking in the legacy orchestrator has two distinct mechanisms:
  1. Interrupts (barge-in): stops agent playback quickly when we detect the user is speaking while the agent is talking.
  2. End-of-turn (EOT): decides when the user finished a turn so we can send the final user message to the agent/LLM.
These are intentionally decoupled: EOT latency is not interrupt latency.

Interrupts (barge-in)

In onTranscription, we attempt to interrupt as soon as an interim transcript indicates real user speech:
  • Interim transcripts are typically Finished=false with Confidence==0.
  • We only consider an interrupt once there is enough text signal (currently at least 3 characters, see InterruptThreshold in common/transcription/transcription.go).
  • If an interrupt happens, we immediately stop playback and proceed to process the user speech without waiting for a final transcript.
Why you might not see an interrupt:
  • Non-interruptible gate / flow-state: certain agent messages intentionally disable interrupts temporarily.
  • Backchannels / non-speech utterances: we intentionally avoid interrupting on acknowledgements like “si”, “ok”, “yeah”, “כן כן”, etc.

Backchannels (why “si si” usually won’t interrupt)

While the agent is speaking, we treat “irrelevant utterances” as non-interrupting. That includes:
  • very short fragments, and
  • language-specific backchannel patterns (regular expressions per language).
Italian is explicitly covered (examples include “si” / “sì” and common variants), so those acknowledgements generally should not trigger an interrupt. Backchannel patterns live under common/i18n/providers/* and are consumed by common/transcription/transcription.go.

End-of-turn (EOT)

EOT is evaluated after we receive a final transcript from the transcriber:
  • Final transcripts are Finished=true (typically emitted when the transcriber sends an <end> token).
  • After a final transcript arrives, the orchestrator runs the EOT detector (endOfTurnDetector.DetectEndOfTurn).
  • If EOT is detected, we trigger user speech processing.
There is also a backup timer: if the EOT detector decides “not end of turn”, we schedule a wakeup so the system won’t stall if the detector/transcriber misses a turn boundary.

Why it can “feel slow” (and why it’s not always EOT)

If the user audio is degraded, the system can appear to “talk over the user” even when the orchestrator logic is correct:
  • A temporarily bad telephony signal can make real speech look like noise/non-speech.
  • Denoising can sometimes harm the signal in specific segments.
  • The transcriber may only begin producing speech-like transcripts once the audio becomes clearly intelligible.
In those cases we will not interrupt early because we are not receiving enough reliable transcript signal to satisfy the interrupt conditions.

Debug checklist

When investigating “slow interrupt” / “agent talked over user”:
  1. Verify whether the user speech was recognized as a backchannel/non-speech utterance during agent playback.
  2. Compare original vs denoised audio on the problematic segment (some calls are “mixed”: one denoiser model helps one portion and harms another).
  3. Inspect when the transcriber first produced interim text and when it produced the final <end> / Finished=true transcript.
  4. Check whether interrupts were disabled (non-interruptible gate / flow-state).