Call Turn-Taking (Legacy Orchestrator)
This page captures the “why” behind common reports like “EOT/VAD is slow” or “the agent talks over the user”, and maps that behavior back to the legacy orchestrator implementation. Relevant implementation:wonderful-controller/components/orchestrators/call_orchestrator/orchestrator.go(onTranscription)common/transcription/transcription.go(interrupt thresholds + backchannel detection)
Mental model
Turn-taking in the legacy orchestrator has two distinct mechanisms:- Interrupts (barge-in): stops agent playback quickly when we detect the user is speaking while the agent is talking.
- End-of-turn (EOT): decides when the user finished a turn so we can send the final user message to the agent/LLM.
Interrupts (barge-in)
InonTranscription, we attempt to interrupt as soon as an interim transcript indicates real user speech:
- Interim transcripts are typically
Finished=falsewithConfidence==0. - We only consider an interrupt once there is enough text signal (currently at least 3 characters, see
InterruptThresholdincommon/transcription/transcription.go). - If an interrupt happens, we immediately stop playback and proceed to process the user speech without waiting for a final transcript.
- Non-interruptible gate / flow-state: certain agent messages intentionally disable interrupts temporarily.
- Backchannels / non-speech utterances: we intentionally avoid interrupting on acknowledgements like “si”, “ok”, “yeah”, “כן כן”, etc.
Backchannels (why “si si” usually won’t interrupt)
While the agent is speaking, we treat “irrelevant utterances” as non-interrupting. That includes:- very short fragments, and
- language-specific backchannel patterns (regular expressions per language).
common/i18n/providers/* and are consumed by common/transcription/transcription.go.
End-of-turn (EOT)
EOT is evaluated after we receive a final transcript from the transcriber:- Final transcripts are
Finished=true(typically emitted when the transcriber sends an<end>token). - After a final transcript arrives, the orchestrator runs the EOT detector (
endOfTurnDetector.DetectEndOfTurn). - If EOT is detected, we trigger user speech processing.
Why it can “feel slow” (and why it’s not always EOT)
If the user audio is degraded, the system can appear to “talk over the user” even when the orchestrator logic is correct:- A temporarily bad telephony signal can make real speech look like noise/non-speech.
- Denoising can sometimes harm the signal in specific segments.
- The transcriber may only begin producing speech-like transcripts once the audio becomes clearly intelligible.
Debug checklist
When investigating “slow interrupt” / “agent talked over user”:- Verify whether the user speech was recognized as a backchannel/non-speech utterance during agent playback.
- Compare original vs denoised audio on the problematic segment (some calls are “mixed”: one denoiser model helps one portion and harms another).
- Inspect when the transcriber first produced interim text and when it produced the final
<end>/Finished=truetranscript. - Check whether interrupts were disabled (non-interruptible gate / flow-state).