It tracks turn boundaries, builds turn summaries, computes durations, and writes one persisted latency record per communication.
What changed in this model
- VAD events are now treated as a call-level global stream.
- Voice orchestrator turn advancement still supports
VAD:speech_started. - Turn-start derivation for turns after
0uses closest globalVAD:speech_endedwith a 3-second distance guard. - All duration computation is centralized in
duration_calculator.go. - Dashboard latency exposure is restricted to
agent_latency.
Event model
There are two event scopes:- Turn-scoped events (sanitized timeline used for per-turn breakdowns).
- Call-scoped VAD events (raw, unsanitized stream used for turn-start and human-speech totals).
VADEvents.
Turn boundaries
- Turn
0starts whencall_startedis marked. - STT boundaries:
interim_transcriptionalways starts a new turn.finished_transcriptionstarts a new turn only when no interim transcription was seen in that turn.
orchestrator:user_heard_all_datais emitted only from telephonyonUserFinishedHearingcallbacks. It is not synthesized as part of STT/VAD turn-boundary transitions.- Recorder stop closure always emits
turn_finish.description=recorder_stopped. orchestrator:initial_message_completedis a marker event and does not advance turn boundaries.- Voice orchestrator boundaries:
VAD:speech_startedremains a valid boundary event for turn advancement.
Turn-start derivation for turns > 0
Turn start is chosen with this priority:- Build the sanitized in-turn event window (same event window persisted for turn timelines).
- Compute the turn’s first event from that sanitized window.
- Back-search within that same window for the latest
VAD:speech_endedat or before the first event. - Accept it only when absolute distance is
<= 1200ms. - Otherwise fallback to:
latest finished_transcription timestamp - silence_detection_threshold.
- Speech detection threshold:
VAD speech_start_frames * frame_duration_ms(default300ms) - Silence detection threshold:
VAD speech_end_frames * frame_duration_ms(default500ms) - Closest-silence max distance:
1200ms
Duration calculation ownership
duration_calculator.go owns all duration math:
- Per-turn durations (
stt_tail_latency_ms,tts_total_ms, gaps, pipeline total, etc.) - Call-level durations:
total_call_durationagent_speech_duration_mshuman_speech_duration_ms
latency_recorder.go is responsible for event recording and turn-boundary behavior, not metric math.
breakdown_writer.go formats payloads from calculator outputs.
Call-level metric definitions
total_call_duration:- Earliest turn start to latest turn stop for the communication.
agent_speech_duration_ms:- Sum, per turn, of
Telephony:startto earliest of:user_heard_all_data- user-started
turn_finish.
- Sum, per turn, of
human_speech_duration_ms:- Sum of matched global
VAD:speech_started -> VAD:speech_endedspans that meet the configured VAD speech-start threshold.
- Sum of matched global
Dashboard exposure
Onlyagent_latency is exposed in dashboard query entities.
Breakdown payload structure
Top-level fields:OrchestratorTypeVADEvents(global call-level VAD list, optional)CallDurations(call-level aggregates, optional)Turns
Turns is a turn summary object that includes turn metadata plus:
DurationsEvents
Durations and Events are intentionally emitted as the final fields in each turn summary object.
Turn-level Events omit VAD records so VAD appears once at call level.
TurnSummary.StopReason is built from definitive orchestrator events only, using deterministic tokens joined with |:
turn_finishorturn_finish.descriptionwhen provideduser_heard_all_datawhen presentidle_timeout_warningwhen presentidle_timeout_firedwhen present
CallDurations contains:
total_call_duration_msagent_speech_duration_mshuman_speech_duration_ms
EoTQueryTimeout.DurationMs(eot_query_timeout) is measured fromEoT:start.EoTFalseNegativeTimeout.DurationMs(eot_timeout_false_negative) is measured from decision-bearingEoT:finish.eot_latency_msis measured fromEoT:startto the first terminal EoT outcome in the turn:EoT:finishwithdecision=trueEoT:eot_query_timeoutEoT:eot_timeout_false_negative
Runtime logs
Turn latency breakdownincludes per-turn latency fields and flattened summary fields (no nestedTurnSummaryobject).- Core fields:
turn_start_timestampturn_stop_timestampuser_heard_all_datatools_calledtools(when present)
- Timeout/idle duration fields (only when present):
eot_query_timeout_duration_mseot_false_negative_timeout_duration_msidle_timeout_warning_duration_msidle_timeout_fired_duration_ms
- Grouped component sections for easier log inspection:
sttstt_msprovider,model
eoteot_msprovider,modelhas_query_timeouthas_false_negative_timeout
llmllm_text_ttft_msllm_text_total_msllm_audio_ttft_msllm_audio_total_msllm_function_call_request_average_msprovider,model
ttstts_ttft_mstts_total_mstts_cache_lookup_avg_msprovider,model
etc(remaining cross-component/gap durations)agent_latency_mssilence_to_llm_first_token_msstt_to_llm_gap_msllm_to_tts_gap_mstts_to_telephony_gap_mspipeline_total_msstt_to_eot_mseot_to_tts_msllm_to_tts_ready_mstts_to_telephony_gap_ms
- Top-level duplication policy:
- component and cross-component duration fields are emitted only inside the grouped sections above (not duplicated at top level).
- top-level fields are reserved for turn summary metadata/state:
turn_start_timestamp,turn_stop_timestamp,user_heard_all_data,tools_called,tools, timeout durations, and idle-timeout durations.
- Core fields:
Speech durations loggedis emitted once per call and carries:total_call_duration_msagent_speech_duration_mshuman_speech_duration_ms
Tool duration loggedis emitted once per tool invocation duration with:agent_turn_id,customer,agent_id,orchestrator_typetool_name,duration_msprovider,model(when available)
Persistence path
On shutdown,BuildPersistencePayload produces:
- Breakdown JSON
- Aggregated communication averages/totals
UpsertLatencyStats stores these values in comm_latency_stats.
The persisted row also includes agent_id for direct agent-scoped filtering.
Upsert-column resolution is covered by unit tests and no longer depends on integration-only environment setup.