The event model
Every event has a discriminatedtype string that indicates what it represents: audio, a tool call, and so on.
The client sends events to configure the session and stream audio. You do not trace these as spans; they are requests, and their effects come back as server events:
| Client event | What it does |
|---|---|
session.update | Configure the session: instructions, voice, audio formats, transcription model, turn detection, tools. |
input_audio_buffer.append | Stream a base64 PCM16 mic chunk. Sent continuously. |
conversation.item.create | Add an item—used to return a function_call_output after running a tool. |
response.create | Ask the model to generate a response (needed explicitly when turn detection uses create_response: false). |
| Server event | What it carries | Traced? |
|---|---|---|
session.created / session.updated | Handshake / config acknowledgement. | Yes |
input_audio_buffer.speech_started | Server VAD heard the user start—the barge-in signal; flush the speaker buffer. | Yes |
input_audio_buffer.speech_stopped | Server VAD heard the user stop. | Yes |
input_audio_buffer.committed | The audio buffer became a conversation item. | Yes |
conversation.item.created | An item was added server-side. | Yes |
conversation.item.input_audio_transcription.completed | The full user transcript for the turn. | Yes |
response.created | The model started generating. | Yes |
response.output_audio.delta | One chunk of agent speech (base64 PCM16); hundreds per response. | No—played, never spanned |
response.output_audio_transcript.delta | Streaming fragment of the agent’s transcript. | No |
response.output_audio_transcript.done | The agent’s full transcript for the response. | Yes |
response.function_call_arguments.delta / .done | Streaming / final tool-call arguments. | .done only |
response.output_item.*, response.content_part.* | Structural progress of the response. | Yes |
response.done | The complete response object: all output items (including every function_call), plus token usage. | Yes |
error | Server-reported error. | Yes |
rate_limits.updated | Quota bookkeeping. | Yes |
How events map to LangSmith runs
We recommend tracing the whole conversation as a single trace, with one span per traced event in arrival order:The noise rule: we recommend skipping every event type ending in
.delta, because the matching .done event repeats the complete payload. Tracing both records everything twice. response.output_audio.delta in particular is the agent’s voice: hundreds of chunks per response that would bury the trace. Play it to the speaker, but never make it a span.Installation
sounddevice and numpy for the mic/speaker and to build the WAV attachment.
Set up your environment
The following steps demonstrate how to trace using the LangSmith SDK. You can also trace using OpenTelemetry directly. See Trace with OpenTelemetry.Quickstart
This guide focuses on the tracing layer. It assumes you already have a working Realtime app: the WebSocket
connection, the session.update configuration, and your microphone and speaker I/O. For a complete, runnable implementation, see the voice demo repository. Enable input_audio_transcription (and the agent transcript) in your session.update, or the transcription events that make the trace readable never arrive.Step 1: Open the conversation root at connect time
Use oneRunTree per conversation:
thread_id you generate per conversation (for example, a UUID) groups the trace into a LangSmith thread; ls_modality="audio" marks it as a voice conversation.
Step 2: Span each received event, skipping the noise
Define a small helper that opens a child run for one event, records the scrubbed payload as the run’s input, and closes it when the block exits:connection, skipping the .delta noise and tracing the rest:
Step 3: Run tools nested under the announcing event
Attach the conversation audio
To listen to a conversation alongside its transcript, attach a single combined recording of the whole conversation to the root run. Record both the user and the agent in one file (for example, a stereo WAV with the user mic on one channel and the agent on the other), captured from what was played to the client so the recording reflects what was actually heard, including speech cut off by a barge-in. The Realtime API streams agent audio asresponse.output_audio.delta events: decode and write those bytes to your output device, and tap that same output to build the recording.
For the underlying attachment API, see Upload files with traces. For the cross-provider rationale, see Record a single combined audio file.
When the conversation ends, finalize the root run:
Next steps
Voice fundamentals
Core conventions for tracing voice agents.
Upload files with traces
Attach the conversation audio recording to your trace.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

