These conventions assume you are exporting traces to LangSmith through one of the supported tracing setups. For audio rendering and playback in the UI, see Log multimodal traces and Upload files with traces.
Two architectures, two trace shapes
How you build a voice agent determines what the trace looks like. There are two common architectures, and they produce fundamentally different traces.Cascade
A cascade chains together separate, single-purpose models: speech-to-text (STT) transcribes the user’s audio, a language model (LLM) reasons over the text and decides what to do, and text-to-speech (TTS) synthesizes the reply. Middleware, tool calls, and retrieval steps sit in between. Because each stage is a discrete model call with a clear input and output, a cascade traces like any other agent pipeline. The trace is a tree ofSTT, LLM, TTS, tool, and middleware runs: stages can run in parallel, and a new STT → LLM → TTS cycle repeats for each turn of the conversation. These runs have meaningful input/output pairs (audio in → transcript out, prompt in → completion out).
The two most common frameworks for building cascade voice agents are LiveKit and Pipecat.
Speech-to-speech (S2S)
A speech-to-speech model (such as the OpenAI Realtime API or Gemini Live) processes audio natively and replies with audio over a single persistent connection, typically a WebSocket. There is no STT/LLM/TTS decomposition to trace. Instead, the model server and your client exchange a stream of events over the wire: audio chunks, transcription fragments, tool-call requests, turn boundaries, interruptions, and errors. The natural unit to trace is the event payload, not a request/response pair. Each event you record becomes one span whose content is the payload that crossed the wire. The rest of this page describes conventions that apply to both architectures. The provider guides cover the event-stream specifics for OpenAI Realtime and Gemini Live.Core conventions
These are the practices we recommend for getting the most out of voice traces in LangSmith. You should trace your voice applications however best suits your infrastructure and implementations, but following the structure we suggest here will help to make your traces consistent and easy to debug and evaluate. We recommend three high-level conventions:- Trace each conversation as a single trace instead of splitting it into multiple traces.
- Record a single combined audio file and attach it to the root run.
- Mark the trace as audio with
ls_modalityso it renders and filters as a voice trace.
Trace each conversation as a single trace
A conversation is a single interaction, so we recommend keeping it in a single trace, with the individual model calls or events nested underneath one root run that represents the whole conversation. Do not split a conversation into multiple traces. If you start a new trace for each exchange, you lose the information that lives between exchanges:- Interruptions: when the user talks over the agent and the agent stops (barge-in).
- Timing and latency: gaps between speakers, and how long the agent took to respond.
- Context: references back to earlier parts of the conversation.
- Conversation-level outcomes: whether the user’s goal was ultimately resolved.
A voice agent has no reliable notion of a “turn”. Speakers overlap, interrupt, and trail off. Do not group runs into synthetic turns. Trace the real units instead: the model calls in a cascade, or the event payloads in a speech-to-speech stream.
Record a single combined audio file
Attach one audio file to the root run that contains both the user and the agent, recorded from what was actually played to the client, not the audio the model generated. Record at the client. A common approach is a stereo WAV with the user’s microphone on one channel and the agent’s speech, captured at the speaker, on the other. This matters because the generated audio and the heard audio are not the same thing: network delay, dropped or reordered packets, and barge-in all change what the user actually experiences. A barge-in that cuts the agent off mid-sentence should appear truncated in the recording, because that is what happened. Recording what was played, rather than what was generated but possibly never heard, is what makes the trace faithful to the real interaction. Attach the file using the attachments API:Python
Mark the trace as audio
Set thels_modality metadata field to "audio" on the root run. This flags the trace as a voice trace so LangSmith can render it appropriately and so you can filter for voice traces in your project.
Python
For other
ls_ metadata fields, refer to Metadata parameters reference.Next steps
Trace OpenAI Realtime
Trace voice agents built on the OpenAI Realtime API.
Trace Gemini Live
Trace voice agents built on the Gemini Live API.
Trace LiveKit
Trace voice agents built with LiveKit Agents.
Trace Pipecat
Trace voice agents built with Pipecat.
Upload files with traces
Attach the conversation audio recording to your trace.
Log multimodal traces
Render audio and other media in the LangSmith UI.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

