Skip to main content
The OpenAI Realtime API powers low-latency speech-to-speech voice agents. This guide shows how to trace a Realtime app to LangSmith. OpenAI Realtime is a speech-to-speech model: it processes audio natively and exchanges a continuous stream of typed JSON events with your application over a persistent WebSocket connection, rather than making discrete request/response calls. The following sections show those events and how to turn them into a LangSmith trace. For our high-level philosophy on getting the most out of your voice agent traces, see Voice tracing fundamentals. For a complete implementation, see the voice demo repository.

The event model

Every event has a discriminated type string that indicates what it represents: audio, a tool call, and so on. The client sends events to configure the session and stream audio. You do not trace these as spans; they are requests, and their effects come back as server events:
Client eventWhat it does
session.updateConfigure the session: instructions, voice, audio formats, transcription model, turn detection, tools.
input_audio_buffer.appendStream a base64 PCM16 mic chunk. Sent continuously.
conversation.item.createAdd an item—used to return a function_call_output after running a tool.
response.createAsk the model to generate a response (needed explicitly when turn detection uses create_response: false).
The server sends back events:
Server eventWhat it carriesTraced?
session.created / session.updatedHandshake / config acknowledgement.Yes
input_audio_buffer.speech_startedServer VAD heard the user start—the barge-in signal; flush the speaker buffer.Yes
input_audio_buffer.speech_stoppedServer VAD heard the user stop.Yes
input_audio_buffer.committedThe audio buffer became a conversation item.Yes
conversation.item.createdAn item was added server-side.Yes
conversation.item.input_audio_transcription.completedThe full user transcript for the turn.Yes
response.createdThe model started generating.Yes
response.output_audio.deltaOne chunk of agent speech (base64 PCM16); hundreds per response.No—played, never spanned
response.output_audio_transcript.deltaStreaming fragment of the agent’s transcript.No
response.output_audio_transcript.doneThe agent’s full transcript for the response.Yes
response.function_call_arguments.delta / .doneStreaming / final tool-call arguments..done only
response.output_item.*, response.content_part.*Structural progress of the response.Yes
response.doneThe complete response object: all output items (including every function_call), plus token usage.Yes
errorServer-reported error.Yes
rate_limits.updatedQuota bookkeeping.Yes

How events map to LangSmith runs

We recommend tracing the whole conversation as a single trace, with one span per traced event in arrival order:
realtime_session                                 ← root run (chain)
│   metadata: thread_id, model, event_count, duration_s, ls_modality=audio
│   attachments: conversation.wav (stereo: L=user, R=agent)

├─ input_audio_buffer.speech_started
├─ input_audio_buffer.speech_stopped
├─ conversation.item.input_audio_transcription.completed
├─ response.created
├─ response.function_call_arguments.done
├─ response.done
│   └─ lookup_weather × N                         ← tool runs, nested under the event that announced them
├─ response.done                                  ← the spoken follow-up after tools
└─ error                                          ← only if the server sent one
The noise rule: we recommend skipping every event type ending in .delta, because the matching .done event repeats the complete payload. Tracing both records everything twice. response.output_audio.delta in particular is the agent’s voice: hundreds of chunks per response that would bury the trace. Play it to the speaker, but never make it a span.

Installation

pip install "langsmith>=0.4" "openai>=1.50"
The demo also uses sounddevice and numpy for the mic/speaker and to build the WAV attachment.

Set up your environment

The following steps demonstrate how to trace using the LangSmith SDK. You can also trace using OpenTelemetry directly. See Trace with OpenTelemetry.
export LANGSMITH_API_KEY=...
export LANGSMITH_TRACING=true
export LANGSMITH_PROJECT=my-voice-app
export OPENAI_API_KEY=...

Quickstart

This guide focuses on the tracing layer. It assumes you already have a working Realtime app: the WebSocket connection, the session.update configuration, and your microphone and speaker I/O. For a complete, runnable implementation, see the voice demo repository. Enable input_audio_transcription (and the agent transcript) in your session.update, or the transcription events that make the trace readable never arrive.

Step 1: Open the conversation root at connect time

Use one RunTree per conversation:
from langsmith import RunTree

root = RunTree(
    name="realtime_session",
    run_type="chain",
    inputs={},
    project_name="my-voice-app",
    extra={"metadata": {"thread_id": thread_id, "model": MODEL, "ls_modality": "audio"}},
)
root.post()
A stable thread_id you generate per conversation (for example, a UUID) groups the trace into a LangSmith thread; ls_modality="audio" marks it as a voice conversation.

Step 2: Span each received event, skipping the noise

Define a small helper that opens a child run for one event, records the scrubbed payload as the run’s input, and closes it when the block exits:
from contextlib import contextmanager

@contextmanager
def event_span(parent, event, *, name):
    """Trace one event as a child run, with its payload as the run's input."""
    payload = event.model_dump(mode="json")
    child = parent.create_child(name=name, run_type="chain", inputs={"event": payload})
    child.post()
    try:
        yield child
    finally:
        child.end()
        child.patch()
Then loop over the events from your open Realtime connection, skipping the .delta noise and tracing the rest:
async for event in connection:
    if event.type.endswith(".delta"):
        continue  # the matching .done event repeats the full payload

    with event_span(root, event, name=event.type) as event_run:
        ...  # your handling for this event type

Step 3: Run tools nested under the announcing event

import json

from langsmith.run_helpers import tracing_context

if event.type == "response.done":
    calls = [i for i in (event.response.output or []) if i.type == "function_call"]
    for call in calls:
        with tracing_context(parent=event_run):
            result = await execute_tool(call.name, call.arguments)  # traced child
        await connection.conversation.item.create(item={
            "type": "function_call_output",
            "call_id": call.call_id,
            "output": json.dumps(result),
        })
    if calls:
        await connection.response.create()   # ask for the spoken follow-up

Attach the conversation audio

To listen to a conversation alongside its transcript, attach a single combined recording of the whole conversation to the root run. Record both the user and the agent in one file (for example, a stereo WAV with the user mic on one channel and the agent on the other), captured from what was played to the client so the recording reflects what was actually heard, including speech cut off by a barge-in. The Realtime API streams agent audio as response.output_audio.delta events: decode and write those bytes to your output device, and tap that same output to build the recording. For the underlying attachment API, see Upload files with traces. For the cross-provider rationale, see Record a single combined audio file. When the conversation ends, finalize the root run:
root.end()
root.patch()

Next steps

Voice fundamentals

Core conventions for tracing voice agents.

Upload files with traces

Attach the conversation audio recording to your trace.