Concepts
There are two common architectures for voice agents:1. “The Sandwich”
The Sandwich architecture composes three distinct components: speech-to-text (STT), a text-based LangChain agent, and text-to-speech (TTS). Pros:- Full control over each component (swap STT/TTS providers as needed)
- Access to latest capabilities from modern text-modality models
- Transparent behavior with clear boundaries between components
- Requires orchestrating multiple services
- Additional complexity in managing the pipeline
- Conversion from speech to text loses information (e.g., tone, emotion)
2. Speech-to-Speech Architecture
Speech-to-speech uses a multimodal model that processes audio input and generates audio output natively. Pros:- Simpler architecture with fewer moving parts
- Typically lower latency for simple interactions
- Direct audio processing captures tone and other nuances of speech
- Limited model options, greater risk of provider lock-in
- Features may lag behind text-modality models
- Less transparency in how audio is processed
- Reduced controllability and customization options
Demo application overview
This guide demonstrates the sandwich architecture to balance performance, controllability, and access to modern model capabilities. The sandwich can achieve sub-700ms latency with some STT and TTS providers while maintaining control over modular components. The agent will manage orders for a sandwich shop. The application will demonstrate all three components of the sandwich architecture, using AssemblyAI for STT and ElevenLabs for TTS (although adapters can be built for most providers). An end-to-end reference application is available in the voice-sandwich-demo repository. We will walk through that application here. The demo uses WebSockets for real-time bidirectional communication between the browser and server. The same architecture can be adapted for other transports like telephony systems (Twilio, Vonage) or WebRTC connections.Architecture
The demo implements a streaming pipeline where each stage processes data asynchronously: Client (Browser)- Captures microphone audio and encodes it as PCM
- Establishes WebSocket connection to the backend server
- Streams audio chunks to the server in real-time
- Receives and plays back synthesized speech audio
- Accepts WebSocket connections from clients
-
Orchestrates the three-step pipeline:
- Speech-to-text (STT): Forwards audio to the STT provider (e.g., AssemblyAI), receives transcript events
- Agent: Processes transcripts with LangChain agent, streams response tokens
- Text-to-speech (TTS): Sends agent responses to the TTS provider (e.g., ElevenLabs), receives audio chunks
- Returns synthesized audio to the client for playback
Setup
For detailed installation instructions and setup, see the repository README.1. Speech-to-text
The STT stage transforms an incoming audio stream into text transcripts. The implementation uses a producer-consumer pattern to handle audio streaming and transcript reception concurrently.Key Concepts
Producer-Consumer Pattern: Audio chunks are sent to the STT service concurrently with receiving transcript events. This allows transcription to begin before all audio has arrived. Event Types:stt_chunk: Partial transcripts provided as the STT service processes audiostt_output: Final, formatted transcripts that trigger agent processing
Implementation
AssemblyAI Client
AssemblyAI Client
2. LangChain agent
The agent stage processes text transcripts through a LangChain agent and streams the response tokens. In this case, we stream all text content blocks generated by the agent.Key Concepts
Streaming Responses: The agent usesstream_mode="messages" to emit response tokens as they’re generated, rather than waiting for the complete response. This enables the TTS stage to begin synthesis immediately.
Conversation Memory: A checkpointer maintains conversation state across turns using a unique thread ID. This allows the agent to reference previous exchanges in the conversation.
Implementation
3. Text-to-speech
The TTS stage synthesizes agent response text into audio and streams it back to the client. Like the STT stage, it uses a producer-consumer pattern to handle concurrent text sending and audio reception.Key Concepts
Concurrent Processing: The implementation merges two async streams:- Upstream processing: Passes through all events and sends agent text chunks to the TTS provider
- Audio reception: Receives synthesized audio chunks from the TTS provider
Implementation
ElevenLabs Client
ElevenLabs Client