> ## Documentation Index > Fetch the complete documentation index at: https://plivo.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Building AI Voice Agents with Audio Streaming > Build conversational AI voice agents using Plivo Voice API and real-time Audio Streaming Build AI-powered voice agents that have natural conversations with callers using Plivo's Audio Streaming. Stream live audio to your AI services (STT, LLM, TTS) via WebSocket and respond in real-time. *** ## Prerequisites Before building your AI voice agent, you'll need: | Requirement | Description | | -------------------------- | ---------------------------------------------------------------------------------------------- | | **Plivo Account** | [Sign up](https://cx.plivo.com/signup) and get your Auth ID and Auth Token | | **Phone Number** | [Purchase a voice-enabled number](/numbers/guides/buy-a-number/) to receive/make calls | | | - **India:** Requires KYC verification. See [Rent India Numbers](/numbers/rent-india-numbers). | | **WebSocket Server** | A publicly accessible server to handle audio streams (use ngrok for development) | | **AI Service Credentials** | API keys for your chosen providers: | | | - Speech-to-Text (STT): Deepgram, Google Speech, AWS Transcribe, etc. | | | - LLM: OpenAI, Anthropic, Google Gemini, etc. | | | - Text-to-Speech (TTS): ElevenLabs, Google TTS, Amazon Polly, etc. | *** ## Voice API Basics Audio Streaming builds on Plivo's Voice API. The core workflow is: 1. **Make or receive a call** using the [Call API](/voice/api/calls/) 2. **Control the call** using [Plivo XML](/voice/xml/overview/) responses 3. **Stream audio** using the `` XML element For complete Voice API documentation, see [Voice API Overview](/voice/concepts/overview/). *** ## What is Audio Streaming? Audio Streaming gives you access to the raw audio of voice calls in real-time via WebSockets. This enables: * **AI Voice Assistants** - Natural conversations with speech recognition and synthesis * **Real-time Transcription** - Live call transcription for analytics * **Voice Bots** - Automated IVR systems with intelligent responses * **Sentiment Analysis** - Real-time audio analysis during calls *** ## How It Works ``` ┌─────────────────┐ │ Caller │ │ (Phone) │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Plivo Voice │ │ API │ └────────┬────────┘ │ │ Audio Stream (WebSocket) ▼ ┌─────────────────┐ │ Your App │ │ (WebSocket) │ └────────┬────────┘ │ │ API Calls ▼ ┌─────────────────┐ │ AI Services │ │ STT/LLM/TTS │ └─────────────────┘ ``` **Flow:** 1. Caller dials your Plivo number (or you make an outbound call) 2. Plivo connects to your WebSocket endpoint and starts streaming audio 3. Your app sends audio to STT for transcription 4. Transcribed text goes to LLM for response generation 5. LLM response is converted to speech via TTS 6. Audio is sent back through WebSocket to the caller *** ## Stream Directions ### Inbound Stream (Unidirectional) Audio flows **from the caller to your server**. Use this when you only need to receive audio (e.g., transcription, call analytics). ```xml theme={null} wss://your-server.com/stream ``` ### Bidirectional Stream Audio flows **both directions** - from caller to your server AND from your server back to the caller. Use this for AI voice agents that need to respond. ```xml theme={null} wss://your-server.com/stream ``` For AI voice agents, always use `bidirectional="true"` and `keepCallAlive="true"` to maintain the call while your agent processes and responds. *** ## Supported Audio Formats Choose the audio codec and sample rate based on your use case: | Content Type | Codec | Sample Rate | Description | Use Case | | ------------------------- | ------------ | ----------- | ------------------------- | ------------------------------------------------------------------------------------------------- | | `audio/x-mulaw;rate=8000` | μ-law (PCMU) | 8 kHz | Compressed 8-bit audio | **Recommended for Voice AI.** Native telephony format with lowest latency and best compatibility. | | `audio/x-l16;rate=8000` | Linear PCM | 8 kHz | Uncompressed 16-bit audio | Higher quality audio when bandwidth is not a concern. | | `audio/x-l16;rate=16000` | Linear PCM | 16 kHz | Uncompressed 16-bit audio | High-fidelity speech recognition requiring wideband audio. | **Why μ-law 8kHz?** It's the native telephony codec, so no transcoding is required. This means lower latency, reduced bandwidth (50% smaller than Linear PCM), and universal compatibility with STT/TTS services. *** ## Latency Considerations For responsive voice AI, understanding and minimizing latency is critical. ### Latency Sources | Component | Description | Target | | ----------------------- | --------------------------------------------- | -------------------------------------------- | | **Codec Processing** | Audio encoding/decoding overhead | μ-law has near-zero overhead (native format) | | **Network (WebSocket)** | Round-trip time between Plivo and your server | \< 100ms (deploy server near caller regions) | | **Speech-to-Text** | Time to transcribe audio to text | \< 200ms | | **LLM Processing** | Time for AI to generate response | \< 500ms | | **Text-to-Speech** | Time to convert text to audio | \< 200ms | | **Total** | End-to-end response time | **\< 1 second** | ### Codec Impact on Latency | Codec | Latency Impact | Notes | | ------------------------- | -------------- | ------------------------------------------------------------ | | `audio/x-mulaw;rate=8000` | **Lowest** | No transcoding required; native telephony format | | `audio/x-l16;rate=8000` | Low | Minimal processing, but larger payload size | | `audio/x-l16;rate=16000` | Moderate | Larger payloads; only use if STT model specifically benefits | ### Best Practices for Low-Latency Voice AI 1. **Use μ-law 8kHz** - Avoid unnecessary transcoding 2. **Co-locate your server** - Deploy near your expected caller regions (e.g., US East for US traffic) 3. **Use streaming APIs** - Choose STT/TTS providers with streaming support 4. **Implement interruption** - Use `clearAudio` to stop playback when user speaks 5. **Optimize LLM calls** - Use streaming responses and appropriate model sizes Plivo routes calls through edge locations closest to the caller. A caller in London connects to Plivo's London edge, so position your WebSocket server near your expected caller locations. *** ## Basic Implementation ### 1. Configure Plivo to Stream Audio Create an XML application that streams audio to your WebSocket: ```xml theme={null} Connected to AI Assistant. wss://your-domain.com/stream ``` ### 2. Handle WebSocket Connection Your server receives the WebSocket connection and processes events: ```python theme={null} # Simplified example async def handle_websocket(websocket): async for message in websocket: event = json.loads(message) if event["event"] == "start": # Stream started - initialize AI services stream_id = event["start"]["streamId"] elif event["event"] == "media": # Audio received - send to STT audio_bytes = base64.b64decode(event["media"]["payload"]) transcript = await speech_to_text(audio_bytes) if transcript: # Get AI response response = await get_llm_response(transcript) # Convert to speech and send back audio = await text_to_speech(response) await websocket.send(json.dumps({ "event": "playAudio", "media": { "contentType": "audio/x-mulaw", "sampleRate": 8000, "payload": base64.b64encode(audio).decode() } })) ``` *** ## Next Steps Complete documentation: XML configuration, WebSocket protocol, APIs, callbacks, signature validation, and code examples Troubleshooting tips and optimization recommendations Official SDKs for Python, Node.js, and Java with built-in audio handling Build with Pipecat framework for higher-level abstraction *** ## Related * [Voice API Overview](/voice/concepts/overview/) - Core voice platform concepts * [Voice API Reference](/voice/api/overview/) - Complete API documentation * [XML Reference](/voice/xml/overview/) - All XML elements for call control