Building AI Voice Agents with Audio Streaming

Build AI-powered voice agents that have natural conversations with callers using Plivo’s Audio Streaming. Stream live audio to your AI services (STT, LLM, TTS) via WebSocket and respond in real-time.

Prerequisites

Before building your AI voice agent, you’ll need:

Requirement	Description
Plivo Account	Sign up and get your Auth ID and Auth Token
Phone Number	Purchase a voice-enabled number to receive/make calls
	- India: Requires KYC verification. See Rent India Numbers.
WebSocket Server	A publicly accessible server to handle audio streams (use ngrok for development)
AI Service Credentials	API keys for your chosen providers:
	- Speech-to-Text (STT): Deepgram, Google Speech, AWS Transcribe, etc.
	- LLM: OpenAI, Anthropic, Google Gemini, etc.
	- Text-to-Speech (TTS): ElevenLabs, Google TTS, Amazon Polly, etc.

Voice API Basics

Audio Streaming builds on Plivo’s Voice API. The core workflow is:

Make or receive a call using the Call API
Control the call using Plivo XML responses
Stream audio using the <Stream> XML element

For complete Voice API documentation, see Voice API Overview.

What is Audio Streaming?

Audio Streaming gives you access to the raw audio of voice calls in real-time via WebSockets. This enables:

AI Voice Assistants - Natural conversations with speech recognition and synthesis
Real-time Transcription - Live call transcription for analytics
Voice Bots - Automated IVR systems with intelligent responses
Sentiment Analysis - Real-time audio analysis during calls

How It Works

       ┌─────────────────┐
       │     Caller      │
       │    (Phone)      │
       └────────┬────────┘
                │
                ▼
       ┌─────────────────┐
       │   Plivo Voice   │
       │      API        │
       └────────┬────────┘
                │
                │ Audio Stream (WebSocket)
                ▼
       ┌─────────────────┐
       │   Your App      │
       │  (WebSocket)    │
       └────────┬────────┘
                │
                │ API Calls
                ▼
       ┌─────────────────┐
       │  AI Services    │
       │  STT/LLM/TTS    │
       └─────────────────┘

Flow:

Caller dials your Plivo number (or you make an outbound call)
Plivo connects to your WebSocket endpoint and starts streaming audio
Your app sends audio to STT for transcription
Transcribed text goes to LLM for response generation
LLM response is converted to speech via TTS
Audio is sent back through WebSocket to the caller

Stream Directions

Inbound Stream (Unidirectional)

Audio flows from the caller to your server. Use this when you only need to receive audio (e.g., transcription, call analytics).

<Stream bidirectional="false">
    wss://your-server.com/stream
</Stream>

Bidirectional Stream

Audio flows both directions - from caller to your server AND from your server back to the caller. Use this for AI voice agents that need to respond.

<Stream bidirectional="true" keepCallAlive="true">
    wss://your-server.com/stream
</Stream>

For AI voice agents, always use bidirectional="true" and keepCallAlive="true" to maintain the call while your agent processes and responds.

Basic Implementation

1. Configure Plivo to Stream Audio

Create an XML application that streams audio to your WebSocket:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Speak>Connected to AI Assistant.</Speak>
    <Stream
        keepCallAlive="true"
        bidirectional="true"
        contentType="audio/x-mulaw;rate=8000"
        statusCallbackUrl="https://your-domain.com/stream-status">
        wss://your-domain.com/stream
    </Stream>
</Response>

2. Handle WebSocket Connection

Your server receives the WebSocket connection and processes events:

# Simplified example
async def handle_websocket(websocket):
    async for message in websocket:
        event = json.loads(message)

        if event["event"] == "start":
            # Stream started - initialize AI services
            stream_id = event["start"]["streamId"]

        elif event["event"] == "media":
            # Audio received - send to STT
            audio_bytes = base64.b64decode(event["media"]["payload"])
            transcript = await speech_to_text(audio_bytes)

            if transcript:
                # Get AI response
                response = await get_llm_response(transcript)

                # Convert to speech and send back
                audio = await text_to_speech(response)
                await websocket.send(json.dumps({
                    "event": "playAudio",
                    "media": {
                        "contentType": "audio/x-mulaw",
                        "sampleRate": 8000,
                        "payload": base64.b64encode(audio).decode()
                    }
                }))

Next Steps

Audio Streaming Guide

Complete documentation: XML configuration, WebSocket protocol, APIs, callbacks, signature validation, and code examples

Best Practices

Troubleshooting tips and optimization recommendations

Plivo Stream SDK

Official SDKs for Python, Node.js, and Java with built-in audio handling

Pipecat Integration

Build with Pipecat framework for higher-level abstraction

Voice API Overview - Core voice platform concepts
Voice API Reference - Complete API documentation
XML Reference - All XML elements for call control

Concepts

Integration Guides

API Reference

XML Reference

Troubleshooting

Building AI Voice Agents with Audio Streaming

Prerequisites

Voice API Basics

What is Audio Streaming?

How It Works

Stream Directions

Inbound Stream (Unidirectional)

Bidirectional Stream

Basic Implementation

1. Configure Plivo to Stream Audio

2. Handle WebSocket Connection

Next Steps

Audio Streaming Guide

Best Practices

Plivo Stream SDK

Pipecat Integration

Concepts

Integration Guides

API Reference

XML Reference

Troubleshooting

​Prerequisites

​Voice API Basics

​What is Audio Streaming?

​How It Works

​Stream Directions

​Inbound Stream (Unidirectional)

​Bidirectional Stream

​Basic Implementation

​1. Configure Plivo to Stream Audio

​2. Handle WebSocket Connection

​Next Steps

Audio Streaming Guide

Best Practices

Plivo Stream SDK

Pipecat Integration

​Related

Prerequisites

Voice API Basics

What is Audio Streaming?

How It Works

Stream Directions

Inbound Stream (Unidirectional)

Bidirectional Stream

Basic Implementation

1. Configure Plivo to Stream Audio

2. Handle WebSocket Connection

Next Steps

Related