Skip to main content
Build AI-powered voice agents that have natural conversations with callers using Plivo’s Audio Streaming. Stream live audio to your AI services (STT, LLM, TTS) via WebSocket and respond in real-time.

Prerequisites

Before building your AI voice agent, you’ll need:
RequirementDescription
Plivo AccountSign up and get your Auth ID and Auth Token
Phone NumberPurchase a voice-enabled number to receive/make calls
- India: Requires KYC verification. See Rent India Numbers.
WebSocket ServerA publicly accessible server to handle audio streams (use ngrok for development)
AI Service CredentialsAPI keys for your chosen providers:
- Speech-to-Text (STT): Deepgram, Google Speech, AWS Transcribe, etc.
- LLM: OpenAI, Anthropic, Google Gemini, etc.
- Text-to-Speech (TTS): ElevenLabs, Google TTS, Amazon Polly, etc.

Voice API Basics

Audio Streaming builds on Plivo’s Voice API. The core workflow is:
  1. Make or receive a call using the Call API
  2. Control the call using Plivo XML responses
  3. Stream audio using the <Stream> XML element
For complete Voice API documentation, see Voice API Overview.

What is Audio Streaming?

Audio Streaming gives you access to the raw audio of voice calls in real-time via WebSockets. This enables:
  • AI Voice Assistants - Natural conversations with speech recognition and synthesis
  • Real-time Transcription - Live call transcription for analytics
  • Voice Bots - Automated IVR systems with intelligent responses
  • Sentiment Analysis - Real-time audio analysis during calls

How It Works

       ┌─────────────────┐
       │     Caller      │
       │    (Phone)      │
       └────────┬────────┘


       ┌─────────────────┐
       │   Plivo Voice   │
       │      API        │
       └────────┬────────┘

                │ Audio Stream (WebSocket)

       ┌─────────────────┐
       │   Your App      │
       │  (WebSocket)    │
       └────────┬────────┘

                │ API Calls

       ┌─────────────────┐
       │  AI Services    │
       │  STT/LLM/TTS    │
       └─────────────────┘
Flow:
  1. Caller dials your Plivo number (or you make an outbound call)
  2. Plivo connects to your WebSocket endpoint and starts streaming audio
  3. Your app sends audio to STT for transcription
  4. Transcribed text goes to LLM for response generation
  5. LLM response is converted to speech via TTS
  6. Audio is sent back through WebSocket to the caller

Stream Directions

Inbound Stream (Unidirectional)

Audio flows from the caller to your server. Use this when you only need to receive audio (e.g., transcription, call analytics).
<Stream bidirectional="false">
    wss://your-server.com/stream
</Stream>

Bidirectional Stream

Audio flows both directions - from caller to your server AND from your server back to the caller. Use this for AI voice agents that need to respond.
<Stream bidirectional="true" keepCallAlive="true">
    wss://your-server.com/stream
</Stream>
For AI voice agents, always use bidirectional="true" and keepCallAlive="true" to maintain the call while your agent processes and responds.

Basic Implementation

1. Configure Plivo to Stream Audio

Create an XML application that streams audio to your WebSocket:
<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Speak>Connected to AI Assistant.</Speak>
    <Stream
        keepCallAlive="true"
        bidirectional="true"
        contentType="audio/x-mulaw;rate=8000"
        statusCallbackUrl="https://your-domain.com/stream-status">
        wss://your-domain.com/stream
    </Stream>
</Response>

2. Handle WebSocket Connection

Your server receives the WebSocket connection and processes events:
# Simplified example
async def handle_websocket(websocket):
    async for message in websocket:
        event = json.loads(message)

        if event["event"] == "start":
            # Stream started - initialize AI services
            stream_id = event["start"]["streamId"]

        elif event["event"] == "media":
            # Audio received - send to STT
            audio_bytes = base64.b64decode(event["media"]["payload"])
            transcript = await speech_to_text(audio_bytes)

            if transcript:
                # Get AI response
                response = await get_llm_response(transcript)

                # Convert to speech and send back
                audio = await text_to_speech(response)
                await websocket.send(json.dumps({
                    "event": "playAudio",
                    "media": {
                        "contentType": "audio/x-mulaw",
                        "sampleRate": 8000,
                        "payload": base64.b64encode(audio).decode()
                    }
                }))

Next Steps