Prerequisites
Before building your AI voice agent, you’ll need:| Requirement | Description |
|---|---|
| Plivo Account | Sign up and get your Auth ID and Auth Token |
| Phone Number | Purchase a voice-enabled number to receive/make calls |
| - India: Requires KYC verification. See Rent India Numbers. | |
| WebSocket Server | A publicly accessible server to handle audio streams (use ngrok for development) |
| AI Service Credentials | API keys for your chosen providers: |
| - Speech-to-Text (STT): Deepgram, Google Speech, AWS Transcribe, etc. | |
| - LLM: OpenAI, Anthropic, Google Gemini, etc. | |
| - Text-to-Speech (TTS): ElevenLabs, Google TTS, Amazon Polly, etc. |
Voice API Basics
Audio Streaming builds on Plivo’s Voice API. The core workflow is:- Make or receive a call using the Call API
- Control the call using Plivo XML responses
- Stream audio using the
<Stream>XML element
What is Audio Streaming?
Audio Streaming gives you access to the raw audio of voice calls in real-time via WebSockets. This enables:- AI Voice Assistants - Natural conversations with speech recognition and synthesis
- Real-time Transcription - Live call transcription for analytics
- Voice Bots - Automated IVR systems with intelligent responses
- Sentiment Analysis - Real-time audio analysis during calls
How It Works
- Caller dials your Plivo number (or you make an outbound call)
- Plivo connects to your WebSocket endpoint and starts streaming audio
- Your app sends audio to STT for transcription
- Transcribed text goes to LLM for response generation
- LLM response is converted to speech via TTS
- Audio is sent back through WebSocket to the caller
Stream Directions
Inbound Stream (Unidirectional)
Audio flows from the caller to your server. Use this when you only need to receive audio (e.g., transcription, call analytics).Bidirectional Stream
Audio flows both directions - from caller to your server AND from your server back to the caller. Use this for AI voice agents that need to respond.For AI voice agents, always use
bidirectional="true" and keepCallAlive="true" to maintain the call while your agent processes and responds.Basic Implementation
1. Configure Plivo to Stream Audio
Create an XML application that streams audio to your WebSocket:2. Handle WebSocket Connection
Your server receives the WebSocket connection and processes events:Next Steps
Audio Streaming Guide
Complete documentation: XML configuration, WebSocket protocol, APIs, callbacks, signature validation, and code examples
Best Practices
Troubleshooting tips and optimization recommendations
Plivo Stream SDK
Official SDKs for Python, Node.js, and Java with built-in audio handling
Pipecat Integration
Build with Pipecat framework for higher-level abstraction
Related
- Voice API Overview - Core voice platform concepts
- Voice API Reference - Complete API documentation
- XML Reference - All XML elements for call control