Architecture
Plivo Audio Streaming enables real-time, bidirectional audio communication between your application and an ongoing phone call via WebSocket.High-Level Flow
Step-by-Step Flow
- Call Initiation: A caller dials your Plivo number, or your application initiates an outbound call.
- Answer URL Request: Plivo makes an HTTP request to your configured Answer URL.
-
Stream XML Response: Your server responds with XML containing the
<Stream>element, specifying the WebSocket URL and streaming parameters. - WebSocket Connection: Plivo establishes a WebSocket connection to your specified URL, validating signatures if configured.
-
Start Event: Plivo sends a
startevent containing call metadata (call ID, stream ID, media format, etc.). -
Media Streaming:
- Inbound: Plivo continuously sends
mediaevents containing base64-encoded audio chunks from the caller. - Outbound: Your server sends
playAudioevents with base64-encoded audio to be played to the caller.
- Inbound: Plivo continuously sends
-
DTMF Events: When the caller presses keys, Plivo sends
dtmfevents with the digit information. -
Control Events: Your server can send
clearAudioto interrupt playback orcheckpointto track playback progress. -
Confirmation Events: Plivo sends
playedStreamwhen audio finishes playing andclearedAudiowhen the queue is cleared. - Connection Close: When the call ends or streaming stops, the WebSocket connection closes.
Stream XML
The<Stream> XML element initiates audio streaming for a call. Include it in your Answer URL response.
Basic Syntax
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
bidirectional | boolean | false | Enable two-way audio streaming. When true, you can send audio back to the caller. |
keepCallAlive | boolean | false | Keep the call active after the stream ends. When false, the call ends when streaming stops. |
contentType | string | audio/x-mulaw;rate=8000 | Audio codec and sample rate. See Supported Content Types. |
statusCallbackUrl | string | — | URL for stream status callbacks (started, stopped, failed). |
statusCallbackMethod | string | POST | HTTP method for status callbacks (GET or POST). |
extraHeaders | string | — | Custom headers to include in the start event. Format: key1=value1;key2=value2 |
Supported Content Types
| Content Type | Description | Use Case |
|---|---|---|
audio/x-mulaw;rate=8000 | μ-law codec at 8kHz | Recommended. Standard telephony, lowest latency, best compatibility. |
audio/x-l16;rate=8000 | Linear PCM 16-bit at 8kHz | Higher quality for speech processing. |
audio/x-l16;rate=16000 | Linear PCM 16-bit at 16kHz | High-quality speech recognition. |
Examples
Basic Unidirectional Stream (Listen Only)
Bidirectional Stream with μ-law Codec
Stream with Status Callbacks and Extra Headers
Higher Quality Stream (16kHz)
Record After Stream
Stream APIs
The Plivo Stream API allows you to control active streams programmatically via REST API calls.Base URL
Authentication
Use HTTP Basic Authentication with your Plivo Auth ID and Auth Token.Stop a Stream
Stop an active stream on a call. Endpoint:DELETE /v1/Account/{auth_id}/Call/{call_uuid}/Stream/
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
auth_id | string | Yes | Your Plivo Auth ID |
call_uuid | string | Yes | The UUID of the call |
Get Stream Details
Retrieve information about active streams on a call. Endpoint:GET /v1/Account/{auth_id}/Call/{call_uuid}/Stream/
Example Request:
Using the Plivo SDK
Node.js
Python
Stream Status Callback URL
Configure a callback URL to receive notifications about stream lifecycle events.Configuration
Set thestatusCallbackUrl attribute in your Stream XML:
Callback Events
Your callback URL receives POST (or GET) requests with the following parameters:| Parameter | Type | Description |
|---|---|---|
CallUUID | string | The unique identifier for the call |
StreamID | string | The unique identifier for the stream |
Event | string | The event type: started, stopped, failed |
Timestamp | string | ISO 8601 timestamp of the event |
From | string | The caller’s phone number |
To | string | The called phone number |
Direction | string | Call direction: inbound or outbound |
StatusReason | string | Reason for status (on stopped or failed) |
Duration | number | Stream duration in seconds (on stopped) |
Event Types
started
Sent when the WebSocket connection is successfully established.
stopped
Sent when the stream ends normally.
failed
Sent when the stream fails to start or encounters an error.
Example Callback Handler
Plivo Signature Validation
Plivo signs WebSocket connection requests to verify authenticity. Validate these signatures to ensure requests originate from Plivo.V3 Signature Headers
Plivo includes two headers with each WebSocket connection request:| Header | Description |
|---|---|
X-Plivo-Signature-V3 | The HMAC-SHA256 signature |
X-Plivo-Signature-V3-Nonce | A unique nonce for this request |
Validation Process
- Construct the signature base string:
{METHOD}{URI}{NONCE} - Compute HMAC-SHA256 using your Auth Token as the key
- Base64 encode the result
- Compare with the
X-Plivo-Signature-V3header
Using the Plivo SDK
The Plivo SDK provides a built-in validation function:Using the Node.js Stream SDK
Theplivo-stream-sdk-node handles signature validation automatically:
validateSignature is enabled, connections with invalid signatures are automatically rejected with a 1008 WebSocket close code.
Manual Validation Example
The Plivo Stream Event Protocol
All communication over the WebSocket uses JSON messages. Events are categorized as Input Events (from Plivo to your server) and Output Events (from your server to Plivo).Input Events (Plivo → Your Server)
start
Sent once when the stream begins. Contains call and stream metadata.
| Field | Type | Description |
|---|---|---|
event | string | Always "start" |
sequenceNumber | number | Event sequence number (starts at 1) |
start.callId | string (UUID) | Unique identifier for the call |
start.streamId | string (UUID) | Unique identifier for the stream |
start.accountId | string | Your Plivo account ID |
start.tracks | string[] | Audio tracks being streamed (e.g., ["inbound"], ["inbound", "outbound"]) |
start.mediaFormat.encoding | string | Audio codec (e.g., "audio/x-mulaw") |
start.mediaFormat.sampleRate | number | Sample rate in Hz |
extra_headers | string | Custom headers from the Stream XML extraHeaders attribute |
media
Sent continuously during the call. Contains audio data from the caller.
| Field | Type | Description |
|---|---|---|
event | string | Always "media" |
sequenceNumber | number | Event sequence number |
streamId | string (UUID) | Stream identifier |
media.track | string | Audio track ("inbound" = caller audio) |
media.timestamp | string | Unix timestamp in milliseconds |
media.chunk | number | Chunk sequence number for this track |
media.payload | string | Base64-encoded audio data |
extra_headers | string | Custom headers from the Stream XML |
- Each chunk contains approximately 20ms of audio
- At 8kHz with μ-law encoding: ~160 bytes per chunk
- Decode using:
Buffer.from(payload, 'base64')
dtmf
Sent when the caller presses a key on their phone.
| Field | Type | Description |
|---|---|---|
event | string | Always "dtmf" |
sequenceNumber | number | Event sequence number |
streamId | string (UUID) | Stream identifier |
dtmf.track | string | Audio track ("inbound") |
dtmf.digit | string | The DTMF digit pressed (0-9, *, #, A-D) |
dtmf.timestamp | string | Unix timestamp in milliseconds |
extra_headers | string | Custom headers from the Stream XML |
playedStream
Confirmation that audio with a checkpoint has finished playing.
| Field | Type | Description |
|---|---|---|
event | string | Always "playedStream" |
sequenceNumber | number | Event sequence number |
streamId | string (UUID) | Stream identifier |
name | string | The checkpoint name you specified |
clearedAudio
Confirmation that the audio queue has been cleared.
| Field | Type | Description |
|---|---|---|
event | string | Always "clearedAudio" |
sequenceNumber | number | Event sequence number |
streamId | string (UUID) | Stream identifier |
Output Events (Your Server → Plivo)
playAudio
Send audio to be played to the caller. For bidirectional streams only.
| Field | Type | Description |
|---|---|---|
event | string | Always "playAudio" |
media.contentType | string | Audio MIME type (must match stream’s contentType) |
media.sampleRate | number | Sample rate in Hz (must match stream’s sample rate) |
media.payload | string | Base64-encoded audio data |
checkpoint
Mark a point in the audio queue. Receive a playedStream event when playback reaches this point.
| Field | Type | Description |
|---|---|---|
event | string | Always "checkpoint" |
streamId | string (UUID) | Stream identifier |
name | string | Unique identifier for this checkpoint |
- Track when a specific response finishes playing
- Coordinate actions after audio playback
- Measure time from sending audio to playback completion
clearAudio
Clear all queued audio. Use this to implement interruption.
| Field | Type | Description |
|---|---|---|
event | string | Always "clearAudio" |
streamId | string (UUID) | Stream identifier |
X-Headers
X-Headers (Extra Headers) allow you to pass custom metadata from your Stream XML to your WebSocket server.Configuration
Set theextraHeaders attribute in your Stream XML:
Format
- Key-value pairs separated by semicolons:
key1=value1;key2=value2 - Keys and values are strings
- URL-encode values if they contain special characters
Accessing X-Headers
X-Headers appear in theextra_headers field of every event:
Parsing X-Headers
Why Use X-Headers?
- Session Correlation: Pass session IDs to correlate WebSocket connections with HTTP sessions
- User Context: Include user IDs, account tiers, or language preferences
- Routing: Pass information to route audio to different processing pipelines
- Analytics: Include tracking IDs for analytics and debugging
Example: Dynamic Agent Selection
Limits
WebSocket URL Length
| Limit | Value |
|---|---|
| Maximum WebSocket URL length | 2048 characters |
Stream Limits
| Limit | Value |
|---|---|
| Maximum concurrent streams per call | 1 |
| Maximum stream duration | Same as call duration |
| Audio buffer size (playback queue) | ~60 seconds of audio |
Rate Limits
| Limit | Value |
|---|---|
| Media events per second | ~50 (approximately 20ms chunks) |
| Maximum playAudio events per second | No hard limit, but limited by playback buffer |
Message Size
| Limit | Value |
|---|---|
| Maximum WebSocket message size | 64 KB |
| Recommended audio chunk size | ≤16 KB base64-encoded |
Protocol Schema Reference
JSON Schema
TypeScript Types
Recommendations for an Effective Plivo Stream Experience
Audio Codec and Sample Rate Considerations
Recommended: μ-law 8000Hz
Why μ-law at 8kHz is the best choice for most applications:- Native Telephony Format: μ-law (PCMU) is the standard codec for telephony networks. Using this format means no transcoding is required, reducing latency.
- Lowest Latency: Because it’s the native format, audio passes through Plivo with minimal processing overhead.
- Bandwidth Efficient: μ-law compresses 16-bit audio to 8-bit, reducing data transfer by 50% while maintaining voice quality.
- Universal Compatibility: Every speech-to-text and text-to-speech service supports μ-law. No conversion needed.
- Sufficient for Voice: Human speech is well-represented at 8kHz. Higher sample rates don’t significantly improve voice AI applications.
When to Use Higher Sample Rates
Consider 16kHz (audio/x-l16;rate=16000) only if:
- Your STT model specifically benefits from higher sample rates (verify with benchmarks)
- You’re doing audio analysis beyond speech recognition
- You have abundant bandwidth and can accept slightly higher latency
Minimize Latency for a Better Experience
1. Choose the Right Region for Your WebSocket Server
Key Latency Sources:- PSTN to Plivo: Fixed, based on caller location
- Plivo to your server: Depends on server location
- Your server to AI services: Depends on AI provider regions
2. Server Location Strategy
| Your Use Case | Recommended Server Location |
|---|---|
| US-focused traffic | US East (Virginia) or US West (Oregon) |
| Europe-focused traffic | Frankfurt or London |
| Asia-Pacific traffic | Singapore or Mumbai |
| Global traffic | Deploy in multiple regions with geographic routing |
3. Latency Budget
For a responsive Voice AI experience, aim for:| Component | Target Latency |
|---|---|
| Speech-to-Text | < 200ms |
| LLM Processing | < 500ms |
| Text-to-Speech | < 200ms |
| Network (round trip) | < 100ms |
| Total | < 1 second |
Where Is My Call Located? How Does Plivo Select the Location?
Plivo routes calls through the edge location closest to the caller’s location on the PSTN, not your server location. Edge Locations:- United States (multiple)
- Europe (London, Frankfurt)
- Asia-Pacific (Singapore, Mumbai, Sydney)
- And more
- A caller in London connects to Plivo’s London edge
- The WebSocket connects from London to your server
- Position your server close to your expected caller locations
India: Phone Numbers and Regulations
Indian telecommunications regulations require:- Local Presence: Indian phone numbers require local business registration
- DND Registry: Respect the Do Not Disturb registry for outbound calls
- Content Restrictions: Certain types of automated content may be restricted
Where to Host Your WebSocket Server
Cloud Providers with Low-Latency Options:| Provider | Best Regions for Voice |
|---|---|
| AWS | us-east-1, eu-west-1, ap-southeast-1 |
| Google Cloud | us-central1, europe-west1, asia-southeast1 |
| Azure | East US, West Europe, Southeast Asia |
| Fly.io | Automatic edge deployment |
| Cloudflare Workers | Global edge (for lightweight processing) |
- Use the same region as your AI services when possible
- Deploy WebSocket servers in multiple regions for global traffic
- Use connection pooling for AI service clients
- Keep WebSocket handlers lightweight—offload heavy processing
What Is Noise Cancellation and Why Do You Need It?
The Problem: Phone calls often include background noise—traffic, coffee shops, offices, wind. This noise degrades:- Speech recognition accuracy
- Voice AI response quality
- Overall user experience
How It Works
- Real-time Processing: Audio is processed in milliseconds at the edge
- AI-Powered: Uses machine learning models trained on telephony noise patterns
- Voice Preservation: Enhances speech while removing noise
- No Code Changes: Works transparently with existing streams
Benefits
- Higher STT Accuracy: 15-30% improvement in word error rate
- Fewer Misunderstandings: Reduces need for “I didn’t understand that” responses
- Better User Experience: Callers can use your voice AI from anywhere
Enable Noise Cancellation
Noise cancellation is an account-level feature. Contact Plivo to enable it: 📧 [email protected] Or reach out to your Plivo account manager.How-To and Examples
Start a Plivo Stream with Stream XML
Basic Answer URL Handler:Record a Plivo Stream with Stream XML
Record the call while streaming for compliance or training purposes:Stop a Plivo Stream with the Stream API
Example: Node.js Stream SDK with Deepgram, OpenAI, and ElevenLabs
A complete voice AI implementation:Sending and Receiving DTMFs
Handle DTMF input for menu navigation or controls:Example with Python Stream SDK
Example with Pipecat
Pipecat is an open-source framework for building voice AI pipelines.General Considerations for Voice AI Agents
Noise Cancellation
Why it matters: Background noise is the #1 cause of speech recognition errors. Implementation:- Enable Plivo’s built-in noise cancellation (contact support)
- Consider client-side noise suppression for high-quality microphones
- For mobile callers, noise is especially prevalent
Voice Activity Detection (VAD) and Turn Detection
The Challenge: Knowing when the user has finished speaking. Approaches:-
Silence-based VAD: Wait for N milliseconds of silence
- Pros: Simple
- Cons: Slow, doesn’t handle pauses well
-
STT End-of-Speech Detection: Most STT services provide
speech_finalevents- Pros: Understands speech patterns
- Cons: Slight delay
-
Semantic Turn Detection: Use LLM to determine if response is needed
- Pros: Handles complex dialogue
- Cons: Added latency
speech_final with a short timeout (300-500ms).
Interruption
Users should be able to interrupt the AI mid-response. Implementation:Context Management
Maintain conversation context for coherent multi-turn dialogue:Best Practices Summary
| Aspect | Recommendation |
|---|---|
| Codec | μ-law 8000Hz for lowest latency |
| Response Time | Aim for < 1 second total |
| Interruption | Always support—use clearAudio |
| DTMF | Support * for interrupt, # for repeat |
| Error Handling | Graceful fallbacks, don’t leave user hanging |
| Context | Maintain conversation history, trim when needed |
| Testing | Test on actual phone calls, not just WebSocket clients |
Support
For questions, issues, or feature requests:- Documentation: https://www.plivo.com/docs/
- Support: [email protected]
- GitHub Issues: For SDK-specific issues
Last updated: January 2026