Build AI Voice Agents with the Plivo Stream SDK

The Plivo Stream SDK provides official libraries for Python, Node.js, and Java to build AI voice agents using Plivo’s Audio Streaming API. These SDKs handle WebSocket connections, audio encoding/decoding, and event management, letting you focus on your AI integration logic.

What You Can Build

AI Voice Assistants - Natural conversations powered by speech-to-text, LLMs, and text-to-speech
Real-time Transcription - Live call transcription with speech recognition services
Voice Bots - Automated IVR systems with intelligent responses
Call Analytics - Real-time audio analysis and sentiment detection

Get Started with Plivo

Before developing your AI voice agent, sign up for Plivo or sign in to your existing account. Purchase a voice-enabled number through the Plivo console.

Prerequisites

Required Accounts

Plivo - Account with Auth ID and Auth Token
Deepgram - Sign up for speech-to-text
OpenAI - Sign up for conversational AI
ElevenLabs - Sign up for text-to-speech

Language Requirements

Python
Node.js
Java

Python 3.8 or later
pip package manager

Installation

Python
Node.js
Java

pip install plivo-stream

The Python SDK supports two WebSocket implementations:

FastAPI - For production applications using ASGI
websockets - Lightweight option for simple use cases

npm install @anthropic/plivo-stream-sdk

Or with yarn:

yarn add @plivo/plivo-stream-sdk

The Node.js SDK is built on the ws WebSocket library and includes TypeScript definitions.

Add to your pom.xml:

<dependency>
    <groupId>com.plivo</groupId>
    <artifactId>plivo-stream-sdk</artifactId>
    <version>1.0.0</version>
</dependency>

Or with Gradle:

implementation 'com.plivo:plivo-stream-sdk:1.0.0'

The Java SDK uses Jakarta WebSocket API 2.1.1.

Core Concepts

Audio Streaming Flow

┌─────────────┐    WebSocket    ┌─────────────┐    API Calls    ┌─────────────┐
│   Plivo     │ ───────────────▶│  Your App   │ ───────────────▶│  AI Services│
│   Call      │ ◀─────────────── │  (SDK)      │ ◀─────────────── │  STT/LLM/TTS│
└─────────────┘   Audio Events   └─────────────┘   Text/Audio    └─────────────┘

Caller dials your Plivo number
Plivo connects to your WebSocket endpoint
SDK receives START event with stream metadata
Audio flows as MEDIA events (base64-encoded mu-law)
Your app processes audio through AI services
SDK sends audio back to the caller

Event Types

Event	Description
`START`	Stream initialized with call metadata (stream ID, call UUID, from/to numbers)
`MEDIA`	Audio chunk received (base64-encoded, mu-law at 8kHz or linear PCM at 16kHz)
`DTMF`	Caller pressed a key on their phone
`STOP`	Stream ended

Audio Formats

Format	Encoding	Sample Rate	Use Case
`audio/x-mulaw`	mu-law	8000 Hz	Standard telephony (default)
`audio/x-l16`	Linear PCM	16000 Hz	Higher quality for STT

Quick Start

Step 1: Create a WebSocket Handler

Python (FastAPI)
Node.js
Java

from fastapi import FastAPI, WebSocket
from plivo_stream import PlivoFastAPIStreamingHandler, StartEvent, MediaEvent

app = FastAPI()

@app.websocket("/stream")
async def websocket_endpoint(websocket: WebSocket):
    handler = PlivoFastAPIStreamingHandler(websocket)

    @handler.on_start
    async def handle_start(event: StartEvent):
        print(f"Stream started: {handler.get_stream_id()}")
        print(f"Call from: {event.start.from_}")
        print(f"Call to: {event.start.to}")

    @handler.on_media
    async def handle_media(event: MediaEvent):
        # Get raw audio bytes from the event
        audio_bytes = event.get_raw_media()

        # Process audio (send to STT, etc.)
        # ...

        # Send audio back to caller
        await handler.send_media(response_audio)

    @handler.on_dtmf
    async def handle_dtmf(event):
        print(f"DTMF digit pressed: {event.dtmf.digit}")

    @handler.on_stop
    async def handle_stop(event):
        print("Stream ended")

    await handler.start()

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=5000)

import express from 'express';
import { createServer } from 'http';
import { PlivoWebSocketServer, StartEvent, MediaEvent } from '@plivo/plivo-stream-sdk';

const app = express();
const server = createServer(app);

const plivoServer = new PlivoWebSocketServer({
    server,
    path: '/stream'
});

plivoServer
    .onStart((event: StartEvent, ws) => {
        console.log(`Stream started: ${event.start.streamId}`);
        console.log(`Call from: ${event.start.from}`);
        console.log(`Call to: ${event.start.to}`);
    })
    .onMedia((event: MediaEvent, ws) => {
        // Get raw audio buffer from the event
        const audioBuffer = event.getRawMedia();

        // Process audio (send to STT, etc.)
        // ...

        // Send audio back to caller
        plivoServer.playAudio(ws, 'audio/x-mulaw', 8000, responseAudio);
    })
    .onDtmf((event, ws) => {
        console.log(`DTMF digit pressed: ${event.dtmf.digit}`);
    })
    .onStop((event, ws) => {
        console.log('Stream ended');
    })
    .start();

server.listen(5000, () => {
    console.log('Server listening on port 5000');
});

import com.plivo.stream.PlivoStreamingHandler;
import com.plivo.stream.PlivoWebSocketEndpoint;
import com.plivo.stream.event.StartEvent;
import com.plivo.stream.event.MediaEvent;
import jakarta.websocket.server.ServerEndpoint;
import jakarta.websocket.Session;

@ServerEndpoint("/stream")
public class StreamEndpoint extends PlivoWebSocketEndpoint {

    @Override
    protected PlivoStreamingHandler createHandler(Session session) {
        PlivoStreamingHandler handler = new PlivoStreamingHandler(session);

        handler.onStart(event -> {
            System.out.println("Stream started: " + event.getStart().getStreamId());
            System.out.println("Call from: " + event.getStart().getFrom());
            System.out.println("Call to: " + event.getStart().getTo());
        });

        handler.onMedia(event -> {
            // Get raw audio bytes from the event
            byte[] audioBytes = event.getRawMedia();

            // Process audio (send to STT, etc.)
            // ...

            // Send audio back to caller
            handler.playAudio(responseAudio, "audio/x-mulaw", 8000);
        });

        handler.onDtmf(event -> {
            System.out.println("DTMF digit pressed: " + event.getDtmf().getDigit());
        });

        handler.onStop(event -> {
            System.out.println("Stream ended");
        });

        return handler;
    }
}

Step 2: Configure Plivo to Stream Audio

Create an XML application that routes calls to your WebSocket endpoint:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Speak>Connected to AI Assistant. You may begin speaking.</Speak>
    <Stream keepCallAlive="true" audioTrack="both" contentType="audio/x-mulaw;rate=8000">
        wss://your-domain.com/stream
    </Stream>
</Response>

Step 3: Set Up Local Development

For local testing, use ngrok to expose your WebSocket endpoint:

# Install ngrok
brew install ngrok  # macOS
# or download from https://ngrok.com/download

# Start tunnel
ngrok http 5000

Update your Plivo XML with the ngrok URL:

<Stream keepCallAlive="true" audioTrack="both">
    wss://abc123.ngrok.app/stream
</Stream>

Building an AI Voice Agent

This example shows a complete AI voice agent using Deepgram (STT), OpenAI (LLM), and ElevenLabs (TTS).

Python
Node.js
Java

import asyncio
import os
from fastapi import FastAPI, WebSocket
from plivo_stream import PlivoFastAPIStreamingHandler, StartEvent, MediaEvent
from deepgram import DeepgramClient, LiveTranscriptionEvents
from openai import AsyncOpenAI
from elevenlabs import ElevenLabs

app = FastAPI()

# Initialize AI service clients
deepgram = DeepgramClient(os.environ["DEEPGRAM_API_KEY"])
openai_client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])
elevenlabs = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])

SYSTEM_PROMPT = """You are a helpful AI voice assistant. Keep responses
concise and conversational. Respond naturally as if speaking on a phone call."""

@app.websocket("/stream")
async def websocket_endpoint(websocket: WebSocket):
    handler = PlivoFastAPIStreamingHandler(websocket)
    conversation_history = []
    audio_buffer = bytearray()

    # Set up Deepgram live transcription
    dg_connection = deepgram.listen.live.v("1")

    @dg_connection.on(LiveTranscriptionEvents.Transcript)
    async def on_transcript(result):
        transcript = result.channel.alternatives[0].transcript
        if transcript and result.is_final:
            # Got final transcript, process with LLM
            await process_with_ai(transcript)

    async def process_with_ai(user_text: str):
        conversation_history.append({"role": "user", "content": user_text})

        # Get response from OpenAI
        response = await openai_client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                *conversation_history
            ]
        )

        assistant_text = response.choices[0].message.content
        conversation_history.append({"role": "assistant", "content": assistant_text})

        # Convert to speech with ElevenLabs
        audio = elevenlabs.text_to_speech.convert(
            text=assistant_text,
            voice_id="21m00Tcm4TlvDq8ikWAM",  # Rachel voice
            model_id="eleven_turbo_v2",
            output_format="ulaw_8000"  # mu-law for Plivo
        )

        # Send audio back to caller
        audio_bytes = b"".join(audio)
        await handler.send_media(audio_bytes)

    @handler.on_start
    async def handle_start(event: StartEvent):
        print(f"Call started from {event.start.from_}")
        await dg_connection.start()

    @handler.on_media
    async def handle_media(event: MediaEvent):
        # Forward audio to Deepgram for transcription
        audio_bytes = event.get_raw_media()
        await dg_connection.send(audio_bytes)

    @handler.on_stop
    async def handle_stop(event):
        await dg_connection.finish()

    await handler.start()

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=5000)

import express from 'express';
import { createServer } from 'http';
import { PlivoWebSocketServer, StartEvent, MediaEvent } from '@plivo/plivo-stream-sdk';
import { createClient, LiveTranscriptionEvents } from '@deepgram/sdk';
import OpenAI from 'openai';
import { ElevenLabsClient } from 'elevenlabs';

const app = express();
const server = createServer(app);

// Initialize AI service clients
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const elevenlabs = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY });

const SYSTEM_PROMPT = `You are a helpful AI voice assistant. Keep responses
concise and conversational. Respond naturally as if speaking on a phone call.`;

const plivoServer = new PlivoWebSocketServer({ server, path: '/stream' });

// Track conversations per WebSocket connection
const conversations = new Map<WebSocket, { history: any[], dgConnection: any }>();

plivoServer
    .onStart(async (event: StartEvent, ws) => {
        console.log(`Call started from ${event.start.from}`);

        // Set up Deepgram live transcription
        const dgConnection = deepgram.listen.live({
            model: 'nova-2',
            encoding: 'mulaw',
            sample_rate: 8000,
            channels: 1
        });

        dgConnection.on(LiveTranscriptionEvents.Transcript, async (data) => {
            const transcript = data.channel?.alternatives?.[0]?.transcript;
            if (transcript && data.is_final) {
                await processWithAI(ws, transcript);
            }
        });

        conversations.set(ws, { history: [], dgConnection });
    })
    .onMedia(async (event: MediaEvent, ws) => {
        const conv = conversations.get(ws);
        if (conv?.dgConnection) {
            // Forward audio to Deepgram
            const audioBuffer = event.getRawMedia();
            conv.dgConnection.send(audioBuffer);
        }
    })
    .onStop((event, ws) => {
        const conv = conversations.get(ws);
        if (conv?.dgConnection) {
            conv.dgConnection.finish();
        }
        conversations.delete(ws);
    })
    .start();

async function processWithAI(ws: WebSocket, userText: string) {
    const conv = conversations.get(ws);
    if (!conv) return;

    conv.history.push({ role: 'user', content: userText });

    // Get response from OpenAI
    const response = await openai.chat.completions.create({
        model: 'gpt-4',
        messages: [
            { role: 'system', content: SYSTEM_PROMPT },
            ...conv.history
        ]
    });

    const assistantText = response.choices[0].message?.content || '';
    conv.history.push({ role: 'assistant', content: assistantText });

    // Convert to speech with ElevenLabs
    const audioStream = await elevenlabs.textToSpeech.convert('21m00Tcm4TlvDq8ikWAM', {
        text: assistantText,
        model_id: 'eleven_turbo_v2',
        output_format: 'ulaw_8000'
    });

    // Collect audio chunks
    const chunks: Buffer[] = [];
    for await (const chunk of audioStream) {
        chunks.push(chunk);
    }
    const audioBuffer = Buffer.concat(chunks);

    // Send audio back to caller
    plivoServer.playAudio(ws, 'audio/x-mulaw', 8000, audioBuffer);
}

server.listen(5000, () => {
    console.log('Server listening on port 5000');
});

import com.plivo.stream.PlivoStreamingHandler;
import com.plivo.stream.PlivoWebSocketEndpoint;
import com.plivo.stream.event.StartEvent;
import com.plivo.stream.event.MediaEvent;
import jakarta.websocket.server.ServerEndpoint;
import jakarta.websocket.Session;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

@ServerEndpoint("/stream")
public class AIVoiceAgentEndpoint extends PlivoWebSocketEndpoint {

    private static final String SYSTEM_PROMPT = """
        You are a helpful AI voice assistant. Keep responses
        concise and conversational. Respond naturally as if speaking on a phone call.
        """;

    // Track conversations per session
    private static final Map<String, ConversationContext> conversations =
        new ConcurrentHashMap<>();

    @Override
    protected PlivoStreamingHandler createHandler(Session session) {
        PlivoStreamingHandler handler = new PlivoStreamingHandler(session);

        handler.onStart(event -> {
            System.out.println("Call started from " + event.getStart().getFrom());

            // Initialize conversation context with AI services
            ConversationContext context = new ConversationContext();
            context.initializeDeepgram(transcriptCallback(handler, context));
            conversations.put(session.getId(), context);
        });

        handler.onMedia(event -> {
            ConversationContext context = conversations.get(session.getId());
            if (context != null) {
                // Forward audio to Deepgram for transcription
                byte[] audioBytes = event.getRawMedia();
                context.sendToDeepgram(audioBytes);
            }
        });

        handler.onStop(event -> {
            ConversationContext context = conversations.remove(session.getId());
            if (context != null) {
                context.close();
            }
        });

        return handler;
    }

    private TranscriptCallback transcriptCallback(
            PlivoStreamingHandler handler,
            ConversationContext context) {
        return (transcript) -> {
            if (transcript != null && !transcript.isEmpty()) {
                processWithAI(handler, context, transcript);
            }
        };
    }

    private void processWithAI(
            PlivoStreamingHandler handler,
            ConversationContext context,
            String userText) {

        context.addUserMessage(userText);

        // Get response from OpenAI
        String assistantText = context.getOpenAIResponse(SYSTEM_PROMPT);
        context.addAssistantMessage(assistantText);

        // Convert to speech with ElevenLabs
        byte[] audioBytes = context.textToSpeech(assistantText);

        // Send audio back to caller
        handler.playAudio(audioBytes, "audio/x-mulaw", 8000);
    }
}

SDK Reference

Sending Audio to Caller

Python
Node.js
Java

# Send audio bytes to the caller
await handler.send_media(audio_bytes)

# Send a checkpoint (receive callback when audio finishes playing)
await handler.send_checkpoint(name="greeting_complete")

# Clear any queued audio (useful for interruptions)
await handler.send_clear_audio()

// Send audio to the caller
plivoServer.playAudio(ws, 'audio/x-mulaw', 8000, audioBuffer);

// Send a checkpoint (receive callback when audio finishes playing)
plivoServer.checkpoint(ws, 'greeting_complete');

// Clear any queued audio (useful for interruptions)
plivoServer.clearAudio(ws);

// Send audio to the caller
handler.playAudio(audioBytes, "audio/x-mulaw", 8000);

// Send a checkpoint (receive callback when audio finishes playing)
handler.checkpoint("greeting_complete");

// Clear any queued audio (useful for interruptions)
handler.clearAudio();

Event Handlers

Event	Handler	Description
Connection	`on_connected` / `onConnection`	WebSocket connected (before START)
Start	`on_start` / `onStart`	Stream initialized, call metadata available
Media	`on_media` / `onMedia`	Audio chunk received
DTMF	`on_dtmf` / `onDtmf`	Keypad digit pressed
Stop	`on_stop` / `onStop`	Stream ended
Checkpoint	`on_played_stream` / `onPlayedStream`	Checkpoint reached (audio finished)
Audio Cleared	`on_cleared_audio` / `onClearedAudio`	Audio queue cleared

Getting Stream Information

Python
Node.js
Java

@handler.on_start
async def handle_start(event: StartEvent):
    stream_id = handler.get_stream_id()
    call_uuid = event.start.call_id
    from_number = event.start.from_
    to_number = event.start.to
    content_type = event.start.media_format.encoding  # audio/x-mulaw
    sample_rate = event.start.media_format.sample_rate  # 8000

plivoServer.onStart((event: StartEvent, ws) => {
    const streamId = event.start.streamId;
    const callUuid = event.start.callId;
    const fromNumber = event.start.from;
    const toNumber = event.start.to;
    const contentType = event.start.mediaFormat.encoding;  // audio/x-mulaw
    const sampleRate = event.start.mediaFormat.sampleRate;  // 8000
});

handler.onStart(event -> {
    String streamId = event.getStart().getStreamId();
    String callUuid = event.getStart().getCallId();
    String fromNumber = event.getStart().getFrom();
    String toNumber = event.getStart().getTo();
    String contentType = event.getStart().getMediaFormat().getEncoding();
    int sampleRate = event.getStart().getMediaFormat().getSampleRate();
});

Configuration Options

Environment Variables

Create a .env file with your credentials:

# Plivo credentials
PLIVO_AUTH_ID=your_auth_id
PLIVO_AUTH_TOKEN=your_auth_token

# AI service credentials
DEEPGRAM_API_KEY=your_deepgram_key
OPENAI_API_KEY=your_openai_key
ELEVENLABS_API_KEY=your_elevenlabs_key

Plivo Stream XML Options

<Stream
    keepCallAlive="true"
    audioTrack="both"
    contentType="audio/x-mulaw;rate=8000"
    statusCallbackUrl="https://your-domain.com/stream-status"
    statusCallbackMethod="POST">
    wss://your-domain.com/stream
</Stream>

Attribute	Description
`keepCallAlive`	Keep call active after stream ends (`true`/`false`)
`audioTrack`	Audio direction: `inbound`, `outbound`, or `both`
`contentType`	Audio format: `audio/x-mulaw;rate=8000` or `audio/x-l16;rate=16000`
`statusCallbackUrl`	URL for stream status webhooks

Troubleshooting

WebSocket Connection Issues

Verify ngrok is running and the URL matches your XML configuration
Check firewall rules allow WebSocket connections on your server
Validate SSL certificates if using custom domains

Audio Quality Issues

Use correct audio format - mu-law at 8kHz for standard telephony
Check sample rate matches between incoming and outgoing audio
Monitor latency - keep processing under 200ms for natural conversation

No Audio Received

Verify audioTrack is set to both or inbound in your XML
Check handler is registered before calling start()
Confirm call is connected - START event should fire first

Clone the Example Repositories

Full working examples are available in the SDK repositories:

Python
Node.js
Java

git clone https://github.com/plivo/plivo-stream-sdk-python.git
cd plivo-stream-sdk-python/examples/demo
pip install -r requirements.txt
python server.py

git clone https://github.com/plivo/plivo-stream-sdk-node.git
cd plivo-stream-sdk-node/examples/express-streaming
npm install
npm start

git clone https://github.com/plivo/plivo-stream-sdk-java.git
cd plivo-stream-sdk-java
mvn clean install
# Run the example in examples/voice-ai-agent

Audio Streaming API - API reference for audio streaming
Stream XML Element - XML configuration reference
OpenAI Realtime Integration - Alternative integration approach
Deepgram + OpenAI + ElevenLabs Guide - Non-SDK integration

Support

Plivo Documentation
Plivo Support for technical assistance
GitHub Issues for SDK bugs

Concepts

Integration Guides

API Reference

XML Reference

Troubleshooting

Build AI Voice Agents with the Plivo Stream SDK

What You Can Build

Get Started with Plivo

Prerequisites

Required Accounts

Language Requirements

Installation

Core Concepts

Audio Streaming Flow

Event Types

Audio Formats

Quick Start

Step 1: Create a WebSocket Handler

Step 2: Configure Plivo to Stream Audio

Step 3: Set Up Local Development

Building an AI Voice Agent

SDK Reference

Sending Audio to Caller

Event Handlers

Getting Stream Information

Configuration Options

Environment Variables

Plivo Stream XML Options

Troubleshooting

WebSocket Connection Issues

Audio Quality Issues

No Audio Received

Clone the Example Repositories

Support

Concepts

Integration Guides

API Reference

XML Reference

Troubleshooting

​What You Can Build

​Get Started with Plivo

​Prerequisites

​Required Accounts

​Language Requirements

​Installation

​Core Concepts

​Audio Streaming Flow

​Event Types

​Audio Formats

​Quick Start

​Step 1: Create a WebSocket Handler

​Step 2: Configure Plivo to Stream Audio

​Step 3: Set Up Local Development

​Building an AI Voice Agent

​SDK Reference

​Sending Audio to Caller

​Event Handlers

​Getting Stream Information

​Configuration Options

​Environment Variables

​Plivo Stream XML Options

​Troubleshooting

​WebSocket Connection Issues

​Audio Quality Issues

​No Audio Received

​Clone the Example Repositories

​Related

​Support

What You Can Build

Get Started with Plivo

Prerequisites

Required Accounts

Language Requirements

Installation

Core Concepts

Audio Streaming Flow

Event Types

Audio Formats

Quick Start

Step 1: Create a WebSocket Handler

Step 2: Configure Plivo to Stream Audio

Step 3: Set Up Local Development

Building an AI Voice Agent

SDK Reference

Sending Audio to Caller

Event Handlers

Getting Stream Information

Configuration Options

Environment Variables

Plivo Stream XML Options

Troubleshooting

WebSocket Connection Issues

Audio Quality Issues

No Audio Received

Clone the Example Repositories

Related

Support