Skip to main content
Build a voice agent using Google Gemini Live for native speech-to-speech processing. Gemini Live processes audio directly without intermediate text conversion, enabling lower latency and more natural conversations. Best for: Multimodal applications requiring audio, video, and text processing with low latency.

How Speech-to-Speech Differs

Standard Pipeline (STT → LLM → TTS):
Audio → Deepgram → Text → OpenAI → Text → Cartesia → Audio
Speech-to-Speech (Direct):
Audio → Gemini Live → Audio
Speech-to-speech models process audio natively, preserving tone, emotion, and context that may be lost in text transcription.

Prerequisites

ServiceWhat You Need
PlivoAuth ID, Auth Token, Voice-enabled phone number
GoogleAPI key from AI Studio with Gemini Live access

Installation

pip install "pipecat-ai[google]"

Environment Variables

# Plivo credentials
PLIVO_AUTH_ID=your_auth_id
PLIVO_AUTH_TOKEN=your_auth_token
PLIVO_PHONE_NUMBER=+1234567890

# Google credentials
GOOGLE_API_KEY=your_google_key

Pipeline Configuration

from pipecat.services.google import GeminiLiveLLMService

# Speech-to-Speech service
llm = GeminiLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    # model="gemini-2.0-flash-exp",  # Check available models
)

Gemini Live Features

FeatureDescription
Multimodal processingHandle audio, video, and text inputs together
Real-time streamingLow-latency audio and video processing
Voice activity detectionAutomatic speech handling
Function callingIntegrate external tools and APIs
Context managementMaintain conversation history

Architecture

With Gemini Live, the pipeline is simplified:
Phone Call ↔ Plivo ↔ WebSocket ↔ Pipecat ↔ Gemini Live
A single service handles:
  • Speech recognition
  • Language understanding
  • Response generation
  • Voice synthesis

Quick Start

Inbound Calls

git clone https://github.com/pipecat-ai/pipecat-examples.git
cd pipecat-examples/plivo-chatbot/inbound

# Configure environment
cp env.example .env
# Edit .env with Plivo and Google credentials

# Modify bot.py to use GeminiLiveLLMService
# Start server
uv sync && uv run server.py

# Expose with ngrok (development)
ngrok http 7860
Configure your Plivo number’s Answer URL to your ngrok URL.

Outbound Calls

cd pipecat-examples/plivo-chatbot/outbound

cp env.example .env
uv sync && uv run server.py

# Initiate a call
curl -X POST http://localhost:7860/start \
  -H "Content-Type: application/json" \
  -d '{"phone_number": "+1234567890"}'

When to Use Gemini Live

Choose Gemini Live when:
  • You need multimodal capabilities (audio + video + text)
  • Latency is critical
  • You want simplified architecture
  • You’re already in the Google ecosystem
Choose standard STT → LLM → TTS when:
  • You need specific voice characteristics (ElevenLabs, Cartesia)
  • You want to mix providers (e.g., Deepgram STT + OpenAI LLM)
  • You need fine-grained control over each component