How Speech-to-Speech Differs
Standard Pipeline (STT → LLM → TTS):Prerequisites
| Service | What You Need |
|---|---|
| Plivo | Auth ID, Auth Token, Voice-enabled phone number |
| API key from AI Studio with Gemini Live access |
Installation
Environment Variables
Pipeline Configuration
Gemini Live Features
| Feature | Description |
|---|---|
| Multimodal processing | Handle audio, video, and text inputs together |
| Real-time streaming | Low-latency audio and video processing |
| Voice activity detection | Automatic speech handling |
| Function calling | Integrate external tools and APIs |
| Context management | Maintain conversation history |
Architecture
With Gemini Live, the pipeline is simplified:- Speech recognition
- Language understanding
- Response generation
- Voice synthesis
Quick Start
Inbound Calls
Outbound Calls
When to Use Gemini Live
Choose Gemini Live when:- You need multimodal capabilities (audio + video + text)
- Latency is critical
- You want simplified architecture
- You’re already in the Google ecosystem
- You need specific voice characteristics (ElevenLabs, Cartesia)
- You want to mix providers (e.g., Deepgram STT + OpenAI LLM)
- You need fine-grained control over each component
Related
- Pipecat Overview - Architecture and setup
- Gemini Live Docs - Full configuration
- Google AI Studio - API key management