OmniVoice

Voice abstraction layer for AgentPlexus supporting TTS, STT, and Voice Agents across multiple providers and transport protocols.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              OmniVoice                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────────────┐  │
│  │     TTS     │    │     STT     │    │          Voice Agent            │  │
│  │             │    │             │    │                                 │  │
│  │ Text → Audio│    │ Audio → Text│    │  Real-time bidirectional voice  │  │
│  └──────┬──────┘    └──────┬──────┘    └───────────────┬─────────────────┘  │
│         │                  │                           │                    │
│         ▼                  ▼                           ▼                    │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         Provider Layer                              │    │
│  ├─────────────┬─────────────┬─────────────┬─────────────┬─────────────┤    │
│  │ ElevenLabs  │  Deepgram   │ Google Cloud│    AWS      │   Azure     │    │
│  │ Cartesia    │  Whisper    │ AssemblyAI  │   Polly     │   Speech    │    │
│  └─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         Transport Layer                             │    │
│  ├─────────────┬─────────────┬─────────────┬─────────────┬─────────────┤    │
│  │   WebRTC    │     SIP     │    PSTN     │  WebSocket  │    HTTP     │    │
│  └─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Call System Integration                        │    │
│  ├─────────────┬─────────────┬─────────────┬─────────────┬─────────────┤    │
│  │   Twilio    │ RingCentral │    Zoom     │   LiveKit   │   Daily     │    │
│  └─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Package Structure

omnivoice/
├── tts/                    # Text-to-Speech
│   ├── tts.go              # Interface definitions
│   ├── elevenlabs/         # ElevenLabs provider
│   ├── polly/              # AWS Polly provider
│   ├── google/             # Google Cloud TTS
│   ├── azure/              # Azure Speech
│   └── cartesia/           # Cartesia provider
│
├── stt/                    # Speech-to-Text
│   ├── stt.go              # Interface definitions
│   ├── whisper/            # OpenAI Whisper
│   ├── deepgram/           # Deepgram provider
│   ├── google/             # Google Speech-to-Text
│   ├── azure/              # Azure Speech
│   └── assemblyai/         # AssemblyAI provider
│
├── agent/                  # Voice Agent orchestration
│   ├── agent.go            # Interface definitions
│   ├── session.go          # Conversation session management
│   ├── elevenlabs/         # ElevenLabs Agents
│   ├── vapi/               # Vapi.ai
│   ├── retell/             # Retell AI
│   └── custom/             # Custom agent (STT + LLM + TTS)
│
├── transport/              # Audio transport protocols
│   ├── transport.go        # Interface definitions
│   ├── webrtc/             # WebRTC transport
│   ├── websocket/          # WebSocket streaming
│   ├── sip/                # SIP protocol
│   └── http/               # HTTP-based (batch)
│
├── callsystem/             # Call system integrations
│   ├── callsystem.go       # Interface definitions
│   ├── twilio/             # Twilio ConversationRelay
│   ├── ringcentral/        # RingCentral Voice API
│   ├── zoom/               # Zoom SDK integration
│   ├── livekit/            # LiveKit rooms
│   └── daily/              # Daily.co
│
└── examples/
    ├── simple-tts/         # Basic TTS example
    ├── voice-agent/        # Voice agent with Twilio
    └── multi-provider/     # Provider fallback example

Call System Integration

How Voice Agents Connect to Phone/Video Calls

Voice AI agents need a transport layer to receive and send audio. The choice depends on the use case:

┌───────────────────────────────────────────────────────────────────────┐
│                        Call System Options                            │
├────────────────┬───────────────┬─────────────────┬────────────────────┤
│    Platform    │   Protocol    │   Best For      │   Complexity       │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ Twilio         │ WebRTC/SIP/   │ Phone calls,    │ Medium - managed   │
│ Conversation-  │ PSTN          │ IVR, call       │ infrastructure     │
│ Relay          │               │ centers         │                    │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ RingCentral    │ WebRTC/SIP    │ Enterprise PBX, │ Medium - native    │
│ Voice API      │               │ business phones │ AI receptionist    │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ Zoom SDK       │ Proprietary   │ Video meetings  │ High - requires    │
│                │ (via SDK)     │ with voice bots │ native SDK         │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ LiveKit        │ WebRTC        │ Custom apps,    │ Low - open source  │
│                │               │ real-time AI    │ WebRTC rooms       │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ Daily.co       │ WebRTC        │ Embedded video, │ Low - simple API   │
│                │               │ browser-based   │                    │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ WebSocket      │ WS/WSS        │ Web apps,       │ Low - direct       │
│ (Direct)       │               │ custom UIs      │ streaming          │
└────────────────┴───────────────┴─────────────────┴────────────────────┘

Wiring Diagram: Voice Agent in a Phone Call

┌────────────────────────────────────────────────────────────────────────────────┐
│                     PSTN/WebRTC Call Flow                                      │
│                                                                                │
│   ┌─────────┐         ┌─────────────┐          ┌───────────────────────────┐   │
│   │  User   │◄───────►│   Twilio    │◄────────►│        OmniVoice          │   │
│   │ (Phone) │  PSTN   │ Conversation│ WebSocket│                           │   │
│   │         │         │   Relay     │          │  ┌─────────────────────┐  │   │
│   └─────────┘         └─────────────┘          │  │   Voice Agent       │  │   │
│                                                │  │                     │  │   │
│                                                │  │  ┌───────┐          │  │   │
│                         Audio In ─────────────►│  │  │  STT  │──┐       │  │   │
│                                                │  │  └───────┘  │       │  │   │
│                                                │  │             ▼       │  │   │
│                                                │  │  ┌───────────────┐  │  │   │
│                                                │  │  │  LLM / Agent  │  │  │   │
│                                                │  │  │  (Eino, etc.) │  │  │   │
│                                                │  │  └───────────────┘  │  │   │
│                                                │  │             │       │  │   │
│                                                │  │             ▼       │  │   │
│                                                │  │  ┌───────┐          │  │   │
│                         Audio Out ◄────────────│  │  │  TTS  │◄─┘       │  │   │
│                                                │  │  └───────┘          │  │   │
│                                                │  └─────────────────────┘  │   │
│                                                └───────────────────────────┘   │
│                                                                                │
└────────────────────────────────────────────────────────────────────────────────┘

Wiring Diagram: Voice Agent in a Zoom Meeting

┌────────────────────────────────────────────────────────────────────────────┐
│                     Zoom Meeting Flow                                      │
│                                                                            │
│   ┌────────────────────────────────────────────────────────────────────┐   │
│   │                         Zoom Meeting                               │   │
│   │                                                                    │   │
│   │   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────────────────┐   │   │
│   │   │ User 1  │  │ User 2  │  │ User 3  │  │     Bot Client      │   │   │
│   │   │ (Human) │  │ (Human) │  │ (Human) │  │   (Zoom SDK)        │   │   │
│   │   └─────────┘  └─────────┘  └─────────┘  └──────────┬──────────┘   │   │
│   │                                                     │              │   │
│   └─────────────────────────────────────────────────────┼──────────────┘   │
│                                                         │                  │
│                                        Raw Audio Stream │                  │
│                                                         ▼                  │
│   ┌────────────────────────────────────────────────────────────────────┐   │
│   │                        OmniVoice Agent                             │   │
│   │                                                                    │   │
│   │   Option A: Use Recall.ai (recommended)                            │   │
│   │   ┌─────────────┐                                                  │   │
│   │   │  Recall.ai  │──► Handles Zoom SDK complexity                   │   │
│   │   │     Bot     │──► Provides audio stream via WebSocket           │   │
│   │   └─────────────┘                                                  │   │
│   │                                                                    │   │
│   │   Option B: Self-hosted Zoom SDK Bot                               │   │
│   │   ┌─────────────┐                                                  │   │
│   │   │ Zoom Linux  │──► Complex: requires native SDK                  │   │
│   │   │   SDK Bot   │──► One instance per meeting                      │   │
│   │   └─────────────┘──► Months of engineering                         │   │
│   │                                                                    │   │
│   └────────────────────────────────────────────────────────────────────┘   │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Use Case Recommendations

Use Case	Call System	Transport	Notes
IVR / Call Center	Twilio ConversationRelay	PSTN/SIP	Best managed solution
Business Phone	RingCentral	WebRTC/SIP	Native AI Receptionist available
Custom Web App	LiveKit or Daily	WebRTC	Open source, flexible
Zoom Meetings	Recall.ai + Zoom	SDK → WebSocket	Avoid building Zoom bot yourself
Browser Widget	Direct WebSocket	WebSocket	ElevenLabs widget or custom
Mobile App	LiveKit	WebRTC	Cross-platform support

Latency Considerations

For natural conversation, total round-trip latency should be under 500ms:

User speaks → STT (100-300ms) → LLM (200-500ms) → TTS (100-200ms) → User hears

Target: < 500ms total for "instant" feel
Acceptable: < 1000ms for natural conversation
Poor: > 1500ms feels laggy

Optimization Strategies

Streaming STT: Start processing before user finishes speaking
Streaming TTS: Start playing audio before full response generated
Edge inference: Use providers with edge nodes (Deepgram, ElevenLabs)
Turn detection: Use voice activity detection (VAD) for quick turn-taking

Provider Comparison

TTS Providers

Provider	Latency	Quality	Voices	Streaming	Price
ElevenLabs	Low	Excellent	5000+	Yes	$$$
Cartesia	Very Low	Good	100+	Yes	$$
AWS Polly	Low	Good	60+	Yes	$
Google TTS	Low	Good	200+	Yes	$
Azure Speech	Low	Excellent	400+	Yes	$$

STT Providers

Provider	Latency	Accuracy	Streaming	Languages	Price
Deepgram	Very Low	Excellent	Yes	30+	$$
Whisper (OpenAI)	Medium	Excellent	No*	50+	$
Google Speech	Low	Excellent	Yes	125+	$$
AssemblyAI	Low	Excellent	Yes	20+	$$
Azure Speech	Low	Excellent	Yes	100+	$$

*Whisper requires self-hosting for streaming (e.g., faster-whisper)

Voice Agent Platforms

Provider	Customization	Latency	Telephony	Price
ElevenLabs Agents	Medium	Low	Via Twilio	$$$
Vapi	High	Low	Built-in	$$
Retell AI	High	Low	Built-in	$$
Custom (OmniVoice)	Full	Variable	Via integration	Variable

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github		.github
agent		agent
audio/codec		audio/codec
callsystem		callsystem
docs		docs
examples		examples
mcp		mcp
pipeline		pipeline
stt		stt
transport		transport
tts		tts
.golangci.yaml		.golangci.yaml
CHANGELOG.json		CHANGELOG.json
CHANGELOG.md		CHANGELOG.md
PRESENTATION.md		PRESENTATION.md
README.md		README.md
ROADMAP.md		ROADMAP.md
go.mod		go.mod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OmniVoice

Architecture Overview

Package Structure

Call System Integration

How Voice Agents Connect to Phone/Video Calls

Wiring Diagram: Voice Agent in a Phone Call

Wiring Diagram: Voice Agent in a Zoom Meeting

Use Case Recommendations

Latency Considerations

Optimization Strategies

Provider Comparison

TTS Providers

STT Providers

Voice Agent Platforms

Resources

Call Systems

Voice AI Providers

About

Uh oh!

Releases 2

Contributors 2

Uh oh!

Languages

agentplexus/omnivoice

Folders and files

Latest commit

History

Repository files navigation

OmniVoice

Architecture Overview

Package Structure

Call System Integration

How Voice Agents Connect to Phone/Video Calls

Wiring Diagram: Voice Agent in a Phone Call

Wiring Diagram: Voice Agent in a Zoom Meeting

Use Case Recommendations

Latency Considerations

Optimization Strategies

Provider Comparison

TTS Providers

STT Providers

Voice Agent Platforms

Resources

Call Systems

Voice AI Providers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors 2

Uh oh!

Languages