Self-Hosted Realtime Engine
The realtime engine runs live voice sessions in a self-hosted deployment. It provides an OpenAI-compatible WebSocket API for real-time voice AI.
What The Realtime Engine Does
The realtime engine is responsible for:
- Accepting authenticated realtime WebSocket connections
- Creating and managing live voice sessions
- Coordinating speech-to-text, model inference, and text-to-speech
- Executing custom tools and actions
- Streaming audio and text responses back to clients
- Managing conversation history and session state
WebSocket Endpoint
The primary endpoint for voice sessions:
ws://localhost:8787/v1/realtimeIn production with TLS:
wss://your-domain.com/v1/realtimeConnection requires:
- Valid ephemeral token in
Authorizationheader or query parameter - WebSocket protocol upgrade support
Example Connection
# Using wscat (install with: npm install -g wscat)
wscat -c "ws://localhost:8787/v1/realtime" \
-H "Authorization: Bearer ek_your_ephemeral_token"// Using native WebSocket
const ws = new WebSocket(
'ws://localhost:8787/v1/realtime',
[],
{
headers: {
'Authorization': 'Bearer ek_your_ephemeral_token'
}
}
);Ephemeral Token Flow
1. Request Token from Core
curl -X POST http://localhost:3000/vowel/api/generateToken \
-H "Content-Type: application/json" \
-H "X-API-Key: vkey_your_bootstrap_key" \
-d '{
"appId": "default",
"provider": "vowel-prime"
}'Response:
{
"token": "ek_eyJhbGciOiJIUzI1NiIs...",
"expiresAt": "2024-01-15T10:30:00Z",
"wsUrl": "ws://localhost:8787/v1/realtime"
}2. Connect with Token
wscat -c "ws://localhost:8787/v1/realtime" \
-H "Authorization: Bearer ek_eyJhbGciOiJIUzI1NiIs..."3. Session Established
Upon successful connection, the engine sends:
{
"type": "session.created",
"session": {
"id": "sess_abc123",
"model": "openai/gpt-oss-20b",
"voice": "aura-2-thalia-en"
}
}HTTP Endpoints
Health Check
GET http://localhost:8787/healthResponse:
{
"status": "ok",
"timestamp": "2024-01-15T10:30:00Z"
}Runtime Configuration
The engine exposes HTTP endpoints for managing runtime configuration without restarting:
Get current config:
curl http://localhost:8787/config \
-H "Authorization: Bearer ${ENGINE_API_KEY}"Update config:
curl -X PUT http://localhost:8787/config \
-H "Authorization: Bearer ${ENGINE_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"llm": {
"provider": "groq",
"model": "openai/gpt-oss-20b"
}
}'Validate config change:
curl -X POST http://localhost:8787/config/validate \
-H "Authorization: Bearer ${ENGINE_API_KEY}" \
-H "Content-Type: application/json" \
-d '{"llm": {"provider": "openrouter"}}'Reload config from disk:
curl -X POST http://localhost:8787/config/reload \
-H "Authorization: Bearer ${ENGINE_API_KEY}"List available presets:
curl http://localhost:8787/presets \
-H "Authorization: Bearer ${ENGINE_API_KEY}"WebSocket Events
Client → Server Events
| Event | Description |
|---|---|
session.update | Update session configuration |
input_audio_buffer.append | Stream audio chunk (PCM16, 24kHz, mono) |
input_audio_buffer.commit | Signal end of speech |
conversation.item.create | Add text message to conversation |
response.create | Request AI response |
response.cancel | Cancel in-progress response |
Server → Client Events
| Event | Description |
|---|---|
session.created | Session initialized |
input_audio_buffer.speech_started | Speech detected |
input_audio_buffer.speech_stopped | Speech ended |
conversation.item.created | New conversation item |
response.text.delta | Streaming text response |
response.audio.delta | Streaming audio response (PCM16) |
response.done | Response complete |
error | Error occurred |
Example: Update Session
{
"type": "session.update",
"session": {
"instructions": "You are a helpful assistant.",
"voice": "aura-2-thalia-en",
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"silence_duration_ms": 550
}
}
}Example: Send Audio
{
"type": "input_audio_buffer.append",
"audio": "base64EncodedPcm16AudioData..."
}How It Fits With Core
Core and the realtime engine have different responsibilities:
- Core prepares and issues short-lived access for session startup (token generation, app management)
- Realtime engine runs the session after the client connects (audio processing, AI inference, TTS)
flowchart LR
A[Browser] -->|Request Token| B[Core]
B -->|JWT Token| A
A -->|WebSocket + Token| C[Realtime Engine]
C -->|Audio/Text Stream| AWhat Operators Should Care About
Public WebSocket Endpoint Stability
Keep the WebSocket URL stable between deployments. Clients store this URL in their configuration.
Health and Restart Behavior
The engine includes a health check endpoint (/health) used by Docker and load balancers. The container restarts automatically on failure (restart: unless-stopped).
Upstream Provider Configuration
Provider settings are loaded from:
- Environment variables (bootstrap/fallback)
- Runtime YAML config (
/app/data/config/runtime.yaml) - Runtime config HTTP API
Update providers without restarting:
curl -X PUT http://localhost:8787/config \
-H "Authorization: Bearer ${ENGINE_API_KEY}" \
-d '{"llm": {"provider": "openrouter", "model": "anthropic/claude-3-sonnet"}}'TLS and Proxy Support
For production deployments behind a reverse proxy:
# nginx configuration
location /v1/realtime {
proxy_pass http://localhost:8787/v1/realtime;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 86400;
proxy_send_timeout 86400;
}Logs and Metrics
View engine logs:
# All logs
bun run stack:logs | grep engine
# Errors only
bun run stack:logs | grep engine | grep -i error
# Session events
bun run stack:logs | grep engine | grep -i "session\|speech\|response"Provider Configuration
LLM Providers
Configure the LLM provider via environment variables or runtime config:
Groq (default, fast):
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_your_key
GROQ_MODEL=openai/gpt-oss-20bOpenRouter (100+ models):
LLM_PROVIDER=openrouter
OPENROUTER_API_KEY=sk-or-v1-your_key
OPENROUTER_MODEL=anthropic/claude-3-sonnetSpeech-to-Text
Default: Deepgram Nova-3
STT_PROVIDER=deepgram
DEEPGRAM_API_KEY=your_key
DEEPGRAM_STT_MODEL=nova-3
DEEPGRAM_STT_LANGUAGE=en-USText-to-Speech
Default: Deepgram Aura-2
TTS_PROVIDER=deepgram
DEEPGRAM_TTS_MODEL=aura-2-thalia-enVoice Activity Detection
Default: Silero VAD
VAD_PROVIDER=silero
VAD_ENABLED=true
VAD_THRESHOLD=0.5
VAD_MIN_SILENCE_MS=550Runtime Config Ownership
The engine persists its runtime configuration as YAML at /app/data/config/runtime.yaml on the engine-data Docker volume. Environment variables act as bootstrap defaults and fallbacks.
Config hierarchy (highest priority first):
- Runtime config HTTP API updates
- Runtime YAML file (
/app/data/config/runtime.yaml) - Environment variables
This allows updating configuration without rebuilding or restarting containers.
Typical Session Flow
- Token Acquisition: Client requests token from Core
- WebSocket Connection: Client connects to engine with token
- Session Creation: Engine validates token and creates session
- Audio Streaming: Client sends audio chunks
- Speech Detection: VAD detects speech start/end
- Transcription: STT converts speech to text
- AI Processing: LLM generates response
- Synthesis: TTS converts text to audio
- Streaming: Audio/text streamed back to client
Source Repository
The realtime engine is open source at github.com/usevowel/engine.
Related Docs
- Architecture - How the engine fits in the stack
- Core - Token service and control plane
- Configuration - Environment variables and config options
- Deployment - Deploying the full stack
- Troubleshooting - Debugging engine issues