Where the room meets the brain.
Avyra Room Bridge is the orchestration layer that makes the rest of the suite act like one room. It listens with a real microphone, knows when to mute itself while the avatar speaks, runs voice through STT, and asks ai-engine what to say next. Every word in the room comes from the brain — nothing is hardcoded here.
What it does in production.
VAD-segmented mic
sounddevice captures at 16 kHz mono; WebRTC VAD slices utterances. Configurable aggressiveness, min speech, and min silence.
Phase 1 — greetings
On `entered`, asks ai-engine for a greeting, publishes it as an utterance, avatar plays it. Working before any mic config.
Phase 2 — two-way turns
After the greeting: STT → `/api/room/turn` → next utterance. Redis-backed session state, looped until `ended: true`.
Echo cancellation
Avatar POSTs `speech-state: started/ended`; bridge mutes the mic during avatar audio plus a grace tail. Cheap, robust, no DSP.
Session lifecycle
Ends on vision `left`, silence beyond `SESSION_IDLE_TIMEOUT_SEC`, or ai-engine `ended: true`. Always a clean handoff.
No hardcoded copy
Every greeting, response, and farewell is generated by ai-engine. The bridge is plumbing — the brain is the writer.
Tenant-scoped channels
All Redis channels (`presence`, `utterance`, `speech_state`) are suffixed by tenant_id. One bridge process per room is the simplest deployment.
Tunable to the room
Mic device, sample rate, VAD aggressiveness, min/max speech durations, unmute grace, idle timeout — all from `.env`.
How it's built.
- Mic → WebRTC VAD → whole-utterance commit → STT call → ai-engine turn.
- Echo cancellation by speech-state pub/sub — no DSP, no aggressive AEC, no false negatives.
- Session state persisted in Redis under tenant_id; replay-safe.
- Ships with sane defaults so an existing Phase 1 `.env` keeps working into Phase 2.
- One process per room — boring, predictable, easy to monitor.
vision ─► avyra:presence ─┐
│ ┌─► avyra:utterance ─► avatar
▼ │
┌──── room-bridge ───┘
│ │
│ ▼
│ ai-engine /api/room/{greeting,turn,close}
│
▼
mic ─► VAD ─► STT (:8001) ─► /api/room/turn ─► utterance
▲
│
avyra:speech_state (avatar → bridge: mute while speaking)Who runs it today.
Clinic reception
Greet on entry, take the booking conversationally, close when ai-engine confirms or silence times out.
Restaurant host stand
Walk-up to host stand — bridge orchestrates the conversation; Dine takes the order when ai-engine routes to it.
Concierge / kiosk
Self-serve check-in or info kiosk where the conversation is the UI.
Questions we get asked.
Why two phases?+
Phase 1 (greetings on entry) works without a microphone — it's the cheapest way to validate the room and audio chain. Phase 2 (two-way) layers in mic + VAD + STT once Phase 1 is solid.
What does echo cancellation cost?+
Effectively nothing. The avatar publishes a speech-state event when it starts and ends speaking; the bridge mutes the mic in between plus a 600 ms grace tail. Simple and reliable.
How does the session end?+
Three triggers — vision publishes `left`, silence exceeds `SESSION_IDLE_TIMEOUT_SEC`, or ai-engine returns `ended: true` (e.g. booking confirmed). First one wins.
Got a product to build? Tell us what you have in mind.
We kick off in days, not months. Working software in weeks. If we're not the right fit, we'll tell you up front.