Poised Genie
Room orchestration · Production

Where the room meets the brain.

Avyra Room Bridge is the orchestration layer that makes the rest of the suite act like one room. It listens with a real microphone, knows when to mute itself while the avatar speaks, runs voice through STT, and asks ai-engine what to say next. Every word in the room comes from the brain — nothing is hardcoded here.

16 kHz mono
Mic capture
WebRTC, 0–3
VAD
Speech-state
Echo cancel
45s default
Idle timeout
Features

What it does in production.

VAD-segmented mic

sounddevice captures at 16 kHz mono; WebRTC VAD slices utterances. Configurable aggressiveness, min speech, and min silence.

Phase 1 — greetings

On `entered`, asks ai-engine for a greeting, publishes it as an utterance, avatar plays it. Working before any mic config.

Phase 2 — two-way turns

After the greeting: STT → `/api/room/turn` → next utterance. Redis-backed session state, looped until `ended: true`.

Echo cancellation

Avatar POSTs `speech-state: started/ended`; bridge mutes the mic during avatar audio plus a grace tail. Cheap, robust, no DSP.

Session lifecycle

Ends on vision `left`, silence beyond `SESSION_IDLE_TIMEOUT_SEC`, or ai-engine `ended: true`. Always a clean handoff.

No hardcoded copy

Every greeting, response, and farewell is generated by ai-engine. The bridge is plumbing — the brain is the writer.

Tenant-scoped channels

All Redis channels (`presence`, `utterance`, `speech_state`) are suffixed by tenant_id. One bridge process per room is the simplest deployment.

Tunable to the room

Mic device, sample rate, VAD aggressiveness, min/max speech durations, unmute grace, idle timeout — all from `.env`.

Architecture

How it's built.

  • Mic → WebRTC VAD → whole-utterance commit → STT call → ai-engine turn.
  • Echo cancellation by speech-state pub/sub — no DSP, no aggressive AEC, no false negatives.
  • Session state persisted in Redis under tenant_id; replay-safe.
  • Ships with sane defaults so an existing Phase 1 `.env` keeps working into Phase 2.
  • One process per room — boring, predictable, easy to monitor.
vision ─► avyra:presence ─┐
                          │              ┌─► avyra:utterance ─► avatar
                          ▼              │
                    ┌──── room-bridge ───┘
                    │           │
                    │           ▼
                    │  ai-engine /api/room/{greeting,turn,close}
                    │
                    ▼
            mic ─► VAD ─► STT (:8001) ─► /api/room/turn ─► utterance
                    ▲
                    │
              avyra:speech_state  (avatar → bridge: mute while speaking)
Use cases

Who runs it today.

Clinic reception

Greet on entry, take the booking conversationally, close when ai-engine confirms or silence times out.

Restaurant host stand

Walk-up to host stand — bridge orchestrates the conversation; Dine takes the order when ai-engine routes to it.

Concierge / kiosk

Self-serve check-in or info kiosk where the conversation is the UI.

Built withPython 3.10+sounddevicewebrtcvadRedisWhisper (faster-whisper)FastAPI client
FAQ

Questions we get asked.

Why two phases?+

Phase 1 (greetings on entry) works without a microphone — it's the cheapest way to validate the room and audio chain. Phase 2 (two-way) layers in mic + VAD + STT once Phase 1 is solid.

What does echo cancellation cost?+

Effectively nothing. The avatar publishes a speech-state event when it starts and ends speaking; the bridge mutes the mic in between plus a 600 ms grace tail. Simple and reliable.

How does the session end?+

Three triggers — vision publishes `left`, silence exceeds `SESSION_IDLE_TIMEOUT_SEC`, or ai-engine returns `ended: true` (e.g. booking confirmed). First one wins.

Ready when you are

Got a product to build? Tell us what you have in mind.

We kick off in days, not months. Working software in weeks. If we're not the right fit, we'll tell you up front.