CETA Global needed a scalable way for practitioners to practice complex clinical conversations beyond live training sessions.
CETA Global trains frontline mental and behavioral health providers around the world in an evidence-based mental health treatment protocol called CETA (the Common Elements Treatment Approach). The problem was there were not enough expert trainers to meet growing international demand, and limited scalable ways for practitioners to practice difficult client conversations before working with real patients.
The result is EBTSim: a real-time AI simulation platform where psychologists hold live audio or video sessions with an AI-generated patient, receive in-session coaching from a second AI agent, and walk away with timestamped, replayable feedback on their clinical technique. CETA Global calls it their ‘flight simulator’ because it is the same idea as pilot training: building competency in a high-stakes scenario before you’re ever responsible for real lives.
Built in approximately two months for a December 2025 Google.org Accelerator: Generative AI demo, the platform was tested by users in the United States, Chile, and South Africa before launch. It has since been accepted into the accelerator program. WebRTC.ventures continues to work with CETA to integrate EBTSim into the broader CETA practitioner ecosystem.
Read: Bridging the Gap Between Mental Health Training and Real-World Readiness (CETA Global, January 27, 2026)
Watch: CETA Demo at 2025 Google.org Accelerator: GenAI Demo Day

EBTSim was not a standard AI build. It required real-time conversation, clinical roleplay, multi-agent orchestration, live coaching, and structured feedback to work together in one coherent experience. These capabilities are complex on their own, and even harder to bring together in a clinical training context. Working alongside our product, clinical, and engineering team, WebRTC.ventures was invaluable. They provided the specialized, real-time AI engineering capacity we needed to move from concept to a working platform - in just eight weeks!
The Challenge: Making Concurrent AI Feel Like a Single Conversation
Most AI applications operate in request-response cycles. A user submits something; a model processes it; a response comes back. EBTSim couldn’t work that way. A therapy simulation has to feel like a real conversation which means audio capture, transcription, AI reasoning, response generation, text-to-speech, and avatar rendering all had to happen fast enough that the psychologist never felt like they were waiting on a machine.
At the same time, the session was a structured training experience so a second AI agent had to monitor the exchange in real time and surface coaching hints when the trainee drifted off protocol. A third had to track which CETA protocol steps had been completed. A fourth had to update a shared whiteboard. All of this had to run concurrently, in the same live session, without collapsing the experience into lag.
The Architecture: WebSockets, Multi-Agent AI, and Push-to-Talk
Building EBTSim required solving two problems simultaneously: keeping latency low enough for a live conversation to feel natural, and coordinating multiple AI agents inside that same session without any one of them disrupting the experience. The architecture reflects both constraints. It is designed from the start around real-time communication as the foundation, with multiple AIs layered in as first-class participants.
Real-Time Communication Layer
The live session was built on WebSockets rather than SSE, because the communication had to be genuinely bidirectional. The browser captured microphone audio and streamed it to the backend. The backend returned transcripts, AI responses, coaching signals, protocol step updates, whiteboard events, generated audio, and avatar actions simultaneously, all in real time.
Google Cloud Speech-to-Text with Chirp 3 handled transcription with the low latency the session required. Google Cloud Text-to-Speech and Gemini 2.5 Flash TTS converted AI responses back to audio. HeyGen Streaming Avatar rendered a lip-synced visual client on screen.
One important UX decision shaped the whole voice pipeline: push-to-talk over automatic turn detection. The team evaluated fully automatic bidirectional streaming which worked technically, but psychologists often pause for long periods while thinking and automatic systems interrupted those pauses too early. Push-to-talk gave practitioners control over when their turn ended, which turned out to be essential for a training context where deliberate, unhurried thinking is part of good clinical practice.
AI Architecture
Two decisions shaped EBTSim’s AI layer more than any others.
The first was decomposition. The instinct with LLM products is often to write one large prompt and ask the model to do everything. That breaks down quickly when different parts of the experience have different jobs and different failure modes. EBTSim runs six purpose-built Gemini-powered agents through Google ADK, each owning one responsibility:
- Simulated Client Agent — maintains a consistent persona, backstory, and emotional state across the session
- Real-Time Coaching Agent — monitors the conversation and surfaces hints when the trainee needs redirection
- Protocol Tracking Agent — watches for completion of specific CETA protocol steps and updates session state
- Progress Assessment Agent — evaluates overall session trajectory
- Whiteboard Agent — extracts content from therapeutic exercises like the Triangle of Difficult Worry and updates the shared whiteboard
- Evaluation Agent — generates the final feedback report with timestamps and replayable moments
Each agent can be tuned and tested independently without one change breaking something else. This is a meaningful advantage when evaluation criteria have to reflect how CETA experts actually judge practitioner performance.
The second was making client resistance adaptive rather than fixed. A static instruction like “do not open up” produced rigid, unrealistic sessions. Instead, trust was built as a dynamic state the client agent tracked across turns, responding to the practitioner’s actual technique rather than running a predetermined script.
Underpinning both was a structured content layer with scenario definitions and protocol configurations that kept AI behavior grounded in clinical standards without asking the models to carry that knowledge on their own.
Tech Stack
AI / LLMs | Google ADK, Gemini 2.5 Flash, Gemini 2.0 Flash, Gemini 2.5 Flash TTS, Vertex AI |
Real-Time Communication | WebSockets, push-to-talk audio pipeline, HeyGen Streaming Avatar |
Speech | Google Cloud Speech-to-Text v2 (Chirp 3), Google Cloud Text-to-Speech |
Frontend | Next.js, React, TypeScript, Tailwind CSS, Radix UI, Zustand, WebAudio, Excalidraw |
Backend | Python 3.12, FastAPI, multi-agent session orchestration via Google ADK |
Database | PostgreSQL on GCP |
Infrastructure | Google Cloud Platform, Cloud Run, Cloud Build, Docker, Artifact Registry, Secret Manager, Google Cloud Storage |
Observability | Arize / OpenInference tracing |
Let's turn your project into another story of success.
Client:
Type of Application:
EBTSim is a real-time AI-powered clinical training simulator that combines voice and video AI roleplay, multi-agent orchestration, live coaching, and automated assessment to help healthcare professionals build and demonstrate competency in realistic patient scenarios.
How We Helped:
As part of a cross-functional team bringing EBTSim to market, WebRTC.ventures helped transform the concept into a production-ready platform, engineering key components including the real-time communication layer, AI orchestration framework, frontend training experience, cloud infrastructure, and voice pipeline.
Project Components:
- Live Audio/Video Simulation. WebSocket-based real-time session with AI client, push-to-talk voice pipeline, lip-synced streaming avatar, and low-latency speech-to-text and text-to-speech
- Multi-Agent AI Orchestration. Six specialized Gemini-powered agents handling client simulation, real-time coaching, protocol tracking, progress assessment, whiteboard interaction, and post-session evaluation
- Adaptive Client Behavior. Dynamic trust state that responds to practitioner technique across the session
- Structured Clinical Content Layer. Scenario definitions and protocol configurations that ground AI behavior in CETA standards
- Post-Session Evaluation. Timestamped, replayable feedback report generated immediately after each session
WebRTC Services
WebRTC.ventures is one of the few agencies in the world dedicated to WebRTC development. This dedication and experience is why so many people trust us to help bring live video application dreams to life.
Let’s convert your idea into another story of success!