This project shows how to prototype a real-time voice AI Android app using Gemini 2.0’s Live API over WebSockets as an open-source proof of concept before committing to full production infrastructure. By combining low-level audio control on Android, duplex audio streaming, and multimodal AI, we built an emotional support companion that behaves like a listener instead of a traditional chatbot.​

The architecture uses Gemini 2.0 and the Gemini Live API for conversational AI, Android’s AudioRecord and AudioTrack APIs for hardware-level audio control, and a Python WebSocket backend that relays audio in both directions while enforcing safety and behavioral constraints. The Android client continuously captures microphone input, streams raw audio to the backend, and plays AI responses as they arrive, creating a natural, interruptible conversation.

In this tutorial, we walk through the full mobile app implementation from audio capture to AI response using the complete open-source code, plus guidance on upgrading from WebSockets to WebRTC for production voice AI apps. Whether you’re building a mental health companion, mobile voice assistant, or conversational AI on Android, this guide provides the practical foundation to prototype, test, and scale real-time Voice AI.

Voice AI Android Architecture with Gemini 2.0

Voice AI on Android presents unique challenges: hardware echo cancellation, background thread management, and sub-100ms latency requirements.

Duplex Streaming on Android

The system follows a duplex streaming architecture that separates hardware-level audio processing from AI orchestration. Android manages microphone capture, playback, and UI synchronization, while the backend operates as a relay rather than a processor. This design prevents UI blocking and avoids buffering delays caused by request response APIs.

Communication between the client and backend uses persistent WebSockets. This choice enables simultaneous uplink and downlink of audio data, which is required for natural voice interaction. The user can interrupt the AI mid-response, and playback stops immediately without waiting for a turn to complete. This conversational behavior is not achievable with REST-based APIs.

Audio flows from the Android device to the backend as raw PCM bytes. The backend forwards the stream directly to Gemini as live input and relays synthesized speech back to the client in small chunks. Android plays these chunks immediately, preserving timing accuracy and conversational rhythm.

Real-Time Voice Handling on Android

For voice-based AI, audio quality matters as much as model intelligence. To make sure the AI receives clean and usable voice input, the Android app uses the AudioRecord API, configured specifically for voice communication.

  • Noise Management and Echo Control: To prevent the AI from hearing its own voice during playback, hardware-level echo cancellation is enabled. This is essential for natural conversation and avoids feedback loops.
  • Audio Format and Sampling: Audio is captured as mono PCM data at 16,000 Hz with 16-bit encoding, which matches the expected input format for the Gemini Live API. This avoids unnecessary conversion and reduces processing delay.
  • Background Thread Processing: Audio capture runs on a dedicated background thread. This ensures that recording and streaming do not block the user interface and keeps the app responsive at all times. Only valid audio data is sent over the WebSocket, which helps maintain timing accuracy and avoids audio jitter.

Python WebSocket Backend

The Python backend acts as a high-speed relay between the Android app and Gemini 2.0.

Instead of traditional REST APIs, the system uses FastAPI’s WebSockets to keep a continuous connection open. This allows audio to be streamed in real time without waiting for full requests or responses.

Two streams run at the same time: one sending user audio to Gemini, and the other receiving AI-generated audio back.

  1. Uplink: Converting raw binary data from the app into types.Blob for Gemini.
  2. Downlink: Receiving binary audio responses from Gemini and pushing them back to the client.
# snippet from main.py
async with client.aio.live.connect(model=model, config=config) as session:
    # Concurrent tasks for bidirectional flow
    await asyncio.gather(
        send_to_gemini(),
        receive_gemini_responses(),
    )

This async design is key to maintaining low latency and smooth conversation flow.

Defining the Emotional Support Persona for AI Safety Guardrails

An emotional support AI must behave responsibly. The system instruction prompt defines the AI’s personality and enforces strict safety rules.

  • The AI is designed to act as a calm and supportive companion. 
  • It focuses on grounding techniques and guided breathing rather than clinical advice. 
  • Any medical or pharmaceutical questions are intentionally redirected toward licensed professionals.

These guardrails ensure the AI remains helpful without crossing ethical or medical boundaries.

Real-Time UI Feedback for Android Voice AI

To make the AI feel responsive rather than mechanical, the Android client uses Lottie animations that are driven directly by live WebSocket events. When audio data is received from the backend, the application transitions the avatar into a speaking state. This visual cue confirms that the AI is actively responding and keeps the user aligned with the conversation flow.

When the backend signals that a response has completed or has been interrupted, the avatar immediately returns to a neutral state. These signals are propagated in real time, ensuring that animation timing always matches audio playback. By tightly coupling UI state with backend events, the interaction remains fluid, attentive, and free from scripted or delayed visual behavior.

In practice:

  • onAudioReceived triggers the speaking animation in MainActivity.kt. 
  • onAnimationStop listens for turn_complete or interrupted signals from the backend to reset the avatar to its idle state.

Low-Latency Design for Android Voice AI Apps

Low latency is achieved by replacing HTTP and REST endpoints with persistent WebSocket connections. This removes request response overhead and enables the near-instant audio feedback required for sensory grounding and real-time emotional support.

Client-side resource management relies on the AudioTrack API in MODE_STREAM. Audio plays in small chunks as it arrives from the backend, which avoids full-file buffering and keeps playback synchronized with live AI responses.

Android Push-to-Talk Engine for Voice AI

The Android client uses a push-to-talk audio engine built on AudioRecord for capture and AudioTrack for playback. This architecture provides precise control over audio timing, which is required for real-time voice interaction.

Real-Time Audio Capture and Streaming

Audio capture runs in a dedicated background thread inside AudioStreamer.kt. The engine continuously reads raw PCM data from the microphone and streams it to the backend as soon as it becomes available. This design keeps the UI responsive while processing high-frequency audio data.

Android VOICE_COMMUNICATION Source Setup

The application uses MediaRecorder.AudioSource.VOICE_COMMUNICATION instead of MIC. This source enables system-level Automatic Gain Control and Noise Suppression, which stabilize volume and reduce background noise before the audio reaches the AI model.

Hardware Echo Cancellation Code

Hardware echo cancellation is enabled when supported. This prevents the AI from capturing its own synthesized speech during playback, which protects conversational turn accuracy.

// From AudioStreamer.kt
init {
    // Hardware-level Echo Cancellation
    if (AcousticEchoCanceler.isAvailable()) {
        val echoCanceler = AcousticEchoCanceler.create(recorder.audioSessionId)
        echoCanceler.enabled = true
    }
}

fun start() {
    streaming = true
    recorder.startRecording()
    val buffer = ByteArray(bufferSize)

    Thread {
        while (streaming) {
            val read = recorder.read(buffer, 0, buffer.size)
            if (read > 0) {
                // Send only the actual bytes read to the WebSocket
                ws.send(buffer.copyOf(read))
            }
        }
    }.start()
}

Precise Audio Byte Transmission Over WebSockets

Only the exact number of bytes read from the microphone buffer is sent over the WebSocket. This reduces unnecessary data transfer and preserves consistent audio timing across the pipeline.

Gemini 2.0 Live API Backend Configuration

The backend operates as a low-latency relay using asyncio.gather to manage concurrent inbound audio and outbound model responses. Gemini 2.0 Flash relies on automatic_activity_detection to detect silence, support barge-ins, and maintain natural conversational turn-taking without explicit state handling.

Python
# From main.py
config = types.LiveConnectConfig(
    response_modalities=[types.Modality.AUDIO],
    realtime_input_config=types.RealtimeInputConfig(
        automatic_activity_detection={
            "prefix_padding_ms": 300, # Buffer before speech
            "silence_duration_ms": 500, # Time to wait before responding
        }
    ),
    speech_config=types.SpeechConfig(
        voice_config=types.VoiceConfig(prebuilt_voice_config=types.
            PrebuiltVoiceConfig(voice_name="Puck"))
    )
)

Voice AI Barge-In and Interruption Handling

One of the most complex problems in voice AI is stopping the bot immediately when the user starts speaking again. The Android client solves this by listening for explicit control signals from the backend over the WebSocket. When a barge-in occurs or a response ends, the backend sends interrupted or turn_complete, which forces an immediate avatar stop and clears the audio playback buffer.

Kotlin
// From WebSocketManager.kt
override fun onMessage(webSocket: WebSocket, text: String) {
    super.onMessage(webSocket, text)
    // Listen for control signals embedded in the text stream
    if (text.contains("interrupted") || text.contains("turn_complete")) {
        onAnimationStop?.invoke(true)
    }
}

Low-Latency Audio Playback

Playback uses AudioTrack in MODE_STREAM to achieve minimal latency. This is more efficient than a standard MediaPlayer because it doesn’t need to load a full file; it plays the raw bytes as they arrive from the WebSocket.

Kotlin
// From AudioPlayer.kt
private val audioTrack = AudioTrack.Builder()
    .setAudioFormat(
        AudioFormat.Builder()
            .setEncoding(AudioFormat.ENCODING_PCM_16BIT)
            .setSampleRate(24000) // Gemini sends 24kHz audio back
            .setChannelMask(AudioFormat.CHANNEL_OUT_MONO)
            .build()
    )
    .setTransferMode(AudioTrack.MODE_STREAM)
    .build()

fun play(data: ByteArray) {
    audioTrack.write(data, 0, data.size)
}

This behavior allows the AI to feel interruptible and attentive rather than scripted which is especially important for emotional support applications where users may change direction mid-sentence.

The AI’s conversational behavior is governed by a system instruction that defines a calm, supportive persona. The model prioritizes grounding techniques and guided breathing and explicitly avoids medical or pharmaceutical advice. Requests that fall outside these boundaries are redirected toward licensed professionals, ensuring safety without breaking conversational flow.

Open Source Gemini Voice AI Android Project

The full implementation, including the Android client, Python backend, and Gemini Live configuration, is available as an open-source proof of concept in this GitHub repository. It demonstrates how low-level audio control, duplex WebSocket streaming, and multimodal AI combine to create a responsive emotional support companion that listens, responds, and adapts in real time.​​

WebSockets vs WebRTC: Choosing the Right Transport for Voice AI

This prototype uses WebSockets, which is what the Gemini Live API supports today for streaming audio to and from the model. For proof-of-concept development on Android, this WebSocket-based approach with Gemini gets you to a working voice AI demo quickly.​

However, for consumer-facing or clinically sensitive mobile apps (especially emotional support, mental health, or telehealth experiences where reliability is critical) WebRTC provides a more robust production transport layer. WebRTC offers built-in network adaptation, optimized battery usage, lower end-to-end latency on constrained networks, and media-focused security, making it the same foundation used by platforms like Google Meet and Zoom.​​

If you are ready to move from a WebSocket prototype to a production-grade Voice AI app, you can pair this kind of Gemini-based conversational logic with WebRTC transports such as OpenAI’s Realtime API, LiveKit, or Daily/Pipecat to deliver resilient, always-on mobile experiences.​​

Ready to build or evolve your Voice AI app? 

WebRTC.ventures has deep experience building production-ready Voice AI systems that span WebSocket prototypes, Gemini and OpenAI-based agents, LiveKit, Daily/Pipecat, and custom WebRTC media pipelines. If you are exploring a Voice AI Android prototype or planning a migration from WebSockets to WebRTC for your mobile app, we can help you design and implement an architecture that balances latency, reliability, and safety for your users.

Reach out to the Voice AI team at WebRTC.ventures to discuss your roadmap.

Further Reading:

Recent Blog Posts