Build a Browser Voice Agent: How Gemini Types Give Context to Your App

A

Abhishek Bahukhandi

5 min read
Learn how to use Gemini's tool calling and session control signals to add context and agentic features to your UI without MCP or LangChain.

🚀 Key Takeaways

  • Direct Tool Calling: No Need for LangChain or MCP frameworks.
  • Real-Time Context: Use Gemini's internal types to sync UI and audio perfectly.
  • Agentic Frontend: Execute local functions triggered directly by AI mid-conversation.

Building AI agents shouldn't require bloated frameworks. By leveraging Gemini's own native tool calling, we've built a browser voice agent for Taqari.com that does more than just talk—it acts.

"An AI agent isn't just a voice; it's a bridge between conversation and computation."

What You Will Learn From This Post

I will show you how to use Gemini's internal types to give deep context and agentic features to your UI. This ensures your frontend stays reactive to the AI's internal state.

The Gemini Signal Types Master Cheat Sheet

Gemini communicates its state through specific types. Here is the definitive guide to handling them in your app:

1. setupComplete

This is your "handshake" signal. Once received, the session is live and bi-directional. Best Practice: Use this to transition your UI from a "Connecting..." state to a live, pulsing "Connected" indicator.

2. modelTurn.parts[].inlineData

This carries the raw audio chunks generated by the model. How to handle: Push these chunks into a playback queue and process them sequentially to ensure smooth, non-jittery voice output.

3. turnComplete

The signal that the AI has finished its current response cycle. UI Tip: Use this to clear loading states or transcripts that show the AI is actively "thinking" or "generating."

4. interrupted

Crucial for a natural flow. If the user interrupts, Gemini sends this signal. Critical Action: You MUST clear your audio playback queue immediately. If you don't, old AI responses will keep playing even after the AI has stopped to listen to the new user input.

5. outputTranscription.text

The real-time text version of the AI's spoken words. Perfect for generating live captions or populating a chat history sidebar.

6. inputTranscription.text

What the AI believes the user just said. Use this for debugging, accessibility, and user-facing logs to show "What the AI heard."

7. toolCall

The core of the agentic feature. The AI identifies that it lacks information or needs to perform an action (like editing code). It sends the function name and the parameters it wants to use.

Mastering the Tool Calling Cycle

You define your tools on the client side during initialization. Gemini decides when it needs them. Here is the flow:

  1. Request: Gemini pauses its voice output and sends the toolCall payload with specific arguments.
  2. Execution: Your application catches this call, runs the function (e.g., updates a database or a UI component), and gets the result.
  3. Response: You send back a toolResponse containing the data or success status.
  4. Completion: Gemini processes your response and resumes the conversation naturally, acknowledging the action was completed.

⚠️ A Word of Warning

Gemini does not execute the code itself. It predicts the intent and parameters. The logic and security of the execution reside entirely within your frontend code. Never trust AI input—always validate the tool arguments before processing.

Did you find this helpful?

Share this guide with your circle.

#gemini tool calling#browser voice agent#gemini session control#ai agent context#gemini types