Building Arqivon: How We Turned a Phone Camera Into a Real-Time AI Agent With the Gemini Live API
A deep dive into designing, building, and shipping a multimodal live agent that sees, hears, and acts — for the Gemini Live Agent Challenge.
What if your AI assistant could see the world through your eyes — in real time? Not a photo you upload five minutes later, but a continuous, living feed of what's happening right now. That question became Arqivon, and this is the story of how we built it.
The Spark: Why We Built a Living Lens
We're a small team, and we've all felt the friction of traditional AI assistants. You're standing in a foreign city, staring at a sign in a language you don't understand. You pull out your phone, open a translation app, take a photo, crop it, wait, and eventually get a translation. By then, you've already walked past the sign.
Or you're helping your kid with homework. They're stuck on a quadratic equation. You try to type the equation into ChatGPT (good luck formatting that on a phone keyboard), wait 15 seconds, read a wall of text, then try to explain it back to a confused 13-year-old. The moment is gone.
When Google announced the Gemini Live Agent Challenge, we saw our opening. The Gemini Live API's native audio support — bidirectional, real-time, with barge-in and function calling — was exactly the missing piece. We could build what we'd always wanted: an AI agent that you don't type at, but one that you look with and talk to.
Architecture: The Three-Legged Stool
Arqivon's architecture comes down to three components, each doing one thing well:
Flutter App
Captures camera at 2fps (JPEG) and microphone at 16kHz (PCM). Ships both streams simultaneously over a single WebSocket. Renders mode-specific UI overlays — translation subtitles, tutor step cards, PDF exports.
FastAPI Relay on Cloud Run
Manages per-user WebSocket connections. Each connection spawns a Gemini Live API session with mode-specific system prompts and tool declarations. Three concurrent asyncio coroutines: client→Gemini, Gemini→client, and heartbeat.
Gemini 2.5 Flash Live API
The brain. Receives audio + vision, responds with native audio, and invokes function calls against our 17-tool registry. Native VAD enables barge-in — users can interrupt mid-sentence.
The key insight was that the backend should be a thin relay, not a processing layer. We don't do any AI inference on our server. Cloud Run receives the raw audio + video from the client, forwards it to Gemini via the google-genai SDK's aio.live.connect(), and routes the responses back. This keeps latency under 200ms and lets us scale to hundreds of concurrent users without GPU infrastructure.
The Tool Registry: Where the Magic Happens
Here's what separates a "Gemini wrapper" from a genuine agent: tools. Arqivon registers 17 function declarations with the Gemini Live API, organized across four specialized modes:
When a user switches modes, the backend tears down the existing Gemini session and reconnects with a completely different system prompt and tool subset. The Translator agent only sees translation tools; the Tutor only sees tutoring tools. This constraint is what makes each mode feel genuinely specialized rather than a generic "do everything" assistant.
Lesson 1: The SDK Migration That Almost Broke Everything
Three weeks into development, we hit our first major crisis. We'd built the entire backend around session.send() for sending audio and video to Gemini. Then google-genai version 1.64.0 dropped and deprecated the entire method.
The replacement was three separate methods:
send_realtime_input()for audio and video streamssend_client_content()for text messages and mode switchessend_tool_response()for returning function call results
This broke our multi-turn conversations completely. The old unified send() handled everything, but now we had to carefully route different message types to different methods. Tool responses sent via send_realtime_input() would silently fail. Text sent via send_tool_response() would crash.
Takeaway: When using a fast-moving SDK like google-genai, pin your version in requirements.txt and test every upgrade in isolation. We now run pip install --upgrade google-genai in a dedicated branch and regression-test all four modes before merging.
Lesson 2: Android Audio Focus Is a Silent Killer
The record Flutter plugin silently dies when another app steals audio focus. No error, no callback, no exception. The recorder just stops producing data, and the WebSocket sends empty frames to the backend, which Gemini interprets as silence.
We discovered this the hard way during a demo — the phone played a notification sound, the recorder died, and the agent went completely silent. The fix was an ensureRecording() function that checks recorder state every audio cycle and force-restarts if needed:
// Simplified version of our fix
Future<void> ensureRecording() async {
if (!_recorder.isRecording) {
await _recorder.stop();
await _recorder.start(
const RecordConfig(encoder: AudioEncoder.pcm16bits,
sampleRate: 16000, numChannels: 1),
path: '',
);
}
}
Lesson 3: Cloud Run WebSocket Timeout — The 5-Minute Wall
Cloud Run defaults to a 5-minute request timeout. For REST APIs, that's generous. For a WebSocket-based voice agent, it means your conversation dies after 5 minutes. Every time.
The fix was a combination of three Cloud Run settings:
--timeout=3600— 1-hour maximum connection lifetime--no-cpu-throttling— prevent CPU from being throttled between requests (critical for WebSocket keepalives)--min-instances=1— keep at least one instance warm to avoid cold starts during reconnection
Plus a 15-second heartbeat ping from the backend to keep the WebSocket alive through any intermediate proxies or load balancers.
Lesson 4: Camera FPS Is a Balancing Act
We started at 10fps — seemed reasonable for "real-time vision." Gemini immediately rate-limited us. Dropped to 5fps — still too aggressive for multimodal streaming. 3fps was borderline.
We settled on 2fps. Two frames per second sounds absurdly slow, but it turns out that for a vision agent that's identifying objects, reading text, and analyzing scenes, 2fps provides more than enough visual information. The user is typically holding their camera steady anyway. And at 2fps, we never hit Gemini rate limits, even during rapid-fire Q&A sessions.
Lesson 5: Fresh Audio Player Per Turn Prevents Playback Bugs
Early on, we reused a single AudioPlayer instance across multiple AI responses. This caused bizarre issues on Android — audio from the previous turn would sometimes bleed into the current one, or the player would enter a corrupted state after being interrupted mid-sentence.
The fix was simple but counterintuitive: create a fresh AudioPlayer for every AI response turn, dispose the old one, and let the garbage collector handle cleanup. This added maybe 5ms of overhead but eliminated an entire class of audio bugs.
The PDF Export Pipeline: From Voice to Document
One of our favorite features is the export_document tool. When you're in Tutor mode and the AI just walked you through solving a calculus problem, you can say "export that as a PDF" and Gemini invokes the export tool with the structured content.
The pipeline works like this:
- Gemini calls
export_documentwith title, sections, and content - Backend routes the structured data as an
EXPORTWebSocket message - Flutter's
ExportServicegenerates a formatted PDF using thepdfpackage - Native share sheet opens via
share_plus— save, email, AirDrop, whatever
The entire flow — from voice command to PDF in your hand — takes under 3 seconds.
What We'd Do Differently
- Start with mode separation earlier. We originally built a single "do everything" agent and retro-fitted modes. Starting with isolated agents from day one would have saved weeks.
- Invest in WebSocket testing infrastructure. Testing real-time audio/video WebSocket flows is hard. We relied too much on manual testing on physical devices. A mock WebSocket test harness would have caught the SDK migration bug earlier.
- Ship the landing page earlier. Having a public-facing website earlier would have helped us collect feedback and refine the pitch before the submission deadline.
What's Next
Arqivon is live and deployed. You can try it right now — download the Android APK or explore the architecture. But this is just the beginning. We're exploring:
- Persistent memory across sessions — the agent remembering your learning progress, translation history, and support tickets
- Multi-user collaboration — imagine pointing two phones at the same whiteboard and having a shared tutoring session
- Offline-first mode — caching common translations and support answers for areas with poor connectivity
Try It Yourself
Arqivon is open source. Clone it, fork it, break it, improve it.