Building an AI agent that controls your phone — in Flutter
Building an AI agent that controls your phone — in Flutter
What if your phone could operate itself?
Not pre-recorded macros. Not fixed scripts. I mean an actual AI that looks at the screen, figures out what's there, and decides what to do next — like a person would.
I built one. In Flutter. Runs on the phone itself, no laptop tether.
The core loop
Four steps, repeated until the task is done:
- Observe — take a screenshot, read the UI tree
- Think — send screenshot + UI tree to an LLM: "what next?"
- Act — tap, swipe, type
- Reflect — check results, update memory
A standard agent loop. The difference: this one runs on a real phone, not in a browser.
Why Flutter?
Runs natively on the device. No computer needed, no USB cable — it's a standalone app.
Cross-platform from one codebase. iOS and Android share the same logic. Most existing phone-agent projects (Alibaba's Mobile-Agent, Tencent's AppAgent) need a desktop driving the phone via ADB. This one doesn't.
Native performance. Flutter compiles to native code. The UI automation layer talks directly to accessibility services — zero network round-trips.
Architecture
LLM provider
One unified interface for many models:
abstract class LLMProvider {
Future<LLMResponse> chat(List<ChatMessage> messages);
Stream<String> chatStream(List<ChatMessage> messages);
bool get supportsVision;
}
OpenAI, Claude, Gemini, and Ollama all implement it. A ModelRouter picks the right model based on the task's difficulty.
UI automation
On Android, the Accessibility Service surfaces the full UI tree — every element's type, text, and position. The agent can tap, scroll, type, and screenshot.
abstract class UIService {
Future<UISnapshot> getSnapshot(); // screenshot + UI tree
Future<void> tap(UIElement element);
Future<void> input(UIElement element, String text);
Future<void> scroll(Offset delta);
}
A UISnapshot has two parts:
- Screenshot — gives the LLM visual context
- Structured UI tree — gives the agent precise selectors for actions
Both go to the LLM. Vision + structure together is much stronger than either alone.
Agent core
Stream<AgentEvent> execute(String instruction) async* {
while (step < maxSteps) {
final snapshot = await ui.getSnapshot();
final decision = await think(provider, snapshot, instruction);
if (decision.isComplete) {
yield AgentEvent.completed(decision.summary);
return;
}
await executeAction(decision.action);
memory.addStep(step, decision);
}
}
Stream-based events let the UI show real-time progress: "Observing screen…", "Thinking…", "Tapping login button…". It feels alive instead of frozen.
Multi-model strategy
The biggest lesson from this project: don't use the same model for everything.
- Screen observation needs vision — GPT-4o or Claude handle it well
- Simple decisions like "tap this button" can run on cheaper, faster models
- Complex reasoning — navigating nested settings, recovering from unexpected screens — needs a heavyweight model
Simple ops → Claude Haiku ~$0.001
Visual analysis → GPT-4o / Claude ~$0.01
Complex planning → Claude Opus ~$0.05
Average cost per task: under $0.10.
What's next
- Task templates — preset flows for common things: "post to social media," "order delivery," "DM someone"
- Scheduled tasks — "check email every morning, summarize"
- Task chains — multi-step flows: "scrape stats → write report → send email"
- iOS support — Android runs today; iOS is next
Building this in public. Follow along at @jiusanzhou or watch the repo on GitHub.

Written by
Zoe
AI Infra Engineer · LLM Serving · GPU/RDMA · indie hacker, obsessed with shipping tools