Building an AI agent that controls your phone — in Flutter

3 min read
Also available in:中文

Building an AI agent that controls your phone — in Flutter

What if your phone could operate itself?

Not pre-recorded macros. Not fixed scripts. I mean an actual AI that looks at the screen, figures out what's there, and decides what to do next — like a person would.

I built one. In Flutter. Runs on the phone itself, no laptop tether.

The core loop

Four steps, repeated until the task is done:

  1. Observe — take a screenshot, read the UI tree
  2. Think — send screenshot + UI tree to an LLM: "what next?"
  3. Act — tap, swipe, type
  4. Reflect — check results, update memory

A standard agent loop. The difference: this one runs on a real phone, not in a browser.

Why Flutter?

Runs natively on the device. No computer needed, no USB cable — it's a standalone app.

Cross-platform from one codebase. iOS and Android share the same logic. Most existing phone-agent projects (Alibaba's Mobile-Agent, Tencent's AppAgent) need a desktop driving the phone via ADB. This one doesn't.

Native performance. Flutter compiles to native code. The UI automation layer talks directly to accessibility services — zero network round-trips.

Architecture

LLM provider

One unified interface for many models:

abstract class LLMProvider {
  Future<LLMResponse> chat(List<ChatMessage> messages);
  Stream<String> chatStream(List<ChatMessage> messages);
  bool get supportsVision;
}

OpenAI, Claude, Gemini, and Ollama all implement it. A ModelRouter picks the right model based on the task's difficulty.

UI automation

On Android, the Accessibility Service surfaces the full UI tree — every element's type, text, and position. The agent can tap, scroll, type, and screenshot.

abstract class UIService {
  Future<UISnapshot> getSnapshot();  // screenshot + UI tree
  Future<void> tap(UIElement element);
  Future<void> input(UIElement element, String text);
  Future<void> scroll(Offset delta);
}

A UISnapshot has two parts:

  • Screenshot — gives the LLM visual context
  • Structured UI tree — gives the agent precise selectors for actions

Both go to the LLM. Vision + structure together is much stronger than either alone.

Agent core

Stream<AgentEvent> execute(String instruction) async* {
  while (step < maxSteps) {
    final snapshot = await ui.getSnapshot();
    final decision = await think(provider, snapshot, instruction);

    if (decision.isComplete) {
      yield AgentEvent.completed(decision.summary);
      return;
    }

    await executeAction(decision.action);
    memory.addStep(step, decision);
  }
}

Stream-based events let the UI show real-time progress: "Observing screen…", "Thinking…", "Tapping login button…". It feels alive instead of frozen.

Multi-model strategy

The biggest lesson from this project: don't use the same model for everything.

  • Screen observation needs vision — GPT-4o or Claude handle it well
  • Simple decisions like "tap this button" can run on cheaper, faster models
  • Complex reasoning — navigating nested settings, recovering from unexpected screens — needs a heavyweight model
Simple ops        → Claude Haiku       ~$0.001
Visual analysis   → GPT-4o / Claude    ~$0.01
Complex planning  → Claude Opus        ~$0.05

Average cost per task: under $0.10.

What's next

  • Task templates — preset flows for common things: "post to social media," "order delivery," "DM someone"
  • Scheduled tasks — "check email every morning, summarize"
  • Task chains — multi-step flows: "scrape stats → write report → send email"
  • iOS support — Android runs today; iOS is next

Building this in public. Follow along at @jiusanzhou or watch the repo on GitHub.

Zoe

Written by

Zoe

AI Infra Engineer · LLM Serving · GPU/RDMA · indie hacker, obsessed with shipping tools

Comments