Building an AI agent that controls your phone — in Flutter

What if your phone could operate itself?

Not pre-recorded macros. Not fixed scripts. I mean an actual AI that looks at the screen, figures out what's there, and decides what to do next — like a person would.

I built one. In Flutter. Runs on the phone itself, no laptop tether.

The core loop

Four steps, repeated until the task is done:

Observe — take a screenshot, read the UI tree
Think — send screenshot + UI tree to an LLM: "what next?"
Act — tap, swipe, type
Reflect — check results, update memory

A standard agent loop. The difference: this one runs on a real phone, not in a browser.

Why Flutter?

Runs natively on the device. No computer needed, no USB cable — it's a standalone app.

Cross-platform from one codebase. iOS and Android share the same logic. Most existing phone-agent projects (Alibaba's Mobile-Agent, Tencent's AppAgent) need a desktop driving the phone via ADB. This one doesn't.

Native performance. Flutter compiles to native code. The UI automation layer talks directly to accessibility services — zero network round-trips.

Architecture

LLM provider

One unified interface for many models:

abstract class LLMProvider {
  Future&#x3C;LLMResponse> chat(List&#x3C;ChatMessage> messages);
  Stream&#x3C;String> chatStream(List&#x3C;ChatMessage> messages);
  bool get supportsVision;
}

OpenAI, Claude, Gemini, and Ollama all implement it. A ModelRouter picks the right model based on the task's difficulty.

UI automation

On Android, the Accessibility Service surfaces the full UI tree — every element's type, text, and position. The agent can tap, scroll, type, and screenshot.

abstract class UIService {
  Future&#x3C;UISnapshot> getSnapshot();  // screenshot + UI tree
  Future&#x3C;void> tap(UIElement element);
  Future&#x3C;void> input(UIElement element, String text);
  Future&#x3C;void> scroll(Offset delta);
}

A UISnapshot has two parts:

Screenshot — gives the LLM visual context
Structured UI tree — gives the agent precise selectors for actions

Both go to the LLM. Vision + structure together is much stronger than either alone.

Agent core

Stream&#x3C;AgentEvent> execute(String instruction) async* {
  while (step &#x3C; maxSteps) {
    final snapshot = await ui.getSnapshot();
    final decision = await think(provider, snapshot, instruction);

    if (decision.isComplete) {
      yield AgentEvent.completed(decision.summary);
      return;
    }

    await executeAction(decision.action);
    memory.addStep(step, decision);
  }
}

Stream-based events let the UI show real-time progress: "Observing screen…", "Thinking…", "Tapping login button…". It feels alive instead of frozen.

Multi-model strategy

The biggest lesson from this project: don't use the same model for everything.

Screen observation needs vision — GPT-4o or Claude handle it well
Simple decisions like "tap this button" can run on cheaper, faster models
Complex reasoning — navigating nested settings, recovering from unexpected screens — needs a heavyweight model

Simple ops        → Claude Haiku       ~$0.001
Visual analysis   → GPT-4o / Claude    ~$0.01
Complex planning  → Claude Opus        ~$0.05

Average cost per task: under $0.10.

What's next

Task templates — preset flows for common things: "post to social media," "order delivery," "DM someone"
Scheduled tasks — "check email every morning, summarize"
Task chains — multi-step flows: "scrape stats → write report → send email"
iOS support — Android runs today; iOS is next

Building this in public. Follow along at @jiusanzhou or watch the repo on GitHub.

Building an AI agent that controls your phone — in Flutter

Building an AI agent that controls your phone — in Flutter

The core loop

Why Flutter?

Architecture

LLM provider

UI automation

Agent core

Multi-model strategy

What's next

Comments