Skip to content

The built-in agent

PhysiClaw isn’t only a set of tools an outside model calls — it ships its own agent, a brain that can drive the phone on its own.

You can run PhysiClaw two ways. As a plain MCP server, it hands the twelve tap/swipe/peek tools to whatever agent you already use — Claude Desktop, an IDE, your own client — and that external model does the deciding. As a built-in agent, PhysiClaw is the model: it runs its own look → decide → act loop, in its own process, with no external client attached. Same robot, same tools — the difference is whose mind is in charge.

What the agent adds over a plain MCP server

Section titled “What the agent adds over a plain MCP server”

A plain MCP server is reactive: it sits still until some external client sends a tool call. The built-in agent closes that gap — it can be the initiator.

It owns the loop

The agent runs the full look → decide → act cycle itself: peek to see, choose a bbox and a gesture, tap, then peek again to check. No external model in the loop.

It runs unattended

It wakes on its own — on a schedule or when the phone screen changes — operates the phone, and goes back to sleep. Nobody has to be sitting at a client.

It remembers

A persistent memory carries facts across wakes, so the agent isn’t starting cold every time. (See Memory & skills.)

It learns routines

Skills are reusable, app-specific playbooks the agent discovers and follows — “how to send a WeChat message,” “how to place a grocery order” — instead of re-figuring-out each app every time.

Every wake runs the same loop you already met in How it works — the agent just drives it instead of an external client:

wake ──► LOOK ──► DECIDE ──► ACT ──► LOOK ──► … ──► close
trigger peek pick a tap / peek (DONE / WAIT /
fires (camera) bbox + swipe again, FAIL / IDLE)
gesture re-decide

A few rules keep the loop honest. Each turn is shaped as exactly [note, one-other] — one running-summary note plus one real action — so the agent takes one step, records why, and never fires a burst of taps blind. Every turn ends by looking at the result, so a popup or a slow load is just the next state to react to, not a script to fall out of. And each session ends with a one-word verdict — DONE, WAIT, FAIL, or IDLE — that says what happened and whether to follow up.

You don’t have to choose once and for all — the same install does both.

Plain MCP serverBuilt-in agent
Who decidesyour external client (Claude Desktop, an IDE)PhysiClaw itself
Starts a taskyou, by prompting the clienta trigger: a schedule or a screen change
Runs unattendedno — needs a client connectedyes — wakes, acts, sleeps
Memory & skillsup to your clientbuilt in

Reach for the plain server when you want to keep an existing agent in the loop and just give it hands. Reach for the built-in agent when you want PhysiClaw to run on its own — a recurring chore, a watch-and-react task, a phone that does things while you’re away.