How it works

Every action PhysiClaw takes is one turn of the same four-phase loop. Keeping the loop fixed is what makes the system reliable: each phase has one job, and each ends by looking at the result.

   ┌──────────────────────────────────────────────────────────┐
   │                                                          │
   ▼                                                          │
 LOOK ─────────────► DECIDE ─────────► MOVE & TOUCH ──────► CHECK
 overhead camera     agent picks a     arm drives stylus    look again:
 + on-device vision  box + a gesture   to the box, taps     did it change?
 boxes every         (tap / swipe /    then lifts away       │      │
 element             long-press)                          yes│    no│
                                                             │      │
                                              next action ◄──┘      │
                                              retry / re-aim ◄──────┘

The whole system is just one camera and one arm. There is no second camera and no depth sensor — reliability comes from re-looking after every move, not from fancier hardware.

1. Look

PhysiClaw takes a photo of the screen and runs it through on-device vision — OCR for text and a small icon-detection model for buttons and images. The output isn’t raw pixels; it’s a tidy listing where every element already has a box and a label:

id  kind   label              bbox [left,top,right,bottom]   conf
12  icon   "Clock"            [0.41, 0.55, 0.49, 0.63]       0.97
13  icon   "Settings"         [0.51, 0.55, 0.59, 0.63]       0.96
14  text   "Wednesday 14"     [0.30, 0.08, 0.70, 0.13]       0.99

A bbox (“bounding box”) is just a rectangle around one element, given as four numbers from 0 to 1 — fractions of the screen’s width and height. [0.41, 0.55, 0.49, 0.63] means “start 41% across and 55% down, end 49% across and 63% down.” Using fractions instead of pixels is what keeps everything portable: the same listing makes sense on any phone size.

2. Decide

The agent — Claude, or any model you point at PhysiClaw — reads that listing and chooses one box plus one gesture. To open the clock, it picks element 12 and calls tap. It never deals in motor coordinates or pixels; it just names a box and an action. That’s the entire vocabulary, and it’s covered in the MCP tools reference.

3. Move & touch

PhysiClaw converts the chosen box into arm coordinates (this mapping is what calibration sets up), drives the stylus to the center of the box, and drops the tip to register a touch. The tip is on a fast electromagnet — a solenoid — so a tap is a crisp down-and-up, and a long-press just holds the tip down longer. Then the arm parks the stylus off to the side, out of the camera’s view, so the next photo is unobstructed.

4. Check

PhysiClaw looks again and produces a fresh listing. The agent compares it to what it expected:

It changed as planned → move on to the next action.
Nothing changed → the tap missed or the box was wrong; re-aim and try again.
Something unexpected appeared (a popup, an ad, a permission prompt) → that’s simply the new state. The agent reads it and decides again — no brittle script to fall out of.

This “observe the result, then decide again” design is the whole reliability story. PhysiClaw recovers from surprises the same way a person would: by looking and trying once more.

Two ways to look: `peek` vs `screenshot`

LOOK actually has two settings, and the agent picks based on what it needs:

	`peek`	`screenshot`
Source	the overhead camera	the phone’s own capture
Speed	~4s	~12s
Sharpness	good enough for most targets	pixel-perfect
Side effects	none	apps can notice a screenshot (share sheets, watermarks)

peek is the default — fast and invisible. The agent only escalates to screenshot when a target is too small for the camera to resolve, or glare makes the frame unreadable. Why this split exists, and how the vision pipeline works, is in How it sees.