How it works
Every action PhysiClaw takes is one turn of the same four-phase loop. Keeping the loop fixed is what makes the system reliable: each phase has one job, and each ends by looking at the result.
┌──────────────────────────────────────────────────────────┐ │ │ ▼ │ LOOK ─────────────► DECIDE ─────────► MOVE & TOUCH ──────► CHECK overhead camera agent picks a arm drives stylus look again: + on-device vision box + a gesture to the box, taps did it change? boxes every (tap / swipe / then lifts away │ │ element long-press) yes│ no│ │ │ next action ◄──┘ │ retry / re-aim ◄──────┘The whole system is just one camera and one arm. There is no second camera and no depth sensor — reliability comes from re-looking after every move, not from fancier hardware.
1. Look
Section titled “1. Look”PhysiClaw takes a photo of the screen and runs it through on-device vision — OCR for text and a small icon-detection model for buttons and images. The output isn’t raw pixels; it’s a tidy listing where every element already has a box and a label:
id kind label bbox [left,top,right,bottom] conf12 icon "Clock" [0.41, 0.55, 0.49, 0.63] 0.9713 icon "Settings" [0.51, 0.55, 0.59, 0.63] 0.9614 text "Wednesday 14" [0.30, 0.08, 0.70, 0.13] 0.99A bbox (“bounding box”) is just a rectangle around one element, given as four numbers from
0 to 1 — fractions of the screen’s width and height. [0.41, 0.55, 0.49, 0.63] means
“start 41% across and 55% down, end 49% across and 63% down.” Using fractions instead of pixels
is what keeps everything portable: the same listing makes sense on any phone size.
2. Decide
Section titled “2. Decide”The agent — Claude, or any model you point at PhysiClaw — reads that listing and chooses one
box plus one gesture. To open the clock, it picks element 12 and calls tap. It never
deals in motor coordinates or pixels; it just names a box and an action. That’s the entire
vocabulary, and it’s covered in the MCP tools reference.
3. Move & touch
Section titled “3. Move & touch”PhysiClaw converts the chosen box into arm coordinates (this mapping is what calibration sets up), drives the stylus to the center of the box, and drops the tip to register a touch. The tip is on a fast electromagnet — a solenoid — so a tap is a crisp down-and-up, and a long-press just holds the tip down longer. Then the arm parks the stylus off to the side, out of the camera’s view, so the next photo is unobstructed.
4. Check
Section titled “4. Check”PhysiClaw looks again and produces a fresh listing. The agent compares it to what it expected:
- It changed as planned → move on to the next action.
- Nothing changed → the tap missed or the box was wrong; re-aim and try again.
- Something unexpected appeared (a popup, an ad, a permission prompt) → that’s simply the new state. The agent reads it and decides again — no brittle script to fall out of.
This “observe the result, then decide again” design is the whole reliability story. PhysiClaw recovers from surprises the same way a person would: by looking and trying once more.
Two ways to look: peek vs screenshot
Section titled “Two ways to look: peek vs screenshot”LOOK actually has two settings, and the agent picks based on what it needs:
peek | screenshot | |
|---|---|---|
| Source | the overhead camera | the phone’s own capture |
| Speed | ~4s | ~12s |
| Sharpness | good enough for most targets | pixel-perfect |
| Side effects | none | apps can notice a screenshot (share sheets, watermarks) |
peek is the default — fast and invisible. The agent only escalates to screenshot when a
target is too small for the camera to resolve, or glare makes the frame unreadable. Why this
split exists, and how the vision pipeline works, is in How it sees.