Skip to content

How it works

Every action PhysiClaw takes is one turn of the same four-phase loop. Keeping the loop fixed is what makes the system reliable: each phase has one job, and each ends by looking at the result.

┌──────────────────────────────────────────────────────────┐
│ │
▼ │
LOOK ─────────────► DECIDE ─────────► MOVE & TOUCH ──────► CHECK
overhead camera agent picks a arm drives stylus look again:
+ on-device vision box + a gesture to the box, taps did it change?
boxes every (tap / swipe / then lifts away │ │
element long-press) yes│ no│
│ │
next action ◄──┘ │
retry / re-aim ◄──────┘

The whole system is just one camera and one arm. There is no second camera and no depth sensor — reliability comes from re-looking after every move, not from fancier hardware.

PhysiClaw takes a photo of the screen and runs it through on-device vision — OCR for text and a small icon-detection model for buttons and images. The output isn’t raw pixels; it’s a tidy listing where every element already has a box and a label:

id kind label bbox [left,top,right,bottom] conf
12 icon "Clock" [0.41, 0.55, 0.49, 0.63] 0.97
13 icon "Settings" [0.51, 0.55, 0.59, 0.63] 0.96
14 text "Wednesday 14" [0.30, 0.08, 0.70, 0.13] 0.99

A bbox (“bounding box”) is just a rectangle around one element, given as four numbers from 0 to 1 — fractions of the screen’s width and height. [0.41, 0.55, 0.49, 0.63] means “start 41% across and 55% down, end 49% across and 63% down.” Using fractions instead of pixels is what keeps everything portable: the same listing makes sense on any phone size.

The agent — Claude, or any model you point at PhysiClaw — reads that listing and chooses one box plus one gesture. To open the clock, it picks element 12 and calls tap. It never deals in motor coordinates or pixels; it just names a box and an action. That’s the entire vocabulary, and it’s covered in the MCP tools reference.

PhysiClaw converts the chosen box into arm coordinates (this mapping is what calibration sets up), drives the stylus to the center of the box, and drops the tip to register a touch. The tip is on a fast electromagnet — a solenoid — so a tap is a crisp down-and-up, and a long-press just holds the tip down longer. Then the arm parks the stylus off to the side, out of the camera’s view, so the next photo is unobstructed.

PhysiClaw looks again and produces a fresh listing. The agent compares it to what it expected:

  • It changed as planned → move on to the next action.
  • Nothing changed → the tap missed or the box was wrong; re-aim and try again.
  • Something unexpected appeared (a popup, an ad, a permission prompt) → that’s simply the new state. The agent reads it and decides again — no brittle script to fall out of.

This “observe the result, then decide again” design is the whole reliability story. PhysiClaw recovers from surprises the same way a person would: by looking and trying once more.

LOOK actually has two settings, and the agent picks based on what it needs:

peekscreenshot
Sourcethe overhead camerathe phone’s own capture
Speed~4s~12s
Sharpnessgood enough for most targetspixel-perfect
Side effectsnoneapps can notice a screenshot (share sheets, watermarks)

peek is the default — fast and invisible. The agent only escalates to screenshot when a target is too small for the camera to resolve, or glare makes the frame unreadable. Why this split exists, and how the vision pipeline works, is in How it sees.