Skip to content

How it sees

PhysiClaw never hands the brain raw pixels to reason over — it hands it a listing, where every button, icon, and line of text has already been found, labelled, and boxed. Two models do that work on every frame: an OCR reader for text and a small neural net for icons. This page is how a frame becomes that listing, and why the listing looks the way it does.

A phone screen has two kinds of tappable thing — words and pictures — and PhysiClaw runs a separate, purpose-built model for each, then merges the results.

Text → OCR

RapidOCR (PaddleOCR models on ONNX Runtime, Chinese + English, no PyTorch). It reads every text region and returns the string plus a pixel box. That gives every text element a real label"Settings", "$29.9" — so the brain can target by name.

Icons → ONNX detector

OmniParser V2 icon detection (a YOLO11m model finetuned by Microsoft, exported to ONNX and run through OpenCV’s DNN module — again no PyTorch). It finds interactable graphics — app icons, image buttons, toggles — that carry no text for OCR to read.

Both run fully on-device. No frame ever leaves your machine to be analysed, which is why install downloads the model locally. The two outputs are merged, cleaned (tiny boxes and low-confidence hits dropped, near-duplicates removed by overlap), sorted top-to-bottom then left-to-right, and renumbered 0, 1, 2, ….

The merged result is the brain’s entire view of the screen — one row per element:

id [kind] "label" [left,top,right,bottom] conf
0 [icon] "" [0.020,0.060,0.110,0.100] 0.64
1 [text] "Settings" [0.510,0.550,0.590,0.630] 0.96
2 [text] "Wednesday 14" [0.300,0.080,0.700,0.130] 0.99

Five fields, and that’s all the brain gets:

FieldWhat it is
ida stable handle for that element in this listing (renumbered every frame).
kindtext or icon — which detector found it.
labelthe OCR string for text; empty ("") for an icon — its picture isn’t named.
bboxthe rectangle, four fractions of the screen [left, top, right, bottom].
confthe detector’s confidence, 01. Text below 0.7 and icons below 0.3 are dropped.

An icon’s label is empty on purpose — the detector finds that something is tappable here, not what it is. The brain figures out an unlabelled icon from its position and the surrounding text, exactly as a person would glance at a grid of app icons and know which is which.

The bbox: a coordinate system that travels

Section titled “The bbox: a coordinate system that travels”

A bbox (“bounding box”) is the rectangle around one element, given as four numbers each from 0 to 1 — fractions of the screen’s width and height, never pixels.

(0,0) ┌───────────────────────┐
│ [0.51, 0.55, 0.59, 0.63]
│ ┌──────┐ ← left=0.51 (51% across)
│ │ Set… │ top=0.55 (55% down)
│ └──────┘ right=0.59, bottom=0.63
│ center = (0.55, 0.59)
└───────────────────────┘ (1,1)

Using fractions instead of pixels is what makes the listing portable. The same [0.51, 0.55, 0.59, 0.63] points at the same button whether the underlying frame is a 1024-px camera crop or a 1170-px phone screenshot — and whether your phone is a small SE or a large Pro Max. The brain reasons in one clean 01 space; the server handles every conversion to and from pixels and millimetres (that’s calibration’s job).

When the brain wants to act, it names a bbox and a gesture — tap [0.51, 0.55, 0.59, 0.63]. The server takes the center of that box, ((left+right)/2, (top+bottom)/2), and aims there. So a loose box is fine as long as its middle sits on the target; you’re pointing, not tracing.

peek vs screenshot: two sources, same listing

Section titled “peek vs screenshot: two sources, same listing”

The listing format never changes — but it can be built from two different images, and choosing between them is the one real trade-off in seeing.

peekscreenshot
Image sourcethe overhead camerathe phone’s own capture
Howsnapshot → crop to screen → detecttap AssistiveTouch → iOS Shortcut uploads pixels → detect
Speed~4 s~12 s
Sharpnessgood for most targetspixel-perfect
Side effectsnoneapps can notice a screenshot (share sheets, watermarks)

peek is the default and the workhorse. It parks the stylus out of view, photographs the screen, crops to just the phone-screen region (cropping gives each icon 2–3× more pixels and sharpens detection), and runs the same two detectors. If the first frame comes back blurry — a focus or motion problem — it waits a couple of seconds and grabs one more before giving up.

screenshot is the escalation. Instead of a camera photo it triggers the phone’s own screenshot through the bridge: the arm taps AssistiveTouch, an iOS Shortcut captures and uploads pixel-perfect bytes, and vision runs on those. The result is sharper but slower, and it’s mutating — taking a screenshot is a real iOS event some apps react to. So the brain only reaches for it when:

  • a target is too small for the camera to resolve (it’s missing from the peek listing),
  • glare or motion blur makes the camera frame unreadable, or
  • it needs to read fine print the camera can’t make out.

Because a screenshot can shift app state, the rule is always to peek again after a screenshot before tapping — read the live screen, then act.

Handing the brain a parsed listing instead of an image does three things at once: it grounds every target in a real on-screen rectangle (no guessing pixel coordinates), it makes “did the screen change?” a cheap text comparison between two listings, and it keeps the brain’s vocabulary tiny — pick a box, pick a gesture. That small, stable interface is what lets PhysiClaw recover from surprises by simply looking again.