How it sees

PhysiClaw never hands the brain raw pixels to reason over — it hands it a listing, where every button, icon, and line of text has already been found, labelled, and boxed. Two models do that work on every frame: an OCR reader for text and a small neural net for icons. This page is how a frame becomes that listing, and why the listing looks the way it does.

Two detectors, one frame

A phone screen has two kinds of tappable thing — words and pictures — and PhysiClaw runs a separate, purpose-built model for each, then merges the results.

Text → OCR

RapidOCR (PaddleOCR models on ONNX Runtime, Chinese + English, no PyTorch). It reads every text region and returns the string plus a pixel box. That gives every text element a real label — "Settings", "$29.9" — so the brain can target by name.

Icons → ONNX detector

OmniParser V2 icon detection (a YOLO11m model finetuned by Microsoft, exported to ONNX and run through OpenCV’s DNN module — again no PyTorch). It finds interactable graphics — app icons, image buttons, toggles — that carry no text for OCR to read.

Both run fully on-device. No frame ever leaves your machine to be analysed, which is why install downloads the model locally. The two outputs are merged, cleaned (tiny boxes and low-confidence hits dropped, near-duplicates removed by overlap), sorted top-to-bottom then left-to-right, and renumbered 0, 1, 2, ….

The element listing

The merged result is the brain’s entire view of the screen — one row per element:

id [kind] "label" [left,top,right,bottom] conf
0 [icon] "" [0.020,0.060,0.110,0.100] 0.64
1 [text] "Settings" [0.510,0.550,0.590,0.630] 0.96
2 [text] "Wednesday 14" [0.300,0.080,0.700,0.130] 0.99

Five fields, and that’s all the brain gets:

Field	What it is
`id`	a stable handle for that element in this listing (renumbered every frame).
`kind`	`text` or `icon` — which detector found it.
`label`	the OCR string for text; empty (`""`) for an icon — its picture isn’t named.
`bbox`	the rectangle, four fractions of the screen `[left, top, right, bottom]`.
`conf`	the detector’s confidence, `0`–`1`. Text below 0.7 and icons below 0.3 are dropped.

An icon’s label is empty on purpose — the detector finds that something is tappable here, not what it is. The brain figures out an unlabelled icon from its position and the surrounding text, exactly as a person would glance at a grid of app icons and know which is which.

The bbox: a coordinate system that travels

A bbox (“bounding box”) is the rectangle around one element, given as four numbers each from 0 to 1 — fractions of the screen’s width and height, never pixels.

(0,0) ┌───────────────────────┐
      │   [0.51, 0.55, 0.59, 0.63]
      │        ┌──────┐  ← left=0.51 (51% across)
      │        │ Set… │     top=0.55 (55% down)
      │        └──────┘     right=0.59, bottom=0.63
      │                     center = (0.55, 0.59)
      └───────────────────────┘ (1,1)

Using fractions instead of pixels is what makes the listing portable. The same [0.51, 0.55, 0.59, 0.63] points at the same button whether the underlying frame is a 1024-px camera crop or a 1170-px phone screenshot — and whether your phone is a small SE or a large Pro Max. The brain reasons in one clean 0–1 space; the server handles every conversion to and from pixels and millimetres (that’s calibration’s job).

When the brain wants to act, it names a bbox and a gesture — tap [0.51, 0.55, 0.59, 0.63]. The server takes the center of that box, ((left+right)/2, (top+bottom)/2), and aims there. So a loose box is fine as long as its middle sits on the target; you’re pointing, not tracing.

`peek` vs `screenshot`: two sources, same listing

The listing format never changes — but it can be built from two different images, and choosing between them is the one real trade-off in seeing.

	`peek`	`screenshot`
Image source	the overhead camera	the phone’s own capture
How	snapshot → crop to screen → detect	tap AssistiveTouch → iOS Shortcut uploads pixels → detect
Speed	~4 s	~12 s
Sharpness	good for most targets	pixel-perfect
Side effects	none	apps can notice a screenshot (share sheets, watermarks)

peek is the default and the workhorse. It parks the stylus out of view, photographs the screen, crops to just the phone-screen region (cropping gives each icon 2–3× more pixels and sharpens detection), and runs the same two detectors. If the first frame comes back blurry — a focus or motion problem — it waits a couple of seconds and grabs one more before giving up.

screenshot is the escalation. Instead of a camera photo it triggers the phone’s own screenshot through the bridge: the arm taps AssistiveTouch, an iOS Shortcut captures and uploads pixel-perfect bytes, and vision runs on those. The result is sharper but slower, and it’s mutating — taking a screenshot is a real iOS event some apps react to. So the brain only reaches for it when:

a target is too small for the camera to resolve (it’s missing from the peek listing),
glare or motion blur makes the camera frame unreadable, or
it needs to read fine print the camera can’t make out.

Because a screenshot can shift app state, the rule is always to peek again after a screenshot before tapping — read the live screen, then act.

Why a listing and not pixels

Handing the brain a parsed listing instead of an image does three things at once: it grounds every target in a real on-screen rectangle (no guessing pixel coordinates), it makes “did the screen change?” a cheap text comparison between two listings, and it keeps the brain’s vocabulary tiny — pick a box, pick a gesture. That small, stable interface is what lets PhysiClaw recover from surprises by simply looking again.