PhysiClaw never hands the brain raw pixels to reason over — it hands it a listing, where every
button, icon, and line of text has already been found, labelled, and boxed. Two models do that
work on every frame: an OCR reader for text and a small neural net for icons. This page is how a
frame becomes that listing, and why the listing looks the way it does.
A phone screen has two kinds of tappable thing — words and pictures — and PhysiClaw runs a
separate, purpose-built model for each, then merges the results.
Text → OCR
RapidOCR (PaddleOCR models on ONNX Runtime, Chinese + English, no PyTorch). It reads
every text region and returns the string plus a pixel box. That gives every text element a
real label — "Settings", "$29.9" — so the brain can target by name.
Icons → ONNX detector
OmniParser V2 icon detection (a YOLO11m model finetuned by Microsoft, exported to ONNX
and run through OpenCV’s DNN module — again no PyTorch). It finds interactable graphics —
app icons, image buttons, toggles — that carry no text for OCR to read.
Both run fully on-device. No frame ever leaves your machine to be analysed, which is why
install downloads the model locally. The two outputs are merged, cleaned
(tiny boxes and low-confidence hits dropped, near-duplicates removed by overlap), sorted
top-to-bottom then left-to-right, and renumbered 0, 1, 2, ….
a stable handle for that element in this listing (renumbered every frame).
kind
text or icon — which detector found it.
label
the OCR string for text; empty ("") for an icon — its picture isn’t named.
bbox
the rectangle, four fractions of the screen [left, top, right, bottom].
conf
the detector’s confidence, 0–1. Text below 0.7 and icons below 0.3 are dropped.
An icon’s label is empty on purpose — the detector finds that something is tappable here, not
what it is. The brain figures out an unlabelled icon from its position and the surrounding text,
exactly as a person would glance at a grid of app icons and know which is which.
A bbox (“bounding box”) is the rectangle around one element, given as four numbers each from
0 to 1 — fractions of the screen’s width and height, never pixels.
(0,0) ┌───────────────────────┐
│ [0.51, 0.55, 0.59, 0.63]
│ ┌──────┐ ← left=0.51 (51% across)
│ │ Set… │ top=0.55 (55% down)
│ └──────┘ right=0.59, bottom=0.63
│ center = (0.55, 0.59)
└───────────────────────┘ (1,1)
Using fractions instead of pixels is what makes the listing portable. The same [0.51, 0.55, 0.59, 0.63] points at the same button whether the underlying frame is a 1024-px camera crop or a
1170-px phone screenshot — and whether your phone is a small SE or a large Pro Max. The brain
reasons in one clean 0–1 space; the server handles every conversion to and from pixels and
millimetres (that’s calibration’s job).
When the brain wants to act, it names a bbox and a gesture — tap [0.51, 0.55, 0.59, 0.63]. The
server takes the center of that box, ((left+right)/2, (top+bottom)/2), and aims there. So a
loose box is fine as long as its middle sits on the target; you’re pointing, not tracing.
The listing format never changes — but it can be built from two different images, and choosing
between them is the one real trade-off in seeing.
peek
screenshot
Image source
the overhead camera
the phone’s own capture
How
snapshot → crop to screen → detect
tap AssistiveTouch → iOS Shortcut uploads pixels → detect
Speed
~4 s
~12 s
Sharpness
good for most targets
pixel-perfect
Side effects
none
apps can notice a screenshot (share sheets, watermarks)
peek is the default and the workhorse. It parks the stylus out of view, photographs the screen,
crops to just the phone-screen region (cropping gives each icon 2–3× more pixels and sharpens
detection), and runs the same two detectors. If the first frame comes back blurry — a focus or
motion problem — it waits a couple of seconds and grabs one more before giving up.
screenshot is the escalation. Instead of a camera photo it triggers the phone’s own
screenshot through the bridge: the arm taps
AssistiveTouch, an iOS Shortcut captures and uploads pixel-perfect bytes, and vision runs on
those. The result is sharper but slower, and it’s mutating — taking a screenshot is a real
iOS event some apps react to. So the brain only reaches for it when:
a target is too small for the camera to resolve (it’s missing from the peek listing),
glare or motion blur makes the camera frame unreadable, or
it needs to read fine print the camera can’t make out.
Because a screenshot can shift app state, the rule is always to peek again after a screenshot
before tapping — read the live screen, then act.
Handing the brain a parsed listing instead of an image does three things at once: it grounds every
target in a real on-screen rectangle (no guessing pixel coordinates), it makes “did the screen
change?” a cheap text comparison between two listings, and it keeps the brain’s vocabulary tiny —
pick a box, pick a gesture. That small, stable interface is what lets PhysiClaw recover from
surprises by simply looking again.