Skip to content

Your first task

The hardware is built and calibrated; now you hand the arm a goal in plain English and watch it work. This page connects an external agent — Claude Desktop — to the running server, gives it a one-line task, and shows the real tool trace that follows: the agent looks, picks a box, taps, and looks again.

PhysiClaw speaks MCP (Model Context Protocol) — the standard way an AI client discovers and calls external tools. Any MCP client works; here we use Claude Desktop as the brain.

  1. Make sure the server is up. In the shell where you ran physiclaw server, you should see the endpoint line:

    PhysiClaw MCP server on http://localhost:8048/mcp
  2. Add PhysiClaw to Claude Desktop’s config. Open claude_desktop_config.json (Settings → Developer → Edit Config) and add the physiclaw server:

    json
    {
    "mcpServers": {
    "physiclaw": {
    "type": "http",
    "url": "http://localhost:8048/mcp"
    }
    }
    }

    This is a streamable-HTTP server — the URL points straight at the running process, so there’s no command to launch and nothing to install client-side.

  3. Restart Claude Desktop. PhysiClaw’s twelve tools — peek, tap, swipe, unlock_phone, and the rest — now show up in the client’s tool list. (Full list: MCP tools.)

Talk to the agent the way you’d ask a person — a plain sentence, not coordinates:

Set a 10-minute timer on the phone.

That’s the whole instruction. You don’t tell it where the Clock app is or how to reach the timer tab — the agent figures that out by looking at the screen and deciding, one step at a time.

Behind that one sentence, the agent runs the same see → act loop on every move. It never sends pixels or motor coordinates; it peeks to get a listing of on-screen elements, each with a bbox — a [left, top, right, bottom] rectangle in 01 screen fractions — then taps the box it wants and peeks again to confirm. Here’s a real trace:

▸ peek()
← [home screen]
id kind label bbox conf
07 icon "Clock" [0.41, 0.55, 0.49, 0.63] 0.97
08 icon "Settings" [0.51, 0.55, 0.59, 0.63] 0.96
09 icon "Photos" [0.61, 0.55, 0.69, 0.63] 0.95
▸ tap([0.41, 0.55, 0.49, 0.63]) # the "Clock" box
← tapped — `peek` to verify and plan the next move
▸ peek()
← [Clock app, World Clock tab]
id kind label bbox conf
11 text "World Clock" [0.05, 0.92, 0.25, 0.98] 0.98
12 text "Timer" [0.74, 0.92, 0.95, 0.98] 0.97
▸ tap([0.74, 0.92, 0.95, 0.98]) # the "Timer" tab
← tapped — `peek` to verify and plan the next move
▸ peek()
← [Timer tab — hour / min / sec wheels, "10" already centered on minutes]
id kind label bbox conf
21 text "10" [0.40, 0.40, 0.50, 0.50] 0.95
22 text "Start" [0.55, 0.78, 0.85, 0.88] 0.98
▸ tap([0.55, 0.78, 0.85, 0.88]) # the green "Start" button
← tapped — `peek` to verify and plan the next move
▸ peek()
← [Timer running — "09:58" counting down, "Cancel" / "Pause" shown]
✓ goal reached: a 10-minute timer is running.

Read it top to bottom and the rhythm is clear: look, pick a box, tap, look again. Every tap is grounded in a bbox the previous peek actually returned, and every tap is followed by a fresh peek to check the result — that re-looking is what lets the agent notice if a tap missed and try again, instead of barreling ahead on a wrong assumption.

  • Two identical peek listings in a row → the tap landed on empty space or the box was slightly off. The agent re-aims from the new listing; if you’re watching, that’s normal recovery, not a crash.
  • “phone unlock failed” → the phone re-locked. Set the passcode to 111111 (the throwaway code unlock_phone uses) or disable auto-lock, then ask again.
  • No tools in the client → the server wasn’t running when the client started, or the URL is wrong. Confirm the endpoint line, fix the config, restart the client.

More failure modes and fixes live in Troubleshooting.