AWE eval

AWE eval

Agent Web Evaluation

A computer-use eval harness for AI agents. Drive one of our sub-sites with a browser, finish, and we grade the result deterministically against a per-eval seed.

How the flow fits together

For every eval you want to run: start a session, drive the sub-site in a browser to satisfy the eval's prompt, then finish to grade. That's the whole orchestration loop.

Sessions are completely isolated — each start mints a fresh seeded world backed by its own Durable Object. Nothing is shared between sessions, so you can fan out and run as many in parallel as you have browsers for.

The server has no notion of "the agent is done." Your harness decides when to grade (timeout, max actions, agent signal — your call) and triggers it by hitting finish.

Grading is server-side and deterministic. The response is only pass or fail — never the grader's reasoning, a diff against expected state, or partial credit. If you need to debug a failure, restart the eval and watch the UI during the run.

Step 1 — Discover the eval catalog

List the sub-sites, then list the evals on each one. Eval ids (todo.complete-today, cal.three-blocks, …) are stable handles you'll use when starting a session. No auth required.

curl https://awe.wtf/api/v1/sub-sites curl https://awe.wtf/api/v1/sub-sites/todo/evals

Step 2 — Start a session

Each call creates a fresh isolated world for one eval. The response gives you a session_key (save it — you'll need it to grade) and a launch_url containing a single-use ?lk=… token valid for 10 minutes.

curl -XPOST https://awe.wtf/api/v1/evals/todo.complete-today/start # { # "session_key": "eses_…", # "session_id": "", # "launch_url": "https://awe.wtf/todo?lk=elaunch_…", # "eval_id": "todo.complete-today", # "sub_site": "todo" # }

One session per eval. Don't reuse a session_key across evals — every eval needs its own start call. Sessions are cheap and fully isolated.

Step 3 — Drive the eval in a browser

Point your browser-driving agent at launch_url. On first load the server exchanges the ?lk= token for an HttpOnly s= session cookie and 302s to the same URL without the token — so the launch key doesn't sit in history or referer headers. From there, drive the sub-site UI like a human: click buttons, fill forms, submit. The prompt the agent must satisfy is the prompt field from step 1.

Launch URLs are single-use. Once exchanged, hitting them again returns 401. Open them once in the agent's browser; don't re-open or share them across tabs.

Key separation. The session_key stays with your harness; only the resulting HttpOnly cookie ever lives in the browser. The agent driving the UI never needs to know — or see — the key.

Step 4 — Finish & grade

When the agent is done, POST to /api/v1/sessions/finish with the session_key as a Bearer token. The server runs the eval's deterministic grader against final state and returns pass or fail. Idempotent: repeat calls return the cached status without re-grading.

curl -XPOST -H "Authorization: Bearer $SESSION_KEY" \ https://awe.wtf/api/v1/sessions/finish # { # "session_id": "", # "eval_id": "todo.complete-today", # "sub_site": "todo", # "status": "pass" # }

Errors and retry behavior

Reference — full API surface