AWE eval — agent mode

How the flow fits together

For every eval you want to run: start a session, drive the sub-site in a browser to satisfy the eval's prompt, then finish to grade. That's the whole orchestration loop.

Sessions are completely isolated — each start mints a fresh seeded world backed by its own Durable Object. Nothing is shared between sessions, so you can fan out and run as many in parallel as you have browsers for.

The server has no notion of "the agent is done." Your harness decides when to grade (timeout, max actions, agent signal — your call) and triggers it by hitting finish.

Grading is server-side and deterministic. The response is only pass or fail — never the grader's reasoning, a diff against expected state, or partial credit. If you need to debug a failure, restart the eval and watch the UI during the run.

Step 1 — Discover the eval catalog

List the sub-sites, then list the evals on each one. Eval ids (todo.complete-today, cal.three-blocks, …) are stable handles you'll use when starting a session. No auth required.

curl https://awe.wtf/api/v1/sub-sites

curl https://awe.wtf/api/v1/sub-sites/todo/evals

Step 2 — Start a session

Each call creates a fresh isolated world for one eval. The response gives you a session_key (save it — you'll need it to grade) and a launch_url containing a single-use ?lk=… token valid for 10 minutes.

curl -XPOST https://awe.wtf/api/v1/evals/todo.complete-today/start

# {
#   "session_key": "eses_…",
#   "session_id":  "",
#   "launch_url":  "https://awe.wtf/todo?lk=elaunch_…",
#   "eval_id":     "todo.complete-today",
#   "sub_site":    "todo"
# }

One session per eval. Don't reuse a session_key across evals — every eval needs its own start call. Sessions are cheap and fully isolated.

Step 3 — Drive the eval in a browser

Point your browser-driving agent at launch_url. On first load the server exchanges the ?lk= token for an HttpOnly s= session cookie and 302s to the same URL without the token — so the launch key doesn't sit in history or referer headers. From there, drive the sub-site UI like a human: click buttons, fill forms, submit. The prompt the agent must satisfy is the prompt field from step 1.

Launch URLs are single-use. Once exchanged, hitting them again returns 401. Open them once in the agent's browser; don't re-open or share them across tabs.

Key separation. The session_key stays with your harness; only the resulting HttpOnly cookie ever lives in the browser. The agent driving the UI never needs to know — or see — the key.

Step 4 — Finish & grade

When the agent is done, POST to /api/v1/sessions/finish with the session_key as a Bearer token. The server runs the eval's deterministic grader against final state and returns pass or fail. Idempotent: repeat calls return the cached status without re-grading.

curl -XPOST -H "Authorization: Bearer $SESSION_KEY" \
  https://awe.wtf/api/v1/sessions/finish

# {
#   "session_id": "",
#   "eval_id":    "todo.complete-today",
#   "sub_site":   "todo",
#   "status":     "pass"
# }

Errors and retry behavior

start returns 404 if the eval_id is unknown — a typo, or an eval that's been removed from the registry.
Hitting a launch_url twice, or more than 10 minutes after it was minted, returns 401. There is no recovery for a stale launch URL — just call start again to mint a fresh session and URL.
finish returns 401 on an unknown session_key.
finish is idempotent. Safe to retry on a network error. Re-calling it on an already-graded session returns the cached status without re-running the grader.
After finish, any further /ui/* mutation in the browser is rejected with 409. A confused agent that keeps clicking after grading cannot corrupt the result.
All errors share the shape {"error": {"code": "…", "message": "…"}}. The code values are AUTH_INVALID, NOT_FOUND, and INTERNAL.

Reference — full API surface

GET /api/v1/sub-sites — list of sub-sites
GET /api/v1/sub-sites/{sub-site}/evals — list of evals with prompts
POST /api/v1/evals/{eval_id}/start → {session_key, session_id, launch_url, eval_id, sub_site}
POST /api/v1/sessions/finish — Bearer SESSION_KEY → {session_id, eval_id, sub_site, status}, idempotent