<SYSTEM>This is the full developer documentation for Autonoma</SYSTEM>

# Introduction

> Learn how to integrate Autonoma's agentic end-to-end testing platform into your application.

Autonoma is an agentic end-to-end testing platform. Users create and run automated tests for web, iOS, and Android applications using natural language. The system executes tests on real devices and emulators, with AI models handling element selection, assertions, self-healing, and agentic decision-making.

## Using these docs with AI

These docs are available as plain text for LLMs. Pass the link below to your coding agent (Claude Code, Cursor, Copilot, etc.) so it can read the full documentation in context:

```plaintext
https://docs.agent.autonoma.app/llms.txt
```

The file links to every page individually, so the model can fetch only what it needs. A single [complete file](/llms-full.txt) with all pages is also available.

## Getting started

There are two paths depending on where you are:

Analyze and test your app

Claude Code plugin that reads your codebase, finds where bugs hide, and generates a full E2E test suite. Install once, run `/autonoma-test-planner:generate-tests`.

[Start the Test Planner →](/test-planner/)

Set up your backend

Connect your app so tests always start with clean, isolated data. One endpoint, automatic teardown.

[Read the guide →](/guides/environment-factory/)

## How Autonoma runs tests

Before each test run:

1. Autonoma calls your endpoint with `action: "up"` and a scenario name
2. Your endpoint creates isolated test data and returns auth credentials
3. Autonoma uses those credentials to log in and run the test
4. After the test, Autonoma calls `action: "down"` with signed refs
5. Your endpoint verifies the signature and deletes only the data it created

The signed refs mechanism ensures teardown can never delete data it didn’t create - even if someone gains access to your endpoint.

## Framework examples

TypeScript

Express + Prisma, Next.js + Drizzle.

[See examples →](/examples/)

Python

FastAPI, Flask, and Django with SQLAlchemy.

[See examples →](/examples/#python)

Elixir

Phoenix + Ecto implementation.

[See examples →](/examples/#elixir)

More Languages

Java, Ruby, Rust, Go, and PHP.

[See all examples →](/examples/)

## Contributing

Want to run Autonoma locally or contribute to the project? Start here:

Development Setup

Clone the repo, install dependencies, configure environment variables, and get the platform running locally.

[Get started →](/development/setup/)

Architecture Overview

Understand the monorepo structure, how apps and packages connect, and the key design decisions behind the platform.

[Read the overview →](/development/architecture/)

Code Conventions

ESM-only, strict TypeScript, logging patterns, and all the rules that keep the codebase consistent.

[See conventions →](/development/conventions/)

Architecture Deep Dives

Detailed documentation on the execution agent core and AI primitives package.

[Explore →](/architecture/execution-agent/)

# Step 1: Generate Knowledge Base

> Analyze your codebase to produce AUTONOMA.md and features.json. This is the first step of the pipeline and feeds every step that follows.

The knowledge base generator is the **first step** of the pipeline. It reads your frontend codebase and produces a user-perspective guide to every important page, flow, and interaction in your application.

Because this runs first, every subsequent step (entity audit, scenarios, environment factory, validation, and test generation) builds on the understanding captured here. Getting the core flows right in this step is the single highest-leverage thing you can do for the quality of the final suite.

## Prerequisites

* Your application codebase must be available in the workspace.
* These environment variables in the Claude Code session: `AUTONOMA_API_KEY`, `AUTONOMA_PROJECT_ID`, `AUTONOMA_API_URL`.

## What this produces

* `autonoma/AUTONOMA.md`
* `autonoma/features.json`

## What to review

The most important output is the **core flows** table. Core flows are the workflows that receive the heaviest test coverage later in the pipeline.

When reviewing:

* check that the product areas are named the way your team names them
* confirm the true core flows are marked as core
* make sure obvious high-value flows were not missed

If the core flows are wrong, the rest of the suite will be prioritized incorrectly.

## The prompt

Expand full prompt

# Knowledge Base Generator

You generate a structured knowledge base for a codebase. Your output MUST be written to `autonoma/AUTONOMA.md` with YAML frontmatter.

## Instructions

1. All Autonoma documentation MUST be fetched via `curl` in the Bash tool. Do NOT use WebFetch. Do NOT write any URL yourself. The docs base URL lives only in `autonoma/.docs-url`, written by the orchestrator before any subagent runs.

   ```bash
   curl -sSfL "$(cat autonoma/.docs-url)/llms/<path>"
   ```

   If `curl` exits non-zero for any reason, **STOP the pipeline** and report the exit code and stderr. Do not invent a URL.

2. Fetch the latest knowledge base generation instructions:

   ```bash
   curl -sSfL "$(cat autonoma/.docs-url)/llms/test-planner/step-1-knowledge-base.txt"
   ```

3. Create the output directory:

   ```bash
   mkdir -p autonoma
   ```

4. Follow the fetched instructions to analyze the codebase — discover the application, map pages and flows, identify core workflows.

5. Write `autonoma/AUTONOMA.md`.

6. Write `autonoma/features.json` — a machine-readable inventory of every feature discovered.

## Output format

`autonoma/AUTONOMA.md` MUST start with YAML frontmatter:

```yaml
---
app_name: "Name of the application"
app_description: "2-4 sentences describing what the application does, who uses it, and its primary purpose."
core_flows:
  - feature: "Feature Name"
    description: "What this feature/area does"
    core: true
  - feature: "Settings"
    description: "User and org settings management"
    core: false
feature_count: 12
---
```

### What makes a flow “core”

A flow is core if: “If this flow broke silently, would users immediately notice and stop using the product?” Typically 2-4 flows are core. They receive 50-60% of test coverage.

### features.json

```json
{
  "features": [
    { "name": "Login", "type": "page", "path": "/login", "core": true },
    { "name": "Dashboard", "type": "page", "path": "/dashboard", "core": true }
  ],
  "total_features": 2,
  "total_routes": 2,
  "total_api_routes": 0
}
```

`type` is one of `page`, `api`, `flow`, `component`, `modal`, `settings`. `core` must match `core_flows` in the AUTONOMA.md frontmatter.

## Validation

A hook script validates your output on every write. If validation fails, fix the issue and rewrite.

Checks:

* File starts with `---` (YAML frontmatter)
* Frontmatter contains all required fields
* `core_flows` is a non-empty list with feature/description/core fields
* At least one flow has `core: true`
* `feature_count` is a positive integer
* `app_description` is at least 20 characters

## Important

* Use subagents for parallel exploration of the codebase
* Treat README files as hints, not ground truth — the codebase is the source of truth
* Document what you find, don’t invent features
* Use the UI vocabulary — the same names the app uses

# Step 2: Entity Creation Audit

> Describe every way each database model is created so the Environment Factory can plan factories, scenario trees, and teardown correctly.

The entity creation audit agent reads your codebase and, for every database model, answers **two orthogonal questions**:

1. **`independently_created`** — does the codebase have an exported function that creates this model on its own?
2. **`created_by`** — which other models’ creation flows produce this model as a side effect?

Both facts can be true simultaneously. A model can have its own `<Child>Service.create()` **and** be minted inline inside a parent’s `<Root>Service.createRoot()` transaction as a required child. The audit captures both; downstream steps pick the right path per use case.

This step runs **after** the knowledge base is generated (Step 1) and **before** scenario generation (Step 3). Its output feeds directly into Step 4 (Implement & Validate), so the generator knows which models need factories, which come along as byproducts, and how to tear them down.

## Why two fields instead of one

Earlier versions of this audit used a single `has_creation_code` boolean. That was too coarse. Some models are **dual**: they have a standalone creation path *and* are produced inline by a parent’s transaction. A single flag forces the downstream pipeline to pretend one of those truths doesn’t exist — either fabricating a factory for a dependent that was never meant to be created standalone, or ignoring the standalone path when it’s the one a scenario actually wants to exercise.

Two orthogonal fields capture all four states cleanly:

| `independently_created` | `created_by` | Meaning                                                                                       |
| ----------------------- | ------------ | --------------------------------------------------------------------------------------------- |
| `true`                  | `[]`         | Pure root — only standalone creation exists.                                                  |
| `true`                  | non-empty    | Dual — has a standalone path AND is produced by at least one owner.                           |
| `false`                 | non-empty    | Pure dependent — only reachable via an owner’s creation flow.                                 |
| `false`                 | `[]`         | **Invalid** — unreachable model (either you missed the owner, or the model is never created). |

## What the agent records

### `independently_created: true`

These models get their own factory in the Environment Factory handler. The audit records:

* `creation_file` — path to the file with the creation logic
* `creation_function` — exported function name
* `side_effects` — observed side effects (password hashing, slug generation, sibling inserts, external calls)
* `needs_extraction` — optional flag set when the only creation path is inline in a route handler or framework-hook closure; the env-factory agent will lift it into a named export before wiring the factory

### `created_by: [{owner, via, why}]`

For every sibling row an owner mints inline, the dependent gets a `created_by` entry pointing back. The `why` field is prose — it flows verbatim into scenario guidance and env-factory teardown hints, so it needs to be specific:

* ✅ “Every new `<Root>` needs a default child created inline in the same transaction so the UI has something to read from the start.”
* ❌ “Creates a child.”

## Factory vs raw SQL

The SDK creates test data two ways per model:

* **Factory** — calls your application’s creation code, preserving every side effect
* **Raw SQL INSERT** — fast, skips application logic

Rule: every `independently_created: true` model gets a factory. Every pure dependent falls back to raw SQL and is torn down via its owner’s factory (see [Environment Factory guide](/guides/environment-factory/)).

## Prerequisites

* `autonoma/AUTONOMA.md` must exist (output from [Step 1](/test-planner/step-1-knowledge-base/))
* Access to your backend codebase — the agent needs to read service files, repositories, and route handlers

## What this produces

`autonoma/entity-audit.md` — a structured audit of every database model with YAML frontmatter:

```yaml
---
model_count: 5
factory_count: 3
models:
  - name: <Root>
    independently_created: true
    creation_file: src/<domain>/<domain>.service.ts
    creation_function: <Root>Service.create
    side_effects:
      - mints a default <Child> in the same transaction
      - seeds an <OnboardingLike> row
    created_by: []


  - name: <User>
    independently_created: true
    creation_file: src/<auth-module>/<auth-module>.ts
    creation_function: <AuthProvider>.databaseHooks.user.create
    side_effects:
      - hashes password
      - creates default <Tenant> + <Member> rows
    created_by: []


  - name: <Child>
    independently_created: true
    creation_file: src/<child-domain>/<child-domain>.service.ts
    creation_function: <Child>Service.create
    side_effects: []
    created_by:
      - owner: <Root>
        via: <Root>Service.create
        why: "Every new <Root> needs a default <Child>, created inline in the same transaction."


  - name: <PureDependent>
    independently_created: false
    created_by:
      - owner: <Root>
        via: <Root>Service.create
        why: "Minted inside the <Root> transaction so downstream features have a row to read."


  - name: <OnboardingLike>
    independently_created: false
    created_by:
      - owner: <Root>
        via: <Root>Service.create
        why: "Seeded with the <Root> row so the onboarding UI has something to advance through."
---
```

The body contains:

* **Roots** — headings for every `independently_created: true` model with file/function and the siblings it mints.
* **Dependents** — a table of every `independently_created: false` model mapping to its owner(s) and the `why`.
* **Dual-creation models** — a call-out listing every model that is both root and dependent, with guidance on when to use each path.

## Review checkpoint

For each `independently_created: true` model:

* Is the identified file/function the one you’d actually call in production?
* Are important side effects missing from the list?
* If `needs_extraction: true`, is that really the only creation path (vs. the agent missing a named service)?

For each `independently_created: false` model:

* Do the `created_by` entries list every owner that mints it? (Multiple owners are fine.)
* Do the `why` entries actually explain the motivation, or are they restating the code?

For each dual model:

* When would a test want the standalone path vs. the via-owner path? That decision drives scenario shape.

If a dependent has no `created_by` entry, **the audit is broken** — either the agent missed a creation path or the model is orphaned in the schema. The audit validator refuses to ship in that state.

## What happens next

In Step 4 (Implement & Validate), the generator:

1. Reads `autonoma/entity-audit.md`
2. Registers one factory per `independently_created: true` model
3. Lets the SDK fall back to raw SQL for every pure dependent
4. Plans teardown per root using the `created_by` graph (see [Environment Factory guide](/guides/environment-factory/))

You don’t manually split “factory” vs “raw SQL” — the audit + hybrid SDK handle it.

## The prompt

Expand full prompt

The live agent prompt lives in the plugin at `agents/entity-audit-generator.md`. It encodes:

* The two orthogonal questions and the four-state matrix above.
* Pass A (find standalone paths) and Pass B (find sibling inserts), both parallelizable.
* Schema rules for the output frontmatter.
* The invariant check: a dependent with empty `created_by` is a bug; fix the audit before writing.
* Instructions for using `curl` + `autonoma/.docs-url` to fetch this page at run time.

# Step 3: Generate Scenarios

> Design the standard, empty, and large test data environments from the knowledge base and the entity audit.

The scenario generator takes your knowledge base and entity audit and produces `scenarios.md` — a description of the test data environments the rest of the pipeline will depend on.

Each scenario is a nested tree of records rooted at your **scope entity** (e.g., `Organization`, `Tenant`, `Workspace`). The tree mirrors the foreign-key structure of your schema and becomes the `create` payload the SDK sends to your Environment Factory in Step 5.

> The JSON file this eventually produces is uploaded to Autonoma’s `/v1/setup/setups/:id/scenario-recipe-versions` endpoint. The exact upload contract - `version`, `source.discoverPath`, `validationMode`, `recipes[]`, and the `variables` tagged union - is documented in the [Scenario Recipe Schema reference](/reference/scenario-recipe-schema/). Read that before writing a recipe generator or debugging an upload rejection.

## Prerequisites

* `autonoma/AUTONOMA.md` must exist (output from [Step 1](/test-planner/step-1-knowledge-base/))
* `autonoma/entity-audit.md` must exist (output from [Step 2](/test-planner/step-2-entity-audit/))
* Access to your backend codebase — the agent reads the ORM schema directly (Prisma schema, Drizzle tables, Ecto schemas, ActiveRecord models, etc.) to understand relationships

## What this produces

`autonoma/scenarios.md` — a markdown file with YAML frontmatter describing:

* `scenario_count` and the list of `scenarios` (at minimum `standard`, `empty`, `large`)
* `entity_types` — every model the scenarios reference
* `variable_fields` — values that must vary across runs, with `{{token}}` placeholders
* `planning_sections` — the agent’s schema summary, relationship map, and variable-data strategy

Each scenario is a fully specified nested tree:

* **`standard`** — realistic day-to-day coverage; the workhorse scenario
* **`empty`** — zero-state and onboarding flows
* **`large`** — pagination, filtering, and high-volume behavior

## Variable fields

Most values in a scenario should be **fixed** — the test can assert against them directly (e.g., “click the project titled ‘Launch Campaign’”). A value should only be marked as a variable `{{token}}` when it genuinely must vary across runs:

* globally unique fields (emails, slugs, usernames)
* time-sensitive fields (timestamps, tokens with TTL)
* backend-generated fields the frontend cannot predict

Tests reference variable fields symbolically: `click the project titled ({{project_title}} variable)`. Marking too many fields as variable makes tests brittle; marking too few causes collisions across parallel runs.

## What to review

* **Root scope entity** — every scenario tree roots at the same scope model. If your app is multi-tenant on `Organization`, the tree should start at `Organization` and every descendant must be reachable via FK nesting.
* **Entity coverage** — the important entities from the knowledge base are represented. Missing entities mean core flows won’t have data to run against.
* **Fixed vs variable** — fixed values are realistic (no “asdf” test data). Variable fields are limited to values that genuinely must vary.
* **Scenario differentiation** — `empty` and `large` are meaningfully different from `standard`. Don’t let `large` just be “standard with more rows” — it should exercise pagination, filtering, and overflow.
* **Feasibility** — the tree you see here will be passed to the SDK verbatim in Step 5. If a required FK or unique constraint is missing, Step 5 will fail. Catching it here is cheaper.

Step 4 installs the SDK and Step 5 validates the scenarios against the real database. If scenarios are wrong, Step 5 will either fail or edit this file to match reality — review those edits carefully.

## The prompt

Expand full prompt

# Scenario Generator

You generate test data scenarios from a knowledge base. Your input is `autonoma/AUTONOMA.md` and `autonoma/entity-audit.md`. Your output MUST be written to `autonoma/scenarios.md` with YAML frontmatter.

## Instructions

1. Fetch Autonoma documentation via `curl` only (not WebFetch). The docs base URL lives in `autonoma/.docs-url`.

   ```bash
   curl -sSfL "$(cat autonoma/.docs-url)/llms/test-planner/step-3-scenarios.txt"
   ```

2. Read `autonoma/AUTONOMA.md` fully — understand the application, core flows, and entity types.

3. Read `autonoma/entity-audit.md` — the authoritative schema map from Step 2. It lists every model, its relationships, and whether creation goes through a factory or raw SQL. Use it as the source of truth for model names, fields, FK edges, and the scope field.

4. Explore the backend codebase only to fill gaps the audit does not cover (enum values, string length limits, constraint details).

5. **Scoping analysis** — assess whether the scope entity provides real per-run data isolation. Does the scope entity parent most other models via required FKs? Can a new scope entity be created per test run? Do most models eventually chain back to the scope entity?

   If yes to all: the app has natural multi-tenant isolation — each test run creates its own scope entity.

   If the scope entity is a singleton, shared across users, or does not meaningfully partition data: the app **lacks natural per-run isolation**. In this case you MUST slug all identifying fields with `{{testRunId}}` so parallel or sequential runs never collide.

6. Design three scenarios: `standard`, `empty`, `large`.

7. **Variable fields.** Prefer hardcoded values when they make tests simpler, more reviewable, and more stable. If a field needs run-level uniqueness but can still be expressed as a concrete literal, prefer a planner-chosen hardcoded value with a discriminator suffix over introducing a variable placeholder (e.g. `Acme Project qa-17` over `{{project_name}}`).

   **Exception — apps without natural per-run isolation:** if your scoping analysis determined the app lacks natural isolation, **reverse the default**. Slug ALL identifying fields — names, titles, descriptions, labels, slugs, emails, usernames — with inline `{{testRunId}}`.

   Only mark a value as variable when at least one of these is true:

   * the field must be globally unique or is highly collision-prone across runs
   * the backend or SDK generates the value at runtime
   * the value is inherently time-based, unstable, or nondeterministic
   * hardcoding it would make later tests misleading or brittle
   * the app lacks natural per-run isolation and the field is used in lookups, searches, or assertions

   Every variable field must have:

   * a double-curly token such as `{{project_title}}`
   * the entity field it belongs to, such as `Project.title`
   * the scenario names that use it
   * a reason explaining why it truly must vary
   * a plain-language test reference such as `({{project_title}} variable)`

8. **Nested tree constraint.** Design scenario entity tables so they can be expressed as a nested tree rooted at the scope entity. Step 4 and Step 5 convert scenarios into nested `create` payloads — flat cross-model structures connected only by `_ref` break when JSON key order is not preserved. Children must nest under their parent using the relation field names from the audit. Use `_ref` only for cross-branch references.

9. **Standalone vs via-owner.** For every model, consult the Step 2 audit:

   * Models with `independently_created: true` may appear as top-level tree nodes when the scenario wants them in isolation.
   * Models whose `created_by` list contains an owner already in the tree must NOT appear as separate nodes — they’re minted inline by the owner’s factory. Quote the `why` from the audit in the scenario prose so the reader knows where they came from.
   * **Dual models** (both `independently_created: true` AND in some owner’s `created_by`) pick per scenario: narratives that create a standalone child use the standalone factory; narratives that spin up a fresh root let the child come in via the owner.

   Never double-create a dependent. If an owner mints a dependent row inline and your scenario already includes that owner, don’t also add the dependent as a sibling node — the factory already creates it, and duplicating it either fails uniqueness checks or produces confusing state.

10. Write `autonoma/scenarios.md`.

## Output format

```yaml
---
scenario_count: 3
scenarios:
  - name: standard
    description: "Full dataset with realistic variety for core workflow testing"
    entity_types: 8
    total_entities: 45
  - name: empty
    description: "Zero data for empty state and onboarding testing"
    entity_types: 0
    total_entities: 0
  - name: large
    description: "High-volume data exceeding pagination thresholds"
    entity_types: 8
    total_entities: 500
entity_types:
  - name: "User"
  - name: "Project"
variable_fields:
  - token: "{{project_title}}"
    entity: "Project.title"
    scenarios: [standard, large]
    generator: "planner literal plus discriminator"
    reason: "title must be unique per test run"
    test_reference: "({{project_title}} variable)"
planning_sections:
  - schema_summary
  - relationship_map
  - variable_data_strategy
---
```

The body of the file must include:

* `## Schema Summary` — key models and required fields driving the scenarios
* `## Relationship Map` — parent/child and FK relationships
* `## Variable Data Strategy` — which values are generated and how tests reference them
* (Optional) `## Scoping Analysis` — if the app lacks natural per-run isolation
* Scenario sections for `standard`, `empty`, `large` with credentials and entity tables

## Important

* **The scenario data is a contract.** Fixed values are hard assertions; variable fields are explicit placeholders.
* Prefer concrete literals unless the field truly must vary across runs.
* Do not default to `faker`. Prefer deterministic strategies.
* Every value must be concrete — not “some applications” but “3 applications: Marketing Website, Android App, iOS App.”
* Every enum value must be covered in `standard`.
* Only use `{{testRunId}}` as a template token in scenario bodies. Custom tokens like `{{user_email_alice}}` are only valid in `variable_fields` declarations.
* Design scenarios so each entity table can be serialised as a nested tree rooted at the scope entity.

# Step 4: Implement Environment Factory

> Install the Autonoma SDK in your backend, configure the handler, and register a factory for every model with dedicated creation code. Validation of the full up/down lifecycle happens in Step 5.

The Environment Factory implementer takes your `scenarios.md` and `entity-audit.md` and sets up the Environment Factory endpoint using the Autonoma SDK. It installs the SDK packages and registers a factory for **every model that has dedicated creation code** (identified in the audit), so test data created during `up` flows through the same business logic your app uses in production.

This step writes code and runs a `discover` smoke test plus a factory-integrity check. It does **not** run the full `up`/`down` lifecycle — that happens in [Step 5](/test-planner/step-5-validate/), which iteratively validates every scenario, fixes what breaks, and uploads the reconciled recipes.

## Prerequisites

* `autonoma/entity-audit.md` must exist (output from [Step 2](/test-planner/step-2-entity-audit/))
* `autonoma/scenarios.md` must exist (output from [Step 3](/test-planner/step-3-scenarios/))
* Your application’s **backend codebase** must be open in the workspace. The agent will locate it by scanning for manifest files (`package.json`, `pyproject.toml`, `go.mod`, etc.) — it does NOT hardcode the directory name `backend/`, so non-standard names like `core-app-backend/`, `apps/api/`, or `services/core/` are fine. If the backend is in a separate repo, the agent will generate a portable prompt instead of scaffolding a sidecar.
* A backend with a working DB layer (Prisma, Drizzle, SQLAlchemy, Ecto, etc.). The SDK does not require a specific ORM — your factories call whatever services / repositories your app already has.
* Node.js 18+ (TS) or Python 3.11+

## Generating the secrets

The implementation requires **two separate secrets** with different purposes:

```bash
# 1. Shared secret — you AND Autonoma both know this one.
#    Autonoma uses it to sign every request (HMAC-SHA256).
#    Your endpoint uses it to verify the signature.
#    You paste this into the Autonoma dashboard when connecting your app.
openssl rand -hex 32
# Example output: 4a8f...  → set as AUTONOMA_SHARED_SECRET


# 2. Signing secret — only YOUR backend knows this one.
#    Used to sign the refsToken during up and verify during down.
#    Autonoma stores the token opaquely — it cannot read or modify it.
openssl rand -hex 32
# Example output: 7b3d...  → set as AUTONOMA_SIGNING_SECRET
```

These must be **different values**. The SDK throws an error if they match. For more details on the security model, see the [Security Model](/guides/environment-factory/#security-model) in the Environment Factory Guide.

## What this produces

* The Autonoma SDK packages installed in your backend

* A working endpoint handler using `createHandler()` / `createExpressHandler()` / `createHonoHandler()` / `createNodeHandler()` (TypeScript) or `create_fastapi_handler()` / `create_flask_handler()` / `create_django_handler()` (Python) with:

  * `scopeField` (e.g. `"organizationId"`) plus the two secrets (`sharedSecret`, `signingSecret`) on `HandlerConfig`
  * A factory registered for **every model in `entity-audit.md`**. Each factory declares an `inputSchema` (Zod) / `input_model` (Pydantic) plus `create` / `teardown` functions that call your real services
  * Auth callback that creates real, working credentials

* Validated scenario lifecycle — proof that `up` creates the correct data and `down` cleans it up

## Review checkpoint

Before writing any code, the agent will present a full implementation plan. This is a standard plan-mode approval gate — review it before the agent proceeds.

**What to check:**

* **SDK packages** — correct packages for your framework (e.g., `@autonoma-ai/sdk` + `@autonoma-ai/server-express` + `zod`, or `autonoma-ai` + the matching `autonoma_*` server adapter)
* **Endpoint location** — fits your existing route structure
* **Factories** — **every** model in `entity-audit.md` has a factory registered. Models marked `independently_created: true` call the audit’s identified `creation_file` / `creation_function`; models marked `independently_created: false` still need a factory, but it can wrap a thin repository call. There is no SQL fallback anymore.
* **Auth strategy** — correctly identifies how your app authenticates users. Session cookies, JWT, or credentials.
* **Environment variables** — `AUTONOMA_SHARED_SECRET` and `AUTONOMA_SIGNING_SECRET` are both listed

> **Tip:**
>
> If you’re unsure about the protocol details, read the [Environment Factory Guide](/guides/environment-factory/) before reviewing the plan.

## The prompt

Expand full prompt

# Environment Factory: SDK Setup & Validation

You are a backend engineer. Your job is to install the Autonoma SDK, configure the handler with factories, and validate the scenario lifecycle for this application.

***

## CRITICAL: Database Safety

You may be connected to a production database. Follow these rules absolutely:

* **ALL writes go through the SDK endpoint only.** The SDK has production guards, HMAC authentication, and signed refs tokens that prevent accidental damage.
* **You MAY read from the database** using `psql`, database GUIs, or ORM queries for verification purposes (SELECT only).
* **You MUST NEVER** run INSERT, UPDATE, DELETE, DROP, or TRUNCATE directly via psql, raw SQL, ORM write methods, or any other path outside the SDK endpoint.
* **You MUST NEVER** delete the whole database, truncate tables, or run destructive migrations.
* The SDK’s `down` action only deletes records that `up` created, verified by a cryptographically signed token. This is the only safe deletion path.

***

## HARD CONTRACT — READ FIRST

You MUST NOT:

* Create a new server, app, or sidecar process. No new `FastAPI()` / `express()` / `Flask()` / standalone `main.py` / `start-*.py` / `main.go` launcher at the repo root.
* Install a Python SDK into a TypeScript backend (or vice versa). The SDK language MUST match the backend’s language.
* Scaffold files at the repo root when an existing backend directory exists — even if that directory is named `core-app-backend/`, `apps/api/`, `services/core/`, or any other non-standard name.
* Pick an SDK before you have located and identified the backend in Phase 1.

If you cannot locate a backend, or the backend’s language has no matching Autonoma SDK, **STOP and ask the user**. Never fall back to a sidecar.

***

## Phase 0: Locate prerequisites

### 0.1 — Find scenarios.md

1. Check for `autonoma/scenarios.md` at the workspace root.
2. If not found, search broadly for `scenarios.md` anywhere in the workspace.

If not found, tell the user:

> “I need `scenarios.md` to implement the Environment Factory. Please run the Scenario Generator (Step 2) first, then come back and run this prompt.”

Do not proceed without it.

### 0.2 — Read the Environment Factory documentation

Fetch the Autonoma documentation to understand the current SDK setup:

1. Fetch `https://docs.agent.autonoma.app/llms.txt` to get the documentation index
2. Read the **Environment Factory Guide** — understand the SDK packages, factory registration with `inputSchema` / `input_model`, the `scopeField` on `HandlerConfig`, auth callback patterns, and the create tree format
3. Read the **framework example** that matches this project’s stack if one exists

**Always read the live docs.** The SDK may have been updated since this prompt was written.

### 0.3 — Read scenarios.md

Read `scenarios.md` fully. Identify:

* The scenario names and their create trees
* Every model referenced in the create trees
* Cross-branch references (`_alias` / `_ref`)
* Fields that use `testRunId` for uniqueness

***

## Phase 1: Explore the codebase

This exploration builds your understanding of the project — the same understanding that determines factory registration and auth implementation.

### 1.1 — Locate the backend and detect its language (do this BEFORE anything else in Phase 1)

Real projects use many directory conventions. Do NOT hardcode `backend/`. Enumerate candidates with Glob:

* `backend/`, `server/`, `api/`, `service/`, `services/`
* `*-backend/`, `*-api/`, `*-server/`, `core-*/`, `app-*/`, `core-app-backend/`
* Monorepo layouts: `apps/*`, `packages/*`, `services/*`
* Single-repo backends at the workspace root

Detect language per candidate from its manifest file — this determines which Autonoma SDK you install:

| Manifest found                                    | Language              | SDK package                                            |
| ------------------------------------------------- | --------------------- | ------------------------------------------------------ |
| `package.json`                                    | TypeScript/JavaScript | `@autonoma-ai/sdk`                                     |
| `pyproject.toml` / `requirements.txt` / `Pipfile` | Python                | [`autonoma-ai`](https://pypi.org/project/autonoma-ai/) |
| `go.mod`                                          | Go                    | `github.com/autonoma-ai/autonoma-sdk-go`               |
| `Cargo.toml`                                      | Rust                  | `autonoma` crate                                       |
| `pom.xml` / `build.gradle`                        | Java                  | `ai.autonoma:autonoma-sdk`                             |
| `Gemfile` / `*.gemspec`                           | Ruby                  | `autonoma` gem                                         |
| `composer.json`                                   | PHP                   | `autonoma/sdk`                                         |
| `mix.exs`                                         | Elixir                | `autonoma` hex package                                 |

**Pick exactly one backend.** If multiple plausible candidates exist, STOP and ask the user. Do not guess. Do not implement in more than one.

**State your finding back to the user before writing any code:**

> “I found the backend at `<path>` (language: `<lang>`, framework: `<framework>`). I’ll implement the endpoint there using the `<sdk-package>` SDK. Is that the right location?”

Wait for confirmation.

**If no candidate matches a supported SDK language**: STOP and ask the user. Do NOT build a standalone Python (or any) sidecar as a workaround. Do NOT install a language SDK that doesn’t match the backend.

**If the backend is in a separate repo not open in this workspace**: generate a self-contained prompt the user can run in the backend workspace, including the full `scenarios.md` content, a link to the live docs, and all implementation instructions. Do not create a sidecar in the current workspace.

### 1.2 — Understand the stack

Identify:

* **Framework**: Next.js, Express, Hono, FastAPI, Flask, Django, etc.
* **DB layer**: Whatever ORM/repository pattern the app already uses — your factories will call those services directly. The SDK does not need a connection.
* **Auth mechanism**: How users log in (session cookies, JWT, OAuth, Better Auth, Lucia, etc.)
* **Existing route patterns**: How other endpoints are structured

### 1.3 — Read entity-audit.md

Read `autonoma/entity-audit.md` and parse the frontmatter. The audit tells you exactly which models to wire up:

* Every model in the audit gets a factory. The SDK is factory-driven: there is no SQL fallback. Every factory must declare an `inputSchema` (Zod) / `input_model` (Pydantic) so the SDK can describe the model to the dashboard and validate the create payload before invoking your code.
* Models with `independently_created: true` get a factory that calls the identified `creation_file` / `creation_function`.
* Models with `independently_created: false` still need a factory, but it can be a thin repository call (`db.tag.create({...})` / `repo.tag.create(...)`) since there’s no shared business logic to preserve.

The audit’s `side_effects` field is informational — it helps you understand what each factory will preserve.

### 1.4 — Understand auth creation

Find the code path that creates sessions or tokens for users. Search for `createSession`, `jwt.sign`, `lucia`, `better-auth`, `iron-session`, or similar. You need to replicate this in the auth callback.

***

## Phase 2: Plan — go into plan mode

Present a complete implementation plan:

```plaintext
## Implementation Plan


### SDK packages to install
[Exact packages: `@autonoma-ai/sdk` + `@autonoma-ai/server-<framework>` + `zod` (TS), or `autonoma-ai` (Python). No ORM-specific package — factories use whatever client the app already has.]


### Endpoint location
[Exact file path]


### Scope field
[e.g., organizationId — explain why]


### Environment variables
- `AUTONOMA_SHARED_SECRET` — shared with Autonoma for HMAC request verification
- `AUTONOMA_SIGNING_SECRET` — private, for signing refs tokens


### Factories to register (from entity-audit.md)
For every model the audit lists, register a factory. Each one declares:
- `inputSchema` (Zod) / `input_model` (Pydantic): every dashboard-supplied field, with the right type. Drives discover.
- `create`: invokes the audit's `creation_file` / `creation_function` for `independently_created: true`, or a thin repository call for `independently_created: false`.
- `teardown`: optional but recommended — invoked during `down` to remove what `up` created.


For every `independently_created: true` row, name the function the factory calls and the side effects observed in the audit. For `independently_created: false`, name the table the factory writes to.


### Auth callback strategy
[How sessions/tokens are created — specific code path in the app]
```

**Wait for user approval before proceeding.**

***

## Phase 3: Implement

### 2.5 — Research pass (MANDATORY before writing any factory)

Post-mortems of past runs show a consistent failure mode: the agent makes **one bad decision and applies it uniformly to every model**. The research pass prevents this by forcing a per-model pause and a documented decision before any handler code is written.

Emit `autonoma/.factory-plan.md` with one row per `independently_created: true` model:

```plaintext
| Model | Audit function | File opened? | Import path | DI dependencies observed | Decision (Branch 1/2/3) | Notes |
|-------|----------------|--------------|-------------|--------------------------|-------------------------|-------|
```

Column rules:

* **File opened?** — “yes, lines X-Y” or “no, why”. If “no”, you MUST NOT proceed — you cannot pick Branch 1 vs Branch 2 without reading the source.
* **Import path** — the exact `import ... from "..."` the handler will use. For Branch 1 rows, this is the *new* export you will create during extraction, not the current inline location.
* **DI dependencies observed** — every constructor arg or closed-over variable the function uses (DB client, logger, event bus, Temporal client, analytics client, etc.). The factory has no `ctx.executor` to lean on; it imports the same DB client / repository singletons the rest of the app uses. Listing every dependency makes any silent give-up visible.
* **Decision** — Branch 1 (extract inline → export → call), Branch 2 (import existing export → call), or Branch 3 (`independently_created: false`, plain repository call is fine). “Inline ORM in production code path” is NOT a valid value for Branches 1 or 2.

#### Cross-codebase DI discovery

Run these greps against the backend BEFORE filling the table:

```bash
# Find how each service is actually constructed in production code.
grep -rnE "new ${ServiceName}\(" apps/ --include='*.ts' --include='*.tsx' | head -20
# Find exported singletons and module-level instances.
grep -rnE "^(export )?(const|let) [a-zA-Z]+ = new " apps/ --include='*.ts' | head -40
# Find composition root candidates.
grep -rnlE "(container|registry|services/index|app\.module)" apps/ | head
```

Use the results to fill “DI dependencies observed” honestly. If a service needs `logger, eventBus, temporal, analytics` and you can’t find where the app wires them, STOP and ask the user — do NOT fall back to raw ORM.

#### Hook-level enforcement

When you write `autonoma/.endpoint-implemented` at the end of this step, the plugin’s validator hook parses `entity-audit.md`, opens the handler you named in the sentinel body, and blocks the write if any factory for a `independently_created: true` model contains an inline ORM write (`prisma.<m>.create`, `db.<m>.create`, `tx.insert(<m>Table)`, etc.) or if any such model has no factory at all. The agent’s self-policed Step A–D check is backed up by this mechanical gate — if you try to ship the anti-pattern, the sentinel write fails with an itemised list of violations and you must fix them before advancing.

***

### 3.0 — Per-model decision tree (run this BEFORE writing any factory)

For every model with `independently_created: true` in `autonoma/entity-audit.md`, walk this tree in order. There is no “give up and use `db.<model>.create()`” escape hatch — `db.<model>.create()` inside a factory body for a `independently_created: true` model is NEVER acceptable.

**Branch 1 — `needs_extraction: true`.** The creation logic is inline in a route handler, a framework hook (Better Auth `databaseHooks`, NextAuth callbacks, Express closures), or an anonymous closure. Extract it first:

1. Move the inline block into a new **named, exported function** in a nearby module (`*.service.ts`, `*.repository.ts`, `create-<model>.ts`, or an existing service). Take a plain input object (no `req`/`res`/`ctx`), return the created record, preserve every side effect the inline block had.
2. Replace the inline block with a call to the new function. Real HTTP callers’ behavior must stay identical. Run typecheck/tests.
3. Update `autonoma/entity-audit.md` in-place: add an `extracted_to: <new-path>` field pointing at the file you created, and keep `creation_file`, `creation_function`, and `needs_extraction: true` exactly as Step 2 recorded them. The fidelity rubric’s framework-hook carve-out (Criterion 1) relies on those fields remaining intact so it can score the factory against the extracted helper rather than the un-callable hook.
4. Import the new function in the factory.

If extraction is genuinely impossible (inline block inseparable from `req`/`res`, or generated code), STOP and ask the user. Do NOT fall back to raw ORM.

Concrete example — Better Auth `databaseHooks`: if the audit flags `User` with `needs_extraction: true` pointing at `src/auth.ts#buildAuth (databaseHooks.user.create)`, the closure body writes `db.user.create`, then `ensureOrgMembership`, then provisions a `BillingCustomer`. Calling `db.user.create()` in the factory silently skips every sibling row. Extract the closure into `export async function createUserWithOnboarding(input)`, call it from the hook (production still works), update the audit, then import it in the factory.

**Branch 2 — `independently_created: true`, no `needs_extraction`.** Import and call the named export. See the DI playbook below for how to invoke it.

**Branch 3 — `independently_created: false`.** Register a factory whose `create` is a thin repository / ORM call. There is no SQL fallback.

### 3.0.1 — DI / constructor-injection playbook

Factories receive `(data, ctx)` where `data` is the value parsed by `inputSchema` / `input_model`. The DB client/transaction is whatever singleton your app already exports — import it directly. Walk this list in order; first match wins:

1. **Top-level exported function** — `import { createX } from "..."; return createX(data);`. Simplest case.
2. **Static method** — `return XService.create(data, db);` where `db` is the app’s exported DB client.
3. **Instance method, needs only a DB client** — `const svc = new XService(db); return svc.create(data);`.
4. **Instance method, needs more dependencies (logger, event bus, config, clients)** — find the app’s composition root (DI container, `container.ts`, `app.module.ts`, `services/index.ts`). Either import the already-constructed singleton (`import { userService } from "@/services"`) or rebuild the service the way the composition root does, importing real singletons for everything (DB client, logger, event bus, temporal client). Do not invent mocks.
5. **Impossible** — STOP and ask the user. Do NOT inline ORM writes that bypass production logic.

Never mock, stub, or fake a dependency. The factory must exercise real code.

### 3.0.2 — External side effects policy

Audited creation functions often perform side effects beyond the DB row: Temporal workflows, GitHub/Stripe/Slack APIs, emails, analytics, LLMs. Your goal is **correct DB state, not production-grade external delivery**. Preserve every DB write (including writes to sibling tables done by ORM hooks, framework hooks, triggers). Order of preference:

1. **Call the real function with real side effects** if the test environment has sandbox keys / local Temporal / mocked SDKs wired.
2. **Use the app’s existing test-mode toggle** (`NODE_ENV=test`, `DISABLE_WORKFLOWS=1`, feature flag, null-object client).
3. **Wrap external-only calls in try/catch** inside the real function (not inside a rewritten factory body) — only for calls whose failure does not affect DB state under test.
4. **Reimplement the DB writes inline.** NEVER. If you’re typing `db.<other_model>.create` inside a factory to replicate what a hook would have done, the function wasn’t truly called — you re-wrote it. Go back to option 1 or 2, or ask the user.

You are NOT allowed to skip: password hashing, slug generation, normalization (pure CPU inside the creation function), DB writes performed by ORM/framework hooks on the created model (e.g. Better Auth’s `databaseHooks.user.create` writing Organization/Member/BillingCustomer), or writes to sibling tables the creation function itself performs (e.g. `createProject` writing a default Folder).

### 3.1 — Install SDK packages

TypeScript:

```bash
pnpm add @autonoma-ai/sdk @autonoma-ai/server-[framework] zod
```

Python:

```bash
pip install autonoma-ai
```

### 3.2 — Create the endpoint handler

Write a single handler file that:

1. Sets `scopeField: "<your scope field>"` plus the two secrets on the handler config. There is no `executor` field anymore.

2. Registers a factory for **every** model in `entity-audit.md`. Each factory:

   * Declares an `inputSchema` (Zod) / `input_model` (Pydantic) covering every field the dashboard sends. The SDK reads it for `discover` and validates payloads through it before invoking `create`.
   * For `independently_created: true`: imports the function from the audit’s `creation_file` and calls it inside `create`. **Never reimplement the creation logic with an inline ORM call** (see WRONG/RIGHT example below). For methods on a class, instantiate the class using the app’s exported DB client.
   * For `independently_created: false`: makes a thin repository / ORM call from inside `create`.
   * Optionally declares a `teardown` to remove the record during `down`.

3. Implements the auth callback using the app’s real session/token creation.

4. Passes both secrets from environment variables.

Follow the project’s existing code patterns — import style, file organization, error handling.

### 3.3 — Register the route

Add the endpoint to the app’s routing (e.g., `app.post('/api/autonoma', handler)`).

### 3.4 — Set up environment variables

Add `AUTONOMA_SHARED_SECRET` and `AUTONOMA_SIGNING_SECRET` to `.env` (or equivalent). If `.env.example` exists, add placeholders there too.

***

### 3.5 — The trap: inline ORM calls inside factories

The most common mistake is writing `db.x.create({...})` inside a factory because calling the real function is inconvenient (constructor args, DI). That silently bypasses every piece of business logic the user has — or will add — and makes the scenario data diverge from what the app itself would produce.

```ts
// entity-audit.md: creation_function = OnboardingManager.getState
// WRONG — inline ORM, bypasses OnboardingManager entirely
OnboardingState: defineFactory({
  inputSchema: z.object({ applicationId: z.string() }),
  create: async (data) => {
    return db.onboardingState.create({ data: { applicationId: data.applicationId, step: "welcome" } });
  },
}),


// RIGHT — import the real DB client, instantiate the class, call the real method.
// `data` is inferred from `inputSchema` — no z.infer<...> annotation needed.
import { db } from "@/db";
import { OnboardingManager } from "@/lib/onboarding-manager";


OnboardingState: defineFactory({
  inputSchema: z.object({ applicationId: z.string() }),
  create: async (data) => {
    const manager = new OnboardingManager(db);
    return manager.getState(data.applicationId);
  },
}),
```

The factory imports the same `db` (or `prisma`/`drizzle`/`session`) singleton the rest of the app uses. The SDK does not own a connection — your factory writes through whatever path your app’s services normally take.

`defineFactory` is generic over its `inputSchema` and optional `refSchema`, so `data` and (when set) `record` are typed automatically. Add `refSchema: z.object({ id: z.string() })` whenever you also write a `teardown` and want a typed record.

***

## Phase 4: Smoke test and factory-integrity check

This phase proves the handler was wired correctly. It does **not** run the full `up`/`down` lifecycle — that is Step 5’s job.

### 4.1 — Start the dev server

Check if it’s already running. If not, start it.

### 4.2 — Test discover

```bash
BODY='{"action":"discover"}'
SIG=$(echo -n "$BODY" | openssl dgst -sha256 -hmac "$AUTONOMA_SHARED_SECRET" | sed 's/.*= //')
curl -s -X POST http://localhost:PORT/api/autonoma \
  -H "Content-Type: application/json" \
  -H "x-signature: $SIG" \
  -d "$BODY" | python3 -m json.tool
```

**Expected**: JSON with `schema` containing `models`, `edges`, `relations`, `scopeField`. Every model from `entity-audit.md` must appear under `schema.models`. `edges` and `relations` are emitted as empty arrays — the dashboard accepts that, and the `_alias`/`_ref` graph in the create payload carries equivalent dependency information at request time.

### 4.3 — Factory-integrity check

Before completing this step, prove deterministically that every factory you registered actually calls the audit’s `creation_function`:

1. Re-read `entity-audit.md`. List every model with `independently_created: true` and its `creation_file` / `creation_function`. If any entry still has `needs_extraction: true`, make sure it also has `extracted_to: <path>` pointing at the extracted helper you created per Branch 1 — that’s what the factory must import and call. A bare `needs_extraction: true` with no `extracted_to` means you skipped the extraction step; HALT and extract.

2. For each such model, open the handler file and verify BOTH:

   * an `import` line pulls in `creation_function` (or the class that owns it) from a path that resolves to `creation_file`
   * the `defineFactory({ create })` body invokes that symbol (e.g. `manager.getState(...)`, `createUser(...)`, `ProjectService.create(...)`)

3. Spot-check with grep — any inline ORM create inside a factory for a model marked `independently_created: true` is the anti-pattern:

   ```bash
   grep -nE '(prisma|db|tx)\.[a-zA-Z]+\.create\(' <handler-file>
   ```

   Cross-reference each match against the audit; replace inline calls with the real function before continuing.

If any factory fails this check, fix it before reporting success. The full lifecycle validation in Step 5 will otherwise find it the hard way.

***

## Phase 5: Report

Tell the user:

> “Done! I’ve set up the Autonoma SDK at `[endpoint path]`.
>
> **Packages installed**: \[list] **Factories registered** (from entity-audit.md): \[list each model + the `creation_file#creation_function` it calls (or the repository call for `independently_created: false`) + side effects observed] **Auth**: \[how sessions/tokens are created]
>
> **Smoke test**: discover returns schema with \[N] models; factory-integrity check passed for \[N] factories.
>
> **Next steps**:
>
> 1. Set your secrets in `.env`:
>
>    ```plaintext
>    AUTONOMA_SHARED_SECRET=<your-value>
>    AUTONOMA_SIGNING_SECRET=<your-value>
>    ```
>
> 2. Proceed to Step 5 to validate the full up/down lifecycle against every scenario.
>
> 3. When ready, paste `AUTONOMA_SHARED_SECRET` into the Autonoma dashboard.”

***

## Important reminders

* **Never create a standalone server or sidecar.** Always integrate into the backend you identified in Phase 1.1. If that’s not possible, stop and ask the user — do not invent a workaround.
* **SDK language must match backend language.** Do not install `autonoma-ai` (Python) into a TypeScript/NestJS project, etc.
* **Do not scaffold at the repo root** when a backend directory exists, including non-standard names like `core-app-backend/`, `apps/api/`, `services/core/`.
* **Always read the live docs** at `https://docs.agent.autonoma.app/llms.txt` before implementing. The SDK may have been updated.
* **ALL database writes go through the SDK endpoint.** Never write directly via psql, raw SQL, or ORM methods.
* **Register a factory for every model in the entity audit** — there is no SQL fallback. For `independently_created: true` rows the factory must call the audit’s identified function; for `independently_created: false` rows a thin repository call is fine. Never reimplement an identified creation function inline.
* **Validate is Step 5’s job.** This step only runs `discover` plus the factory-integrity check. Do not try to run `up`/`down` here.
* **Match existing codebase patterns.** Don’t introduce new conventions. Use the same import style, file organization, and error handling.
* **Use `testRunId`** in all unique fields (emails, slugs, org names) to prevent parallel test collisions.
* **If context compaction occurs**, re-read this prompt and use a TODO list to track progress.

# Step 5: Validate Scenario Lifecycle

> Run discover/up/down against every scenario, fix whatever breaks, emit scenario recipes, and upload them to the dashboard.

The scenario validator is the **gate** between “endpoint exists” and “tests can be written against it.” It drives the full SDK lifecycle against every scenario, iteratively fixes whatever is broken, and records the final, reconciled scenario trees as `scenario-recipes.json` for the Autonoma dashboard.

This step **must pass** before Step 6 (test generation) runs. A PostToolUse validation gate in the plugin blocks test-file writes until the sentinel `autonoma/.endpoint-validated` exists, so you cannot accidentally generate tests against broken scenario data.

## Prerequisites

* `autonoma/entity-audit.md` (output from [Step 2](/test-planner/step-2-entity-audit/))
* `autonoma/scenarios.md` (output from [Step 3](/test-planner/step-3-scenarios/))
* `autonoma/.endpoint-implemented` sentinel (output from [Step 4](/test-planner/step-4-implement/))
* A running dev server that exposes the Environment Factory endpoint
* `AUTONOMA_SHARED_SECRET` and `AUTONOMA_SIGNING_SECRET` set in the server’s environment

## What this produces

* `autonoma/scenario-recipes.json` — the validated create tree for every scenario, keyed by scenario name, with a `variables` block listing every `{{token}}` placeholder
* `autonoma/.scenario-validation.json` — terminal artifact recording validation status, preflight result, and any edits the agent made to `scenarios.md`
* `autonoma/.endpoint-validated` — sentinel that unlocks Step 6
* Uploaded scenario recipes on the Autonoma dashboard, attached to this generation

## What the agent does

### The iteration loop

For each scenario in `scenarios.md`, the agent runs an HMAC-signed `discover` → `up` → `down` loop against the live endpoint.

* **`discover`** — fetches the schema. Every model in the entity audit must appear under `schema.models`. Every model marked `independently_created: true` must have a factory registered on the handler.
* **`up`** — sends the scenario’s create tree. The agent verifies that the response includes a non-empty `auth` block, that every expected record exists in the database (read-only SELECT queries), and that `refsToken` is returned.
* **`down`** — tears down using the signed refs token. The agent verifies that every record created by `up` is gone and that nothing outside the refs was touched.

If a scenario fails, the agent decides whether the **handler** is wrong or the **scenario** is wrong:

| Symptom                                                                                                                                 | Fix                                            |
| --------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- |
| Factory missing, FK unresolved, handler crash, auth callback broken                                                                     | Fix the handler in the backend and retry       |
| Scenario references a model that doesn’t exist, requires an impossible unique constraint, or depends on a field the schema doesn’t have | Edit `scenarios.md` to match reality and retry |

The loop runs up to **5 iterations**. If it still hasn’t converged, the agent stops and surfaces the failure — it does not write the validated sentinel.

### Scenario recipes

Once every scenario passes, the agent emits `scenario-recipes.json`. Each recipe is the **exact nested tree** that was proven to work in `up`, plus a `variables` block mapping every `{{token}}` to the concrete value used during validation. The file is validated against `ScenarioRecipesFileSchema` (in `@autonoma/types`) by both the local preflight and the dashboard upload endpoint. Full field-by-field contract (including the `variables` tagged union and all rejection reasons) lives in the [Scenario Recipe Schema reference](/reference/scenario-recipe-schema/). The shape is:

```json
{
  "version": 1,
  "source": {
    "discoverPath": "autonoma/discover.json",
    "scenariosPath": "autonoma/scenarios.md"
  },
  "validationMode": "endpoint-lifecycle",
  "recipes": [
    {
      "name": "standard",
      "description": "Realistic dataset for core flows",
      "create": {
        "Organization": [ { "_alias": "org1", "name": "Acme", "projects": [ { "title": "{{project_title}}" } ] } ]
      },
      "variables": {
        "project_title": { "strategy": "literal", "value": "Launch Campaign" }
      },
      "validation": { "status": "validated", "method": "endpoint-up-down", "phase": "ok", "up_ms": 12, "down_ms": 8 }
    }
  ]
}
```

Required invariants (the upload endpoint rejects otherwise):

* `version` is the integer `1` (not the string `"1.0"`).
* `source` is an object with BOTH `discoverPath` and `scenariosPath` as non-empty strings.
* `validationMode` is `"sdk-check"` or `"endpoint-lifecycle"`.
* `recipes` is an array (not a map) with at least one entry; each entry has `name`, `description`, `create`, and `validation`.
* `variables` values use `strategy: "literal" | "derived" | "faker"`. `derived` additionally requires `source: "testRunId"` and a `format` string. `faker` requires a `generator` id.

### Preflight

Before uploading, the agent runs `preflight_scenario_recipes.py` against the file. Preflight is a deterministic Python check that enforces structural invariants:

* every scenario listed in `scenarios.md` frontmatter appears as a recipe
* every `{{token}}` referenced in a tree is declared in `variables`
* the create tree roots at the scope entity from `discover`
* `variables` values are concrete, not placeholder

If preflight fails, the agent stops — the dashboard never sees a malformed recipe.

### Upload

On success, the plugin orchestrator uploads the recipes to `/v1/setup/setups/:id/scenario-recipe-versions`. The response must be 200 or 201. Upload failures also block Step 6.

## Review checkpoint

After validation completes, review:

* **Scenario edits** — did the agent modify `scenarios.md`? If yes, read the edits carefully. A small edit (correcting a field name) is fine; a large structural change suggests the original scenario design missed something and is worth revisiting before moving on.
* **Auth block** — the `up` response’s `auth` block is what tests use to log in. Confirm it contains usable credentials (session cookie, JWT, etc.) for every role the scenarios define.
* **Clean teardown** — the agent verified `down` leaves no orphans. If your schema has triggers or cascade rules that the ORM doesn’t know about, this is where you’ll catch them.
* **Upload success** — the recipes uploaded successfully and are visible on the Autonoma dashboard for this generation.

## What happens next

Step 6 (E2E Test Generation) consumes `scenarios.md` (possibly edited) as the source of truth for test data. Every `{{token}}` placeholder in the tests corresponds to a variable declared in `scenario-recipes.json`, so the test runner can substitute the real values at execution time.

## Safety

The validator only writes through the SDK endpoint. It never runs INSERT, UPDATE, DELETE, DROP, or TRUNCATE directly, even if validation fails repeatedly. Read-only `SELECT` queries are used for database verification. The SDK’s `down` action is the only deletion path, and it only removes what the matching `up` created (verified by the signed refs token).

## The prompt

Expand full prompt

# Scenario Validator: iterative fix loop + reality reconciliation

The Environment Factory endpoint exists (Step 4 wrote `autonoma/.endpoint-implemented`). Your job is to prove it actually works and keep iterating until it does. The E2E test generator (Step 6) is gated on your sentinel — if you do not write `autonoma/.endpoint-validated`, no tests get generated.

## Database safety (absolute)

* ALL writes go through the SDK endpoint only. Never INSERT/UPDATE/DELETE/DROP/TRUNCATE via psql or raw SQL.
* You MAY run SELECT via psql / ORM read queries to verify data.
* The SDK’s `down` action deletes only what `up` created (signed refs token).

## Inputs

* `autonoma/entity-audit.md`
* `autonoma/scenarios.md` (may contain mistakes you will correct)
* The handler file created in Step 4
* A running dev server
* `AUTONOMA_SDK_ENDPOINT` and `AUTONOMA_SHARED_SECRET`

## Outputs

* `autonoma/scenario-recipes.json`
* `autonoma/.scenario-validation.json`
* `autonoma/.endpoint-validated`

## The loop

Repeat until all three actions succeed for every scenario OR you exhaust 5 iterations:

1. Fetch protocol docs (first iteration only):

   ```bash
   curl -sSfL "$(cat autonoma/.docs-url)/llms/protocol.txt"
   curl -sSfL "$(cat autonoma/.docs-url)/llms/scenarios.txt"
   curl -sSfL "$(cat autonoma/.docs-url)/llms/test-planner/step-5-validate.txt"
   ```

2. Export working secrets:

   ```bash
   export AUTONOMA_SHARED_SECRET=${AUTONOMA_SHARED_SECRET:-$(openssl rand -hex 32)}
   export AUTONOMA_SIGNING_SECRET=${AUTONOMA_SIGNING_SECRET:-$(openssl rand -hex 32)}
   ```

3. Run `discover` via curl with proper HMAC.

   * Response MUST contain `schema.models`, `schema.edges`, `schema.relations`, `schema.scopeField`.
   * **Coverage check**: every model in `entity-audit.md` MUST appear in `schema.models`.
   * **Factory coverage check**: every model with `independently_created: true` MUST be registered on the handler.
   * **Factory-body integrity check (deterministic, MANDATORY)**: grep the handler for raw DB/ORM writes. Any inline ORM/raw-SQL create inside a factory body for a model marked `independently_created: true` is a FAIL — fix the handler to import and call the audited function and restart.

4. For each scenario in `scenarios.md`:

   1. Build `{action:"up", create:..., testRunId:"<scenario>-<iteration>"}` from the scenario.

   2. HMAC-sign and POST.

   3. On failure, pick one of three paths:

      * **Handler bug** → fix the handler and restart.
      * **Scenario bug** (field does not exist, FK target wrong) → edit `scenarios.md` to match reality and restart. Log the change.
      * **Unfeasible scenario** → REMOVE it from `scenarios.md` with justification. Restart.

   4. On 200, parse `auth`, `refs`, `refsToken`.

      * **Auth check**: `auth` MUST be non-null and contain at least one of `{ cookies, headers, token, user }`.
      * **Refs check**: every top-level model in the `create` tree MUST appear in `refs`.

   5. Verify DB state with a read-only `SELECT` for at least one refs id.

   6. POST `{action:"down", refsToken}`. Expect `{ok:true}`.

   7. Verify the refs rows are gone.

5. After every scenario passes cleanly, emit the scenario recipes.

   Write `autonoma/scenario-recipes.json`:

   ```json
   {
     "version": 1,
     "source": {
       "discoverPath": "autonoma/discover.json",
       "scenariosPath": "autonoma/scenarios.md"
     },
     "validationMode": "endpoint-lifecycle",
     "recipes": [
       {
         "name": "standard",
         "description": "Realistic dataset for core flows",
         "create": {
           "Organization": [{ "_alias": "org1", "name": "Acme Corp" }]
         },
         "variables": {
           "testRunId": { "strategy": "derived", "source": "testRunId", "format": "{testRunId}" }
         },
         "validation": { "status": "validated", "method": "endpoint-up-down", "phase": "ok", "up_ms": 12, "down_ms": 8 }
       }
     ]
   }
   ```

   Rules:

   * Top-level keys MUST be exactly `version`, `source`, `validationMode`, `recipes`
   * `version` must be integer `1`
   * `source` MUST be an object with BOTH `discoverPath` (path to `autonoma/discover.json`) and `scenariosPath` (path to `autonoma/scenarios.md`) as non-empty strings. The dashboard `/v1/setup/setups/:id/scenario-recipe-versions` endpoint will reject the upload if either is missing.
   * `validationMode` must be `sdk-check` or `endpoint-lifecycle`
   * `recipes` MUST include `standard`, `empty`, and `large`
   * Every recipe MUST contain `name`, `description`, `create`, and `validation`
   * `create` MUST use a nested tree rooted at the scope entity. Do NOT use flat top-level model keys connected only by `_ref`.
   * If `create` contains `{{token}}` placeholders, include a `variables` object. Every `{{token}}` in `create` must match a key in `variables`; every key in `variables` must be used in `create`.

6. Run preflight on the emitted recipes:

   ```bash
   python3 "$(cat /tmp/autonoma-plugin-root)/hooks/preflight_scenario_recipes.py" \
     autonoma/scenario-recipes.json
   ```

   This resolves tokenized payloads and re-runs signed up/down against the live endpoint. If preflight exits non-zero, fix the failing recipe and re-run.

7. Write `autonoma/.scenario-validation.json`:

   ```json
   {
     "status": "ok",
     "preflightPassed": true,
     "smokeTestPassed": true,
     "validatedScenarios": ["standard", "empty", "large"],
     "failedScenarios": [],
     "blockingIssues": [],
     "recipePath": "autonoma/scenario-recipes.json",
     "validationMode": "endpoint-lifecycle",
     "endpointUrl": "http://localhost:3000/api/autonoma"
   }
   ```

8. Write the sentinel `autonoma/.endpoint-validated` via the `Write` tool (NOT `touch`) with a short plain-text report.

## Iteration discipline

* One handler fix per iteration, then re-run everything.
* If the same scenario fails twice in a row with the same error, the scenario itself is probably wrong — prefer editing `scenarios.md`.
* If you have edited `scenarios.md`, re-read it from disk after every edit.

## When you hit the 5-iteration cap

STOP and write a clear failure report. Do NOT write `.endpoint-validated`. Include the last failing curl body + response, which scenario(s) failed, and which handler file + line range is most likely at fault. The orchestrator surfaces this to the user.

## scenarios.md reconciliation rules

Preserve the frontmatter shape (the validator hook checks it). Allowed:

* Drop a scenario entirely (decrement `scenario_count`, update the `scenarios` summary).
* Remove/rename fields on a model to match what `discover` reports.
* Adjust FK aliases so they reference models that actually exist.
* Flatten cross-branch references that the handler cannot resolve.

Disallowed: silently changing a scenario’s intent (e.g. renaming “admin with one project” to “user with one project” without reflecting that in the description).

# Step 6: E2E Tests

> Generate an exhaustive E2E test suite as markdown files, ready to upload to Autonoma. Runs against scenario data that was validated end-to-end in Step 5.

The test generation agent produces an exhaustive set of E2E test cases as natural language markdown files. Tests are distributed across tiers - core flows get 50-60% of coverage - and span happy paths, input validation, state persistence, navigation, and cross-flow journey tests. An adversarial review agent runs after to find gaps.

Step 6 consumes the named scenarios from Step 3 — reconciled with reality in Step 5 — which can contain both:

* **fixed values** that tests should assert directly
* **generated value tokens** that tests should reference symbolically, such as `({{project_title}} variable)`

## Prerequisites

* `autonoma/AUTONOMA.md` must exist (output from [Step 1](/test-planner/step-1-knowledge-base/))
* `autonoma/scenarios.md` must exist and have been validated end-to-end (output from [Step 3](/test-planner/step-3-scenarios/), reconciled in [Step 5](/test-planner/step-5-validate/))

> **Your scenario data is already validated:**
>
> Your scenario data was validated end-to-end in Step 5 — the Environment Factory successfully created and tore down every scenario. This means the data references in your tests are proven to work against a real database.

## What this produces

* `qa-tests/` directory with subdirectories per flow, each containing markdown test files
* `qa-tests/INDEX.md` - summary of test count, tier distribution, and adversarial review results

## Review checkpoint

After the agent finishes generating the tests, it will sample a set of **Journey** and **Critical** priority tests and walk you through them. These are the tests that carry the most signal about overall quality.

**Journey tests** chain multiple flows together into one continuous path (e.g., create app, add test steps, run test, inspect results). If a journey test is poorly written, it means the agent doesn’t understand how your flows connect.

**Critical tests** cover core flow happy paths and the most important validations. If these reference wrong button labels or use vague assertions, the rest of the suite is likely to have the same problems.

For each sampled test, the agent will explain what flow it covers, why it was prioritized, and what bug it’s designed to catch. Here’s what to look for:

* **Do the steps reference actual UI text?** “Click the ‘Save Changes’ button” is correct. “Click the save button” is a red flag - it means the agent may not have read the actual code.
* **Are assertions specific?** “Assert that a green toast appears with the text ‘Test saved successfully’” is a real test. “Assert that the save was successful” will pass regardless of what happens.
* **Are generated values referenced correctly?** If Step 3 marked a field as variable (and Step 5 confirmed it), the test should say something like `({{project_title}} variable)` instead of inventing a literal title.
* **Does the test budget feel right?** If 70% of tests are for settings pages and 30% for your core flow, ask the agent to rebalance.

You don’t need to review every test. Focus on the journey tests and a random sample from each core flow.

## The prompt

Expand full prompt

# E2E Test Generation Agent

You are a senior QA engineer. Your job is to analyze this codebase and produce an exhaustive set of end-to-end test cases written as natural language step-by-step guides, as if you were writing instructions for a human tester sitting in front of the application.

The goal is not to write “tests.” The goal is to **find bugs**. Be adversarial. Think like a user who does weird things. Think like a developer who forgot an edge case. Every test you write should have a realistic chance of catching a real bug. If a small test has even a chance of catching a bug, include it. You can always trim later - but a bug that ships because you didn’t write the test is a bug that reaches users.

***

## Phase 0: Prerequisites

### 0.1 - Locate the AUTONOMA knowledge base

Look for an `autonoma/` directory in the workspace. It should contain:

* `AUTONOMA.md` - the main application guide

If it doesn’t exist, tell the user:

> “I need the AUTONOMA knowledge base to generate tests that reference it. Please run the AUTONOMA Knowledge Base Generator first, then come back and run this prompt.”

Do not proceed without it.

### 0.2 - Read the knowledge base

Read `AUTONOMA.md` fully. Understand:

* What the application is and does

* The user roles

* The navigation structure

* The **Core flows** section - these are the most important workflows. They will receive the deepest test coverage.

* The **Core flows** table - this is the flat inventory of features/areas and whether each one is core

* The **Preferences** section - respect everything in it:

  * **Skip these areas**: Do not generate tests for skipped areas
  * **Assume these conditions**: Bake these into test preconditions
  * **Ignore these elements**: Don’t interact with or assert against these
  * **Don’t report these as bugs**: Don’t write tests specifically targeting these

### 0.3 - Load test data scenarios (if provided)

Look for a `scenarios.md` file alongside the AUTONOMA knowledge base (in `autonoma/` or provided separately). This file describes named test data scenarios - pre-configured environments with known data.

If scenarios exist:

* Each scenario has a name, credentials, and a detailed inventory of what data exists (entities, counts, relationships).
* **Tests must reference scenarios by name** in their Setup section: `Using scenario: standard`
* **Tests must assert against known data** from the scenario. If the scenario says application “My Web App” exists, the test says `filter by application "My Web App"` - not `filter by an application`.
* **If the scenario marks a field as generated, tests must reference the token instead of a literal.** For example, if `variable_fields` includes `{{project_title}}`, write `click the project titled ({{project_title}} variable)` - not `click the project titled "My Project"`.
* **Never write conditional test steps.** The scenario guarantees the data exists. Don’t write “if X exists, do A; otherwise verify B.” Each test follows exactly one path.

If no scenarios file exists, tests should describe the setup steps inline. Even in this case, never write conditional steps - if a test needs data, the setup must create it.

***

## Phase 1: Discovery

### 1.1 - Confirm this is a frontend project

Look at the project structure, dependencies, and framework in use. If this is a **backend-only project** (no UI, no pages, no components), stop here and tell the user:

> “This looks like a backend-only project. These E2E tests are designed for frontend applications - I need access to the frontend codebase to generate them. Can you point me to the frontend repo or add it to this workspace?”

Do not proceed until you have a frontend codebase.

### 1.2 - Identify the tech stack

From `package.json`, config files, and the codebase structure, determine:

* Framework (React, Next.js, Vue, Svelte, Angular, etc.)
* Routing approach (file-based, react-router, etc.)
* State management (Redux, Zustand, Context, Pinia, etc.)
* API layer (REST, GraphQL, tRPC, etc.)
* Auth mechanism (session, JWT, OAuth, etc.)
* UI library if any (shadcn, MUI, Ant Design, Chakra, custom, etc.)

You’ll use this context to write smarter tests. For example, if you see React with Context, you know stale state bugs are likely. If you see optimistic updates, you know rollback bugs are likely.

### 1.3 - Page-by-page feature decomposition

This is the most critical phase. You will systematically decompose every page in the application into individual features. The goal is to produce a **complete feature inventory** - a written list of every interactive element and capability in the entire application. This inventory drives everything: the test budget, the folder structure, and the test suite itself.

**Do not skip this phase or do it in your head.** Write the inventory to a scratchpad file as you go.

#### Step 1: Create the scratchpad

Create the file `qa-tests/.scratchpad/feature-inventory.md`. You will append to this file after decomposing each page. This file serves two purposes:

* It survives context compaction (so you don’t lose progress)
* It becomes the source of truth for the folder structure in Phase 7

#### Step 2: List every page/route

Using the codebase (route files, navigation components, sidebar config) AND the AUTONOMA knowledge base (core flows table), produce a flat list of every page a user can visit. Include the URL pattern and a short description.

Write this list to the scratchpad under a `## Pages` heading.

#### Step 3: Decompose each page into features

For **each** page in the list, go through the following checklist mechanically. Do not skip categories - even if you think a category doesn’t apply, check the code to confirm.

**Interactive element checklist** (go through every item):

| Category                         | What to look for                                                 | What to write down                                                                                                            |
| -------------------------------- | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| **Buttons**                      | Every button on the page                                         | Name, what it triggers (navigation, modal, API call, download, toggle)                                                        |
| **Forms**                        | Every form or input group                                        | Each field name, field type, what submission does, what validation exists                                                     |
| **Tables/lists**                 | Every data collection                                            | Columns, row click behavior, row action buttons, sort options, filter controls, pagination, bulk selection checkbox           |
| **Tabs**                         | Tab bars within the page                                         | Each tab name, what content it shows                                                                                          |
| **Sidebar sections**             | Side panels, secondary navigation                                | Each section, what it contains, whether it collapses                                                                          |
| **Dropdown menus**               | Every ”…” menu, context menu, action menu                        | Open each one (read the code), list every option inside                                                                       |
| **Modals/dialogs**               | Every overlay, drawer, popup                                     | What triggers it, every field/button inside it                                                                                |
| **Search/filter controls**       | Search bars, filter dropdowns, date pickers                      | Each individual filter type, clear/reset mechanism                                                                            |
| **Toggles/switches**             | Checkboxes, radio buttons, toggle switches that trigger behavior | What each one controls                                                                                                        |
| **Drag-and-drop**                | Reorderable lists, drag targets                                  | What can be dragged, where it can be dropped                                                                                  |
| **Embedded/interactive content** | Canvas, video, iframe, code editor, device stream                | What it displays, what interactions are possible (click, scroll, type, drag)                                                  |
| **Conditional UI**               | Elements that appear only in certain states                      | Bulk action bars (on selection), hover actions, expandable sections, state-dependent badges, disabled-until-condition buttons |

After going through the checklist, **group the elements into named features**. Ask yourself: “If a user described what they can DO on this page, what verbs/actions would they list?” Each distinct capability = a feature.

**Name each feature** using the UI’s own vocabulary - tab labels, section headings, button text. Do not invent abstract category names.

**Decide nesting**: If a feature has 5+ distinct sub-behaviors, split it into sub-features. For example:

* “Transaction filters” with date range, status dropdown, amount range, text search, saved filters, clear all = 6 sub-behaviors = split into sub-features
* “Delete project” with just a confirmation dialog = 1 behavior = keep as a single feature

**Merge after decomposing, not before.** Do NOT pre-merge features while listing them. First, list every feature granularly. Then, after you’ve finished decomposing a page (or all pages), look for semantic similarities and merge. If you see `workout-videos`, `workout-challenges`, `workout-friends` as separate features, merge them into a single `workout` parent with sub-features. The rule is: decompose first, then look at what you wrote and group by shared prefix or domain. This prevents accidentally hiding features inside a premature abstraction.

**Write the decomposition to the scratchpad** before moving to the next page. Use this format:

```plaintext
## Page: /payments


### Features:
- **Wire transfer** (form): amount field (number, required), recipient field (text, autocomplete from saved), currency selector (dropdown, 3 options), submit button -> confirmation modal
  - Sub-features: recipient autocomplete, currency conversion preview
- **Transaction history** (table): columns (date, amount, status, recipient), row click -> detail panel
  - Sub-features: filter by date range, filter by status (5 statuses), filter by amount range, sort by date/amount, export CSV button, pagination (20/page)
- **Saved recipients** (sidebar list): add new (modal with name + account fields), edit, delete with confirmation, search, click to pre-fill wire transfer form
```

#### Step 4: Deep-dive on core flows

For features that belong to **core flows** (from AUTONOMA.md’s “Core flows” section), go one level deeper:

* **Every action/command type** within the flow (e.g., if a test builder supports click, assert, scroll, fetch - list each one)
* **Every configuration option** (advanced settings, optional toggles, conditional fields)
* **Every trigger point** (e.g., “run a test” can be triggered from the detail page, from a folder, from a schedule)
* **Every result/outcome state** (passed, failed, running, pending, healing)
* **The end-of-flow experience** - what happens at the final step? If there’s a save/publish/submit dialog, what fields does it have? What options? What validations?
* **Live/interactive content interactions** - if the flow embeds a live device, browser, or canvas, what does the user do with it? What feedback do they get? What states does it go through?

Add this detail to the scratchpad under each relevant feature.

#### Step 5: Cross-page features

After decomposing all individual pages, note any features that span multiple pages:

* Global navigation (sidebar, header, breadcrumbs)
* Toast/notification system
* Global search
* User menu / profile dropdown
* Keyboard shortcuts

Add these under a `## Cross-page features` heading in the scratchpad.

#### Scratchpad checkpoint

After completing all pages, your `qa-tests/.scratchpad/feature-inventory.md` should have: a `## Pages` section listing every route, a `## Page: /path` section for each page with its features and sub-features, and a `## Cross-page features` section.

**Each item in this inventory represents at least one test case.** If your inventory has 80 features, you should expect 80+ tests minimum. If you only wrote 20 tests, you missed features.

**Use subagents to parallelize this decomposition.** Launch multiple agents to explore different parts of the codebase simultaneously - one per major section of the app (e.g., one for the main content area pages, one for settings/admin pages, one for core flow pages). Each agent writes its findings to the scratchpad. This is a large task and serial exploration will take too long.

### 1.4 - Identify and classify flows

Based on the surface map and the AUTONOMA knowledge base, classify every flow into one of three tiers:

**Tier 1 - Core flows** (from AUTONOMA.md’s “Core flows” section) The 2-4 workflows that represent the primary reason users use this product. Bugs here directly prevent users from getting value. These flows get the deepest coverage.

**Tier 2 - Important supporting flows** Flows that users interact with regularly but aren’t the core value proposition. Examples: user management, filtering/searching, folder organization, tagging. Bugs here are annoying but don’t completely block the user.

**Tier 3 - Supporting/administrative flows** Settings pages, profile management, API keys, integrations, admin panels. Bugs here matter but have the lowest impact on core user value.

**Output the classification as a table before proceeding.** This makes the distribution decision visible and reviewable.

### 1.5 - Plan the test budget

Based on the flow classification, plan how tests will be distributed:

* **Tier 1 (Core flows)**: 50-60% of all tests
* **Tier 2 (Important supporting)**: 25-30% of all tests
* **Tier 3 (Supporting/admin)**: 15-20% of all tests

There is **no upper bound** on the total test count. Write as many tests as needed to be confident you’ll catch every bug the application might have. If a core flow has 12 command types, 3 platform variations, 5 trigger points, a publish flow, and interactive canvas interactions - that’s easily 40+ tests for that one flow. That’s correct, not excessive.

Produce a table showing the planned allocation:

```plaintext
| Flow | Tier | Planned Tests | Rationale |
|------|------|---------------|-----------|
| Test Creation & Step Authoring | 1 | ~30 | 10 step types x variations + creation paths |
| Running Tests & Results | 1 | ~25 | Run triggers + result states + inspection |
| Folder Management | 2 | ~12 | CRUD + tree navigation |
| Settings - Variables | 3 | ~5 | Simple CRUD |
| ... | ... | ... | ... |
```

The “Rationale” column should reference the enumeration from Phase 1.3. If a core flow has 10 command types, the rationale should say “10 command types x at least 1 happy path + key validations.”

**Sanity check**: Look at the Tier 1 allocation. Does it feel proportional to the complexity you discovered? If a core flow has 10 sub-types and you’ve only allocated 8 tests, something is wrong. Adjust upward.

***

## Phase 2: Test generation strategy

You will generate tests across these categories for **every** flow. The depth of coverage per category depends on the flow’s tier.

### Category A - Happy paths

The ideal user journey. Everything works. User does exactly what they’re supposed to do. These are your baseline - if these fail, something is very broken.

**For Tier 1 flows**: Write a happy path for **every variation and sub-type**. If “Create Application” has web/Android/iOS paths, write three happy path tests, not one. If there are 10 step types, write a happy path for each. If there’s live/interactive content (device canvas, browser iframe), write happy path tests for interacting with that content - not just the surrounding controls.

**For Tier 2/3 flows**: One happy path per flow is usually sufficient.

### Category B - Input validation and bad data

For **every single input field** in the application, test:

* Empty submission (submit with nothing)
* Whitespace-only input
* Extremely long input (500+ characters)
* Special characters (`<script>alert('xss')</script>`, `'; DROP TABLE`, emojis, unicode)
* Wrong data type (letters in number fields, numbers in email fields)
* Boundary values (0, -1, 99999999, dates in the past, dates far in the future)
* SQL/XSS injection strings (not to actually exploit - just to verify the app doesn’t break)

**For Tier 1 flows**: Test every input field with multiple bad data variations. **For Tier 2/3 flows**: Test key fields - required fields, fields that affect other behavior.

### Category C - State and data persistence

* Fill a form partially, navigate away, come back - is the data there or correctly gone?
* Open a modal, close it, reopen it - is it in a clean state?
* Edit something, refresh the page - did it save?
* Create something, immediately try to edit it
* Delete something, verify it’s gone from all lists/views that showed it
* Create duplicate entries - does the app handle it?
* Perform an action, hit the back button - is the state consistent?

### Category D - Loading, async, and data display patterns

* Click a submit button - can you click it again while loading? (double submission)
* Trigger a save, immediately navigate away - does it save correctly?
* What happens when a list is empty? Is there an empty state?
* What happens during loading? Is there a loading indicator?
* Apply multiple filters simultaneously - does the combined filtering work correctly?
* Apply filters, then clear/reset all - does the list return to its unfiltered state?
* Navigate to page 2 of a paginated list, apply a filter - does it reset to page 1?
* Sort a list, then filter - does sorting persist through the filter?

### Category E - Navigation and routing

* Use the browser back/forward buttons throughout a flow
* Manually change the URL to skip steps in a flow
* Bookmark a page mid-flow, close the browser, open the bookmark
* Access a page you shouldn’t have access to (if auth exists)
* Navigate to a page with an invalid ID in the URL (e.g., `/users/nonexistent-id`)

### Category F - Responsive and visual

* Check that modals are scrollable if content overflows
* Check that long text doesn’t break layouts (truncation, overflow)
* Verify that error messages appear near the relevant field, not just as a generic toast

### Category G - Multi-entity interaction

* Does updating entity A correctly reflect in entity B if they’re related?
* Delete a parent entity - what happens to its children?
* If there’s a list with filters/search - does creating a new item show up correctly in the filtered view?
* **Bulk operations**: If the app supports selecting multiple items, test every bulk action (bulk delete, bulk tag, bulk export, etc.) - these are different code paths from single-item operations and are a common source of bugs.

### Category H - Auth and permissions (if applicable)

* Login with valid credentials
* Login with wrong password
* Login with nonexistent account
* Access protected routes while logged out
* Session expiry - what happens mid-action?
* If roles exist - verify each role sees only what they should

***

## Phase 3: Core flow deep dive (write these FIRST)

**Do not write any Tier 2 or Tier 3 tests until all Tier 1 tests are written.**

For each Tier 1 (core) flow:

### Step 1: Enumerate all variations

Before writing any test, produce a checklist of everything that needs testing. This checklist should be derived from the enumeration you did in Phase 1.3. Every entity variant, every action type, every trigger point, every configuration option, every result state, and every interactive content interaction should appear as a line item.

### Step 2: Write tests for every checked item

Each variation gets at least one test. High-complexity variations (like “publish test”) may get multiple tests (happy path + validation + state persistence).

### Step 3: Add cross-category tests

After covering all variations with happy paths, go back and add Category B-H tests for the most important inputs and interactions within the core flow.

### Step 4: Verify end-of-flow completeness

For each core flow, check that you have tests covering the **entire flow from start to finish**, including the final step. Common gaps:

* If the flow ends with a “save” or “publish” dialog, do you have a test for that dialog’s fields and validation?
* If the flow produces a result (a run, a report, an export), do you have a test that inspects that result in detail?
* If the flow has a “new version” or “update” variant (not just “create”), do you have a test for that?
* If the flow involves a list/table page, do you have tests for its filters, sorting, pagination, and bulk actions?
* **If the flow involves live/interactive content**, do you have tests for interacting with that content? (e.g., running a step on the device, clicking elements on a canvas, verifying the canvas updates after an action) Don’t just test the surrounding controls - test the interaction with the embedded content itself.

**The most commonly missed tests are at the end of flows** - thorough tests for steps 1-8 of a 10-step flow, then a vague test for step 10 that says “click save and assert it works.” The save/publish/submit step deserves its own dedicated test with field-level assertions.

### Step 5: Verify interactive content coverage

If any core flow embeds live/interactive content (device streams, browser iframes, canvases, video players), verify you have tests for:

* The embedded content loads and is visible
* The user can interact with the embedded content (click, scroll, type on it)
* The embedded content updates in response to actions in the surrounding UI
* The surrounding UI updates in response to actions on the embedded content
* Error states (what happens if the embedded content fails to load or disconnects)
* Any playback controls (play, pause, scrub, fullscreen) if it’s a video/recording player

***

## Phase 4: Supporting flow coverage (write these SECOND)

After all Tier 1 tests are written, write Tier 2 and Tier 3 tests.

For these flows, the approach is breadth-first:

* One happy path per flow
* Key validation tests for required fields and important inputs
* One state persistence test if there’s a form with save
* One empty state test if there’s a list
* Any destructive action tests (delete with confirmation)
* If the flow has a list/table: at least one filter test and one empty state test
* If the flow has dropdown/action menus: test the **behavior** of each menu option, not just its presence

This is the level of coverage you’d expect for non-core features. It ensures nothing is completely untested while keeping the focus on core flows.

**Use subagents to parallelize test writing.** Once the tier classification and budget are established, launch multiple agents to write tests concurrently - one for each core flow (Tier 1), and one or two for all Tier 2/3 flows combined. Each agent should receive the AUTONOMA knowledge base, the flow classification, and the specific enumeration/checklist for the flows it’s responsible for.

***

## Phase 4.5: E2E journey tests

After all flow-specific tests are written, write **E2E journey tests** that cross multiple flows. These are the equivalent of integration tests - they verify that the full user journey works end-to-end when flows connect to each other.

### Why journey tests matter

Flow-specific tests are like unit tests - they verify each piece works in isolation. But bugs often hide at the seams between flows. A test might pass for “create entity,” and a separate test might pass for “view entity detail,” but the transition from creating to viewing might break if the creation response doesn’t include the right ID, or if the detail page caches stale data.

### What to write

Identify the **critical user journeys** - the 2-5 end-to-end paths that represent the complete user experience for the core product use case. These typically chain together 3-5 individual flows into one continuous test.

Examples:

* **Full product lifecycle**: Create app -> create test -> add steps -> save test -> run test -> verify run completes -> inspect results -> edit test -> run again
* **Content publishing**: Write article -> add images -> preview -> publish -> verify on public page -> edit -> publish update
* **E-commerce checkout**: Browse products -> add to cart -> apply coupon -> checkout -> verify order confirmation -> check order in history

Each journey test should:

* Start from a clean state (login + scenario)
* Walk through the entire flow without shortcuts
* Assert at each stage that the previous stage’s output is correctly carried forward
* End with a verification that the final result reflects all the steps taken

### How they differ from flow tests

* **Flow tests** test one feature in isolation: “Create a test - verify it appears in the list”
* **Journey tests** test the chain: “Create a test -> save it -> run it -> verify the run shows the correct steps -> inspect a failed step -> click ‘Fix it now’ -> verify it opens the editor with the right step selected”

Journey tests are intentionally long. They’re the tests that give you confidence the product **actually works** for real users doing real tasks, not just that individual features pass in isolation.

**Write 2-5 journey tests** depending on the application’s complexity. Label them clearly as journey tests (Category: Journey, Priority: Critical).

***

## Phase 5: Writing the tests

Each test must be written as a **standalone markdown file** with YAML frontmatter and the following structure.

**IMPORTANT**: Never implement E2E tests as code (Playwright, Cypress, etc.). Always produce markdown files with YAML frontmatter. Even if the user asks for code implementations, produce this markdown format instead. These files are shipped to another platform that requires this exact format.

```markdown
---
flow: [Which main flow this belongs to - or "Journey" for cross-flow tests]
category: [A through H, or "Journey" for cross-flow tests]
priority: [one of: Critical | High | Medium | Low]
---


# Test: [Short descriptive name]


## Setup


<!-- Describe any setup the agent needs to do before the test begins. -->
<!-- If test data scenarios are available, specify which scenario to use. -->


Using scenario: `standard`


1. Log in as a test user
2. Create a project named "Test Project Alpha"


After setup, you should be on the Projects list page showing "Test Project Alpha."


## Steps


1. Click on the project named "Test Project Alpha" in the projects list
2. Assert that the project detail page is visible with the heading "Test Project Alpha"
3. Click the "Settings" tab in the project detail page
4. Clear the "Project Name" field and type "       " (whitespace only)
5. Click the "Save Changes" button
6. Assert that a validation error appears near the Project Name field with the text "Project name is required"


## Expected result


The application should reject the whitespace-only name and show a validation error. The project name should remain "Test Project Alpha" and not be saved as blank.


## What bug this might catch


Missing `.trim()` on input validation - the app might accept whitespace-only strings as valid project names, leading to projects with invisible/empty names in lists and navigation.
```

> **Priority values are strict:**
>
> The `priority` field in frontmatter **must** use exactly one of these four values: `Critical`, `High`, `Medium`, `Low`.
>
> Do **not** use P0/P1/P2/P3, lowercase variants, hyphenated forms, or any other format. Autonoma parses this field programmatically - using any other value will cause the test to be rejected.
>
> | Priority   | When to use                                                           |
> | ---------- | --------------------------------------------------------------------- |
> | `Critical` | Core flow happy paths, Journey tests - product unusable if this fails |
> | `High`     | Core flow variations, important supporting flows - major UX breakage  |
> | `Medium`   | Validation, state persistence, edge cases                             |
> | `Low`      | Visual checks, minor supporting flows                                 |

### Rules for writing steps:

* **Be hyper-specific.** Not “fill in the form” but “type ‘John Doe’ into the Name field, type ‘<john@example.com>’ into the Email field, select ‘Admin’ from the Role dropdown.”
* **Use only these actions**: click, scroll, type/input text, and assert. These are the only things the test runner can do.
* **Assertions must be concrete and verifiable.** Never write “Assert that an error appears” or “Assert that the page shows a proper error.” Always specify **what text, what element, or what visual state** the agent should look for. For example: “Assert that the text ‘Name is required’ appears below the Name field” or “Assert that a red banner appears at the top with the text ‘Failed to save.’” If you don’t know the exact error text, go back to the codebase and find it. Vague assertions produce tests that always “pass” because the agent interprets them loosely.
* **For success states, be equally specific.** Don’t write “Verify the save was successful.” Find the exact feedback: what toast appears (title and message text), what page redirect happens (what URL or heading), what UI state changes (button text changes, item appears in list). Search for toast/notification calls near form submission handlers and API mutation callbacks in the codebase.
* **Never write conditional steps.** Don’t write “If X exists, do A; otherwise verify B.” If the test needs X to exist, the setup must ensure X exists (via a scenario or explicit setup steps). Each test follows exactly one path.
* **One test, one scenario.** Don’t combine “happy path” and “error case” in the same test. Split them.
* **After setup, confirm position.** Always include an assertion or description of where the agent should be after completing setup, before the Steps section begins. This anchors the agent.
* **Use realistic test data.** Names, emails, addresses that look real. Not “asdf” or “test123” (unless you’re specifically testing bad input).
* **Use real text from the UI.** If a button says “Save Changes” in the code, write “click the ‘Save Changes’ button” - not “click the save button.” If a dialog title is “Create a test” in the code, write exactly that - not “Create Test.”
* **Test the behavior behind every menu option.** When a dropdown/action menu has options, don’t just verify the options are listed - test what happens when you click each one. Each high-value option should get its own test. At minimum, verify the immediate result (dialog opens, action executes, navigation occurs).

***

## Phase 6: Adversarial review

**After all tests are written, launch a separate subagent to perform an adversarial review.** This agent’s job is to find gaps in your test suite - things you missed, flows that are under-covered, and assertions that are too vague. It should NOT rewrite tests - it should produce a gap report that you then act on.

The review agent should receive:

* The complete list of generated tests (filenames + flow/category/priority)
* The AUTONOMA knowledge base
* The flow classification and test budget from Phase 1

### The review agent evaluates against these criteria:

**1. Core flow completeness**

* For each core flow, are ALL variations from the Phase 3 Step 1 enumeration covered by at least one test?
* Is there a test for the **end of the flow** (save/publish/submit dialog)?
* Is there a test for the “update/edit existing” variant, not just “create new”?
* Are bulk operations tested if the flow involves lists with multi-select?

**2. Tier distribution**

* Is Tier 1 at 50-60% of total tests?
* Are any Tier 2/3 flows over-represented relative to their importance?

**3. Category coverage per core flow**

* Does each core flow have tests from at least categories A, B, and C?
* Are there any forms in core flows with no input validation tests?

**4. Assertion quality**

* Spot-check 15-20 tests. Do assertions specify exact text, exact element, or exact visual state? Flag any assertion that says “assert an error appears” or “assert proper behavior” without specifying what to look for.
* **Check success state assertions specifically.** Flag any test that says “verify save was successful” or “verify it works” without specifying the exact toast text, redirect, or UI change.

**5. Conditional UI coverage**

* Are there tests for UI that only appears under specific conditions? (bulk actions on selection, hover states, expanded sections, data-dependent badges)
* Are there tests for every option in dropdown/“more actions” menus in core flows? Not just that the menu lists options, but that clicking each option produces the expected result?

**6. Data display patterns**

* For every major list/table page: is there at least one filter test, one sort test, one empty state test?
* Is pagination tested for any paginated list?
* Are combined/multi-filter scenarios tested?

**7. Interactive/embedded content coverage**

* If any core flow embeds live content (device stream, browser iframe, canvas, video player), are there tests for interacting with that content?
* Are there tests for the content loading, updating after actions, and error states?
* Are there tests for any playback controls?

**8. Journey test coverage**

* Are there E2E journey tests that chain multiple flows together?
* Do the journey tests cover the critical user paths - the ones that, if broken, would mean the product is fundamentally unusable?

**9. High-value missing tests**

* What are the top 5-10 tests that would add the most value if added? Think about: what bugs would be most embarrassing to ship? What would a user complain about first?

### After the review:

The review agent returns a gap report. **You must address every gap rated as high-value.** Write additional tests to fill critical gaps. You may skip gaps that are genuinely low-value or would require test infrastructure you can’t assume exists (e.g., multi-user scenarios, specific environment configuration).

Update the test count and INDEX.md after adding gap-fill tests.

***

## Phase 6.5: Validation checklist

**Before producing any output, run through this checklist. Do not skip it. Do not proceed to Phase 7 until every item passes.**

Maintain a TODO list throughout this step. Before compaction, write your TODO list to a scratchpad so you can pick up where you left off. After compaction, re-read this prompt, re-read AUTONOMA.md, scenarios.md, and your TODO list, then resume.

### Check 1: Test count is proportional to project complexity

Count the total number of test files you’ve generated. Then assess the project’s size:

**Sizing heuristic** (rule of thumb - use judgement, not rigid math):

* **Small project** (weekend/hobby project, 1 repo, <20 routes/pages): 40-70 tests is a good range. More than 100 is probably over-testing.
* **Medium project** (early startup, 1-3 repos, 20-50 routes/pages): 80-150 tests. Fewer than 60 means you’re under-covering core flows.
* **Large project** (mature product, 5-10 repos, 50+ routes/pages): 150-400 tests.
* **Very large project** (enterprise, 10-20+ repos, 100+ routes/pages): 300-1000+ tests.

**If your test count is wildly off** (e.g., 20 repos and only 42 tests), stop and diagnose:

* Did you skip Tier 2/3 flows entirely?
* Did you write only one test per core flow instead of enumerating all variations?
* Did you forget journey tests?
* Did context compaction cause you to lose track of which flows you’ve covered?

Go back to Phase 1.5 (test budget), compare it against what you actually generated, and fill the gaps.

### Check 2: All tests are markdown with YAML frontmatter

Verify that every test file you generated:

* Is a `.md` file (not `.ts`, `.js`, `.py`, or any code file)
* Has YAML frontmatter with `flow`, `category`, and `priority` fields
* Has `priority` set to exactly one of: `Critical`, `High`, `Medium`, `Low`
* Contains the sections: `# Test:`, `## Setup`, `## Steps`, `## Expected result`, `## What bug this might catch`

If ANY test is implemented as code (Playwright, Cypress, etc.) instead of markdown, delete it and rewrite it as markdown. This is non-negotiable - the platform requires markdown format.

### Check 3: Test budget was followed

Compare your actual test distribution against the budget from Phase 1.5:

* Is Tier 1 at 50-60% of total tests?
* Did every core flow get the number of tests you planned?
* Are journey tests present (2-5)?

If the distribution is off by more than 10%, rebalance before proceeding.

### What to do if checks fail

Fix the issue in place - do not start over. Then re-run the checklist. Only proceed to Phase 7 when all checks pass.

***

## Phase 7: Output

### 7.1 - Create the test files

Create a directory called `qa-tests` in the current working directory.

**The folder structure must mirror the feature inventory from your scratchpad.** Each feature = a folder. Each sub-feature = a sub-folder. The scratchpad is the source of truth - do not invent folders that don’t correspond to features you discovered, and do not skip features that are in the inventory.

Rules for the folder hierarchy:

* **Each feature becomes a folder**, named in kebab-case using the feature’s own name from the UI
* **Sub-features become nested folders** when they have enough tests (3+) to justify it
* **Cross-page features** (navigation, toasts, search) get their own top-level folder
* **Journey tests** go in a `journey/` folder
* There is **no max nesting depth** - nest as deep as the feature structure requires

Example (derived from a feature inventory):

```plaintext
qa-tests/
  payments/
    wire-transfer/
      001-happy-path-send-wire.md
      002-wire-insufficient-funds.md
      003-wire-validation-empty-amount.md
      ...
    transaction-history/
      filters/
        001-filter-by-date-range.md
        002-filter-by-status.md
        003-filter-by-amount-range.md
        004-combined-filters.md
        005-clear-all-filters.md
      001-sort-by-date.md
      002-sort-by-amount.md
      003-pagination.md
      004-export-csv.md
      005-empty-state.md
      ...
    saved-recipients/
      001-add-new-recipient.md
      002-edit-recipient.md
      003-delete-recipient.md
      004-search-recipients.md
      005-select-to-prefill.md
      ...
  settings/
    001-update-profile.md
    002-change-password.md
    ...
  navigation/
    001-sidebar-links.md
    002-breadcrumb-navigation.md
    ...
  journey/
    001-full-payment-lifecycle.md
    002-onboard-and-send-first-wire.md
    ...
  ...
```

Name files with a numeric prefix and a short kebab-case description.

### 7.2 - Create an index

Create `qa-tests/INDEX.md` with:

* Total number of tests generated
* **Tier breakdown** (how many Tier 1, Tier 2, Tier 3 tests, plus journey tests)
* Breakdown by flow
* Breakdown by category (A through H, plus Journey)
* Breakdown by priority
* The flow classification table from Phase 1.4
* The test budget table from Phase 1.5
* **Adversarial review summary** - what gaps were found and which were addressed

### 7.3 - Report completion

After all files are written, tell the user:

> “Done! I’ve generated \[N] E2E test cases across \[M] flows, plus \[J] journey tests. The tests are in `qa-tests/`.
>
> **Test distribution**:
>
> * Tier 1 (Core flows): \[X] tests (\[Y]%)
> * Tier 2 (Important supporting): \[X] tests (\[Y]%)
> * Tier 3 (Supporting/admin): \[X] tests (\[Y]%)
> * Journey tests: \[J] tests
>
> **Core flow coverage**: \[List each core flow with its test count and key variations covered]
>
> **Journey tests**: \[List each journey test name and what flows it chains]
>
> **Adversarial review**: \[X] gaps identified, \[Y] addressed with additional tests. \[Brief summary of what was added.]
>
> Make sure both `autonoma/` and `qa-tests/` are available to the execution agent.”

***

## Important reminders

* **Respect the Preferences section.** If AUTONOMA.md says to skip an area, don’t generate tests for it. If it says to ignore cookie banners, don’t write tests about cookie banners. This is the user’s configuration - honor it.
* **Core flows get the lion’s share.** 50-60% of tests should cover Tier 1 flows. If you find yourself writing more settings tests than core flow tests, stop and rebalance. The core flows are where the product’s value lives - and where the bugs matter most.
* **Enumerate before writing.** For core flows, always produce the variation checklist BEFORE writing tests. This prevents the “I wrote one generic test and moved on” failure mode. If the codebase reveals 10 command types, your checklist should have 10 entries, and your tests should cover all 10.
* **Do not be lazy.** Do not generate 10 tests and call it a day. Your feature inventory is your accountability tool - every feature in the inventory needs at least one test. If your inventory has 80 features and you wrote 30 tests, you skipped 50 features. Go back and cover them.
* **Do not be generic.** Every test must reference actual pages, actual buttons, actual field names from the codebase you analyzed. If you write “click the submit button” and there’s no submit button on that page, the test is useless.
* **Assertions must be specific - for both errors AND successes.** “Assert that a toast appears” is unacceptable. “Assert that a green toast appears with the text ‘Project saved successfully’” is a test. Go back to the code to find the exact text if needed. This applies equally to success states - don’t write “verify save was successful,” find the actual toast text, redirect URL, or UI change.
* **Never write conditional steps.** “If X exists, do A; otherwise do B” is a test that doesn’t test anything - it passes regardless. If the test needs X, the setup must guarantee X exists. Each test follows exactly one deterministic path.
* **Think like a bug hunter, not a checkbox filler.** Ask yourself: “What would a developer forget here?” That’s your test.
* **Test the end of every flow.** Don’t just test steps 1-8 and then write “click save.” The save/publish/submit step is where state bugs, validation gaps, and race conditions hide. Give it a dedicated test.
* **Test live/interactive content directly.** If the app embeds a device stream, browser iframe, canvas, or video player, test the interactions with that content - not just the surrounding controls. The embedded content is often the core product experience.
* **Write journey tests.** After all flow-specific tests, write 2-5 E2E journey tests that chain the core flows together into complete user paths. These catch integration bugs that flow-specific tests miss.
* **There is no upper bound.** Write as many tests as you need to be confident every bug will be caught. If a test has even a small chance of catching a real bug, include it. You can trim later, but a shipped bug is worse than an extra test.
* **Write Tier 1 tests first.** Do not write Tier 2 or 3 tests until all Tier 1 tests are complete. This prevents the failure mode where you run out of context/budget before covering core flows deeply enough.
* **Use subagents to parallelize.** Discovery, test writing, and the adversarial review should all use subagents where possible. Launch exploration agents in parallel during discovery. Launch writing agents in parallel for different flows. The adversarial review is a separate agent that runs after all tests are written.
* **If context compaction occurs, re-read this prompt and use a TODO list.** Before compaction happens, write your current TODO list and progress to the scratchpad (`qa-tests/.scratchpad/`). After compaction, immediately re-read this prompt, AUTONOMA.md, scenarios.md, your feature inventory (`qa-tests/.scratchpad/feature-inventory.md`), and your TODO list. The feature inventory is your most important artifact - it tells you which pages you’ve already decomposed and which features need tests. Resume from where you left off. This prevents losing track of progress and is the #1 cause of under-generated test suites.
* **Never implement tests as code.** Always produce markdown files with YAML frontmatter. Never generate Playwright, Cypress, or any other test framework code. The markdown format is required by the platform that consumes these test files.
* **Always run the validation checklist before finishing.** Phase 6.5 is mandatory. Do not skip it. The test count check catches the most common failure mode: generating a tiny test suite for a large project because context was lost during compaction.
* **Test count must match project complexity.** A 20-repo enterprise project with 42 tests is a failure. A weekend project with 300 tests is over-engineering. Use the sizing heuristic in Phase 6.5 as a sanity check. If you’re wildly off, go back and fix it before outputting.

# Environment Factory Guide

> How to set up the Autonoma Environment Factory in your application using the SDK — a single POST endpoint for creating and destroying isolated test environments.

> **Note:**
>
> This guide covers the SDK-based setup for the Environment Factory. For framework-specific examples, see [Examples](/examples/) — covering TypeScript, Python, Elixir, Java, Ruby, Rust, Go, and PHP.

> **Tip:**
>
> For the exact JSON contract of the recipe file that drives this endpoint at runtime, see the [Scenario Recipe Schema reference](/reference/scenario-recipe-schema/).

## The Big Picture

Before Autonoma runs an E2E test, it needs two things:

1. **Data** — a user account, some test records, whatever the test scenario requires
2. **Authentication** — a way to log in as that user (cookies, headers, or credentials)

After the test finishes, everything gets cleaned up so the next test starts fresh.

You set up **one endpoint** that the Autonoma SDK handles for you. It responds to three actions:

| Action       | When it’s called       | What happens                                                             |
| ------------ | ---------------------- | ------------------------------------------------------------------------ |
| **discover** | When Autonoma connects | Returns the schema derived from your registered factories’ input schemas |
| **up**       | Before each test run   | Validates each entity, calls your factory, generates auth credentials    |
| **down**     | After each test run    | Verifies the signed token and calls each factory’s `teardown`            |

The SDK orders entities from the create payload’s `_alias` / `_ref` graph, validates inputs through each factory’s `inputSchema` / `input_model`, signs teardown tokens, and manages the full lifecycle. You configure the adapter, register one factory per model the dashboard can create, and implement an auth callback.

## How the Protocol Works

All communication is a single **POST** request with a JSON body. The `action` field determines the operation. Every request is HMAC-SHA256 signed.

### Discover

Autonoma asks: “What does your database look like?”

**Request:**

```json
{ "action": "discover" }
```

**Response:**

```json
{
  "version": "1.0",
  "sdk": { "language": "typescript", "orm": "unknown", "server": "web" },
  "schema": {
    "models": [
      { "name": "Organization", "tableName": "organization", "fields": [{ "name": "id", "type": "string", "isRequired": false, "isId": true, "hasDefault": true }, { "name": "name", "type": "string", "isRequired": true, "isId": false, "hasDefault": false }] },
      { "name": "User", "tableName": "user", "fields": [{ "name": "id", "type": "string", "isRequired": false, "isId": true, "hasDefault": true }, { "name": "email", "type": "string", "isRequired": true, "isId": false, "hasDefault": false }] }
    ],
    "edges": [],
    "relations": [],
    "scopeField": "organizationId"
  }
}
```

The schema contains:

* **models**: every model the dashboard can create — derived directly from each factory’s `inputSchema` / `input_model`. Field metadata (name, type, required, id, hasDefault) comes from the schema introspection.
* **edges** / **relations**: emitted as empty arrays. The dashboard reads dependencies from each create payload’s `_alias` / `_ref` graph at request time — there is no static FK schema in the discover response anymore.
* **scopeField**: the field name used for test data isolation (e.g., `organizationId`).

### Up

Autonoma says: “Create this data for a test run.”

**Request:**

```json
{
  "action": "up",
  "testRunId": "run-abc123",
  "create": {
    "Organization": [{
      "name": "Acme Corp",
      "slug": "acme-corp",
      "members": [{
        "role": "owner",
        "user": [{ "name": "Alice", "email": "alice-run-abc123@test.com" }]
      }]
    }]
  }
}
```

The `create` field is a flat or nested JSON tree of entities. The SDK:

* Walks the payload to collect every `_alias` declaration and every `_ref` usage.
* Topologically sorts the entities so dependency targets are created before dependents.
* Validates each entity through its factory’s `inputSchema` / `input_model` before invoking `create`.
* Replaces every `{"_ref": "alias"}` placeholder with the real id once the aliased entity exists.

The dashboard now sends a flat map keyed by model name with `_alias` / `_ref` describing the dependency graph; nested children are still supported but no longer required.

**Response:**

```json
{
  "version": "0.2.0",
  "sdk": { "language": "typescript", "orm": "prisma", "server": "web" },
  "auth": {
    "cookies": [{
      "name": "session",
      "value": "eyJ...",
      "httpOnly": true,
      "sameSite": "lax",
      "path": "/"
    }]
  },
  "refs": {
    "Organization": [{ "id": "org_xyz", "name": "Acme Corp" }],
    "User": [{ "id": "usr_abc", "email": "alice-run-abc123@test.com" }],
    "Member": [{ "id": "mem_123" }]
  },
  "refsToken": "header.payload.signature"
}
```

* **auth**: credentials the test runner uses to authenticate (from your auth callback)
* **refs**: all created records, keyed by model name
* **refsToken**: a signed token encoding the created record IDs, used for safe teardown

### Down

Autonoma says: “I’m done — delete what you created.”

**Request:**

```json
{
  "action": "down",
  "refsToken": "header.payload.signature"
}
```

The `refsToken` is the exact token from the `up` response. The SDK verifies the signature, extracts the record IDs, and deletes them in reverse topological order.

**Response:**

```json
{
  "version": "0.2.0",
  "sdk": { "language": "typescript", "orm": "prisma", "server": "web" },
  "ok": true
}
```

## Security Model

Three layers of security protect your endpoint, using **two separate secrets** with different purposes.

### The Two Secrets

| Secret             | Env Variable              | Who knows it   | Purpose                                                                                                                      |
| ------------------ | ------------------------- | -------------- | ---------------------------------------------------------------------------------------------------------------------------- |
| **Shared secret**  | `AUTONOMA_SHARED_SECRET`  | You + Autonoma | HMAC-SHA256 signature on every request. Autonoma signs; your SDK verifies. You paste this into the Autonoma dashboard.       |
| **Signing secret** | `AUTONOMA_SIGNING_SECRET` | Only you       | Signs the `refsToken` during `up`, verifies during `down`. Autonoma stores the token opaquely — it cannot read or modify it. |

The two secrets **must be different values**. The SDK throws an error at startup if they match.

**Generate with `openssl`:**

```bash
openssl rand -hex 32   # → use as AUTONOMA_SHARED_SECRET
openssl rand -hex 32   # → use as AUTONOMA_SIGNING_SECRET (must be different!)
```

### Layer 1: Production Guard

The endpoint returns **404** when the application is running in production mode (`NODE_ENV=production` or equivalent), unless explicitly opted in with `allowProduction: true`. Even if someone discovers the URL, it doesn’t respond in production.

### Layer 2: Request Signing (HMAC-SHA256)

Every request from Autonoma includes a signature header:

```plaintext
x-signature: <hex-digest>
```

The signature is HMAC-SHA256 of the raw request body, keyed with the **shared secret**. The SDK verifies this automatically — unsigned or tampered requests are rejected with 401.

### Layer 3: Signed Refs Token

When `up` creates data, the SDK signs all created record IDs into a token (`refsToken`) using the **signing secret**. During `down`, the SDK verifies this token before deleting anything.

This guarantees that `down` can only delete data that `up` actually created. Even Autonoma cannot forge or modify this token — it just stores the opaque string and passes it back.

| Attack                                            | Why it fails                      |
| ------------------------------------------------- | --------------------------------- |
| Attacker sends fake refs with made-up IDs         | No valid token → rejected         |
| Attacker sends a valid token but changes the refs | Refs don’t match token → rejected |
| Attacker replays a token from a week ago          | Token expired (24h) → rejected    |

### What the SDK Can and Cannot Do

The SDK enforces hard safety constraints:

* **UP can only CREATE** — it invokes the factories you registered, which call your existing services / repositories. It cannot UPDATE, DELETE, DROP, TRUNCATE, or run raw SQL outside whatever your factory body runs.
* **DOWN can only DELETE what UP created** — verified by the signed refs token. It calls each factory’s `teardown` for the records listed in the token, in reverse topological order.
* **No raw SQL from the SDK** — the SDK never runs SQL itself. It calls your factories, which invoke whatever services / repositories your app already has.

### Error Codes

| Code                 | HTTP Status | Meaning                                                                |
| -------------------- | ----------- | ---------------------------------------------------------------------- |
| `INVALID_SIGNATURE`  | 401         | HMAC signature missing or does not match                               |
| `INVALID_BODY`       | 400         | Request body is not valid JSON, or missing required fields             |
| `UNKNOWN_ACTION`     | 400         | The action field is not discover, up, or down                          |
| `INVALID_REFS_TOKEN` | 403         | The refs token is missing, malformed, or signature verification failed |
| `PRODUCTION_BLOCKED` | 404         | Endpoint is disabled in production mode                                |
| `SAME_SECRETS`       | 500         | sharedSecret and signingSecret are the same value                      |
| `INTERNAL_ERROR`     | 500         | Unexpected server error                                                |

## Setting Up the SDK

### 0. Integrate into your existing backend — never a sidecar

The endpoint lives inside **your existing backend application**, alongside your other routes. It is not a separate server, sidecar, or standalone process.

Pick the SDK in the **same language as your backend**:

| Your backend language   | Manifest file                         | SDK package                                            |
| ----------------------- | ------------------------------------- | ------------------------------------------------------ |
| TypeScript / JavaScript | `package.json`                        | `@autonoma-ai/sdk`                                     |
| Python                  | `pyproject.toml` / `requirements.txt` | [`autonoma-ai`](https://pypi.org/project/autonoma-ai/) |
| Go                      | `go.mod`                              | `github.com/autonoma-ai/autonoma-sdk-go`               |
| Rust                    | `Cargo.toml`                          | `autonoma` crate                                       |
| Java                    | `pom.xml` / `build.gradle`            | `ai.autonoma:autonoma-sdk`                             |
| Ruby                    | `Gemfile` / `*.gemspec`               | `autonoma` gem                                         |
| PHP                     | `composer.json`                       | `autonoma/sdk`                                         |
| Elixir                  | `mix.exs`                             | `autonoma` hex package                                 |

If your backend is in a language without a matching SDK, open an issue — do not spin up a polyglot sidecar. Running a Python `FastAPI` next to a NestJS app so you can use the Python SDK will silently drift from your production code (auth flows, hashing, hooks, triggers) and create maintenance headaches.

Backend directory detection: scan for the manifest file above. Real projects use many conventions — `backend/`, `server/`, `api/`, `apps/api/`, `services/core/`, `core-app-backend/`, etc. — so don’t assume the directory is named `backend/`.

### 1. Install

The SDK is **factory-driven**: you register one factory per model and the SDK derives the discover schema from each factory’s input schema (Zod in TypeScript, Pydantic in Python). There is no SQL introspection, no ORM executor, and no SQL fallback. Pick the packages that match your stack:

**Next.js App Router**:

```bash
pnpm add @autonoma-ai/sdk @autonoma-ai/server-web zod
```

**Express**:

```bash
pnpm add @autonoma-ai/sdk @autonoma-ai/server-express zod
```

**Hono**:

```bash
pnpm add @autonoma-ai/sdk @autonoma-ai/server-hono zod
```

**Bun / Deno** (Web standard `Request`/`Response`):

```bash
pnpm add @autonoma-ai/sdk @autonoma-ai/server-web zod
```

**Node.js http**:

```bash
pnpm add @autonoma-ai/sdk @autonoma-ai/server-node zod
```

**Python** ([PyPI](https://pypi.org/project/autonoma-ai/)):

```bash
pip install autonoma-ai
```

The `autonoma-ai` package includes the core SDK plus framework adapters (`autonoma_fastapi`, `autonoma_flask`, `autonoma_django`). Pydantic is a hard dependency.

#### Package reference

| Your Framework                                                    | Package                       | Handler export           |
| ----------------------------------------------------------------- | ----------------------------- | ------------------------ |
| Next.js App Router, Bun, Deno (Web standard `Request`/`Response`) | `@autonoma-ai/server-web`     | `createHandler`          |
| Hono                                                              | `@autonoma-ai/server-hono`    | `createHonoHandler`      |
| Express, Fastify                                                  | `@autonoma-ai/server-express` | `createExpressHandler`   |
| Node.js `http`                                                    | `@autonoma-ai/server-node`    | `createNodeHandler`      |
| FastAPI (Python)                                                  | `autonoma_fastapi`            | `create_fastapi_handler` |
| Flask (Python)                                                    | `autonoma_flask`              | `create_flask_handler`   |
| Django (Python)                                                   | `autonoma_django`             | `create_django_handler`  |

### 2. Find your scope field

Pick the field most of your models use to reference the root tenant entity — usually `organizationId`, `orgId`, `tenantId`, or `workspaceId`. The SDK does not introspect FKs to find this; it just declares the field in the discover response so the dashboard knows how to scope test data. Your factories own the actual writes — including any tenant-scoped FK columns.

### 3. Generate secrets

You need two **different** secrets. The SDK throws an error if they are the same.

```bash
openssl rand -hex 32   # → use as AUTONOMA_SHARED_SECRET
openssl rand -hex 32   # → use as AUTONOMA_SIGNING_SECRET (must be different!)
```

Add to `.env`:

```env
AUTONOMA_SHARED_SECRET=abc123...   # share this with Autonoma
AUTONOMA_SIGNING_SECRET=def456...  # keep this private, never share
```

### 4. Create the endpoint

#### Next.js App Router

app/api/autonoma/route.ts

```typescript
import { createHandler } from '@autonoma-ai/server-web'


export const POST = createHandler({
  scopeField: 'organizationId',
  sharedSecret: process.env.AUTONOMA_SHARED_SECRET!,
  signingSecret: process.env.AUTONOMA_SIGNING_SECRET!,
  factories: { /* see section 6 */ },
  auth: async (user) => {
    // Create a real session for this user — see section 5
    const session = await createSession(user!.id as string)
    return {
      cookies: [{ name: 'session', value: session.token, httpOnly: true, sameSite: 'lax', path: '/' }],
    }
  },
})
```

#### Express

routes/autonoma.ts

```typescript
import { createExpressHandler } from '@autonoma-ai/server-express'


app.post('/api/autonoma', createExpressHandler({
  scopeField: 'organizationId',
  sharedSecret: process.env.AUTONOMA_SHARED_SECRET!,
  signingSecret: process.env.AUTONOMA_SIGNING_SECRET!,
  factories: { /* see section 6 */ },
  auth: async (user) => {
    const token = jwt.sign({ sub: user!.id }, process.env.JWT_SECRET!)
    return { headers: { Authorization: `Bearer ${token}` } }
  },
}))
```

#### Hono

src/routes/autonoma.ts

```typescript
import { createHonoHandler } from '@autonoma-ai/server-hono'


app.post('/api/autonoma', createHonoHandler({
  scopeField: 'organizationId',
  sharedSecret: process.env.AUTONOMA_SHARED_SECRET!,
  signingSecret: process.env.AUTONOMA_SIGNING_SECRET!,
  factories: { /* see section 6 */ },
  auth: async (user) => {
    const token = await createToken(user!.id as string)
    return { headers: { Authorization: `Bearer ${token}` } }
  },
}))
```

#### FastAPI (Python)

autonoma\_handler.py

```python
import os
from autonoma.types import HandlerConfig
from autonoma_fastapi import create_fastapi_handler


config = HandlerConfig(
    scope_field='organization_id',
    shared_secret=os.environ['AUTONOMA_SHARED_SECRET'],
    signing_secret=os.environ['AUTONOMA_SIGNING_SECRET'],
    factories={ ...  # see section 6 },
    auth=lambda user, ctx: {'headers': {'Authorization': f'Bearer {issue_token(user)}'}},
)


router = create_fastapi_handler(config)
app.include_router(router, prefix='/api/autonoma')
```

### 5. Implement the auth callback

The `auth` callback receives the first `User` record created during `up`. It must return real, working credentials that the test runner can use to authenticate with your app.

**This is critical.** If the auth callback returns fake or expired tokens, every test will fail at the login step.

#### What the callback receives

```typescript
auth: async (user, context) => {
  // user: the first User record from refs, or `null` if no User model exists.
  //   Always check for null — not every scenario creates a User.
  //   Shape: { id: 'clxyz...', name: 'Admin', email: 'admin-abc123@test.com', ... }
  // context:
  //   - scopeValue: the detected scope value (e.g. organization id) or testRunId fallback
  //   - refs: all created records keyed by model name, for looking up related data
}
```

#### What the callback must return

```typescript
interface AuthResult {
  cookies?: Array<{                   // Session cookies
    name: string
    value: string
    httpOnly?: boolean
    sameSite?: 'strict' | 'lax' | 'none'
    path?: string
    domain?: string
    secure?: boolean
    maxAge?: number
  }>
  headers?: Record<string, string>    // Custom auth headers (use for bearer tokens: `Authorization: Bearer …`)
  credentials?: Record<string, string>  // Arbitrary key/value pairs for manual login flows (e.g. { email, password })
}
```

There is no top-level `token` field. To return a bearer token, put it on `headers` as `Authorization: Bearer …`. To return email/password for a native login flow, put them on `credentials`.

#### Pattern 1: Session cookies (most web apps)

```typescript
auth: async (user) => {
  const session = await lucia.createSession(user.id as string, {})
  const cookie = lucia.createSessionCookie(session.id)
  return {
    cookies: [{
      name: cookie.name,
      value: cookie.value,
      httpOnly: true,
      sameSite: 'lax',
      path: '/',
    }],
  }
}
```

#### Pattern 2: JWT bearer token (APIs, SPAs)

```typescript
auth: async (user) => {
  const token = jwt.sign(
    { sub: user!.id, email: user!.email },
    process.env.JWT_SECRET!,
    { expiresIn: '1h' }
  )
  return { headers: { Authorization: `Bearer ${token}` } }
}
```

#### Pattern 3: Email/password (mobile apps)

When the test runner needs to log in through the UI, return credentials instead of a token:

```typescript
auth: async (user) => ({
  credentials: {
    email: user.email as string,
    password: 'test-password-123',
  },
})
```

**Important**: For this to work, the User must be created with a known password. Use a factory to hash the password during creation.

> **Mobile apps: use credentials only:**
>
> For **iOS and Android** applications, cookies and headers are **not supported**. Autonoma cannot inject them into native mobile apps. Use **credentials** and return email/password for the agent to log in through your app’s login screen.

#### Common auth mistakes

| Mistake                                          | What happens                    | Fix                                      |
| ------------------------------------------------ | ------------------------------- | ---------------------------------------- |
| Returning a hardcoded string like `"test-token"` | Every test fails at login       | Use your real session/JWT creation       |
| Not setting password on the User record          | Email/password login fails      | Use a factory that hashes passwords      |
| Token expires too quickly                        | Tests fail midway               | Set expiration to at least 1 hour        |
| Wrong cookie name                                | Browser doesn’t send the cookie | Check your app’s cookie name in DevTools |

### 6. Register factories

Register a **factory** for every model the dashboard can create. There is no SQL fallback — every model the SDK writes goes through your factory. The factory’s `inputSchema` (Zod) / `input_model` (Pydantic) drives both the discover schema and validation of the create payload.

**Why factory-by-default?** If you already have `ProjectService.create()` that today just wraps `prisma.project.create()`, wire it up anyway. The day you add an audit log, a Stripe sync, or a cache write to that function, your tests keep working — zero rewiring. The factory always runs the same code path the rest of your app does.

For models without a dedicated create function, register a factory whose body is a thin repository call. The Step 2 audit classifies models with `independently_created: true` (call the audit’s identified function) vs `independently_created: false` (a thin repository call is fine).

```typescript
import { z } from 'zod'
import { defineFactory } from '@autonoma-ai/sdk'


const OrganizationInput = z.object({ name: z.string(), slug: z.string() })
const OrganizationRef = z.object({ id: z.string(), name: z.string(), slug: z.string() })
const UserInput = z.object({ email: z.string(), name: z.string() })


const handler = createExpressHandler({
  scopeField: 'organizationId',
  sharedSecret: process.env.AUTONOMA_SHARED_SECRET!,
  signingSecret: process.env.AUTONOMA_SIGNING_SECRET!,
  factories: {
    Organization: defineFactory({
      inputSchema: OrganizationInput,
      // Optional: validates the record on teardown and types `record` for free.
      refSchema: OrganizationRef,
      // `data` is typed `{ name: string; slug: string }` — no z.infer<...> needed.
      create: async (data) => organizationService.create({ name: data.name, slug: data.slug }),
      // `record` is typed `{ id: string; name: string; slug: string }` from refSchema.
      teardown: async (record) => organizationService.delete(record.id),
    }),
    User: defineFactory({
      inputSchema: UserInput,
      create: async (data) =>
        userService.create({
          email: data.email,
          name: data.name,
          password: 'test-password-123', // known password for auth
        }),
      // No teardown: this model is left alone on `down`.
    }),
  },
  auth: async (user) => { /* ... */ },
})
```

The `defineFactory` generics are inferred from the schemas you pass:

* `data` inside `create` is typed as `z.infer<typeof inputSchema>`.
* When `refSchema` is set, `record` inside `teardown` is typed as `z.infer<typeof refSchema>`, and `create`’s return type is constrained to that shape too.
* When `refSchema` is omitted, `record` widens to `Record<string, unknown> & { id: string | number }` so legacy factories keep compiling without it.

#### How factories work

1. The SDK reads the create payload’s `_alias` / `_ref` graph and topologically sorts entities — no FK introspection, no schema needed.
2. For each model in order, the SDK validates the entity through `inputSchema.safeParse(...)` (Zod) / `input_model.model_validate(...)` (Pydantic) and calls your `create` with the typed value.
3. **Factory receives pre-resolved fields** — `_ref` placeholders are already replaced with the real id of the referenced entity. The factory never sees `{"_ref": "..."}` or `__temp_*` values.
4. **Factory must return at least the primary key** (e.g., `{ id: "..." }`). All returned fields are stored in refs and available to subsequent factories via `ctx.refs`.
5. On teardown: if a factory defines `teardown`, it’s called per record in reverse topological order; otherwise that model is left alone.

#### When to register a factory

Always. The SDK writes through factories only — every model the dashboard can create needs one. The Step 2 entity audit classifies how the factory body should look:

| Audit value                                                                                                      | Factory body                                                                         |
| ---------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| `independently_created: true` (a `create`/`insert`/`register` function exists in a service or repository)        | Call that function from `create`.                                                    |
| `independently_created: true` and additionally hashes passwords, generates slugs, syncs to Stripe, etc.          | Call the function — your factory inherits the logic for free.                        |
| `independently_created: false` (only inline ORM calls scattered across route handlers, or no create path at all) | Make the same ORM call directly from `create`.                                       |
| Model is never created at all (seed-only lookup table)                                                           | Either omit it from your scenarios, or write a factory that re-creates the seed row. |

See [Dependents, cascades, and teardown](#dependents-cascades-and-teardown) below for how transitively-created rows come and go.

#### Dependents, cascades, and teardown

A root can mint dependent rows inline — e.g. `<Root>Service.create` may insert a root row plus a default child, a grandchild, and an onboarding row, all in one transaction. Step 2 records each dependent with a `created_by: [{owner, via, why}]` pointing back at the owner. The SDK does not automatically know about those rows; you have to tell it how to tear them down. Four options, in preference order:

1. **Schema cascade** — the FK chain from every dependent back to the root is `onDelete: Cascade` (Prisma) / `ON DELETE CASCADE` (raw SQL). Deleting the root row is enough; the DB handles the rest. Nothing to configure on the factory. This is the easiest case and usually the intent when the production code mints everything in one transaction.

2. **Call the app’s delete function** — if your codebase already has a delete method that tears down the same subtree (e.g. a `<Root>Service.delete` that removes the root and every dependent it minted), register `teardown` on the root’s factory to call it:

   ```typescript
   <Root>: defineFactory({
     inputSchema: <Root>Input,
     create: async (data) => <Root>Service.create(data),
     teardown: async (record) => <Root>Service.delete(record.id as string),
   }),
   ```

3. **Forward dependent IDs that the production `create` already returns** — if the production `create` function returns the dependent IDs in its result (e.g. `{ root, child, grandchild }`), surface those IDs from the factory so they land in refs, and write a `teardown` that deletes them in reverse FK order using your app’s existing DB client:

   ```typescript
   import { db } from '@/db'


   <Root>: defineFactory({
     inputSchema: <Root>Input,
     create: async (data) => {
       const { root, child, grandchild } = await <Root>Service.create(data)
       return { id: root.id, childId: child.id, grandchildId: grandchild.id }
     },
     teardown: async (record) => {
       await db.<grandchild>.delete({ where: { id: record.grandchildId } })
       await db.<child>.delete({ where: { id: record.childId } })
       await db.<root>.delete({ where: { id: record.id } })
     },
   }),
   ```

4. **None of the above — STOP.** Do NOT modify your production service to return more IDs than it already does just to satisfy the test harness. Adding test-only return values to production code inverts the relationship we want (tests adapt to production, not the other way around). Instead, report the gap: add a cascade to the schema, add a delete function to the service, or accept orphans between runs (acceptable when the test database is reset periodically).

Pure dependents (`independently_created: false`) typically still get a factory — registered as a thin repository call — unless they are minted transitively by the parent’s `create`. If they are, omit them from the create payload and let the parent’s `teardown` clean them up.

#### Factory context

Both `create` and `teardown` receive a context object. There is no SDK-managed DB client — your factory imports the same client/repository singletons your app’s services use:

```typescript
interface FactoryContext {
  refs: Record<string, Record<string, unknown>[]>  // all records created so far
  scenarioName: string
  testRunId: string
}
```

## The Create Tree Format

The `create` field in `up` requests is a nested JSON tree. Top-level keys are model names. Children are nested inside their parents using the relation field name from your ORM schema.

### How nesting works

The SDK reads your ORM schema’s relations to know what each nested key means. The nested key must match the exact relation field name from the parent model.

```json
{
  "create": {
    "Organization": [{
      "name": "Acme Corp",
      "members": [{
        "role": "owner",
        "user": [{ "name": "Alice", "email": "alice@test.com" }]
      }]
    }]
  }
}
```

This creates:

1. One Organization
2. One User (created first because Member holds the FK to it)
3. One Member with `organizationId` set to the Organization’s ID and `userId` set to the User’s ID

The SDK handles both FK directions automatically:

* **FK on child** (most common): `Application.organizationId` → Organization is created first, then Application with `organizationId` set
* **FK on parent** (reverse): `Member.userId` → User is created first, then Member gets `userId` set

### What to include in fields

* **Required fields** without defaults that are not auto-generated
* **Unique fields** with values unique per test run (use `testRunId` in emails, slugs, etc.)

### What to omit

* **id** — auto-generated by the database
* **Fields with defaults** — the database or ORM handles them
* **Auto-updated timestamps** — `updatedAt` is handled by the ORM
* **FK fields handled by nesting** — if you nest Application under Organization, don’t set `organizationId` manually
* **The scope field** — the SDK injects it automatically

### Cross-branch references (`_alias` / `_ref`)

When a record needs a FK to something in a different branch of the tree, use `_alias` to name a node and `_ref` to reference it:

```json
{
  "create": {
    "Organization": [{
      "name": "Acme Corp",
      "applications": [{
        "_alias": "webApp",
        "name": "Marketing Website",
        "architecture": "WEB",


        "testPlans": [{
          "name": "Smoke Plan",
          "plan": "content",
          "testGenerations": [{
            "_alias": "gen1",
            "conversation": "[]",
            "status": "success",
            "applicationId": { "_ref": "webApp" }
          }]
        }],


        "tests": [{
          "name": "Homepage Test",
          "testGenerationId": { "_ref": "gen1" },
          "steps": [
            { "order": 1, "interaction": "click", "params": {} }
          ]
        }]
      }]
    }]
  }
}
```

Rules:

* `_alias` is a string name you choose. It must be unique across the entire scenario.
* `_ref` resolves to the `id` of the aliased node after it’s created.
* The aliased node must appear before the `_ref` in depth-first traversal order.

## Validating the Lifecycle

After setting up the endpoint, validate that `up` creates the correct data and `down` cleans it up completely. **This must happen before writing tests** — it catches bad assumptions about scenario data early.

### Smoke test with curl

```bash
SECRET="your-shared-secret-here"
URL="http://localhost:3000/api/autonoma"
BODY='{"action":"discover"}'
SIG=$(echo -n "$BODY" | openssl dgst -sha256 -hmac "$SECRET" | sed 's/.*= //')
curl -s -X POST "$URL" \
  -H "Content-Type: application/json" \
  -H "x-signature: $SIG" \
  -d "$BODY" | jq .
```

**Expected**: A JSON response with your full schema — models, fields, edges, relations.

### Integration test with checkScenario

```typescript
import { checkScenario, defineFactory } from '@autonoma-ai/sdk'
import { z } from 'zod'


const factories = {
  Organization: defineFactory({
    inputSchema: z.object({ name: z.string(), slug: z.string() }),
    refSchema: z.object({ id: z.string() }),
    // `data` typed { name: string; slug: string }; `record` typed { id: string }
    create: async (data) => organizationService.create(data),
    teardown: async (record) => organizationService.delete(record.id),
  }),
  User: defineFactory({
    inputSchema: z.object({ name: z.string(), email: z.string(), organizationId: z.string() }),
    create: async (data) => userService.create(data),
  }),
}


const result = await checkScenario(
  factories,
  {
    create: {
      Organization: [{ _alias: 'org', name: 'Test Org', slug: 'test-org' }],
      User: [{ name: 'Admin', email: 'admin@test.com', organizationId: { _ref: 'org' } }],
    },
  },
  { scopeField: 'organizationId' },
)


// result.valid   — true if up + down both succeeded
// result.phase   — 'ok' | 'up' | 'down' (where it failed)
// result.timing  — { upMs, downMs }
// result.errors  — [{ phase, message, fix? }]
```

`checkScenario` runs the full `up` → `down` cycle through your factories — same code path the dashboard would hit.

### What to verify

1. **After `up`**: Query the database (read-only) to confirm all expected records exist with correct field values
2. **After `down`**: Query the database to confirm all created records were deleted — no orphans remain
3. **Auth works**: Use the returned cookies/headers to make an authenticated request to your app

## Enable in Production

The endpoint returns 404 in production by default. When you’re ready:

```typescript
export const POST = createHandler({
  scopeField: 'organizationId',
  sharedSecret: process.env.AUTONOMA_SHARED_SECRET!,
  signingSecret: process.env.AUTONOMA_SIGNING_SECRET!,
  factories: { /* ... */ },
  allowProduction: true,
  auth: async (user) => { /* ... */ },
})
```

## Connect to Autonoma

Deploy your endpoint and paste `AUTONOMA_SHARED_SECRET` into the Autonoma dashboard when connecting your app. The platform will:

1. Call `discover` to learn your schema
2. Generate scenario data based on your models
3. Send that data in `up` requests before each test
4. Send `down` requests after each test to clean up

## Troubleshooting

| Problem                    | Cause                                          | Fix                                                                                         |
| -------------------------- | ---------------------------------------------- | ------------------------------------------------------------------------------------------- |
| `INVALID_SIGNATURE` (401)  | Shared secret mismatch                         | Check `AUTONOMA_SHARED_SECRET` matches between your server and the Autonoma dashboard       |
| `SAME_SECRETS` (500)       | Both secrets are identical                     | Use two different values from `openssl rand -hex 32`                                        |
| `PRODUCTION_BLOCKED` (404) | Running in production mode                     | Set `allowProduction: true` or ensure `NODE_ENV` is not `production`                        |
| `INVALID_REFS_TOKEN` (403) | Signing secret changed between `up` and `down` | Ensure the same `AUTONOMA_SIGNING_SECRET` is used for both                                  |
| `FACTORY_MISSING_PK`       | Factory `create` didn’t return the primary key | Ensure your factory returns at least `{ id: "..." }`                                        |
| FK violation on `up`       | Missing required FK in scenario data           | Check that all required relationships are nested correctly in the create tree               |
| FK violation on `down`     | Circular FK between tables                     | The SDK handles cycles with deferred updates — if this still fails, check for untracked FKs |
| Parallel tests collide     | Same email/name across runs                    | Use `testRunId` in all unique fields                                                        |

# Scenario Recipe Schema

> Canonical JSON contract for the scenario recipes file uploaded to Autonoma at POST /v1/setup/setups/:id/scenario-recipe-versions.

This page documents the **canonical upload contract** for scenario recipes. It is language-agnostic: the schema is described as JSON with per-field expectations. The source of truth lives in `packages/types/src/schemas/scenarios.ts` (`ScenarioRecipesFileSchema`).

The file is posted as the JSON body of:

```plaintext
POST /v1/setup/setups/:setupId/scenario-recipe-versions
```

## Top-level shape

```json
{
  "version": 1,
  "source": {
    "discoverPath": "string",
    "scenariosPath": "string"
  },
  "validationMode": "sdk-check" | "endpoint-lifecycle",
  "recipes": [ /* at least one ScenarioRecipe */ ]
}
```

| Field                  | Type                                    | Required | Notes                                                                                                                                                                                     |
| ---------------------- | --------------------------------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `version`              | integer, must equal `1`                 | yes      | Contract version. Currently only `1` is accepted. Not a string.                                                                                                                           |
| `source`               | object                                  | yes      | Provenance pointers. Additional keys are preserved.                                                                                                                                       |
| `source.discoverPath`  | string                                  | yes      | Path (relative to the application repo) to the discovery output, e.g. `autonoma/discover.json`. **Required** - omitting it causes Zod to fail with `expected string, received undefined`. |
| `source.scenariosPath` | string                                  | yes      | Path to the human-readable scenarios document, e.g. `autonoma/scenarios.md`.                                                                                                              |
| `validationMode`       | `"sdk-check"` \| `"endpoint-lifecycle"` | yes      | How Autonoma validated the recipes before upload. `sdk-check` = `checkScenario`/`checkAllScenarios`. `endpoint-lifecycle` = real HTTP `up`/`down`.                                        |
| `recipes`              | array, minimum length `1`               | yes      | One entry per scenario. See below.                                                                                                                                                        |

## `ScenarioRecipe` (one entry in `recipes[]`)

```json
{
  "name": "string",
  "description": "string",
  "create": { /* arbitrary model graph, see below */ },
  "variables": { /* optional, see below */ },
  "validation": {
    "status": "validated",
    "method": "checkScenario" | "checkAllScenarios" | "endpoint-up-down",
    "phase": "ok",
    "up_ms": 0,
    "down_ms": 0
  }
}
```

| Field                | Type                                                                  | Required | Notes                                                                                                                                                          |
| -------------------- | --------------------------------------------------------------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `name`               | string                                                                | yes      | Stable identifier. Must match the scenario name used in the LLM-facing docs.                                                                                   |
| `description`        | string                                                                | yes      | Human-readable summary of the scenario state.                                                                                                                  |
| `create`             | object                                                                | yes      | The model graph passed to the SDK’s `createScenario` / `up` flow. Keys are model names; values are arrays or objects of seeded rows. Extra keys are preserved. |
| `variables`          | object (map of name → definition)                                     | no       | Per-recipe dynamic values. See **Variable definitions** below.                                                                                                 |
| `validation`         | object                                                                | yes      | Proof that the recipe was validated. All fields must be present.                                                                                               |
| `validation.status`  | literal string `"validated"`                                          | yes      |                                                                                                                                                                |
| `validation.method`  | one of `"checkScenario"`, `"checkAllScenarios"`, `"endpoint-up-down"` | yes      | Which validator produced this result.                                                                                                                          |
| `validation.phase`   | literal string `"ok"`                                                 | yes      |                                                                                                                                                                |
| `validation.up_ms`   | non-negative integer                                                  | no       | Milliseconds the `up` phase took.                                                                                                                              |
| `validation.down_ms` | non-negative integer                                                  | no       | Milliseconds the `down` phase took.                                                                                                                            |

## Variable definitions

`variables` is a map from variable name to a **tagged union** discriminated by the `strategy` field. Exactly one of the three shapes below is valid per entry. Unknown `strategy` values are rejected.

### `literal`

Emits a fixed scalar on every run.

```json
{
  "strategy": "literal",
  "value": "admin@example.com"
}
```

| Field      | Type                                | Required | Notes                                                    |
| ---------- | ----------------------------------- | -------- | -------------------------------------------------------- |
| `strategy` | literal `"literal"`                 | yes      |                                                          |
| `value`    | string \| number \| boolean \| null | yes      | Any JSON scalar. Objects and arrays are **not** allowed. |

### `derived`

Derives a deterministic value from the test run ID (so every invocation of the same test gets the same value, but different runs get different values).

```json
{
  "strategy": "derived",
  "source": "testRunId",
  "format": "user-{shortId}@example.com"
}
```

| Field      | Type                  | Required | Notes                                                                        |
| ---------- | --------------------- | -------- | ---------------------------------------------------------------------------- |
| `strategy` | literal `"derived"`   | yes      |                                                                              |
| `source`   | literal `"testRunId"` | yes      | Only `testRunId` is supported today.                                         |
| `format`   | string                | yes      | Template. The token `{shortId}` is replaced with a short hash of the run ID. |

### `faker`

Generates a fresh random value per run using Faker.

```json
{
  "strategy": "faker",
  "generator": "internet.email"
}
```

| Field       | Type                     | Required | Notes                                                              |
| ----------- | ------------------------ | -------- | ------------------------------------------------------------------ |
| `strategy`  | literal `"faker"`        | yes      |                                                                    |
| `generator` | dotted Faker method path | yes      | e.g. `internet.email`, `person.firstName`, `commerce.productName`. |

## Full example

```json
{
  "version": 1,
  "source": {
    "discoverPath": "autonoma/discover.json",
    "scenariosPath": "autonoma/scenarios.md"
  },
  "validationMode": "sdk-check",
  "recipes": [
    {
      "name": "adminWithTwoProjects",
      "description": "Organization with an admin user and two projects.",
      "create": {
        "Organization": [{ "id": "org-1", "name": "Acme" }],
        "User": [
          {
            "id": "user-1",
            "email": "{adminEmail}",
            "role": "admin",
            "organizationId": "org-1"
          }
        ],
        "Project": [
          { "id": "proj-1", "name": "Alpha", "organizationId": "org-1" },
          { "id": "proj-2", "name": "Beta",  "organizationId": "org-1" }
        ]
      },
      "variables": {
        "adminEmail": {
          "strategy": "derived",
          "source": "testRunId",
          "format": "admin-{shortId}@acme.test"
        }
      },
      "validation": {
        "status": "validated",
        "method": "checkScenario",
        "phase": "ok",
        "up_ms": 142,
        "down_ms": 61
      }
    }
  ]
}
```

## Common rejection reasons

* **`expected string, received undefined` under `source.discoverPath`** - the `source` object is missing `discoverPath`. Both `discoverPath` and `scenariosPath` are required.
* **Discriminated union error under `recipes[n].variables.<name>`** - an unknown or missing `strategy` key. Use exactly one of `"literal"`, `"derived"`, `"faker"`.
* **`version` must be literal `1`** - don’t send `"1"` or `"1.0"`. Integer `1`.
* **`recipes` must contain at least 1 element** - empty arrays are rejected.
* **`validation.status` / `validation.phase` mismatch** - both are fixed literals (`"validated"` / `"ok"`). Any other value fails.

## Related

* [Scenarios step (test-planner)](/test-planner/step-3-scenarios/) - how scenarios are authored.
* [Validate step (test-planner)](/test-planner/step-5-validate/) - how recipes are validated before upload.
* [Environment Factory guide](/guides/environment-factory/) - the `up` / `down` / `discover` SDK that consumes these recipes at runtime.

# TypeScript

> Autonoma Environment Factory examples with Express, Hono, and Next.js.

The TypeScript SDK is **factory-driven**: you register one factory per model and the SDK derives the discover schema from each factory’s Zod `inputSchema`. There is no database introspection, no ORM executor, and no SQL fallback — your factories own creation, the SDK owns the protocol.

`zod` is a peer dependency: `npm install zod` (any v3.23+ or v4 release works).

## Express

Uses `createExpressHandler` from `@autonoma-ai/server-express`. The factories use whatever Prisma / Drizzle / pg client your app already has — the SDK does not need a connection.

src/index.ts

```typescript
import express from 'express'
import { z } from 'zod'
import { defineFactory } from '@autonoma-ai/sdk'
import { createExpressHandler } from '@autonoma-ai/server-express'
import { PrismaClient } from '@prisma/client'


import { OrganizationRepository } from './repositories/organization'
import { UserRepository } from './repositories/user'


const prisma = new PrismaClient()
const organizationRepo = new OrganizationRepository(prisma)
const userRepo = new UserRepository(prisma)


const OrganizationInput = z.object({ name: z.string() })
const UserInput = z.object({
  email: z.string(),
  name: z.string(),
  organizationId: z.string(),
})


const app = express()
app.use(express.json())


app.post(
  '/api/autonoma',
  createExpressHandler({
    // The column that scopes all models to a tenant
    scopeField: 'organizationId',
    // Shared with Autonoma — verifies incoming requests via HMAC-SHA256
    sharedSecret: process.env.AUTONOMA_SHARED_SECRET!,
    // Private to your server — signs the refs token so teardown only deletes what was created
    signingSecret: process.env.AUTONOMA_SIGNING_SECRET!,


    // Every model the dashboard can create needs a factory.
    // `defineFactory` infers `data`'s type from `inputSchema` — no z.infer<...> needed.
    factories: {
      Organization: defineFactory({
        inputSchema: OrganizationInput,
        create: async (data) => organizationRepo.create({ name: data.name }),
        teardown: async (record) =>
          organizationRepo.delete(record.id as string),
      }),
      User: defineFactory({
        inputSchema: UserInput,
        create: async (data) =>
          userRepo.create({
            email: data.email,
            name: data.name,
            organizationId: data.organizationId,
          }),
      }),
    },


    // Called after `up` — returns credentials so Autonoma can make authenticated requests
    auth: async (user) => ({ headers: { Authorization: 'Bearer test-token' } }),
  }),
)
```

[Full source code on GitHub](https://github.com/Autonoma-AI/sdk/tree/main/examples/typescript/express)

***

## Next.js (App Router)

`createHandler` from `@autonoma-ai/server-web` works with any Web-standard runtime: Next.js App Router, Hono, Bun, Deno.

src/app/api/autonoma/route.ts

```typescript
import { z } from 'zod'
import { defineFactory } from '@autonoma-ai/sdk'
import { createHandler } from '@autonoma-ai/server-web'


import { db } from '@/db'
import { OrganizationRepository } from '@/repositories/organization'
import { UserRepository } from '@/repositories/user'


const organizationRepo = new OrganizationRepository(db)
const userRepo = new UserRepository(db)


const OrganizationInput = z.object({ name: z.string() })
const UserInput = z.object({
  email: z.string(),
  name: z.string(),
  organizationId: z.string(),
})


export const POST = createHandler({
  scopeField: 'organizationId',
  sharedSecret: process.env.AUTONOMA_SHARED_SECRET!,
  signingSecret: process.env.AUTONOMA_SIGNING_SECRET!,


  factories: {
    Organization: defineFactory({
      inputSchema: OrganizationInput,
      create: async (data) => organizationRepo.create({ name: data.name }),
      teardown: async (record) =>
        organizationRepo.delete(record.id as string),
    }),
    User: defineFactory({
      inputSchema: UserInput,
      create: async (data) =>
        userRepo.create({
          email: data.email,
          name: data.name,
          organizationId: data.organizationId,
        }),
    }),
  },


  auth: async () => ({ headers: { Authorization: 'Bearer test-token' } }),
})
```

[Full source code on GitHub](https://github.com/Autonoma-AI/sdk/tree/main/examples/typescript/nextjs)

***

## Hono

Same factories, the `createHonoHandler` adapter unwraps a Hono `Context` into the Web-standard request the SDK expects.

```typescript
import { Hono } from 'hono'
import { createHonoHandler } from '@autonoma-ai/server-hono'


const app = new Hono()
app.post('/api/autonoma', createHonoHandler({ /* same config as above */ }))
```

[Full source code on GitHub](https://github.com/Autonoma-AI/sdk/tree/main/examples/typescript/hono)

***

## What `inputSchema` does

The Zod schema you pass as `inputSchema`:

1. **Drives discover** — the SDK walks the schema’s shape to describe the model to the dashboard (field names, types, required/optional, defaults). No database introspection runs.
2. **Validates the create payload** — before invoking your `create` function, the SDK calls `inputSchema.safeParse(payload)` and passes the parsed value in. Validation failures bubble up as a 500 with the field path the dashboard can show inline.
3. **Drives types** — `defineFactory` is generic over the schemas you pass. `data` inside `create` is automatically typed as `z.infer<typeof inputSchema>` and `record` inside `teardown` is automatically typed as `z.infer<typeof refSchema>` when you set one. No `z.infer<...>` annotations at the call site.
4. **Lets you accept extras** — recipes can carry display-only metadata (e.g. `_alias`) without failing validation; Zod ignores keys that aren’t part of your schema by default.

### Validated teardown with `refSchema`

Adding a `refSchema` lets `teardown` work against a typed record (validated through Zod first). When `refSchema` is set, `create`’s return type is constrained to its input shape — the same record flows from `create` → `down` token → `teardown` with no manual casts.

```typescript
const ProjectInput = z.object({ name: z.string(), organizationId: z.string() })
const ProjectRef = z.object({ id: z.string(), name: z.string() })


defineFactory({
  inputSchema: ProjectInput,
  refSchema: ProjectRef,
  // `data` typed as { name: string; organizationId: string }
  create: async (data) => projectService.create(data),
  // `record` typed as { id: string; name: string }
  teardown: async (record) => projectService.delete(record.id),
})
```

Without `refSchema`, `create`’s return type widens to `Record<string, unknown> & { id: string | number }` and `record` in `teardown` matches that shape — the existing factories above keep compiling.

# Python

> Autonoma Environment Factory examples with FastAPI, Flask, and Django.

The Python SDK is **factory-driven**: you register one factory per model and the SDK derives the discover schema from each factory’s Pydantic `input_model`. There is no database introspection, no ORM executor, and no SQL fallback — your factories own creation, the SDK owns the protocol.

## FastAPI + SQLAlchemy

Uses `create_fastapi_handler` from `autonoma_fastapi`. The factories use whatever SQLAlchemy session your app already has — the SDK does not need a connection.

app.py

```python
import os
from pydantic import BaseModel, ConfigDict
from autonoma.types import HandlerConfig
from autonoma.factory import define_factory
from autonoma_fastapi import create_fastapi_handler


from database import session
from repositories.organization import OrganizationRepository
from repositories.user import UserRepository


organization_repo = OrganizationRepository(session)
user_repo = UserRepository(session)


class OrganizationInput(BaseModel):
    model_config = ConfigDict(extra="ignore")
    name: str


class UserInput(BaseModel):
    model_config = ConfigDict(extra="ignore")
    email: str
    name: str
    organization_id: str


config = HandlerConfig(
    # The column that scopes all models to a tenant — used to isolate test data
    scope_field="organization_id",
    # Shared with Autonoma — verifies incoming requests via HMAC-SHA256
    shared_secret=os.environ["AUTONOMA_SHARED_SECRET"],
    # Private to your server — signs the refs token so teardown only deletes what was created
    signing_secret=os.environ["AUTONOMA_SIGNING_SECRET"],


    # Every model the dashboard can create needs a factory.
    # The factory's input_model drives both validation and discover.
    factories={
        "Organization": define_factory(
            create=lambda data, ctx: organization_repo.create({"name": data.name}),
            teardown=lambda record, ctx: organization_repo.delete(record["id"]),
            input_model=OrganizationInput,
        ),
        "User": define_factory(
            create=lambda data, ctx: user_repo.create({
                "email": data.email,
                "name": data.name,
                "organization_id": data.organization_id,
            }),
            input_model=UserInput,
        ),
    },


    # Called after `up` — returns credentials so Autonoma can make authenticated requests
    auth=lambda user, context: {"headers": {"Authorization": "Bearer test-token"}},
)


router = create_fastapi_handler(config)
app.include_router(router, prefix="/api/autonoma")
```

[Full source code on GitHub](https://github.com/Autonoma-AI/sdk/tree/main/examples/python/fastapi-sqlalchemy)

***

## Flask + SQLAlchemy

Same `HandlerConfig`, different server adapter. `create_flask_handler` returns a Flask Blueprint.

app.py

```python
from autonoma_flask import create_flask_handler


# Same HandlerConfig as FastAPI — scope_field, secrets, factories, auth.
# The only difference is the server adapter.
bp = create_flask_handler(config)
app.register_blueprint(bp, url_prefix="/api/autonoma")
```

[Full source code on GitHub](https://github.com/Autonoma-AI/sdk/tree/main/examples/python/flask-sqlalchemy)

***

## Django

`create_django_handler` returns a Django view function (already decorated with `@csrf_exempt` + `@require_POST`).

core/autonoma\_config.py

```python
import os
from pydantic import BaseModel, ConfigDict
from autonoma.types import HandlerConfig
from autonoma.factory import define_factory
from autonoma_django import create_django_handler


from core.repositories.organization import OrganizationRepository
from core.repositories.user import UserRepository


organization_repo = OrganizationRepository()
user_repo = UserRepository()


class OrganizationInput(BaseModel):
    model_config = ConfigDict(extra="ignore")
    name: str


class UserInput(BaseModel):
    model_config = ConfigDict(extra="ignore")
    email: str
    name: str
    organization_id: str


config = HandlerConfig(
    scope_field="organization_id",
    shared_secret=os.environ["AUTONOMA_SHARED_SECRET"],
    signing_secret=os.environ["AUTONOMA_SIGNING_SECRET"],
    factories={
        "Organization": define_factory(
            create=lambda data, ctx: organization_repo.create({"name": data.name}),
            teardown=lambda record, ctx: organization_repo.delete(record["id"]),
            input_model=OrganizationInput,
        ),
        "User": define_factory(
            create=lambda data, ctx: user_repo.create({
                "email": data.email,
                "name": data.name,
                "organization_id": data.organization_id,
            }),
            input_model=UserInput,
        ),
    },
    auth=lambda user, context: {"headers": {"Authorization": "Bearer test-token"}},
)


handler = create_django_handler(config)
```

[Full source code on GitHub](https://github.com/Autonoma-AI/sdk/tree/main/examples/python/django)

***

## What `input_model` does

The Pydantic class you pass as `input_model`:

1. **Drives discover** — the SDK introspects `model_fields` to describe the model to the dashboard (field names, types, required/optional, defaults). No database introspection runs.
2. **Validates the create payload** — before invoking your `create` function, the SDK calls `input_model.model_validate(payload)` and passes the typed instance in. Your factory body works on a real Python object, not a `dict`.
3. **Lets you accept extras with `extra="ignore"`** — recipes can carry display-only metadata (e.g. `_alias`) without failing validation.

If you also want validated teardown, declare a `ref_model` (a Pydantic class describing the record returned by `create`) and the SDK will call `ref_model.model_validate(record)` before each `teardown` call.

# Elixir

> Autonoma Environment Factory example with Phoenix.

The Elixir SDK is **factory-driven**: you register one factory per model and the SDK derives the discover schema from each factory’s `input_fields`. There is no database introspection, no Ecto executor, and no SQL fallback — your factories own creation, the SDK owns the protocol.

## Phoenix

Uses `Autonoma.Plug.Handler` as a Plug mounted via Phoenix’s `forward` macro. The factories use whatever Ecto Repo or service module your app already has — the SDK does not need a database connection.

lib/autonoma\_example/router.ex

```elixir
defmodule AutonomaExample.Router do
  use Phoenix.Router


  alias AutonomaExample.Repositories


  @autonoma_config %{
    # The column that scopes all models to a tenant — used to isolate test data
    scope_field: "organization_id",
    # Shared with Autonoma — verifies incoming requests via HMAC-SHA256
    shared_secret: System.get_env("AUTONOMA_SHARED_SECRET") || "",
    # Private to your server — signs the refs token so teardown only deletes what was created
    signing_secret: System.get_env("AUTONOMA_SIGNING_SECRET") || "",


    # Every model the dashboard can create needs a factory.
    # The factory's input_fields drives both validation and discover.
    factories: %{
      "Organization" => Autonoma.Factory.define_factory(%{
        input_fields: [
          %{name: "name", type: "string", required: true}
        ],
        create: fn data, _ctx -> Repositories.Organization.create(data) end,
        teardown: fn record, _ctx -> Repositories.Organization.delete(record["id"]) end
      }),
      "User" => Autonoma.Factory.define_factory(%{
        input_fields: [
          %{name: "email", type: "string", required: true},
          %{name: "name", type: "string", required: true},
          %{name: "organization_id", type: "string", required: true}
        ],
        create: fn data, _ctx -> Repositories.User.create(data) end
      })
    },


    # Called after `up` — returns credentials so Autonoma can make authenticated requests
    auth: fn _user, _context ->
      %{"headers" => %{"Authorization" => "Bearer test-token"}}
    end
  }


  forward "/api/autonoma", Autonoma.Plug.Handler, @autonoma_config
end
```

[Full source code on GitHub](https://github.com/Autonoma-AI/sdk/tree/main/examples/elixir/phoenix)

***

## What `input_fields` does

The field list you pass as `input_fields`:

1. **Drives discover** — the SDK uses the field definitions to describe the model to the dashboard (field names, types, required/optional). No database introspection runs.
2. **Validates the create payload** — before invoking your `create` function, the SDK checks that all required fields are present and strips unknown keys. Your factory body works on a clean map.
3. **Keeps it simple** — no external dependencies required. Use `"string"`, `"integer"`, `"number"`, `"boolean"`, `"timestamp"`, `"date"`, `"uuid"`, or `"json"` as the type.

# Java

> Autonoma Environment Factory example with Spring Boot.

The Java SDK is **factory-driven**: you register one factory per model and the SDK derives the discover schema from each factory’s `inputClass` (a Java class). There is no database introspection, no JDBC executor, and no SQL fallback — your factories own creation, the SDK owns the protocol.

## Spring Boot

Uses `AutonomaController` from `ai.autonoma.spring`. Configured as a Spring `@Configuration` bean. The factories use whatever `JdbcTemplate`, JPA repository, or service layer your app already has — the SDK does not need a database connection.

AutonomaConfig.java

```java
@Configuration
public class AutonomaConfig {


    public record OrganizationInput(String name) {}
    public record UserInput(String email, String name, String organizationId) {}


    @Bean
    public AutonomaController autonomaController() {
        OrganizationRepository organizationRepo = new OrganizationRepository(dataSource);
        UserRepository userRepo = new UserRepository(dataSource);


        HandlerConfig config = new HandlerConfig(
            // The column that scopes all models to a tenant — used to isolate test data
            "organization_id",
            // Shared with Autonoma — verifies incoming requests via HMAC-SHA256
            System.getenv("AUTONOMA_SHARED_SECRET"),
            // Private to your server — signs the refs token so teardown only deletes what was created
            System.getenv("AUTONOMA_SIGNING_SECRET"),
            // Called after `up` — returns credentials so Autonoma can make authenticated requests
            (user, context) -> AuthResult.ofHeaders(
                Map.of("Authorization", "Bearer test-token")
            )
        );


        // Every model the dashboard can create needs a factory.
        // The factory's inputClass drives both validation and discover.
        config.setFactories(Map.of(
            "Organization", FactoryUtil.defineFactory(
                OrganizationInput.class,
                (data, ctx) -> organizationRepo.create(data),
                (record, ctx) -> organizationRepo.delete((String) record.get("id"))
            ),
            "User", FactoryUtil.defineFactory(
                UserInput.class,
                (data, ctx) -> userRepo.create(data)
            )
        ));


        return new AutonomaController(config);
    }
}
```

[Full source code on GitHub](https://github.com/Autonoma-AI/sdk/tree/main/examples/java/spring-boot)

***

## What `inputClass` does

The Java class you pass as the first argument to `defineFactory`:

1. **Drives discover** — the SDK uses reflection to walk the class’s declared fields and map Java types to the dashboard’s type system. No database introspection runs.
2. **Validates the create payload** — before invoking your `create` function, the SDK uses Jackson’s `ObjectMapper.convertValue` to deserialize the incoming map into an instance of your class. Type mismatches fail validation.
3. **Uses standard Java conventions** — field names come from Jackson `@JsonProperty` annotations (or the field name itself); Java types map to SDK types automatically (`String`→“string”, `int/long`→“integer”, `double`→“number”, `boolean`→“boolean”, `Instant`→“timestamp”, `UUID`→“uuid”).

# Ruby

> Autonoma Environment Factory example with Rails.

The Ruby SDK is **factory-driven**: you register one factory per model and the SDK derives the discover schema from each factory’s `input_fields`. There is no database introspection, no ActiveRecord executor, and no SQL fallback — your factories own creation, the SDK owns the protocol.

## Rails

Uses `AutonomaRails::Handler` mixin in a standard Rails controller. The factories use whatever ActiveRecord models, service objects, or repositories your app already has — the SDK does not need a database connection.

app/controllers/autonoma\_controller.rb

```ruby
require "autonoma"
require "autonoma_rails"


class AutonomaController < ApplicationController
  include AutonomaRails::Handler


  def handle
    autonoma_handle(autonoma_config)
  end


  private


  def autonoma_config
    @autonoma_config ||= Autonoma::Types::HandlerConfig.new(
      # The column that scopes all models to a tenant — used to isolate test data
      scope_field: "organization_id",
      # Shared with Autonoma — verifies incoming requests via HMAC-SHA256
      shared_secret: ENV.fetch("AUTONOMA_SHARED_SECRET", ""),
      # Private to your server — signs the refs token so teardown only deletes what was created
      signing_secret: ENV.fetch("AUTONOMA_SIGNING_SECRET", ""),


      # Every model the dashboard can create needs a factory.
      # The factory's input_fields drives both validation and discover.
      factories: {
        "Organization" => Autonoma::Factory.define_factory(
          input_fields: [
            { name: "name", type: "string", required: true }
          ],
          create: ->(data, _ctx) { OrganizationRepository.create(data) },
          teardown: ->(record, _ctx) { OrganizationRepository.delete(record["id"]) }
        ),
        "User" => Autonoma::Factory.define_factory(
          input_fields: [
            { name: "email", type: "string", required: true },
            { name: "name", type: "string", required: true },
            { name: "organization_id", type: "string", required: true }
          ],
          create: ->(data, _ctx) { UserRepository.create(data) }
        ),
      },


      # Called after `up` — returns credentials so Autonoma can make authenticated requests
      auth: ->(_user, _context) {
        { "headers" => { "Authorization" => "Bearer test-token" } }
      }
    )
  end
end
```

[Full source code on GitHub](https://github.com/Autonoma-AI/sdk/tree/main/examples/ruby/rails)

***

## What `input_fields` does

The field definitions you pass as `input_fields`:

1. **Drives discover** — the SDK uses the field definitions to describe the model to the dashboard (field names, types, required/optional). No database introspection runs.
2. **Validates the create payload** — before invoking your `create` function, the SDK checks that all required fields are present and strips unknown keys. Your factory body works on a clean Hash.
3. **Keeps it simple** — no external gems required. Use `"string"`, `"integer"`, `"number"`, `"boolean"`, `"timestamp"`, `"date"`, `"uuid"`, or `"json"` as the type.

# Rust

> Autonoma Environment Factory example with Axum.

The Rust SDK is **factory-driven**: you register one factory per model and the SDK derives the discover schema from each factory’s `input_fields`. There is no database introspection, no SQLx executor, and no SQL fallback — your factories own creation, the SDK owns the protocol.

## Axum

Uses `create_axum_handler` from `autonoma_sdk::axum`. Factories are registered in a `HashMap<String, FactoryDefinition>`. The factories use whatever SQLx pool, Diesel connection, or service layer your app already has — the SDK does not need a database connection.

src/main.rs

```rust
use autonoma_sdk::axum::create_axum_handler;
use autonoma_sdk::factory::define_factory;
use autonoma_sdk::types::{FactoryContext, FactoryRegistry, FieldDef, HandlerConfig};
use std::collections::HashMap;


let mut factories: FactoryRegistry = HashMap::new();


factories.insert(
    "Organization".to_string(),
    define_factory(
        vec![FieldDef::required("name", "string")],
        |data, ctx| Box::pin(create_organization(data, ctx)),
        Some(|record, ctx| Box::pin(delete_organization(record, ctx))),
    ),
);


factories.insert(
    "User".to_string(),
    define_factory(
        vec![
            FieldDef::required("email", "string"),
            FieldDef::required("name", "string"),
            FieldDef::required("organization_id", "string"),
        ],
        |data, ctx| Box::pin(create_user(data, ctx)),
        None,
    ),
);


let config = HandlerConfig {
    // The column that scopes all models to a tenant — used to isolate test data
    scope_field: "organization_id".to_string(),
    // Shared with Autonoma — verifies incoming requests via HMAC-SHA256
    shared_secret,
    // Private to your server — signs the refs token so teardown only deletes what was created
    signing_secret,
    factories: Some(factories),
    // Called after `up` — returns credentials so Autonoma can make authenticated requests
    auth: Some(Box::new(|_user, _ctx| {
        Box::pin(async {
            Ok(serde_json::json!({"headers": {"Authorization": "Bearer test-token"}}))
        })
    })),
    ..Default::default()
};


let app = Router::new()
    .route("/api/autonoma", post(create_axum_handler(config)));
```

[Full source code on GitHub](https://github.com/Autonoma-AI/sdk/tree/main/examples/rust/axum)

***

## What `input_fields` does

The `Vec<FieldDef>` you pass as the first argument to `define_factory`:

1. **Drives discover** — the SDK uses the field definitions to describe the model to the dashboard (field names, types, required/optional). No database introspection runs.
2. **Validates the create payload** — before invoking your `create` function, the SDK checks that all required fields are present in the `serde_json::Map`. Your factory body works on a validated map.
3. **Keeps it simple** — no external dependencies required beyond `serde_json`. Use `"string"`, `"integer"`, `"number"`, `"boolean"`, `"timestamp"`, `"date"`, `"uuid"`, or `"json"` as the type.

# Go

> Autonoma Environment Factory example with Gin.

The Go SDK is **factory-driven**: you register one factory per model and the SDK derives the discover schema from each factory’s `InputStruct` (a Go struct type). There is no database introspection, no SQL executor, and no SQL fallback — your factories own creation, the SDK owns the protocol.

## Gin

Uses `autonoma.GinHandler` with factories registered in an `autonoma.FactoryRegistry` map. The factories use whatever `*sql.DB`, GORM, or service layer your app already has — the SDK does not need a database connection.

main.go

```go
import (
    "os"
    "github.com/autonoma-ai/sdk-go/autonoma"
    "github.com/gin-gonic/gin"
)


type OrganizationInput struct {
    Name string `json:"name"`
}


type UserInput struct {
    Email          string `json:"email"`
    Name           string `json:"name"`
    OrganizationID string `json:"organization_id"`
}


config := &autonoma.HandlerConfig{
    // The column that scopes all models to a tenant — used to isolate test data
    ScopeField:   "organization_id",
    // Shared with Autonoma — verifies incoming requests via HMAC-SHA256
    SharedSecret:  os.Getenv("AUTONOMA_SHARED_SECRET"),
    // Private to your server — signs the refs token so teardown only deletes what was created
    SigningSecret:  os.Getenv("AUTONOMA_SIGNING_SECRET"),


    // Every model the dashboard can create needs a factory.
    // The factory's InputStruct drives both validation and discover.
    Factories: autonoma.FactoryRegistry{
        "Organization": autonoma.DefineFactory(autonoma.FactoryOpts{
            InputStruct: reflect.TypeOf(OrganizationInput{}),
            Create: func(data any, ctx autonoma.FactoryContext) (map[string]any, error) {
                input := data.(*OrganizationInput)
                return createOrganization(db, input)
            },
            Teardown: func(record map[string]any, ctx autonoma.FactoryContext) error {
                return deleteOrganization(db, record["id"].(string))
            },
        }),
        "User": autonoma.DefineFactory(autonoma.FactoryOpts{
            InputStruct: reflect.TypeOf(UserInput{}),
            Create: func(data any, ctx autonoma.FactoryContext) (map[string]any, error) {
                input := data.(*UserInput)
                return createUser(db, input)
            },
        }),
    },


    // Called after `up` — returns credentials so Autonoma can make authenticated requests
    Auth: func(user map[string]any, ctx autonoma.AuthContext) (*autonoma.AuthResult, error) {
        return &autonoma.AuthResult{
            Extra: map[string]any{"headers": map[string]any{"Authorization": "Bearer test-token"}},
        }, nil
    },
}


r := gin.Default()
r.POST("/api/autonoma", autonoma.GinHandler(config))
```

[Full source code on GitHub](https://github.com/Autonoma-AI/sdk/tree/main/examples/go/gin)

***

## What `InputStruct` does

The Go struct type you pass as `InputStruct`:

1. **Drives discover** — the SDK uses `reflect` to walk the struct’s fields and `json` tags to describe the model to the dashboard (field names, types, required/optional). No database introspection runs.
2. **Validates the create payload** — before invoking your `Create` function, the SDK uses `json.Unmarshal` into a new instance of the struct. Type mismatches and missing required fields fail validation. Your factory body receives a typed pointer to the struct.
3. **Uses standard Go conventions** — field names come from `json` struct tags; Go types map to SDK types automatically (`string`→“string”, `int`→“integer”, `float64`→“number”, `bool`→“boolean”, `time.Time`→“timestamp”, `uuid.UUID`→“uuid”).

# PHP

> Autonoma Environment Factory example with Laravel.

The PHP SDK is **factory-driven**: you register one factory per model and the SDK derives the discover schema from each factory’s `inputFields`. There is no database introspection, no Eloquent executor, and no SQL fallback — your factories own creation, the SDK owns the protocol.

## Laravel

Uses the auto-discovered service provider from `autonoma/sdk`. The entire setup is configuration-driven via `config/autonoma.php`. The factories use whatever Eloquent models, repositories, or service classes your app already has — the SDK does not need a database connection.

config/autonoma.php

```php
<?php
use App\Repositories\OrganizationRepository;
use App\Repositories\UserRepository;
use Autonoma\Sdk\Factory;
use Autonoma\Sdk\Types\FieldInfo;
use Autonoma\Sdk\Types\FactoryContext;


return [
    // The column that scopes all models to a tenant — used to isolate test data
    'scope_field' => 'organization_id',
    // Shared with Autonoma — verifies incoming requests via HMAC-SHA256
    'shared_secret' => env('AUTONOMA_SHARED_SECRET', ''),
    // Private to your server — signs the refs token so teardown only deletes what was created
    'signing_secret' => env('AUTONOMA_SIGNING_SECRET', ''),
    'path' => 'api/autonoma',


    // Every model the dashboard can create needs a factory.
    // The factory's inputFields drives both validation and discover.
    'factories' => [
        'Organization' => Factory::define(
            inputFields: [
                new FieldInfo('name', 'string', true),
            ],
            create: function (array $data, FactoryContext $ctx) {
                return (new OrganizationRepository())->create(['name' => $data['name']]);
            },
            teardown: function (array $record, FactoryContext $ctx) {
                (new OrganizationRepository())->delete($record['id']);
            }
        ),
        'User' => Factory::define(
            inputFields: [
                new FieldInfo('email', 'string', true),
                new FieldInfo('name', 'string', true),
                new FieldInfo('organization_id', 'string', true),
            ],
            create: function (array $data, FactoryContext $ctx) {
                return (new UserRepository())->create([
                    'email' => $data['email'],
                    'name' => $data['name'],
                    'organization_id' => $data['organization_id'],
                ]);
            }
        ),
    ],


    // Called after `up` — returns credentials so Autonoma can make authenticated requests
    'auth' => function (?array $user, array $context): array {
        return ['headers' => ['Authorization' => 'Bearer test-token']];
    },
];
```

[Full source code on GitHub](https://github.com/Autonoma-AI/sdk/tree/main/examples/php/laravel)

***

## What `inputFields` does

The `FieldInfo` array you pass as `inputFields`:

1. **Drives discover** — the SDK uses the field definitions to describe the model to the dashboard (field names, types, required/optional). No database introspection runs.
2. **Validates the create payload** — before invoking your `create` function, the SDK checks that all required fields are present and strips unknown keys. Your factory body works on a clean associative array.
3. **Keeps it simple** — no external dependencies required. Use `'string'`, `'integer'`, `'number'`, `'boolean'`, `'timestamp'`, `'date'`, `'uuid'`, or `'json'` as the type.

# Development Setup

> How to get Autonoma AI running locally - from prerequisites through a working dev environment.

## Prerequisites

You need three things installed before starting:

| Tool                              | Version | How to get it                                                   |
| --------------------------------- | ------- | --------------------------------------------------------------- |
| [Node.js](https://nodejs.org/)    | >= 24   | Use [nvm](https://github.com/nvm-sh/nvm) or download directly   |
| [pnpm](https://pnpm.io/)          | 10.x    | Run `corepack enable` - the version is pinned in `package.json` |
| [Docker](https://www.docker.com/) | Latest  | Docker Desktop or Docker Engine                                 |

**Optional tools** (only needed if you’re working on specific engines):

* [Playwright](https://playwright.dev/) - for `engine-web` development
* [Appium](https://appium.io/) - for `engine-mobile` development

## Clone and install

```bash
git clone https://github.com/autonoma-ai/autonoma.git
cd agent
pnpm install
```

`pnpm install` handles the entire monorepo - all apps and packages get their dependencies in one pass.

## Start infrastructure

PostgreSQL and Redis run via Docker Compose:

```bash
docker compose up -d
```

This starts:

* **PostgreSQL 18** on `localhost:5432` (user: `postgres`, password: `postgres`)
* **Redis** on `localhost:6379`

Verify they’re running:

```bash
docker compose ps
```

Both containers should show `running` status.

## Environment variables

Copy the example file and fill in the required values:

```bash
cp .env.example .env
```

### Minimum required variables

| Variable               | Description                  | Where to get it                                                                                                                                                                                |
| ---------------------- | ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `DATABASE_URL`         | PostgreSQL connection string | Use `postgresql://postgres:postgres@localhost:5432/autonoma` for the Docker Compose setup                                                                                                      |
| `REDIS_URL`            | Redis connection string      | Use `redis://localhost:6379` for the Docker Compose setup                                                                                                                                      |
| `BETTER_AUTH_SECRET`   | Session signing secret       | Generate any random string: `openssl rand -hex 32`                                                                                                                                             |
| `GOOGLE_CLIENT_ID`     | Google OAuth client ID       | Create OAuth credentials in the [Google Cloud Console](https://console.cloud.google.com/apis/credentials). Set the authorized redirect URI to `http://localhost:4000/api/auth/callback/google` |
| `GOOGLE_CLIENT_SECRET` | Google OAuth client secret   | Same Google Cloud Console OAuth credentials page                                                                                                                                               |
| `GEMINI_API_KEY`       | Google Gemini API key        | Get one from [Google AI Studio](https://aistudio.google.com/apikey)                                                                                                                            |

### How environment variables work in the codebase

The project uses `createEnv` from `@t3-oss/env-core` for environment variable validation. Each app has an `env.ts` file that defines its required variables with Zod schemas. Variables are validated at startup - if something is missing, you get a clear error message telling you exactly what to add.

You should never read `process.env` directly in application code. Instead, import from the app’s `env.ts` file.

See `.env.example` for the full list of variables grouped by service. Most optional variables have sensible defaults or are only needed for specific features (S3 storage, Sentry, PostHog, etc.).

## Database setup

Generate the Prisma client and run migrations:

```bash
pnpm db:generate
pnpm db:migrate
```

`db:generate` creates the TypeScript client from the Prisma schema. `db:migrate` applies all migrations to create the database tables.

You need to re-run `db:generate` whenever the Prisma schema changes (after pulling new changes or editing the schema yourself).

## Start development servers

```bash
pnpm dev
```

This starts both servers concurrently:

* **UI** at `http://localhost:3000` (Vite + React)
* **API** at `http://localhost:4000` (Hono + tRPC)

To run them individually:

```bash
pnpm api    # API only (port 4000)
pnpm ui     # UI only (port 3000)
```

## Verify everything works

1. Open `http://localhost:3000` in your browser
2. You should see the login page
3. Sign in with Google OAuth
4. If you see the dashboard, everything is working

Run the full check suite to make sure nothing is broken:

```bash
pnpm typecheck    # TypeScript type checking
pnpm lint         # ESLint
pnpm test         # Vitest
pnpm build        # Full build
```

## Other useful commands

| Command            | Description                                      |
| ------------------ | ------------------------------------------------ |
| `pnpm dev`         | Start API + UI in development mode               |
| `pnpm build`       | Build all packages and apps                      |
| `pnpm typecheck`   | Run TypeScript type checking across all packages |
| `pnpm lint`        | Lint all packages                                |
| `pnpm test`        | Run tests across all packages                    |
| `pnpm format`      | Format code with Biome                           |
| `pnpm check`       | Lint and format with Biome                       |
| `pnpm db:generate` | Generate Prisma client from schema               |
| `pnpm db:migrate`  | Run database migrations                          |
| `pnpm docs`        | Start the documentation site (port 4321)         |

## Troubleshooting

### `pnpm install` fails

Make sure you’re using pnpm 10.x. Run `corepack enable` to let Node manage the pnpm version, then try again.

### Database connection refused

Check that Docker Compose is running: `docker compose ps`. If PostgreSQL isn’t up, check logs with `docker compose logs postgres`.

### Prisma generate fails

This usually means dependencies aren’t installed. Run `pnpm install` first, then `pnpm db:generate`.

### Port already in use

Another process is using port 3000 or 4000. Find and kill it:

```bash
lsof -i :3000  # or :4000
kill <PID>
```

### Google OAuth redirect error

Make sure your Google Cloud OAuth credentials have `http://localhost:4000/api/auth/callback/google` as an authorized redirect URI.

### ”Missing environment variable” error on startup

The app validates all required environment variables at startup using `createEnv`. Check the error message for which variable is missing, then add it to your `.env` file.

### TypeScript errors after pulling changes

Run `pnpm db:generate` first (the Prisma client may have changed), then `pnpm build` to rebuild all packages. TypeScript errors in the UI or API often come from stale package builds.

# Architecture Overview

> High-level architecture of Autonoma AI - how the monorepo is organized, how data flows, and why each technology was chosen.

## How Autonoma works

Autonoma is an agentic E2E testing platform. Users describe tests in natural language, and an AI agent executes them on real browsers and devices. The core loop is:

1. User writes a test instruction (“Log in, go to settings, verify the avatar is visible”)
2. The execution agent takes a screenshot of the current screen
3. An LLM decides which action to perform (click, type, scroll, assert)
4. Platform drivers execute the action (Playwright for web, Appium for mobile)
5. The agent records the step and repeats until the test is done

Everything else - the API, the UI, the jobs - exists to support this loop.

## Monorepo structure

The codebase is split into **apps** (deployable services) and **packages** (shared libraries). Each package has exactly one concern.

```plaintext
apps/
  api/              Hono + tRPC API server
  ui/               Vite + React 19 SPA
  engine-web/       Playwright web test execution
  engine-mobile/    Appium mobile test execution
  docs/             Astro Starlight documentation site
  jobs/             Background jobs (multiple sub-services)


packages/
  ai/               AI primitives - models, vision, point detection
  analytics/        PostHog server-side event tracking
  billing/          Subscription and billing logic
  blacklight/       Shared UI component library
  db/               Prisma schema + generated client
  diffs/            Test diff computation
  emulator/         Mobile emulator management
  engine/           Platform-agnostic execution agent core
  errors/           Custom error hierarchy
  image/            Image processing utilities
  integration-test/ Test harness with Testcontainers
  k8s/              Kubernetes helpers
  logger/           Sentry-based structured logging
  review/           Post-execution AI review
  scenario/         Environment Factory scenario logic
  storage/          S3 file storage
  test-updates/     Test suite update logic
  types/            Shared Zod schemas and TypeScript types
  utils/            Shared utilities
  workflow/         Temporal workflow definitions
```

### Why apps vs packages?

**Apps** are independently deployable. Each one becomes its own Docker image and runs as its own process. The API, UI, and each engine are separate images - they never share a runtime.

**Packages** are shared code. They’re consumed by apps at build time via pnpm workspaces. A package like `@autonoma/ai` is used by both `engine-web` and `engine-mobile`, but it never runs on its own.

## How the apps connect

```plaintext
Browser
  |
  | HTTP (port 3000)
  v
 UI (Vite + React SPA)
  |
  | tRPC (port 4000)
  v
 API (Hono + tRPC)
  |
  |--- Prisma ---> PostgreSQL
  |--- Redis ----> Device locks, caching
  |
  | (dispatches jobs)
  v
 Engine Web / Engine Mobile
  |
  | Execution Agent (packages/engine)
  |--- Playwright (web) or Appium (mobile)
  |--- AI models (packages/ai)
  v
 Test results, recordings, artifacts
```

**UI to API**: The React SPA communicates with the API exclusively through tRPC. Types flow end-to-end - the frontend never manually defines API response types. Zod schemas in `packages/types` are the single source of truth for both sides.

**API to Database**: The API uses Prisma as its ORM. The schema lives in `packages/db` and is shared across all backend services.

**API to Engines**: When a test run starts, the API dispatches it to the appropriate engine (web or mobile). Engines execute tests independently and report results back.

**Engines to AI**: During execution, engines call into `packages/ai` for element detection, visual assertions, and agent decision-making. AI calls go to external providers (Google Gemini, Groq, OpenRouter).

## Tech stack

| Layer          | Technology                                 | Why                                                                                                  |
| -------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------- |
| Runtime        | Node.js 24, ESM-only                       | Latest LTS with native ESM. No CommonJS compatibility issues                                         |
| Monorepo       | pnpm workspaces + Turborepo                | pnpm for fast, disk-efficient installs. Turborepo for cached, parallel builds                        |
| Language       | TypeScript (strictest)                     | Full type safety with `noUncheckedIndexedAccess`, `exactOptionalPropertyTypes`, and all strict flags |
| API            | Hono + tRPC                                | Hono is fast and lightweight. tRPC gives end-to-end type safety without code generation              |
| Frontend       | React 19 + Vite + TanStack Router          | Vite for fast dev builds. TanStack Router for type-safe routing with built-in data loading           |
| Database       | PostgreSQL + Prisma                        | PostgreSQL for reliability. Prisma for type-safe queries and migration management                    |
| Cache/Locking  | Redis                                      | Distributed device locking and caching across engine instances                                       |
| AI             | Gemini, Groq, OpenRouter via Vercel AI SDK | Multiple providers for different tasks. Vercel AI SDK unifies the interface                          |
| Web testing    | Playwright                                 | Most reliable browser automation library. Supports all major browsers                                |
| Mobile testing | Appium                                     | Industry standard for iOS and Android automation on real devices                                     |
| UI components  | Radix UI + Tailwind CSS v4 + CVA           | Accessible primitives (Radix), utility-first styling (Tailwind), type-safe variants (CVA)            |
| Observability  | Sentry                                     | Error tracking, performance monitoring, and structured logging in one tool                           |
| Analytics      | PostHog                                    | Product analytics with server-side event tracking                                                    |
| Deployment     | Kubernetes + Temporal                      | K8s for orchestration. Temporal for workflow-based test execution pipelines                          |

## The execution flow

This is the most important flow in the system - how a test goes from natural language to executed results.

### 1. Test creation

The user writes a test as a natural language instruction, optionally with a URL and configuration. The API stores it in PostgreSQL.

### 2. Test dispatch

When a test run starts, the API dispatches it to the appropriate engine based on the application type (web or mobile). For mobile, Redis-based device locking ensures exclusive access to physical devices.

### 3. Execution agent loop

The execution agent (`packages/engine`) runs a loop powered by the Vercel AI SDK:

```plaintext
Screenshot -> LLM decides action -> Execute command -> Record step -> Repeat
```

The agent has access to these commands:

| Command    | What it does                                                                            |
| ---------- | --------------------------------------------------------------------------------------- |
| **click**  | Uses vision AI to locate an element from a natural language description, then clicks it |
| **type**   | Locates an element, clicks it, then types text                                          |
| **scroll** | Scrolls up or down                                                                      |
| **assert** | Checks visual conditions against the current screenshot                                 |
| **wait**   | Pauses for a specified duration (for loading states)                                    |

The LLM (currently Gemini) sees the screenshot, the test instruction, and the steps taken so far, then decides which command to call next. When it determines the test is complete (or has failed), it calls `execution-finished`.

### 4. AI-powered element detection

Instead of CSS selectors or XPaths, the agent uses vision models to find UI elements. The `PointDetector` takes a screenshot and a natural language description (“the blue Submit button”) and returns pixel coordinates. This is what makes tests resilient to UI changes - the AI adapts to visual changes automatically.

### 5. Results and artifacts

Every test run produces:

* Step-by-step execution log with before/after screenshots
* Video recording of the entire session
* AI conversation log (what the model “thought” at each step)
* Success/failure status with reasoning

These artifacts are stored in S3 and accessible through the UI.

## Key design decisions

### ESM-only

Every `package.json` has `"type": "module"`. No CommonJS anywhere. This eliminates an entire class of import/export bugs and aligns with the direction of the Node.js ecosystem.

### Strictest TypeScript

All strict flags enabled, including `noUncheckedIndexedAccess` (array/object access returns `T | undefined`) and `exactOptionalPropertyTypes`. This catches real bugs at compile time. It’s more work upfront, but prevents entire categories of runtime errors.

### Constructor injection

All dependencies are passed through constructors. No DI framework, no decorators, no magic. You can read any class and immediately see what it depends on.

### Separate Docker images

Each engine (web, mobile) and each job type gets its own Docker image. This keeps images small and deployment independent. A change to the web engine doesn’t require redeploying the mobile engine.

### Platform-agnostic agent core

All execution logic lives in `packages/engine`. Platform-specific apps (`engine-web`, `engine-mobile`) only implement driver interfaces (`ScreenDriver`, `MouseDriver`, etc.). The same agent loop, command system, and AI integration works for both Playwright and Appium.

## Deployment model

The platform runs on Kubernetes:

* **API** and **UI** are standard deployments with horizontal scaling
* **Engines** run on device-hosting machines (physical or virtual). Web engines need browsers, mobile engines need connected devices or emulators
* **Jobs** run as Temporal workflows - triggered on demand via Temporal workers
* **Redis** handles distributed device locking across engine instances
* **PostgreSQL** is the single source of truth for all state

# Package Guide

> What each package and app does, what it exports, and when you would modify it.

## Packages

Every package in `packages/` is a shared library consumed by one or more apps. Each has exactly one concern.

### ai

AI primitives used by the execution agent. Contains the model registry (manages LLM instances and providers), visual AI (screenshot analysis, assertion checking, element selection), point detection (locating UI elements from natural language descriptions), object detection (bounding box generation), and structured output generation.

**Key exports:** `ModelRegistry`, `PointDetector`, `ObjectDetector`, `VisualConditionChecker`, `AssertChecker`, `ObjectGenerator`, `AssertionSplitter`

**When to modify:** Adding a new AI model or provider, changing how elements are detected, adjusting assertion logic, or adding a new visual AI capability.

### analytics

PostHog server-side event tracking. Wraps `posthog-node` with Sentry trace linking. No-ops when not initialized, so it’s safe to import in dev and test environments.

**Key exports:** `analytics` (singleton)

**When to modify:** Adding new server-side analytics events, changing event properties, or adjusting the PostHog integration.

### billing

Subscription and billing logic. Handles plan management, usage tracking, and payment integration.

**Key exports:** Billing service classes and plan definitions

**When to modify:** Changing pricing plans, adding billing features, or integrating new payment providers.

### blacklight

Shared UI component library built on Radix UI + Tailwind CSS v4 + CVA. This is where all reusable frontend components live - buttons, cards, inputs, dialogs, tables, and more. Follows shadcn/ui patterns.

**Key exports:** `Button`, `Card`, `Input`, `Dialog`, `Table`, `Select`, `cn()`, and many more components

**When to modify:** Adding new UI components, updating component styles, or changing the design system. The path alias `@/*` maps to `packages/blacklight/src/*` inside the package.

### db

Prisma schema and generated client for PostgreSQL. This is the single source of truth for the database structure.

**Key exports:** `PrismaClient`, generated types for all models

**When to modify:** Adding or changing database tables, columns, relations, or indexes. After editing the schema, run `pnpm db:generate` and `pnpm db:migrate`.

### diffs

Test diff computation. Computes differences between test suite versions for change tracking and review.

**Key exports:** Diff computation functions

**When to modify:** Changing how test diffs are calculated or displayed.

### emulator

Mobile emulator management. Handles lifecycle management of iOS simulators and Android emulators.

**Key exports:** Emulator management classes

**When to modify:** Adding support for new device types, changing emulator configuration, or adjusting lifecycle management.

### engine

The core of test execution. This is a platform-agnostic AI agent that web and mobile engines extend. Contains the execution agent loop, command system (click, type, scroll, assert), driver interfaces, runner orchestration, and artifact management.

Everything is parameterized with generics (`TSpec` for command specs, `TContext` for driver context), so the same agent core works for both Playwright and Appium.

**Key exports:** `ExecutionAgent`, `ExecutionAgentRunner`, `AgentCommand`, `CommandRegistry`, driver interfaces (`ScreenDriver`, `MouseDriver`, `KeyboardDriver`, `NavigationDriver`, `ApplicationDriver`)

**When to modify:** Adding new commands to the agent, changing the execution loop, adjusting the system prompt, or modifying how steps are recorded.

### errors

Custom error hierarchy for the project. All errors extend `AutonomaError` with specific subclasses for different failure types.

**Key exports:** `AutonomaError`, `TestError`, `DriverError`, `PreconditionError`, `VerificationError`, `ThirdPartyError`

**When to modify:** Adding new error types or changing how errors are categorized.

### image

Image processing utilities. Handles screenshot manipulation, resizing, and format conversion used throughout the execution pipeline.

**Key exports:** Image processing functions

**When to modify:** Changing how screenshots are processed, adding new image operations, or adjusting compression settings.

### integration-test

Test harness using Testcontainers. Provides `IntegrationHarness` and `integrationTestSuite` for writing integration tests that use real PostgreSQL and Redis containers.

**Key exports:** `IntegrationHarness`, `integrationTestSuite`

**When to modify:** Changing the test harness setup, adding new test utilities, or supporting new infrastructure in tests.

### k8s

Kubernetes helpers. Utilities for interacting with the K8s API, managing pods, and reading cluster state.

**Key exports:** Kubernetes client wrappers and helpers

**When to modify:** Changing how the platform interacts with Kubernetes, or adding new K8s operations.

### logger

Sentry-based structured logging. Provides a logger that integrates with Sentry for error tracking, performance monitoring, and structured context.

**Key exports:** `logger` (root logger), `Logger` type

**When to modify:** Changing the logging format, adjusting Sentry integration, or adding new logging capabilities.

### review

Post-execution AI review. Analyzes test execution recordings and results to validate whether tests passed correctly.

**Key exports:** Review service classes

**When to modify:** Changing how test results are reviewed, adjusting AI review prompts, or adding new review criteria.

### scenario

Environment Factory scenario logic. Handles test scenario definitions, data seeding, and teardown for isolated test environments.

**Key exports:** Scenario classes and types

**When to modify:** Adding new test scenarios, changing how test data is seeded, or adjusting the Environment Factory protocol.

### storage

S3 file storage. Handles uploading and downloading artifacts (screenshots, videos, test results) to S3-compatible storage.

**Key exports:** Storage service classes

**When to modify:** Changing storage providers, adjusting upload/download logic, or adding new artifact types.

### test-updates

Test suite update logic. Handles applying changes to test suites - adding, removing, and modifying test cases.

**Key exports:** Test update service classes

**When to modify:** Changing how test suites are modified, or adding new update operations.

### types

Shared Zod schemas and TypeScript types. This is the contract layer between the API and frontend. Schemas defined here are used for tRPC input validation and frontend type inference.

**Key exports:** Zod schemas for all API inputs/outputs, TypeScript types, constants

**When to modify:** Adding new API endpoints, changing request/response shapes, or adding shared constants.

### utils

Shared utilities that don’t fit into a more specific package.

**Key exports:** Various utility functions

**When to modify:** Adding general-purpose utilities used across multiple packages.

### workflow

Temporal workflow definitions and client. Orchestrates test execution pipelines using Temporal workflows and activities.

**Key exports:** Workflow builder classes

**When to modify:** Changing how test execution is orchestrated, adjusting workflow templates, or adding new pipeline steps.

## Apps

### api

The backend server. Built with Hono (HTTP framework) and tRPC (type-safe API layer). Routers are thin - they wire tRPC procedures to controller files in `controllers/<routerName>/`. One file per procedure.

**When to modify:** Adding new API endpoints, changing business logic, or adjusting authentication.

### ui

The frontend SPA. Built with React 19, Vite, and TanStack Router. Compiled to static files - no SSR. Uses `@autonoma/blacklight` for all UI components.

**When to modify:** Adding new pages, changing the UI, or adjusting frontend behavior.

### engine-web

Playwright-based web test execution. Implements the driver interfaces from `packages/engine` using Playwright’s API. Handles browser lifecycle, screenshot capture, network idle detection, and video recording.

**When to modify:** Changing web-specific test execution behavior, adjusting Playwright configuration, or fixing browser-related issues.

### engine-mobile

Appium-based mobile test execution for iOS and Android. Implements the same driver interfaces using Appium/WebDriver. Uses `@autonoma/device-lock` for Redis-based device allocation.

**When to modify:** Changing mobile-specific test execution behavior, adjusting Appium configuration, or adding support for new device types.

### docs

This documentation site. Built with Astro Starlight and deployed to S3 + CloudFront.

**When to modify:** Adding or updating documentation pages.

### jobs

Background job services, each deployed as a separate Docker image:

| Job                             | Purpose                                           |
| ------------------------------- | ------------------------------------------------- |
| **run-completion-notification** | Slack/email notifications when test runs complete |
| **scenario**                    | Environment Factory scenario execution            |
| **diffs**                       | Computes test suite diffs                         |

## Dependency graph

The general dependency flow (simplified):

```plaintext
apps (api, ui, engines, jobs)
 |
 +-- packages/types        (shared schemas - used by almost everything)
 +-- packages/db           (database - used by api, jobs)
 +-- packages/engine       (execution core - used by engines)
 +-- packages/ai           (AI primitives - used by engine, jobs)
 +-- packages/try          (error handling - used by everything)
 +-- packages/logger       (logging - used by everything)
 +-- packages/errors       (error types - used by engine, api)
 +-- packages/storage      (S3 - used by api, engines, jobs)
 +-- packages/blacklight   (UI components - used by ui only)
 +-- packages/analytics    (PostHog - used by api)
 +-- packages/workflow     (Temporal workflows - used by api, workers)
```

Key relationships:

* `packages/engine` depends on `packages/ai` for all AI operations
* `packages/ai` is self-contained - it only depends on `try`, `logger`, and `image`
* `packages/types` is a leaf dependency - it depends on nothing else in the monorepo
* `packages/try` is a leaf dependency - used everywhere, depends on nothing
* Both `engine-web` and `engine-mobile` depend on `packages/engine` but never on each other

# Code Conventions

> The rules of the Autonoma AI codebase - TypeScript patterns, error handling, logging, testing, and style guidelines.

## ESM-only

Every `package.json` has `"type": "module"`. No CommonJS anywhere in the codebase.

**Never use `.js` extensions in imports.** TypeScript and the bundler resolve modules automatically.

```ts
// Good
import { foo } from "./foo";
import { bar } from "@autonoma/types";


// Bad
import { foo } from "./foo.js";
```

## TypeScript strictness

All strict flags are enabled. Every package extends `tsconfig.base.json`, which includes:

* `strict: true` (enables all strict checks)
* `noUncheckedIndexedAccess` - array and object index access returns `T | undefined`
* `exactOptionalPropertyTypes` - optional properties can’t be assigned `undefined` explicitly unless typed that way
* `verbatimModuleSyntax` - enforces explicit `type` imports

In practice, this means:

* You must check array access results before using them
* You must narrow types before passing them to functions that expect non-nullable values
* You must use `import type { ... }` for type-only imports

## Classes vs functions

**Needs state or dependencies?** Use a class with constructor injection.

**Pure logic with no state?** Use a function file.

In practice, almost everything is a class because most logic needs a logger, a database client, or some other dependency.

## Dependency injection

Plain constructor injection. No DI framework, no decorators.

```ts
class StepExecutor {
  private readonly logger: Logger;


  constructor(
    private readonly engine: Engine,
    private readonly db: PrismaClient,
  ) {
    this.logger = logger.child({ name: this.constructor.name });
  }
}
```

You can read any class constructor and immediately see all its dependencies. No magic, no hidden state.

## One export per file

A file exports exactly one thing - a class, a function, or a type. The exported item tells the story top-to-bottom. Private helpers follow in call order.

This keeps files focused and makes imports predictable.

### Custom error hierarchy

```plaintext
AutonomaError (base)
  TestError          - test execution failures
  DriverError        - Appium/Playwright driver failures
  PreconditionError  - setup/precondition failures
  VerificationError  - assertion failures
  ThirdPartyError    - external service failures
```

## Prefer undefined over null

Always use `undefined` as the absence-of-value sentinel. Use optional properties (`?`) instead of `| null` types. Never initialize to `null`.

```ts
// Good
private timeout?: number


// Bad
private timeout: number | null = null
```

This applies everywhere: class properties, function parameters, return types, object shapes.

## Nullish checks

Always `??`, never `||`. Always `!= null` / `== null`, never truthy/falsy checks.

```ts
// Good
const timeout = config.timeout ?? 3000;
if (element != null) { /* ... */ }


// Bad - truthy/falsy has unexpected behavior with 0, "", false
const timeout = config.timeout || 3000;  // 0 becomes 3000!
if (element) { /* ... */ }
```

The `!= null` check covers both `null` and `undefined`, which is exactly what you want.

## Early returns

Always prefer early returns to reduce nesting. If a function has deeply nested `if` blocks, extract the inner logic into a separate function with guard clauses.

```ts
// Good
function processOrder(order: Order): Result {
  if (order.status === "cancelled") throw new OrderCancelledError();
  if (order.items.length === 0) throw new EmptyOrderError();


  return calculateTotal(order);
}


// Bad - deeply nested
function processOrder(order: Order): Result {
  if (order.status !== "cancelled") {
    if (order.items.length > 0) {
      return calculateTotal(order);
    }
  }
  // ...
}
```

## No complex destructuring or spread

If constructing an object requires multiple `...` spreads or ternary-based spreads, build the object explicitly instead.

```ts
// Good
const permissions = isAdmin ? allPermissions : readOnly;
return {
  name: baseConfig.name,
  timeout: baseConfig.timeout,
  permissions,
  retries: overrides.retries ?? baseConfig.retries,
};


// Bad
return {
  ...baseConfig,
  ...((isAdmin) ? { permissions: allPermissions } : { permissions: readOnly }),
  ...overrides,
};
```

## Extract complex conditions

If a condition isn’t immediately obvious, extract it into a descriptively named variable.

```ts
// Good
const isTrialExpired = subscription.status === "trial" && subscription.endsAt < now;
const hasNoPaymentMethod = user.paymentMethods.length === 0;
if (isTrialExpired && hasNoPaymentMethod) { /* ... */ }


// Bad - what does this check?
if (subscription.status === "trial" && subscription.endsAt < now && user.paymentMethods.length === 0) { /* ... */ }
```

## Avoid let + conditional assignment

Instead of using `let` and assigning in `if/else` blocks, extract a function with early returns.

## Logging with Sentry

Every class and every function file must have logging. When in doubt, add a log. Overlogging is always better than underlogging.

### What to log

* Service startup and configuration
* Incoming requests and their resolution (success/failure)
* External API calls (start, success, failure)
* State transitions (agent steps, job status changes)
* Resource acquisition/release (device locks, browser sessions)
* Every public method entry with relevant parameters
* Every method exit with relevant results

Use structured context (Sentry breadcrumbs, tags, extra data) so logs are searchable. Never log sensitive data (credentials, tokens).

### Class logger pattern

Every class gets a `private readonly logger` instance, created in the constructor as a child of the root logger with the class name and identifying context.

```ts
import { type Logger, logger } from "@autonoma/logger";


export class TestSuiteUpdater {
  private readonly logger: Logger;


  constructor(private readonly snapshotId: string) {
    this.logger = logger.child({ name: this.constructor.name, snapshotId });
  }


  public async apply(change: TestSuiteChange) {
    this.logger.info("Applying test suite change", { type: change.constructor.name });
    // ... do work ...
    this.logger.info("Finished applying change");
  }
}
```

### Function logger pattern - called from classes

If a reusable function is called from a class method, accept a `Logger` parameter to preserve the logging context chain.

```ts
import type { Logger } from "@autonoma/logger";


export function computeChanges(branchId: string, logger: Logger) {
  logger.info("Computing changes", { branchId });
  // ... do work ...
  logger.info("Changes computed", { count: changes.length });
  return changes;
}
```

### Function logger pattern - standalone files

If a file exports independently useful functions (not called from a single class), import the root logger and create a child per function.

```ts
import { logger as rootLogger } from "@autonoma/logger";


export function syncDevices(deviceIds: string[]) {
  const logger = rootLogger.child({ name: "syncDevices" });
  logger.info("Syncing devices", { count: deviceIds.length });
  // ... do work ...
  logger.info("Devices synced");
}
```

## Testing

### Philosophy

* **Vitest** for all tests
* **Prefer integration tests** over unit tests. Test the real thing, not mocks
* **Never mock the database.** Use Testcontainers with a real PostgreSQL container
* Only test what makes sense - don’t test trivial getters

### Setup

Test files go in `test/` directories that mirror the `src/` structure. File naming: `*.test.ts`.

For integration tests that need a database, use the `@autonoma/integration-test` package:

```ts
import { integrationTestSuite } from "@autonoma/integration-test";


integrationTestSuite("MyService", (harness) => {
  it("should create a record", async () => {
    const db = harness.db;
    // ... test with a real database
  });
});
```

The harness spins up a real PostgreSQL container via Testcontainers, runs migrations, and gives you a fresh database for each test suite.

### Running tests

```bash
pnpm test              # run all tests
pnpm test --filter=ai  # run tests in a specific package
```

## Database transactions

Wrap sequential database queries in a Prisma `$transaction` when they must be consistent. If a service method reads then writes (or writes to multiple tables), use `$transaction`:

```ts
async createGeneration(userId: string, orgId: string, appId: string) {
  return await this.db.$transaction(async (tx) => {
    const app = await tx.application.findFirst({
      where: { id: appId, organizationId: orgId },
    });
    if (app == null) throw new Error("Application not found");


    const generation = await tx.applicationGeneration.create({
      data: { /* ... */ },
    });


    await tx.onboardingState.upsert({
      where: { applicationId: appId },
      /* ... */
    });


    return { id: generation.id };
  });
}
```

Pass `tx` to all queries inside the transaction - not the original `db` client.

## Adding dependencies

**Always check `pnpm-workspace.yaml` first.** The catalog section defines pinned versions for shared dependencies. When adding a dependency:

1. Check if it already exists in the `catalog:` section
2. If it does, use `"catalog:"` as the version in `package.json`
3. If it doesn’t, consider whether it should be added to the catalog (used by multiple packages) or pinned locally

```jsonc
// Good - uses catalog version
"dependencies": {
  "zod": "catalog:"
}


// Bad - hardcodes a version when a catalog entry exists
"dependencies": {
  "zod": "^3.23.0"
}
```

## Environment variables

Never read `process.env` directly. Define all environment variables in a dedicated `env.ts` file using `createEnv` from `@t3-oss/env-core` with Zod schemas:

```ts
import { createEnv } from "@t3-oss/env-core";
import { z } from "zod";


export const env = createEnv({
  server: {
    DATABASE_URL: z.string().url(),
    REDIS_URL: z.string().url(),
    BETTER_AUTH_SECRET: z.string().min(1),
  },
  runtimeEnv: process.env,
});
```

This gives you type safety, runtime validation, and a single source of truth for all required variables. Pass validated env values as function parameters rather than reading `process.env` in library code.

# Common Workflows

> Step-by-step guides for common development tasks - adding routes, pages, commands, models, tests, and more.

This page covers the most common development tasks you will perform in the Autonoma monorepo. Each workflow is a step-by-step guide with file paths and code patterns.

## Adding a New tRPC Route

Types flow through tRPC from API to frontend. Never manually define API response types on the frontend.

**1. Define Zod schemas** in `packages/types/src/schemas/`:

packages/types/src/schemas/my-feature.ts

```ts
import z from "zod";


export const myFeatureInput = z.object({
  name: z.string(),
  organizationId: z.string(),
});


export const myFeatureOutput = z.object({
  id: z.string(),
  createdAt: z.date(),
});
```

**2. Create a controller** in `apps/api/src/controllers/<routerName>/<procedureName>.ts`. Controllers hold all business logic:

apps/api/src/controllers/myFeature/create.ts

```ts
import type { PrismaClient } from "@autonoma/db";
import type { z } from "zod";
import type { myFeatureInput } from "@autonoma/types";


export async function createMyFeature(
  db: PrismaClient,
  input: z.infer<typeof myFeatureInput>,
) {
  return db.myFeature.create({
    data: { name: input.name, organizationId: input.organizationId },
  });
}
```

**3. Create or update the router** in `apps/api/src/routers/`. Routers are thin wiring - they delegate to controllers:

apps/api/src/routers/my-feature.ts

```ts
import { router, protectedProcedure } from "../trpc";
import { myFeatureInput } from "@autonoma/types";
import { createMyFeature } from "../controllers/myFeature/create";


export const myFeatureRouter = router({
  create: protectedProcedure
    .input(myFeatureInput)
    .mutation(async ({ ctx, input }) => {
      return createMyFeature(ctx.db, input);
    }),
});
```

**4. Add to `appRouter`** in `apps/api/src/router.ts` (if this is a new router):

```ts
export const appRouter = router({
  // ...existing routers
  myFeature: myFeatureRouter,
});
```

**5. Use on the frontend.** For queries, use `useSuspenseQuery` with `queryOptions`:

```ts
const { data } = useSuspenseQuery(
  trpc.myFeature.list.queryOptions({ organizationId }),
);
```

For mutations, use `useAPIMutation` with `mutationOptions`:

```ts
const createMutation = useAPIMutation(
  trpc.myFeature.create.mutationOptions(),
);
```

## Adding a New Page

TanStack Router with file-based routing makes this straightforward.

**1. Create a route file** in `apps/ui/src/routes/`:

apps/ui/src/routes/my-feature.tsx

```ts
import { createFileRoute } from "@tanstack/react-router";


export const Route = createFileRoute("/my-feature")({
  component: MyFeaturePage,
});


function MyFeaturePage() {
  return <div>My Feature</div>;
}
```

**2. That’s it.** The TanStack Router plugin auto-generates the route tree. The page is immediately accessible at `/my-feature`.

For pages that need data, add a `loader`:

```ts
export const Route = createFileRoute("/my-feature")({
  loader: ({ context }) => {
    context.queryClient.ensureQueryData(
      trpc.myFeature.list.queryOptions(),
    );
  },
  component: MyFeaturePage,
});
```

## Database Schema Changes

**1. Edit the schema** at `packages/db/prisma/schema.prisma`.

**2. Create a migration:**

```bash
pnpm db:migrate
```

This generates a migration file and applies it to your local database.

**3. Regenerate the Prisma client:**

```bash
pnpm db:generate
```

**4. Run typecheck** to catch any type errors from the schema change:

```bash
pnpm typecheck
```

If multiple queries in a service method need to be consistent (read-then-write, or writes to multiple tables), wrap them in a Prisma `$transaction`:

```ts
return await this.db.$transaction(async (tx) => {
  const existing = await tx.myTable.findFirst({ where: { id } });
  if (existing == null) throw new Error("Not found");
  return tx.myTable.update({ where: { id }, data: { ... } });
});
```

## Adding a New Command to the Execution Agent

See the [Execution Agent](/architecture/execution-agent/#adding-a-new-command) page for a detailed walkthrough. The short version:

**1. Define the spec** with a `CommandSpec` interface and Zod schema in `packages/engine/src/commands/commands/<name>/<name>.def.ts`.

**2. Implement the command** by extending `Command<TSpec, TContext>` in `packages/engine/src/commands/commands/<name>/<name>.command.ts`.

**3. Create the tool wrapper** by extending `CommandTool<TSpec, TContext>` in `packages/engine/src/execution-agent/agent/tools/commands/<name>.tool.ts`.

**4. Add the spec** to the union type in `packages/engine/src/commands/command-defs.ts`.

**5. Register the tool** in the `ExecutionAgentFactory` subclass for the relevant platform(s).

**6. Write tests** in `packages/engine/src/commands/commands/<name>/<name>.test.ts`. Use the test utilities in `packages/engine/src/commands/test-utils/` for fake drivers and model registries.

## Adding a New AI Model

See the [AI Package](/architecture/ai-package/#adding-a-new-model) page for full details. The short version:

**1. Add the model entry** to `MODEL_ENTRIES` in `packages/ai/src/registry/model-entries.ts`:

```ts
MY_MODEL: {
  createModel: () => googleProvider.getModel("my-model-id"),
  pricing: simpleCostFunction({
    inputCostPerM: 0.5,
    outputCostPerM: 1.5,
  }),
},
```

**2. Add a provider** in `packages/ai/src/registry/providers.ts` if the model uses a new provider. Add the API key to `packages/ai/src/env.ts` using `createEnv`.

**3. Use it** via `registry.getModel({ model: "MY_MODEL", tag: "my-use-case" })`.

## Running and Writing Tests

Vitest is used everywhere. Every package has it installed.

### Running Tests

```bash
# Run all tests across the monorepo
pnpm test


# Run tests for a specific package
pnpm --filter @autonoma/engine test


# Run a specific test file
pnpm --filter @autonoma/ai test -- src/visual/assert-checker.test.ts


# Run in watch mode
pnpm --filter @autonoma/engine test -- --watch
```

### Writing Tests

**Prefer integration tests over unit tests.** Only test what provides value - don’t test trivial getters.

Test files go in `test/` directories or alongside source files as `*.test.ts`.

**Never mock the database.** For tests that need a database, use Testcontainers with a real PostgreSQL container via the `@autonoma/integration-test` package:

```ts
import { integrationTestSuite } from "@autonoma/integration-test";


integrationTestSuite("MyService", ({ getDb }) => {
  it("creates a record", async () => {
    const db = getDb();
    const result = await myService.create(db, { name: "test" });
    expect(result.name).toBe("test");
  });
});
```

For command tests, use the fake drivers in `packages/engine/src/commands/test-utils/`:

```ts
import { FakeScreenDriver } from "../test-utils/fake-screen.driver";
import { FakeMouseDriver } from "../test-utils/fake-mouse.driver";
```

## Working with the UI Component Library

All frontend components come from `@autonoma/blacklight`, built on Radix UI + Tailwind CSS v4 + CVA.

### Using Components

```tsx
import { Button, Card, Input, cn } from "@autonoma/blacklight";


function MyComponent() {
  return (
    <Card className={cn("p-4")}>
      <Input placeholder="Enter name" />
      <Button variant="default" size="sm">
        Submit
      </Button>
    </Card>
  );
}
```

### Icons

Use Lucide React for all icons:

```tsx
import { Plus, Settings } from "lucide-react";


<Button>
  <Plus className="size-4" />
  Add item
</Button>
```

### Custom Variants

Use CVA (class-variance-authority) for component variants:

```tsx
import { cva } from "class-variance-authority";


const badgeVariants = cva("rounded-full px-2 py-0.5 text-xs font-medium", {
  variants: {
    status: {
      active: "bg-green-100 text-green-800",
      inactive: "bg-gray-100 text-gray-800",
    },
  },
});
```

## Adding Environment Variables

Never read `process.env` directly. Always use `createEnv` from `@t3-oss/env-core`.

**1. Define the variable** in a dedicated `env.ts` file for the package or app:

packages/my-package/src/env.ts

```ts
import { createEnv } from "@t3-oss/env-core";
import z from "zod";


export const env = createEnv({
  server: {
    MY_API_KEY: z.string().min(1),
    MY_TIMEOUT: z.coerce.number().default(5000),
  },
  runtimeEnv: process.env,
});
```

**2. Use the validated env** in your code:

```ts
import { env } from "./env";


const client = new MyClient({ apiKey: env.MY_API_KEY });
```

**3. For library code**, prefer passing values as function parameters rather than reading env directly. This keeps the library testable and reusable:

```ts
// Good - library accepts config
export class MyService {
  constructor(private readonly apiKey: string) {}
}


// App wires it up with env
const service = new MyService(env.MY_API_KEY);
```

**4. Check the catalog** in `pnpm-workspace.yaml` before adding `@t3-oss/env-core` as a dependency. If it is already in the catalog, use `"@t3-oss/env-core": "catalog:"` in your `package.json`.

## Adding Dependencies

Before adding any dependency, check `pnpm-workspace.yaml` for the catalog:

```bash
# Check if the package exists in the catalog
grep "my-package" pnpm-workspace.yaml
```

If the package is in the catalog, use `catalog:` as the version:

```json
{
  "dependencies": {
    "zod": "catalog:"
  }
}
```

If it is not in the catalog but will be shared across multiple packages, consider adding it there first.

Then install:

```bash
pnpm install
```

## Building and Type Checking

```bash
# Build everything (Turborepo handles dependency order)
pnpm build


# Type check all packages
pnpm typecheck


# Lint all packages
pnpm lint


# Run dev servers (web on 3000, API on 4000)
pnpm dev
```

All packages are ESM-only. Never use `.js` extensions in imports - TypeScript resolves modules automatically.

# Environment Variables

> Complete reference for every environment variable used across the Autonoma AI monorepo - API server, frontend, AI services, database, storage, logging, billing, and infrastructure.

## Quick Start - Minimum for Local Development

To get the API and UI running locally, you need a surprisingly small set of variables. Copy `.env.example` to `.env` at the repo root and fill in these essentials:

```bash
# Database
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/autonoma


# Redis
REDIS_URL=redis://localhost:6379


# API server
API_PORT=4000
SCENARIO_ENCRYPTION_KEY=any-string-at-least-1-char


# Google OAuth (create credentials at console.cloud.google.com)
GOOGLE_CLIENT_ID=your-google-client-id
GOOGLE_CLIENT_SECRET=your-google-client-secret


# AI model keys (needed for test execution)
GEMINI_API_KEY=your-gemini-key
GROQ_KEY=your-groq-key
OPENROUTER_API_KEY=your-openrouter-key


# S3-compatible storage (can use MinIO locally)
S3_BUCKET=autonoma-local
S3_REGION=us-east-1
S3_ACCESS_KEY_ID=minioadmin
S3_SECRET_ACCESS_KEY=minioadmin
```

Everything else has sensible defaults or is optional for local development. The sections below cover every variable in detail.

## How Environment Variables Work in This Project

Every app and package defines its environment variables in a dedicated `env.ts` file using [`createEnv` from `@t3-oss/env-core`](https://env.t3.gg/). This gives you:

* **Zod validation at startup** - the process crashes immediately if a required variable is missing or malformed, rather than failing mysteriously at runtime.
* **Type safety** - `env.DATABASE_URL` is typed as `string`, not `string | undefined`. No more `process.env.DATABASE_URL!` casts.
* **Composability** - packages export their `env` object, and apps extend them. For example, the API server’s `env.ts` extends the database, storage, logger, and billing envs, inheriting all their variables.

You should **never read `process.env` directly** in application code. Always import from the nearest `env.ts`:

```ts
// Good
import { env } from "./env";
const port = env.API_PORT;


// Bad - bypasses validation
const port = process.env.API_PORT;
```

The `emptyStringAsUndefined: true` option is enabled everywhere, so setting a variable to an empty string is treated the same as not setting it at all.

For boolean variables, the codebase uses `z.stringbool()` which accepts `"true"`, `"false"`, `"1"`, `"0"`, `"yes"`, and `"no"`.

***

## Core API Server

**Source:** `apps/api/src/env.ts`

The API server extends the database, storage, logger, and billing environments, so all variables from those sections apply here too.

| Variable                  | Required | Default                    | Description                                                                                              |
| ------------------------- | -------- | -------------------------- | -------------------------------------------------------------------------------------------------------- |
| `API_PORT`                | Yes      | -                          | Port the API server listens on. Typically `4000`.                                                        |
| `INTERNAL_DOMAIN`         | No       | `autonoma.app`             | Internal domain used for routing and service discovery.                                                  |
| `ALLOWED_ORIGINS`         | No       | `http://localhost:3000`    | Comma-separated list of CORS origins. Must include the frontend URL.                                     |
| `SCENARIO_ENCRYPTION_KEY` | Yes      | -                          | Key used to encrypt scenario data. Any non-empty string works for local dev.                             |
| `GOOGLE_CLIENT_ID`        | Yes      | -                          | OAuth 2.0 client ID from Google Cloud Console. Required for user authentication.                         |
| `GOOGLE_CLIENT_SECRET`    | Yes      | -                          | OAuth 2.0 client secret from Google Cloud Console.                                                       |
| `AGENT_VERSION`           | No       | `latest`                   | Version tag for the execution agent. Used when dispatching engine jobs.                                  |
| `POSTHOG_KEY`             | No       | -                          | PostHog project API key for server-side analytics. Omit to disable analytics.                            |
| `POSTHOG_HOST`            | No       | `https://us.i.posthog.com` | PostHog ingestion endpoint. Override for self-hosted PostHog instances.                                  |
| `GEMINI_API_KEY`          | Yes      | -                          | Google Gemini API key. Used by the API for AI features like test generation.                             |
| `REDIS_URL`               | Yes      | -                          | Redis connection string (e.g., `redis://localhost:6379`). Used for device locking, caching, and pub/sub. |
| `TESTING`                 | No       | `false`                    | Set to `true` in test environments. Prevents importing certain modules. Not for general use.             |
| `ENGINE_BILLING_SECRET`   | No       | -                          | Shared secret for authenticating billing calls from the engine.                                          |

***

## Frontend (UI)

**Source:** `apps/ui/src/env.ts`

The frontend uses Vite’s `import.meta.env` and requires the `VITE_` prefix for all variables.

| Variable               | Required | Default                 | Description                                                                                                                                                      |
| ---------------------- | -------- | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `VITE_API_URL`         | No       | `http://localhost:4000` | URL of the API server. The frontend makes all tRPC calls to this address.                                                                                        |
| `VITE_INTERNAL_DOMAIN` | No       | `autonoma.app`          | Internal domain, used for UI routing logic.                                                                                                                      |
| `VITE_TEMPORAL_URL`    | No       | -                       | URL of the Temporal UI. When set, enables links to workflow runs in the dashboard.                                                                               |
| `VITE_SENTRY_DSN`      | No       | -                       | Sentry DSN for frontend error tracking. Omit to disable Sentry in the browser.                                                                                   |
| `VITE_SENTRY_URL`      | No       | -                       | Sentry organization URL. Used for linking to Sentry issues from the UI.                                                                                          |
| `VITE_POSTHOG_KEY`     | No       | -                       | PostHog project API key for frontend analytics. Omit to disable analytics. PostHog events are proxied through the API server at `/ingest` to bypass ad blockers. |

***

## Database

**Source:** `packages/db/src/env.ts`

| Variable       | Required | Default | Description                                                                                                                        |
| -------------- | -------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `DATABASE_URL` | Yes      | -       | PostgreSQL connection string. Format: `postgresql://user:password@host:port/database`. Used by Prisma for all database operations. |

> **Note:**
>
> For local development, a typical value is `postgresql://postgres:postgres@localhost:5432/autonoma`. Make sure PostgreSQL is running and the database exists before starting the API.

***

## AI Services

**Source:** `packages/ai/src/env.ts`

These keys are required by the execution engines (web and mobile) and any service that runs AI inference. The API server only needs `GEMINI_API_KEY` directly - the other keys are consumed by the engine apps.

| Variable             | Required | Default | Description                                                                                                                               |
| -------------------- | -------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `GEMINI_API_KEY`     | Yes      | -       | Google Gemini API key. Used for the primary model (Gemini 3 Flash/Pro), point detection, object detection, and visual condition checking. |
| `GROQ_KEY`           | Yes      | -       | Groq API key. Used for fast inference with open-source models (e.g., GPT-OSS-120B).                                                       |
| `OPENROUTER_API_KEY` | Yes      | -       | OpenRouter API key. Provides access to Ministral-8B and serves as a fallback provider for open-source models.                             |

> **Note:**
>
> Validation is skipped when running in Vitest (`VITEST` env var is set), so you do not need these keys to run unit tests.

***

## Storage (S3)

**Source:** `packages/storage/src/env.ts`

Used for storing screenshots, video recordings, test artifacts, and other binary assets.

| Variable               | Required | Default | Description                                                        |
| ---------------------- | -------- | ------- | ------------------------------------------------------------------ |
| `S3_BUCKET`            | Yes      | -       | S3 bucket name for storing artifacts.                              |
| `S3_REGION`            | Yes      | -       | AWS region of the S3 bucket (e.g., `us-east-1`).                   |
| `S3_ACCESS_KEY_ID`     | Yes      | -       | AWS access key ID (or MinIO equivalent) for S3 authentication.     |
| `S3_SECRET_ACCESS_KEY` | Yes      | -       | AWS secret access key (or MinIO equivalent) for S3 authentication. |

> **Local development with MinIO:**
>
> You can run [MinIO](https://min.io/) locally as an S3-compatible object store. The default credentials are `minioadmin`/`minioadmin`. Point `S3_REGION` to any valid region string (e.g., `us-east-1`) and create a bucket matching your `S3_BUCKET` value.

***

## Logging and Observability

**Source:** `packages/logger/src/env.ts`

| Variable         | Required | Default       | Description                                                                                                |
| ---------------- | -------- | ------------- | ---------------------------------------------------------------------------------------------------------- |
| `NODE_ENV`       | No       | `development` | Node environment. Accepts `development`, `production`, or `test`. Affects log formatting and behavior.     |
| `SENTRY_DSN`     | No       | -             | Sentry DSN for backend error tracking and performance monitoring. Omit to disable Sentry.                  |
| `SENTRY_ENV`     | No       | `production`  | Sentry environment tag (e.g., `staging`, `production`).                                                    |
| `SENTRY_RELEASE` | No       | `unknown`     | Sentry release identifier. Typically set to the git SHA or version tag in CI.                              |
| `DEBUG`          | No       | -             | Debug filter string. When set, enables verbose debug logging for matching namespaces (e.g., `autonoma:*`). |

***

## Billing (Stripe)

**Source:** `packages/billing/src/env.ts`

Billing is entirely optional. When `STRIPE_ENABLED` is `false` (the default), all billing features are disabled and no other Stripe variables are needed.

| Variable                       | Required | Default                 | Description                                                                                                    |
| ------------------------------ | -------- | ----------------------- | -------------------------------------------------------------------------------------------------------------- |
| `STRIPE_ENABLED`               | No       | `false`                 | Master switch for billing. Set to `true` to enable Stripe integration.                                         |
| `STRIPE_SECRET_KEY`            | No       | -                       | Stripe secret API key. Required when `STRIPE_ENABLED` is `true`.                                               |
| `STRIPE_WEBHOOK_SECRET`        | No       | -                       | Stripe webhook signing secret for verifying incoming webhook events. Required when `STRIPE_ENABLED` is `true`. |
| `STRIPE_SUBSCRIPTION_PRICE_ID` | No       | -                       | Stripe Price ID for the subscription plan. Required when `STRIPE_ENABLED` is `true`.                           |
| `STRIPE_TOPUP_PRICE_ID`        | No       | -                       | Stripe Price ID for credit top-up purchases. Required when `STRIPE_ENABLED` is `true`.                         |
| `BILLING_GRACE_PERIOD_DAYS`    | No       | `3`                     | Number of days after a subscription lapses before access is revoked.                                           |
| `APP_URL`                      | No       | `http://localhost:3000` | Frontend application URL. Used in Stripe checkout redirect URLs and billing emails.                            |

***

## Kubernetes and Workflows

**Source:** `packages/k8s/src/env.ts` and `packages/workflow/src/env.ts`

These variables are only needed in production or when running engine jobs on Kubernetes. Not required for local development.

| Variable    | Required     | Default | Description                                                            |
| ----------- | ------------ | ------- | ---------------------------------------------------------------------- |
| `NAMESPACE` | Yes (in K8s) | -       | Kubernetes namespace where jobs are deployed. Used by `@autonoma/k8s`. |

The workflow package also reads:

| Variable       | Required | Default | Description                                                                                     |
| -------------- | -------- | ------- | ----------------------------------------------------------------------------------------------- |
| `DATABASE_URL` | Yes      | -       | PostgreSQL connection string. The workflow package needs direct DB access for job coordination. |
| `SENTRY_ENV`   | No       | -       | Sentry environment tag for workflow jobs.                                                       |

***

## Engine - Web (Playwright)

**Source:** `apps/engine-web/src/platform/env.ts` and `apps/engine-web/src/execution-agent/env.ts`

The web engine extends the AI, database, logger, and storage environments. All variables from those sections apply.

| Variable             | Required | Default | Description                                                                                                                           |
| -------------------- | -------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `REMOTE_BROWSER_URL` | No       | -       | WebSocket URL of a remote browser instance (e.g., Browserless or Playwright remote). When omitted, launches a local Chromium browser. |
| `HEADLESS`           | No       | -       | Set to any value to run Playwright in headless mode. When omitted, the browser window is visible (useful for local debugging).        |

***

## Engine - Mobile (Appium)

**Source:** `apps/engine-mobile/src/platform/env.ts`

The mobile engine extends the AI, database, logger, and storage environments. All variables from those sections apply.

| Variable                   | Required | Default | Description                                                                                                     |
| -------------------------- | -------- | ------- | --------------------------------------------------------------------------------------------------------------- |
| `APPIUM_HOST`              | No       | -       | Hostname of the Appium server.                                                                                  |
| `APPIUM_PORT`              | No       | -       | Port of the Appium server.                                                                                      |
| `APPIUM_MJPEG_PORT`        | No       | -       | Port for the Appium MJPEG video stream. Used for live frame capture during test execution.                      |
| `APPIUM_SYSTEM_PORT`       | No       | -       | System port used by Appium’s UiAutomator2 (Android) or WebDriverAgent (iOS).                                    |
| `APPIUM_SKIP_INSTALLATION` | No       | `true`  | When `true`, skips reinstalling the app before each test. Speeds up repeated runs on the same device.           |
| `DEVICE_NAME`              | No       | -       | Name of the target device or emulator (e.g., `iPhone 15 Pro`, `Pixel 7`).                                       |
| `IOS_PLATFORM_VERSION`     | No       | -       | iOS version to target (e.g., `17.2`). Required for iOS testing.                                                 |
| `ANDROID_DAEMON_HOSTS`     | No       | -       | Comma-separated list of Android daemon host addresses for distributed device access.                            |
| `IOS_DAEMON_HOSTS`         | No       | -       | Comma-separated list of iOS daemon host addresses for distributed device access.                                |
| `SKIP_DEVICE_DATE_UPDATE`  | No       | `false` | When `true`, skips updating the device date/time before tests. Useful when the device clock is already correct. |

***

## Jobs

### Execution Agent Runner

**Source:** `packages/engine/src/execution-agent/runner/env.ts`

| Variable       | Required | Default | Description                                                                                                              |
| -------------- | -------- | ------- | ------------------------------------------------------------------------------------------------------------------------ |
| `ARTIFACT_DIR` | No       | -       | Local directory for saving test artifacts (screenshots, videos, step logs). Used by the local runner during development. |

### Run Completion Notification

**Source:** `apps/jobs/run-completion-notification/src/env.ts`

| Variable                | Required | Default | Description                                             |
| ----------------------- | -------- | ------- | ------------------------------------------------------- |
| `DATABASE_URL`          | Yes      | -       | PostgreSQL connection string.                           |
| `API_URL`               | No       | -       | API server URL for callbacks.                           |
| `ENGINE_BILLING_SECRET` | No       | -       | Shared secret for authenticating billing-related calls. |
| `STRIPE_ENABLED`        | No       | `false` | Whether to process billing events on run completion.    |

### Diffs

**Source:** `apps/jobs/diffs/src/env.ts`

| Variable                    | Required | Default  | Description                                                           |
| --------------------------- | -------- | -------- | --------------------------------------------------------------------- |
| `BRANCH_ID`                 | Yes      | -        | Branch identifier for computing diffs.                                |
| `GEMINI_API_KEY`            | Yes      | -        | Gemini API key for AI-powered diff analysis.                          |
| `GITHUB_APP_ID`             | Yes      | -        | GitHub App ID for repository access.                                  |
| `GITHUB_APP_PRIVATE_KEY`    | Yes      | -        | GitHub App private key, base64-encoded PEM (`cat key.pem \| base64`). |
| `GITHUB_APP_WEBHOOK_SECRET` | Yes      | -        | GitHub App webhook secret for verifying events.                       |
| `AGENT_VERSION`             | No       | `latest` | Version tag for the diff agent.                                       |

### Review Jobs (Generation Reviewer, Replay Reviewer)

**Source:** `packages/diffs/src/env.ts`

Both the generation reviewer and replay reviewer jobs re-export from `@autonoma/diffs/env`, which extends the AI, logger, and storage environments. No additional variables beyond those from the AI, logger, and storage sections.

***

## GitHub App

These variables appear in `.env.example` and are used by the API server and the diffs job for GitHub integration features (repository connections, PR-triggered test runs).

| Variable                    | Required | Default | Description                                                                            |
| --------------------------- | -------- | ------- | -------------------------------------------------------------------------------------- |
| `GITHUB_APP_ID`             | No       | -       | GitHub App ID. Required for GitHub integration features.                               |
| `GITHUB_APP_PRIVATE_KEY`    | No       | -       | GitHub App private key, base64-encoded PEM (`cat key.pem \| base64`). Decoded at boot. |
| `GITHUB_APP_WEBHOOK_SECRET` | No       | -       | Secret for verifying GitHub webhook payloads.                                          |
| `GITHUB_APP_SLUG`           | No       | -       | GitHub App slug (URL-friendly name). Used for generating installation links.           |

***

## Authentication

These variables are referenced in `.env.example` for the Better Auth integration used by the API server.

| Variable             | Required | Default | Description                                                                                        |
| -------------------- | -------- | ------- | -------------------------------------------------------------------------------------------------- |
| `BETTER_AUTH_SECRET` | Yes      | -       | Secret key for Better Auth session signing. Generate with `openssl rand -hex 32`.                  |
| `BETTER_AUTH_URL`    | Yes      | -       | Base URL of the API server (e.g., `http://localhost:4000`). Used by Better Auth for callback URLs. |

***

## Tips for Local Development

**What you can skip entirely:**

* **Billing** - Leave `STRIPE_ENABLED=false` (the default). No Stripe keys needed.
* **Analytics** - Omit `POSTHOG_KEY` and `VITE_POSTHOG_KEY`. Analytics calls become no-ops.
* **Sentry** - Omit `SENTRY_DSN` and `VITE_SENTRY_DSN`. Error tracking is disabled gracefully.
* **Kubernetes** - Omit `NAMESPACE`. Only needed when deploying to K8s.
* **GitHub App** - Omit all `GITHUB_APP_*` variables unless you are working on GitHub integration.
* **Temporal** - Omit `VITE_TEMPORAL_URL`. The UI hides workflow links when this is unset.

**What uses defaults that just work:**

* `ALLOWED_ORIGINS` defaults to `http://localhost:3000` - correct for local dev.
* `VITE_API_URL` defaults to `http://localhost:4000` - correct for local dev.
* `APP_URL` defaults to `http://localhost:3000` - correct for local dev.
* `NODE_ENV` defaults to `development`.
* `AGENT_VERSION` defaults to `latest`.

**What you must provide:**

* `DATABASE_URL` - there is no default. You need a running PostgreSQL instance.
* `REDIS_URL` - there is no default. You need a running Redis instance.
* `GOOGLE_CLIENT_ID` and `GOOGLE_CLIENT_SECRET` - required for authentication. Create OAuth credentials in the [Google Cloud Console](https://console.cloud.google.com/apis/credentials).
* `SCENARIO_ENCRYPTION_KEY` - any non-empty string works locally.
* `BETTER_AUTH_SECRET` - generate one with `openssl rand -hex 32`.
* `BETTER_AUTH_URL` - set to `http://localhost:4000`.
* AI keys (`GEMINI_API_KEY`, `GROQ_KEY`, `OPENROUTER_API_KEY`) - required if you are running test execution. Not needed if you are only working on the UI or API without triggering test runs.
* S3 credentials - required for artifact storage. Use MinIO locally.

# Execution Agent

> Deep dive into the core test execution engine - a platform-agnostic AI agent that powers web and mobile test execution through natural language.

The execution agent is the brain of Autonoma’s test execution. It is a **generic, platform-agnostic AI agent** that takes a natural language test instruction, interacts with a live application through screenshots and commands, and produces a structured test result with recorded steps.

Web (`engine-web`) and mobile (`engine-mobile`) engines both extend this shared core. Everything is parameterized with `TSpec` (command spec) and `TContext` (driver context), so the same agent logic works across Playwright and Appium without code duplication.

## The Agent Loop

Every test execution follows the same cycle:

```plaintext
┌─────────────────────────────────────────────────────┐
│  1. Screenshot  - capture current screen state       │
│  2. Inject context - screenshot + instruction +      │
│     steps-so-far + memory into a user message        │
│  3. LLM decides - model picks a tool/command         │
│     (or calls execution-finished)                    │
│  4. Command executes - the chosen command runs       │
│     against platform drivers                         │
│  5. Record step - save before/after metadata,        │
│     execution output, and screenshots                │
│  6. Wait planning - asynchronously generate a wait   │
│     condition for replay                             │
│  7. Loop or stop - continue until execution-finished │
│     is called or maxSteps is reached                 │
└─────────────────────────────────────────────────────┘
```

The agent wraps the Vercel AI SDK’s `ToolLoopAgent`. Before each step, it captures a screenshot and injects it alongside the test instruction, all previous steps, and any stored memory variables. The LLM then decides which command to call next.

**Loop detection:** If the model’s reasoning mentions “loop”, “stuck”, “no progress”, or “repeating” in a `success: false` finish, the result is flagged as a loop.

**Success validation:** Even if the model calls `execution-finished` with `success: true`, the agent verifies that at least one command step was executed and at least one `assert` step exists. If either check fails, the result is overridden to `success: false`.

## Directory Structure

```plaintext
packages/engine/src/
├── commands/                          # Command abstraction system
│   ├── command-spec.ts                # CommandSpec type definition
│   ├── command.ts                     # Abstract Command base class
│   ├── command-defs.ts                # Union of all command specs
│   ├── step.ts                        # StepData type
│   └── commands/                      # Built-in command implementations
│       ├── click/                     # AI-powered element clicking
│       ├── type/                      # Find element + type text
│       ├── scroll/                    # Scroll with condition checking
│       ├── assert/                    # Visual assertion checking
│       ├── hover/                     # Hover over elements (web only)
│       ├── drag/                      # Drag from one element to another
│       ├── read/                      # Extract text from screen into memory
│       ├── navigate/                  # Navigate to URL (web only, last resort)
│       ├── refresh/                   # Refresh the current page
│       ├── save-clipboard/            # Save clipboard content to memory
│       └── wait-until/                # Wait for visual condition (not LLM-exposed)
├── execution-agent/                   # Core AI agent loop
│   ├── agent/
│   │   ├── execution-agent.ts         # Main agent class
│   │   ├── execution-agent-factory.ts # Abstract factory for building agents
│   │   ├── execution-result.ts        # Result types
│   │   ├── test-case.ts               # TestCase interface
│   │   ├── system-prompt.ts           # Agent system prompt
│   │   ├── memory/                    # Variable memory store
│   │   ├── components/
│   │   │   └── wait-planner.ts        # Generates wait conditions between steps
│   │   └── tools/                     # LLM tools
│   │       ├── command-tool.ts        # Wraps Command as an AI SDK tool
│   │       ├── execution-finished-tool.ts
│   │       ├── ask-user-tool.ts
│   │       └── wait-tool.ts
│   ├── runner/
│   │   ├── execution-agent-runner.ts  # Main runner - ties installer + factory + recording
│   │   ├── artifacts.ts               # Writes screenshots, steps, video to disk
│   │   └── events.ts                  # Event hooks (beforeStep, afterStep, frame)
│   └── local-dev/
│       ├── local-runner.ts            # Local dev runner (loads markdown test files)
│       └── load-test-case.ts          # Parses markdown frontmatter into test cases
└── platform/                          # Platform driver interfaces
    ├── context/
    │   ├── base-context.ts            # BaseCommandContext (screen + application drivers)
    │   ├── installer.ts               # Abstract Installer
    │   ├── image-stream.ts            # Live frame streaming interface
    │   └── video-recorder.ts          # Abstract VideoRecorder with state machine
    └── drivers/
        ├── screen.driver.ts           # screenshot(), getResolution()
        ├── mouse.driver.ts            # click(), hover(), drag(), scroll()
        ├── keyboard.driver.ts         # type(), press(), selectAll(), clear()
        ├── application.driver.ts      # waitUntilStable()
        ├── navigation.driver.ts       # navigate(), getCurrentUrl(), refresh()
        └── clipboard.driver.ts        # read()
```

## CommandSpec - The Command Type System

Every command is defined by a `CommandSpec`:

```ts
interface CommandSpec {
  interaction: string;  // command name (e.g., "click")
  params: object;       // what gets stored for replay
  output: BaseOutput;   // what the command returns (always includes `outcome: string`)
}
```

The `Command<TSpec, TContext>` abstract base class is what all commands extend:

```ts
abstract class Command<TSpec extends CommandSpec, TContext extends BaseCommandContext> {
  abstract readonly interaction: TSpec["interaction"];
  abstract readonly paramsSchema: z.ZodSchema<CommandParams<TSpec>>;
  abstract execute(params: CommandParams<TSpec>, context: TContext): Promise<CommandOutput<TSpec>>;
}
```

The `CommandTool<TSpec, TContext>` class wraps a `Command` to make it compatible with the AI SDK. It adds:

* An `inputSchema()` that defines what the LLM provides (may differ from `paramsSchema`)
* A `description()` shown to the AI model
* An `extractParams()` method that converts LLM input into command parameters

This separation means the LLM can provide a natural language description (“the blue submit button”) while the stored params contain the resolved coordinates and structured data needed for replay.

## Built-in Commands

| Command            | Exposed to LLM | Params                                                      | What it does                                                                                                                                                                                                                                                                                                                                                               |
| ------------------ | -------------- | ----------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **click**          | Yes            | `{ description, options }`                                  | Takes a natural-language element description, uses `PointDetector` AI to locate pixel coordinates, calls `mouse.click(x, y)`                                                                                                                                                                                                                                               |
| **type**           | Yes            | `{ description, text, overwrite }`                          | Uses `PointDetector` to find the input element, clicks it, then types the text. Supports overwrite mode to replace existing content                                                                                                                                                                                                                                        |
| **assert**         | Yes            | `{ instruction }`                                           | Takes an instruction (can contain multiple assertions). Uses `AssertionSplitter` to decompose, takes one screenshot, runs `AssertChecker` on all assertions in parallel                                                                                                                                                                                                    |
| **scroll**         | Yes            | `{ elementDescription?, direction, condition, maxScrolls }` | Scrolls up or down on a specific element or the page, checking a visual condition after each scroll                                                                                                                                                                                                                                                                        |
| **hover**          | Yes            | `{ description }`                                           | Hovers over an element identified by natural language description (web only)                                                                                                                                                                                                                                                                                               |
| **drag**           | Yes            | `{ startDescription, endDescription }`                      | Drags from one element to another, both identified by natural language                                                                                                                                                                                                                                                                                                     |
| **read**           | Yes            | `{ description, variableName }`                             | Extracts text from the screen and stores it in the agent’s memory under `variableName` for use in later steps via `{{variableName}}` syntax                                                                                                                                                                                                                                |
| **navigate**       | Yes            | `{ url }`                                                   | Navigates directly to a URL. Accepts full URLs, URLs without protocol (adds `https://`), or relative paths (resolved against the current page origin). **Last resort only** - prefer UI interaction (clicking links, buttons) to find bugs. Use only when you can’t reach something through the UI, or you’ve already tested the UI navigation in the same test (web only) |
| **refresh**        | Yes            | (none)                                                      | Refreshes the current page                                                                                                                                                                                                                                                                                                                                                 |
| **save-clipboard** | Yes            | `{ variableName }`                                          | Reads clipboard content and stores it in memory under `variableName`                                                                                                                                                                                                                                                                                                       |
| **wait-until**     | No             | `{ condition, timeout }`                                    | Polls a visual condition every second up to timeout using `VisualConditionChecker`. Auto-generated by `WaitPlanner`, not callable by the LLM                                                                                                                                                                                                                               |

## LLM Tools (Non-Command)

These tools are available to the model but are not recorded as test steps:

| Tool                   | Purpose                                                                                                                  |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| **wait**               | Sleeps for N seconds. Useful for loading screens or animations                                                           |
| **ask-user**           | Sends questions to a human via WebSocket. Pauses execution until answered. Only available in frontend-connected sessions |
| **execution-finished** | Called by the model to end the test. Takes `{ success, reasoning }`                                                      |

## Driver Interfaces

Platform-specific apps (`engine-web`, `engine-mobile`) implement these interfaces:

### ScreenDriver

```ts
interface ScreenDriver {
  getResolution(): Promise<ScreenResolution>;
  screenshot(): Promise<Screenshot>;
}
```

### MouseDriver

```ts
interface MouseDriver<TClickOptions extends object = Record<string, never>> {
  click(x: number, y: number, options?: TClickOptions): Promise<void>;
  hover?(x: number, y: number): Promise<void>;
  drag(startX: number, startY: number, endX: number, endY: number): Promise<void>;
  scroll(args: ScrollArgs): Promise<void>;
}
```

### KeyboardDriver

```ts
interface KeyboardDriver {
  selectAll(): Promise<void>;
  clear(): Promise<void>;
  type(text: string, options?: TypeOptions): Promise<void>;
  press(key: string): Promise<void>;
}
```

### ApplicationDriver

```ts
interface ApplicationDriver {
  waitUntilStable(): Promise<void>;
}
```

### NavigationDriver

```ts
interface NavigationDriver {
  navigate(url: string): Promise<void>;
  getCurrentUrl(): Promise<string>;
  refresh(): Promise<void>;
}
```

### ClipboardDriver

```ts
interface ClipboardDriver {
  read(): Promise<string>;
}
```

The `BaseCommandContext` requires only `screen` and `application` drivers. Each platform extends this with additional drivers as needed.

## Memory System

The agent maintains a `MemoryStore` - a key-value store that persists across steps within a single execution. Commands like `read` and `save-clipboard` write values into memory, and any subsequent command can reference stored values using `{{variableName}}` template syntax.

When a command executes, the agent resolves `{{variableName}}` templates in the parameters before passing them to the command. The unresolved params are stored for replay (keeping the template references), while the resolved values are used for actual execution.

## Adding a New Command

1. **Define the spec.** Create a `CommandSpec` type for the command’s interaction, params, and output:

packages/engine/src/commands/commands/my-command/my-command.def.ts

```ts
import z from "zod";


export interface MyCommandSpec {
  interaction: "my-command";
  params: { target: string; value: number };
  output: { outcome: string; success: boolean };
}


export const myCommandParamsSchema = z.object({
  target: z.string().describe("Description for the LLM"),
  value: z.number().describe("A numeric value"),
});
```

2. **Implement the command.** Create a class extending `Command`:

packages/engine/src/commands/commands/my-command/my-command.command.ts

```ts
import { Command } from "../../command";
import { type MyCommandSpec, myCommandParamsSchema } from "./my-command.def";


export class MyCommand extends Command<MyCommandSpec, YourContext> {
  readonly interaction = "my-command" as const;
  readonly paramsSchema = myCommandParamsSchema;


  async execute(params, context) {
    // Use context drivers to perform the action
    return { outcome: "Did the thing", success: true };
  }
}
```

3. **Create the tool wrapper.** Create a `CommandTool` subclass that defines how the LLM interacts with the command:

packages/engine/src/execution-agent/agent/tools/commands/my-command.tool.ts

```ts
import { CommandTool } from "../command-tool";
import type { MyCommandSpec } from "../../../../commands/commands/my-command/my-command.def";


export class MyCommandTool extends CommandTool<MyCommandSpec, YourContext> {
  protected inputSchema() { return myCommandParamsSchema; }
  description() { return "Description shown to the AI model"; }
  protected async extractParams(input, context) { return input; }
}
```

4. **Register it.** Add the tool to the command tools array in your `ExecutionAgentFactory` subclass.

5. **Add the spec to the union type** in `packages/engine/src/commands/command-defs.ts` so TypeScript knows about it.

## Extending for a New Platform

1. **Implement all driver interfaces** using your platform’s SDK. At minimum you need `ScreenDriver` and `ApplicationDriver` (the `BaseCommandContext`). Add `MouseDriver`, `KeyboardDriver`, `NavigationDriver`, and `ClipboardDriver` as needed.

2. **Create an `Installer` subclass** that builds the context. The installer receives application data (URL, device config, etc.) and returns the context with all drivers, plus an `ImageStream` and `VideoRecorder`:

```ts
class MyPlatformInstaller extends Installer<MyAppData, MyContext> {
  async install(appData: MyAppData) {
    // Launch browser/device, create driver instances
    return { context, imageStream, videoRecorder };
  }
}
```

3. **Create an `ExecutionAgentFactory` subclass** that builds the agent with platform-specific command tools:

```ts
class MyPlatformAgentFactory extends ExecutionAgentFactory<MySpec, MyContext> {
  async buildAgent(params) {
    return new ExecutionAgent({
      model: this.model,
      systemPrompt: this.systemPrompt,
      maxSteps: 50,
      commandTools: [new ClickTool(...), new TypeTool(...), ...],
      // ...rest of config
      ...params,
    });
  }
}
```

4. **Create a runner entry point** that wires the installer, factory, and event handlers together using `ExecutionAgentRunner`.

## The Runner and Artifacts

`ExecutionAgentRunner` orchestrates a full test run:

1. Calls `Installer.install()` to build the platform context (browser/device + drivers)
2. Registers a frame handler for live streaming
3. Builds the `ExecutionAgent` via the factory
4. Wraps `agent.generate()` in `VideoRecorder.withRecording()`
5. Returns `{ result, videoPath }`

`LocalRunner` extends this for local development - it loads test cases from markdown files and saves artifacts to disk:

```plaintext
artifacts/{timestamp}-{testName}/
├── screenshots/step-0-before.jpeg, step-0-after.jpeg, ...
├── steps.json          # Array of step execution outputs
├── conversation.json   # Sanitized AI turn log
├── instruction.txt     # The test prompt
└── video.{ext}         # Recording
```

## Result Types

**`GeneratedStep<TSpec>`** - one step of execution:

* `executionOutput` - the command’s step data (interaction + params) and result
* `waitCondition` - an optional wait condition for replay
* `beforeMetadata` / `afterMetadata` - screenshots and other metadata from before/after the step

**`ExecutionResult<TSpec>`** - the full test result:

* `generatedSteps` - all steps
* `memory` - final state of extracted variables
* `success` - whether the test passed
* `finishReason` - `"success"`, `"max_steps"`, or `"error"`
* `reasoning` - the model’s explanation for finishing
* `conversation` - the full AI message history

**`LeanExecutionResult<TSpec>`** - a network-safe version that strips large image buffers from step metadata.

## Test Cases as Markdown

Test files use [gray-matter](https://github.com/jonschlinkert/gray-matter) frontmatter for parameters, with the body containing the natural language prompt:

```markdown
---
url: https://example.com
---
Navigate to the login page, enter "user@test.com" and "password123",
click Sign In, and assert the dashboard is visible.
```

The `loadTestCase` function parses the frontmatter against a Zod schema and extracts the prompt from the body.

# AI Package

> Deep dive into the AI primitives that power test execution - model registry, visual checkers, point detection, object detection, and structured output generation.

The `@autonoma/ai` package provides every AI primitive used by the execution agent. It handles model management, visual analysis, element location, structured output generation, and evaluation benchmarking. No AI logic should be duplicated in platform apps - everything lives here.

## Directory Structure

```plaintext
packages/ai/src/
├── index.ts                          # Package re-exports
├── env.ts                            # Environment variables (API keys)
├── registry/                         # Model registry and configuration
│   ├── model-registry.ts             # Core ModelRegistry class
│   ├── model-entries.ts              # Model definitions and pricing
│   ├── providers.ts                  # LLM provider singletons
│   ├── options.ts                    # ModelOptions, reasoning effort levels
│   ├── costs.ts                      # Cost calculation functions
│   ├── cost-collector.ts             # Aggregated cost tracking
│   ├── usage.ts                      # Token usage tracking
│   └── monitoring.ts                 # Logging middleware and telemetry
├── visual/                           # Visual AI primitives
│   ├── visual-condition-checker.ts   # Check if a condition is met on a screenshot
│   ├── assert-checker.ts             # Validate test assertions
│   ├── visual-chooser.ts             # Pick which UI element matches an instruction
│   └── text-extractor.ts             # Extract text from screenshots
├── text/
│   └── assertion-splitter.ts         # Split compound assertions into atomic ones
├── object/                           # Structured output generation
│   ├── object-generator.ts           # Core structured JSON generator
│   ├── retry.ts                      # Retry with exponential backoff
│   ├── user-messages.ts              # Build multimodal messages (text + images + video)
│   └── video/
│       ├── video-processor.ts        # Upload videos to Google GenAI Files API
│       └── video-input.ts            # Video input types and model support
└── freestyle/                        # Point and object detection
    ├── resolution-fallback.ts        # Coordinate resolution management
    ├── point/
    │   ├── point-detector.ts         # Abstract PointDetector base
    │   ├── gemini-computer-use-point-detector.ts
    │   └── object-point-detector.ts  # Adapter: ObjectDetector -> PointDetector
    └── object/
        ├── object-detector.ts        # Abstract ObjectDetector base
        └── gemini-object-detector.ts # Gemini-based bounding box detection
```

## Model Registry

`ModelRegistry<TModel>` manages all LLM instances with middleware for cost tracking and monitoring. It wraps the Vercel AI SDK’s language models with usage tracking and provider-specific configuration.

### How It Works

The registry is constructed with a map of model entries. Each entry knows how to create its model instance and how to calculate costs:

```ts
const registry = new ModelRegistry({
  models: MODEL_ENTRIES,
  defaultSettings: { temperature: 0 },
  monitoring: { onGenerate: (result) => { /* log it */ } },
});
```

When you request a model, the registry wraps it with middleware for usage tracking, monitoring, and default settings:

```ts
const model = registry.getModel({
  model: "GEMINI_3_FLASH_PREVIEW",
  tag: "assert-checker",
  reasoning: "low",
});
```

The `tag` field identifies the use case (e.g., “assert-checker”, “click-detector”) for monitoring and cost attribution. The `reasoning` field sets the thinking effort level.

### Current Models

| Key                      | Model ID                      | Provider   |
| ------------------------ | ----------------------------- | ---------- |
| `GEMINI_3_FLASH_PREVIEW` | `gemini-3-flash-preview`      | Google     |
| `MINISTRAL_8B`           | `mistralai/ministral-8b-2512` | OpenRouter |
| `GPT_OSS_120B`           | `openai/gpt-oss-120b`         | Groq       |

An alternative `OPENROUTER_MODEL_ENTRIES` set routes all models through OpenRouter, including a Gemini variant (`google/gemini-3-flash-preview`) and a Llama variant (`meta-llama/llama-4-maverick`) in place of Ministral.

### Providers

Three LLM provider singletons are available, each lazily initialized with their respective API key:

| Provider             | SDK                           | Env Variable         |
| -------------------- | ----------------------------- | -------------------- |
| `googleProvider`     | `@ai-sdk/google`              | `GEMINI_API_KEY`     |
| `groqProvider`       | `@ai-sdk/groq`                | `GROQ_KEY`           |
| `openRouterProvider` | `@openrouter/ai-sdk-provider` | `OPENROUTER_API_KEY` |

The `LLMProvider` class wraps each provider as a singleton - the underlying SDK instance is created on first use.

### Reasoning Effort

The `ModelReasoningEffort` type supports four levels:

| Level      | Groq                        | Google                    |
| ---------- | --------------------------- | ------------------------- |
| `"none"`   | `reasoningEffort: "none"`   | Thinking disabled         |
| `"low"`    | `reasoningEffort: "low"`    | `thinkingLevel: "low"`    |
| `"medium"` | `reasoningEffort: "medium"` | `thinkingLevel: "medium"` |
| `"high"`   | `reasoningEffort: "high"`   | `thinkingLevel: "high"`   |

Reasoning effort is translated to provider-specific options in `buildSettings()`, so callers never need to think about which provider they are targeting.

### Extra Context

The registry supports dynamic context that can be attached during execution:

```ts
registry.addContext({ testRunId: "run-123", stepIndex: 3 });
// Later...
registry.resetContext();
```

This context is passed to monitoring callbacks, making it possible to trace costs and usage back to specific test runs and steps.

## Visual AI Primitives

### VisualConditionChecker

The base class for checking whether a condition is met on a screenshot. It extends `ObjectGenerator` with a predefined schema:

```ts
const checker = new VisualConditionChecker({ model });
const result = await checker.checkCondition(
  "The login form is visible with email and password fields",
  screenshot,
);
// result: { metCondition: true, reason: "The form is visible with both fields" }
```

Returns `{ metCondition: boolean, reason: string }`.

### AssertChecker

Extends `VisualConditionChecker` with a specialized system prompt for test assertions. It handles both positive assertions (“validate there’s a title that says Hello”) and negative assertions (“assert there’s no download button”):

```ts
const checker = new AssertChecker(model);
const result = await checker.checkCondition(
  "The submit button is disabled",
  screenshot,
);
```

Used by the `assert` command to validate each individual assertion against a screenshot.

### VisualChooser

Picks which UI element from a set of options matches a user instruction. It draws numbered bounding boxes on the screenshot and asks the model to choose:

```ts
const chooser = new VisualChooser({ model });
const result = await chooser.chooseOption({
  options: [
    { boundingBox: { x: 10, y: 20, width: 100, height: 30 }, description: "Submit" },
    { boundingBox: { x: 10, y: 60, width: 100, height: 30 }, description: "Cancel" },
  ],
  instruction: "Click the submit button",
  screenshot,
});
// result: { reasoning: "Option 1 is the submit button", option: { ... } }
```

Throws `NoValidOptionFoundError` if no option matches, or `InvalidIndexError` if the model returns an out-of-bounds index.

### AssertionSplitter

Splits a compound assertion instruction into individual atomic assertions that can be checked independently:

```ts
const splitter = new AssertionSplitter(model);
const result = await splitter.splitAssertions(
  "validate that the title is visible, the subtitle as well but the button is not",
);
// result.assertions: [
//   "validate that the title is visible",
//   "validate that the subtitle is visible",
//   "validate that the button is not visible"
// ]
```

Importantly, the splitter ensures each split assertion contains enough context to stand alone. It repairs incomplete fragments (e.g., “the subtitle as well” becomes “validate that the subtitle is visible”).

## Point Detection

Point detectors locate where to interact on screen, given a natural language description. They are used by the `click`, `type`, `hover`, and `drag` commands.

### Abstract Base

All point detectors extend `PointDetector`:

```ts
abstract class PointDetector {
  protected abstract detectPointForResolution(
    screenshot: Screenshot,
    prompt: string,
    resolution: ScreenResolution,
  ): Promise<Point>;


  async detectPoint(
    screenshot: Screenshot,
    prompt: string,
    targetResolution?: ScreenResolution,
  ): Promise<Point>;
}
```

The public `detectPoint` method handles resolution fallback automatically - if no target resolution is provided, it defaults to the device resolution (if configured) or the image resolution.

### GeminiComputerUsePointDetector

Uses Google’s Gemini computer-use API with a `click_at` tool. The model returns coordinates in a normalized 0-1000 space, which are then scaled to actual pixel coordinates based on the target resolution.

### ObjectPointDetector

An adapter that converts an `ObjectDetector` into a `PointDetector`. It detects the bounding box of an element and returns the center point. Useful when you have an object detector but need point-level precision.

## Object Detection

### ObjectDetector (Abstract Base)

Detects objects in an image and returns bounding boxes:

```ts
abstract class ObjectDetector {
  async detectObjects(
    screenshot: Screenshot,
    prompt: string,
    targetResolution?: ScreenResolution,
  ): Promise<DetectedObject[]>;
}
```

Each `DetectedObject` contains a `boundingBox` and an optional `label`.

### GeminiObjectDetector

Uses Gemini’s structured output to return bounding boxes as normalized 0-1000 coordinates. Useful for detecting multiple UI elements at once.

## ObjectGenerator

The core structured output engine used by almost every AI primitive in the package. It wraps the AI SDK’s `generateText` with:

* **Zod schema validation** for structured JSON output
* **Automatic retry** with exponential backoff (default: 5 retries, 100ms initial delay, 2x backoff factor)
* **Multimodal input** via `ObjectGenerationParams` - supports text, images, and video
* **Null byte stripping** from responses for PostgreSQL compatibility
* **Tool support** for agentic generation workflows (stops after 5 tool steps)

```ts
const generator = new ObjectGenerator({
  model,
  systemPrompt: "You are a UI analysis expert.",
  schema: z.object({
    elements: z.array(z.object({
      label: z.string(),
      visible: z.boolean(),
    })),
  }),
});


const result = await generator.generate({
  userPrompt: "List all visible buttons",
  images: [screenshot],
});
```

Video input is supported for models that handle it (checked via `modelSupportsVideo`). Videos are uploaded through the Google GenAI Files API via `VideoProcessor`.

If generation fails after all retries, an `ObjectGenerationFailedError` is thrown wrapping the original error.

## Adding a New Model

1. **Add the model entry** to `packages/ai/src/registry/model-entries.ts`:

```ts
export const MODEL_ENTRIES = {
  // ...existing entries
  MY_NEW_MODEL: {
    createModel: () => googleProvider.getModel("my-new-model-id"),
    pricing: simpleCostFunction({
      inputCostPerM: 0.5,
      outputCostPerM: 1.5,
    }),
  },
} as const;
```

2. **Choose the right cost function.** Use `simpleCostFunction` for models without cache pricing, or `inputCacheCostFunction` for models that support input caching (adds a `cachedInputCostPerM` field).

3. **Add a provider** if needed. If the model uses a provider not yet configured, add a new `LLMProvider` singleton in `providers.ts` and add the corresponding API key to `env.ts`.

4. **Use the model** by referencing its key when calling `registry.getModel()`:

```ts
const model = registry.getModel({
  model: "MY_NEW_MODEL",
  tag: "my-use-case",
  reasoning: "medium",
});
```

## Adding a New Visual AI Primitive

Most visual primitives follow the same pattern: extend `ObjectGenerator` with a specialized schema and system prompt.

1. **Define the output schema** with Zod:

```ts
const myPrimitiveSchema = z.object({
  elements: z.array(z.object({
    name: z.string(),
    confidence: z.number(),
  })),
});
type MyPrimitiveResult = z.infer<typeof myPrimitiveSchema>;
```

2. **Create the class** extending `ObjectGenerator`:

```ts
export class MyPrimitive extends ObjectGenerator<MyPrimitiveResult> {
  constructor(model: LanguageModel) {
    super({
      model,
      systemPrompt: "Your specialized system prompt here.",
      schema: myPrimitiveSchema,
    });
  }


  async analyze(screenshot: Screenshot, instruction: string): Promise<MyPrimitiveResult> {
    return this.generate({ images: [screenshot], userPrompt: instruction });
  }
}
```

3. **Export it** from the package index.

For point or object detection, extend `PointDetector` or `ObjectDetector` instead and implement the `detectPointForResolution` or `detectObjectsForResolution` method.

## Evaluation Framework

The `evals/` directory contains a Vitest-integrated framework for benchmarking AI accuracy:

* **`Evaluation<TTestCase>`** - base class that defines test cases and runs them against models

* **`ModelEvaluation`** - tracks token usage and cost per model across an evaluation run

* **Three eval types:**

  * `assert-condition/` - measures assertion checking accuracy
  * `freestyle-click/` - measures point detection accuracy
  * `wait-for-instruction/` - measures wait condition generation accuracy

Results are saved as JSON with pass rates and per-case breakdowns, making it easy to compare models and track accuracy over time.

# Examples

> Working examples of the Autonoma Environment Factory across 8 languages and 11 framework combinations.

Every example follows the same pattern: install the SDK, configure the handler, register a factory for every model the dashboard can create, and expose a single POST endpoint. Each factory carries an input schema (Pydantic in Python, Zod in TypeScript, Ecto/serde/etc. elsewhere) so the SDK can describe the model to the dashboard and validate the create payload before invoking your code. There is no SQL introspection and no SQL fallback.

> **Prerequisites:**
>
> Read the [Environment Factory Guide](/guides/environment-factory/) first for concepts. These examples are the code.

## Available Examples

All examples live in the [SDK repository](https://github.com/Autonoma-AI/sdk/tree/main/examples). Each one has a README with prerequisites, quick start, project structure, and how it works.

| Language                            | Framework            | Schema lib              | Source                                                                                |
| ----------------------------------- | -------------------- | ----------------------- | ------------------------------------------------------------------------------------- |
| [TypeScript](/examples/typescript/) | Express              | Zod                     | [express](https://github.com/Autonoma-AI/sdk/tree/main/examples/typescript/express)   |
| [TypeScript](/examples/typescript/) | Next.js (App Router) | Zod                     | [nextjs](https://github.com/Autonoma-AI/sdk/tree/main/examples/typescript/nextjs)     |
| [TypeScript](/examples/typescript/) | Hono                 | Zod                     | [hono](https://github.com/Autonoma-AI/sdk/tree/main/examples/typescript/hono)         |
| [Python](/examples/python/)         | FastAPI              | Pydantic                | [fastapi](https://github.com/Autonoma-AI/sdk/tree/main/examples/python/fastapi)       |
| [Python](/examples/python/)         | Flask                | Pydantic                | [flask](https://github.com/Autonoma-AI/sdk/tree/main/examples/python/flask)           |
| [Python](/examples/python/)         | Django               | Pydantic                | [django](https://github.com/Autonoma-AI/sdk/tree/main/examples/python/django)         |
| [Elixir](/examples/elixir/)         | Phoenix              | Ecto schemas            | [phoenix](https://github.com/Autonoma-AI/sdk/tree/main/examples/elixir/phoenix)       |
| [Java](/examples/java/)             | Spring Boot          | Bean Validation         | [spring-boot](https://github.com/Autonoma-AI/sdk/tree/main/examples/java/spring-boot) |
| [Ruby](/examples/ruby/)             | Rails                | dry-validation          | [rails](https://github.com/Autonoma-AI/sdk/tree/main/examples/ruby/rails)             |
| [Rust](/examples/rust/)             | Axum                 | serde + validator       | [axum](https://github.com/Autonoma-AI/sdk/tree/main/examples/rust/axum)               |
| [Go](/examples/go/)                 | Gin                  | go-playground/validator | [gin](https://github.com/Autonoma-AI/sdk/tree/main/examples/go/gin)                   |
| [PHP](/examples/php/)               | Laravel              | Symfony Validator       | [laravel](https://github.com/Autonoma-AI/sdk/tree/main/examples/php/laravel)          |

## Configuration Reference

Every example configures the same handler fields:

| Field           | Description                                                                                                                                                                                                                                                                                                                               |
| --------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `scopeField`    | The column that scopes all models to a tenant (e.g. `organizationId`). The SDK uses this to isolate test data and ensure teardown only removes records belonging to the test run.                                                                                                                                                         |
| `sharedSecret`  | Shared between your server and Autonoma. Used to verify incoming requests via HMAC-SHA256. Generate with `openssl rand -hex 32`.                                                                                                                                                                                                          |
| `signingSecret` | Private to your server only. Used to sign the refs token that tracks which records were created, so teardown can only delete what was created. Generate with `openssl rand -hex 32`. Must be different from `sharedSecret`.                                                                                                               |
| `factories`     | One factory per model the dashboard can create. Each factory declares an `input_model` / `inputSchema` (Pydantic, Zod, etc.) plus a `create` function that calls your real service/repository, and an optional `teardown`. The SDK introspects the schema to drive `discover` and validates payloads through it before invoking `create`. |
| `auth`          | Called after entity creation during `up`. Returns credentials (cookies, headers, tokens) so Autonoma can make authenticated requests as the test user.                                                                                                                                                                                    |

# Previewkit

> Vercel-style preview environments for every pull request. Drop a .preview.yaml in your repo, open a PR, get a live URL.

Previewkit gives you a fresh, isolated environment for every pull request. You describe your stack once in a `.preview.yaml` at the root of your repo and Previewkit handles the rest: building the containers, provisioning the supporting services, wiring environment variables, and posting the URL back to the PR.

## How it works

Once the Previewkit GitHub App is installed on your repository, every `pull_request` event triggers the pipeline:

1. **Opened / synchronized / reopened** — Previewkit fetches the head commit, builds each app, provisions service recipes (Postgres, Redis, etc.), deploys to a dedicated Kubernetes namespace, and comments the preview URL on the PR.
2. **Closed** — Previewkit deletes the namespace and all resources tied to that PR, then updates the comment.

Each preview lives at `https://<app>-pr-<N>-<repo-slug>.preview.autonoma.app`. One PR may expose many apps under one preview — each app gets its own hostname.

## What you author

A single file: `.preview.yaml` at the repo root. It declares:

* **Apps** to build and deploy (each becomes a public HTTPS URL)
* **Services** the apps depend on (databases, caches, etc.), picked from a curated catalog of recipes
* **Hooks** that run after deploy (typical use: database migrations)
* **Environment variables**, with templates that resolve service hostnames at deploy time

See the [`.preview.yaml` reference](/previewkit/preview-yaml/) for the full schema.

## Minimal example

```yaml
version: 1


apps:
  - name: web
    path: ./apps/web
    port: 3000
    env:
      API_URL: "http://{{api.host}}:{{api.port}}"
    health_check: /health


  - name: api
    path: ./apps/api
    port: 4000
    env:
      DATABASE_URL: "postgresql://preview:preview@{{db.host}}:5432/preview"
    health_check: /health


services:
  - name: db
    recipe: postgres
    version: "16"


hooks:
  post_deploy:
    - app: api
      command: "npx prisma migrate deploy"
```

Open a PR with this file at the repo root and you’ll get a comment back with two URLs (one for `web`, one for `api`) within a few minutes.

## Builds work out of the box

Previewkit auto-detects how to build each app:

* If a `Dockerfile` exists in the app’s `path` (or you specify one explicitly), it builds with [BuildKit](https://github.com/moby/buildkit).
* Otherwise it falls back to [Railpack](https://railpack.com), which detects Node, Python, Go, Ruby, Rust, PHP, Java, and more, and produces a working image without you writing any Dockerfile.

Images are pushed to a private registry and pulled by the preview cluster. You never touch credentials.

## Secrets

Secrets that should NOT live in `.preview.yaml` (API keys, third-party tokens) are managed out-of-band via the REST API. They can be owner-scoped (every PR sees them) or PR-scoped (just this PR — useful for testing prod credentials in isolation). See [Secrets](/previewkit/secrets/).

## What’s next

* [Author your `.preview.yaml`](/previewkit/preview-yaml/) — full schema and examples
* [Manage secrets](/previewkit/secrets/) — REST API reference

# .preview.yaml Reference

> Complete schema for the Previewkit configuration file - apps, services, recipes, environment templating, and post-deploy hooks.

The `.preview.yaml` file at the root of your repository tells Previewkit how to build and deploy your stack for each pull request. This page documents every field.

## Top-level shape

```yaml
version: 1            # required, must be 1
domain: string?       # optional, override preview hostname suffix
registry: string?     # optional, override image registry


apps: [App]           # required, at least one
services: [Service]   # optional
hooks:
  post_deploy: [Hook] # optional
```

| Field      | Type   | Default                | Notes                                                                           |
| ---------- | ------ | ---------------------- | ------------------------------------------------------------------------------- |
| `version`  | `1`    | required               | Schema version. Only `1` is supported today.                                    |
| `domain`   | string | `preview.autonoma.app` | Overrides the hostname suffix. Wildcard DNS must be configured by the operator. |
| `registry` | string | platform default       | Container registry to push built images to. Usually leave unset.                |
| `apps`     | list   | required               | One or more applications. Each gets a public HTTPS URL.                         |
| `services` | list   | `[]`                   | Backing services from the recipe catalog (databases, caches, etc.).             |
| `hooks`    | object | `{}`                   | Lifecycle hooks. Today only `post_deploy` is supported.                         |

## Apps

Each entry under `apps` becomes a Kubernetes Deployment plus a public HTTPS hostname.

```yaml
apps:
  - name: api
    path: ./apps/api
    dockerfile: ./apps/api/Dockerfile   # optional
    build_args:                          # optional
      NODE_VERSION: "20"
    port: 4000
    env:
      DATABASE_URL: "postgresql://preview:preview@{{db.host}}:5432/preview"
    command: "node dist/server.js"      # optional, overrides image CMD
    health_check: /health                # optional
    replicas: 1
    resources:
      cpu: 500m
      memory: 512Mi
```

| Field              | Type                 | Default    | Notes                                                                                                                                                                                     |
| ------------------ | -------------------- | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `name`             | string               | required   | Kubernetes-style name (lowercase letters, digits, hyphens). Also the leftmost label of the hostname.                                                                                      |
| `path`             | string               | `"."`      | Path to the build context, relative to the repo root.                                                                                                                                     |
| `dockerfile`       | string               | autodetect | Path to a Dockerfile (relative to repo root). If omitted, Previewkit looks for `Dockerfile` inside `path`; if none is found, [Railpack](https://railpack.com) auto-detects the framework. |
| `build_args`       | map\<string, string> | `{}`       | Build-time `--build-arg` values.                                                                                                                                                          |
| `port`             | integer              | required   | The container port the app listens on.                                                                                                                                                    |
| `env`              | map\<string, string> | `{}`       | Runtime environment variables. Supports [templating](#environment-templating).                                                                                                            |
| `command`          | string               | image CMD  | Optional shell command to override the image entrypoint. Wrapped in `/bin/sh -c`.                                                                                                         |
| `health_check`     | string               | none       | HTTP path for both readiness and liveness probes.                                                                                                                                         |
| `replicas`         | integer              | `1`        | Number of pod replicas.                                                                                                                                                                   |
| `resources.cpu`    | string               | `"250m"`   | CPU request (no limit).                                                                                                                                                                   |
| `resources.memory` | string               | `"256Mi"`  | Memory request and limit.                                                                                                                                                                 |

### Resulting URL

For an app named `web` in PR #42 of `acme-corp/storefront`, the URL is:

```plaintext
https://web-pr-42-acme-corp-storefront.preview.autonoma.app
```

(Repo slugs are sanitized to fit DNS-label limits — long owner/repo names are truncated.)

## Services

Services come from a curated recipe catalog. You don’t write Kubernetes manifests — you pick a recipe and Previewkit handles everything (Deployment, Service, persistent volume, readiness probes).

```yaml
services:
  - name: db
    recipe: postgres
    version: "16"
    env:
      POSTGRES_DB: app
    resources:
      cpu: 500m
      memory: 1Gi
```

| Field       | Type                 | Default        | Notes                                                           |
| ----------- | -------------------- | -------------- | --------------------------------------------------------------- |
| `name`      | string               | required       | Used by other apps to address this service (`{{<name>.host}}`). |
| `recipe`    | string               | required       | One of the recipes listed below.                                |
| `version`   | string               | recipe default | Image tag for the underlying service.                           |
| `env`       | map\<string, string> | `{}`           | Extra environment variables for the service container.          |
| `options`   | map                  | `{}`           | Recipe-specific config. See each recipe below.                  |
| `resources` | object               | `250m / 256Mi` | Same shape as app resources.                                    |

### Available recipes

| Recipe        | Default version  | Port | Notes                                                                                              |
| ------------- | ---------------- | ---- | -------------------------------------------------------------------------------------------------- |
| `postgres`    | `16-alpine`      | 5432 | Persistent volume attached. Connection: `postgresql://preview:preview@{{name.host}}:5432/preview`. |
| `redis`       | `7-alpine`       | 6379 | No persistence. Connection: `redis://{{name.host}}:6379`.                                          |
| `valkey`      | `7-alpine`       | 6379 | Open-source Redis fork. Same connection shape.                                                     |
| `temporal`    | (recipe-default) | 7233 | Local Temporal cluster for workflow testing.                                                       |
| `api-gateway` | `1.27-alpine`    | 80   | Nginx-based router that fans requests to multiple apps. Requires `options.routes`.                 |

#### `api-gateway` options

```yaml
services:
  - name: gateway
    recipe: api-gateway
    options:
      client_max_body_size: 25m
      routes:
        - path: /api
          target: api
          strip_prefix: true
        - path: /
          target: web
```

Each route has `path` (URL prefix to match), `target` (app or service name to forward to), `strip_prefix` (boolean), and `rewrite` (optional path rewrite).

## Hooks

```yaml
hooks:
  post_deploy:
    - app: api
      command: "npx prisma migrate deploy"
    - app: api
      command: "node scripts/seed.js"
```

`post_deploy` runs after every app is deployed and at least one pod for the target app is `Running`. Each step is executed via the Kubernetes API exec subresource (no `kubectl` needed), wrapped in `/bin/sh -c`. Steps run sequentially; the first failure aborts the rest.

| Field     | Type   | Notes                                                            |
| --------- | ------ | ---------------------------------------------------------------- |
| `app`     | string | Name of an app declared in `apps`. The hook runs inside its pod. |
| `command` | string | Shell command to execute.                                        |

## Environment templating

The `env` map on apps (and on services) supports two kinds of placeholders that resolve at deploy time:

### Service references

`{{<name>.host}}` and `{{<name>.port}}` resolve to the in-namespace DNS name and port of another app or service.

```yaml
apps:
  - name: web
    env:
      API_URL: "http://{{api.host}}:{{api.port}}"


  - name: api
    env:
      DATABASE_URL: "postgresql://preview:preview@{{db.host}}:5432/preview"
      REDIS_URL: "redis://{{cache.host}}:{{cache.port}}"


services:
  - name: db
    recipe: postgres
  - name: cache
    recipe: redis
```

`{{name.host}}` is always the bare service name (in-cluster DNS). `{{name.port}}` is the recipe’s known port for services, or the declared `port` for apps.

Unknown names raise a deploy error with the list of valid names.

### Context variables

Three variables describe the current PR context:

| Variable        | Value                                                  |
| --------------- | ------------------------------------------------------ |
| `{{pr}}`        | PR number (e.g. `42`)                                  |
| `{{namespace}}` | Kubernetes namespace (`preview-<owner>-<repo>-pr-<N>`) |
| `{{owner}}`     | Repo owner (e.g. `acme-corp`)                          |

```yaml
apps:
  - name: api
    env:
      ENV_LABEL: "pr-{{pr}}-{{owner}}"
      S3_PREFIX: "previews/{{namespace}}/"
```

## Full example

```yaml
version: 1


apps:
  - name: web
    path: ./apps/web
    port: 3000
    env:
      API_URL: "http://{{api.host}}:{{api.port}}"
      DATABASE_URL: "postgresql://preview:preview@{{db.host}}:5432/preview"
    health_check: /health


  - name: api
    path: ./apps/api
    dockerfile: ./apps/api/Dockerfile
    port: 4000
    env:
      DATABASE_URL: "postgresql://preview:preview@{{db.host}}:5432/preview"
      REDIS_URL: "redis://{{cache.host}}:6379"
      ENV_LABEL: "pr-{{pr}}"
    health_check: /health
    resources:
      cpu: 500m
      memory: 512Mi


services:
  - name: db
    recipe: postgres
    version: "16"


  - name: cache
    recipe: redis


hooks:
  post_deploy:
    - app: api
      command: "npx prisma migrate deploy"
```

## Validation tips

* Names must match `^[a-z0-9][a-z0-9-]*[a-z0-9]$` (Kubernetes-compatible).
* At least one `app` is required.
* `version` must be exactly `1`.
* Service references in `env` must match a name declared elsewhere in the same file.
* Hostnames combine `<app>-pr-<N>-<repo-slug>` and must stay under 63 characters — keep app names short.

# Secrets

> How to manage credentials, API keys, and other sensitive values for your Previewkit environments.

Anything you wouldn’t commit to your repo - API keys, database URLs, signed tokens - should not live in `.preview.yaml`. Manage it through the Previewkit API instead. Every key you upload is mounted into your running app as an environment variable; your code just reads `process.env.STRIPE_API_KEY` and gets the value.

## Managing secrets

```plaintext
GET    /v1/secrets/:applicationId/:app                # list keys (no values)
PUT    /v1/secrets/:applicationId/:app                # batch upsert; body: {"items":[{"key","value"},...]}
PUT    /v1/secrets/:applicationId/:app/:key           # single upsert; body: {"value":"..."}
DELETE /v1/secrets/:applicationId/:app/:key           # delete one key
```

`applicationId` is your autonoma Application row id. Look it up once via the dashboard and hardcode it in your CI. `app` matches the `name:` field of an app inside `.preview.yaml`. For a single-app repo it’s just that one name; for a monorepo each app has its own bundle.

### Authentication

Every call needs an `Authorization: Bearer <api-key>` header. Create an API key from the autonoma dashboard (Settings → API keys); keys are scoped to your organization, so they can only see and modify your own applications’ secrets. Treat them like a password.

```bash
export PREVIEWKIT_API_KEY="ak_live_..."


# Batch upsert
curl -X PUT "https://previewkit.autonoma.app/v1/secrets/app_abc123/web" \
  -H "Authorization: Bearer $PREVIEWKIT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"items":[{"key":"STRIPE_API_KEY","value":"sk_live_..."},{"key":"SENTRY_DSN","value":"https://..."}]}'


# Single key upsert
curl -X PUT "https://previewkit.autonoma.app/v1/secrets/app_abc123/web/STRIPE_API_KEY" \
  -H "Authorization: Bearer $PREVIEWKIT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"value":"sk_live_..."}'


# List keys (names only, never values)
curl "https://previewkit.autonoma.app/v1/secrets/app_abc123/web" \
  -H "Authorization: Bearer $PREVIEWKIT_API_KEY"


# Delete
curl -X DELETE "https://previewkit.autonoma.app/v1/secrets/app_abc123/web/STRIPE_API_KEY" \
  -H "Authorization: Bearer $PREVIEWKIT_API_KEY"
```

Calls without a valid Bearer token get a 401. Calls referencing an `applicationId` your key doesn’t have access to are indistinguishable from “no secrets yet” - the API never reveals whether a foreign application exists.

Updates take effect on the next preview deploy for that app.

## Build-time secrets (`build_secrets`)

`NEXT_PUBLIC_*` values for Next.js, `VITE_*` values for Vite, anything else baked into a client bundle at compile time - these need to be present during `next build` / `vite build`, not just at runtime. List them in `build_secrets:` inside `.preview.yaml` and Previewkit will pass them to your builder:

```yaml
apps:
  - name: web
    port: 3000
    build_secrets:
      - NEXT_PUBLIC_FIREBASE_API_KEY
      - NEXT_PUBLIC_FIREBASE_PROJECT_ID
      - NEXT_PUBLIC_STRIPE_PUBLISHABLE_KEY
```

Each name must already be a key you’ve uploaded via the API. The build fails fast with a clear error if a listed key isn’t there.

Server-only secrets (those your running pod reads via `process.env`) do NOT need to be in `build_secrets` - the runtime mount already covers them. Listing them anyway is harmless but verbose.

## Overrides committed to git

If you also define a key in `.preview.yaml`’s `env:` map, the value there wins over the uploaded one. Use this for behaviour switches that must pass code review:

```yaml
apps:
  - name: api
    port: 4000
    env:
      # A wrong API edit can't silently flip a preview into "talk to live banking".
      PLAID_ENV: "sandbox"
      SEND_EMAILS_LOCALLY: "false"
```

Template substitutions (`{{api.host}}`, `{{pr}}`, etc.) inside `env:` resolve the same way - see [environment templating](/previewkit/preview-yaml/#environment-templating).

## What goes where

| Value type                                                                                         | Where it lives                                                                                                     |
| -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| Third-party API keys, database URLs, signed tokens                                                 | Previewkit API                                                                                                     |
| `NEXT_PUBLIC_*` / `VITE_*` baked into a client bundle                                              | Previewkit API, also listed in `build_secrets`                                                                     |
| In-cluster service URLs (`{{db.host}}`, `{{api.host}}`)                                            | `.preview.yaml` env - resolved automatically, no upload needed                                                     |
| PR / owner / namespace metadata (`{{pr}}`, `{{owner}}`, `{{namespace}}`)                           | `.preview.yaml` env - resolved automatically, no upload needed                                                     |
| Behaviour switches that should pass code review (`PLAID_ENV=sandbox`, `SEND_EMAILS_LOCALLY=false`) | `.preview.yaml` env - keep in git so a wrong API edit can’t silently flip a preview into “talk to production” mode |
| Anything non-sensitive that varies between environments                                            | `.preview.yaml` env - reviewable in git                                                                            |

If you’re unsure, default to the Previewkit API. You only need to think about `build_secrets` when a value must be present *during* the build (the client-bundle case above).

# E2E Test Planner

> A Claude Code plugin that analyzes your codebase and produces a complete E2E test suite, plus the infrastructure to run it.

The Test Planner is a **Claude Code plugin** that takes any web application codebase and produces a complete, production-ready E2E test suite in six steps. Install it once, run a single command, and the plugin handles orchestration, deterministic validation gates, and user confirmation checkpoints automatically. The final output is a set of test files ready to upload to Autonoma, backed by a live Environment Factory endpoint that creates isolated test data for every run.

## Installation

Claude Code   OpenAI Codex Soon   OpenCode Soon

Install the plugin from the marketplace:

```bash
/plugin marketplace add Autonoma-AI/test-planner-plugin
```

Then register it with Claude Code:

```bash
/plugin install autonoma-test-planner@autonoma
```

Then run the plugin:

```bash
/autonoma-test-planner:generate-tests
```

OpenAI Codex support is coming soon.

OpenCode support is coming soon.

The plugin runs each step in an **isolated subagent** and validates every output with **deterministic shell-script validation** (Python + YAML parsing) — not LLM-based checks. If validation fails, the agent sees the exact error and must fix it before proceeding. PostToolUse hooks validate every file write automatically, and cross-file consistency checks ensure outputs like INDEX.md test counts match features.json feature counts. A hard validation gate prevents test files from being written until the scenario lifecycle has been proven end-to-end.

## Before you start

The plugin runs against your **frontend codebase** to discover pages, flows, and UI patterns, and against your **backend codebase** to audit entity creation paths and set up the Environment Factory.

Before you start, make sure you have:

* access to the frontend codebase

* access to the backend codebase (can be the same monorepo)

* these environment variables in the Claude Code session:

  * `AUTONOMA_API_KEY`
  * `AUTONOMA_PROJECT_ID`
  * `AUTONOMA_API_URL`

If your frontend and backend are in the same monorepo, you’re all set. If they’re in separate repositories, make sure the agent has access to both.

## The six steps

### Step 1 — Generate a knowledge base

The agent analyzes your frontend codebase and produces `AUTONOMA.md`: a user-perspective guide to every page, flow, and interaction in your application.

**Consumes:** Your frontend codebase. **Produces:** `autonoma/AUTONOMA.md` + `autonoma/features.json`

### Step 2 — Entity creation audit

The agent inspects your backend codebase to find the canonical creation function for every database model — typically a service, repository, or similar helper. Models with dedicated creation code get factories in the Environment Factory; models without fall back to raw SQL INSERT. The audit records observed side effects (password hashing, slug generation, external API calls) so you can see why each factory matters.

**Consumes:** Knowledge base + your backend codebase. **Produces:** `autonoma/entity-audit.md`

### Step 3 — Generate test data scenarios

The agent reads the database schema directly from your backend and designs three named test data environments: `standard` (realistic variety for most tests), `empty` (for zero-state testing), and `large` (for pagination and performance). `scenarios.md` records concrete values and relationships, plus schema metadata and generated-value placeholders (`{{token}}`) for fields that must vary across runs.

**Consumes:** Knowledge base + entity audit + your backend codebase. **Produces:** `autonoma/scenarios.md`

### Step 4 — Implement the Environment Factory

The agent installs the Autonoma SDK in your backend, configures the handler, and registers a factory for every model marked `independently_created: true` in the audit — calling your real service/repository function so test data flows through the same business logic as production data. Models without dedicated creation code fall back to raw SQL INSERT automatically. A `discover` smoke test plus a factory-integrity check confirm the handler is wired correctly before handing off to Step 5.

**Consumes:** Entity audit + scenarios + your backend codebase. **Produces:** A working Environment Factory endpoint + `autonoma/.endpoint-implemented`

### Step 5 — Validate scenario lifecycle

The agent runs the full `discover` → `up` → `down` lifecycle against every scenario, iterating up to five times to fix handler bugs or reconcile `scenarios.md` with reality. Once every scenario passes, it emits `autonoma/scenario-recipes.json`, runs a deterministic preflight check, and uploads the validated recipes to the Autonoma dashboard. The `.endpoint-validated` sentinel this step writes is what unlocks Step 6.

**Consumes:** Endpoint from Step 4 + scenarios from Step 3. **Produces:** `autonoma/scenario-recipes.json` + `autonoma/.scenario-validation.json` + `autonoma/.endpoint-validated` + uploaded recipes on the dashboard.

The upload contract for `scenario-recipes.json` is documented in the [Scenario Recipe Schema reference](/reference/scenario-recipe-schema/).

### Step 6 — Generate E2E tests

The agent produces an exhaustive set of test cases as natural language markdown files, using the **validated, reconciled scenarios** from Step 5. Tests are distributed across tiers: core flows get 50-60% of coverage, supporting flows get the rest. The suite includes happy paths, input validation, state persistence, navigation, and cross-flow journey tests. Variable-field tokens are referenced symbolically so the test runner can substitute real values at execution time. An adversarial review agent runs after to find gaps.

**Consumes:** Knowledge base + validated scenarios + scenario recipes. **Produces:** `autonoma/qa-tests/` directory with test files + `INDEX.md`

## How the steps connect

* **Step 1** output feeds every later step — the knowledge base tells subsequent agents what the app is and what the core flows are
* **Step 2** output feeds Step 4 — the entity audit decides which models get factories and which fall back to SQL
* **Step 3** output feeds Steps 5 and 6 — scenarios define what data to create and what tests assert against
* **Step 4** produces the endpoint Step 5 validates against; it does NOT run `up`/`down` itself
* **Step 5** is the critical gate — it proves the scenarios actually work against a real database. Its sentinel unlocks Step 6. If validation fails, Step 6 is blocked at the hook level.
* **Step 6** consumes the (possibly reconciled) scenarios as the source of truth for test data

## How validation works

Unlike prompt-based validation, the Test Planner uses **deterministic shell-script validators** at every step:

* **PostToolUse hooks** run after every file write, catching structural issues (missing frontmatter, invalid YAML, wrong file locations) immediately
* **Step-level validators** run Python and YAML parsing scripts to verify the complete output before the next step begins
* **Cross-file consistency checks** ensure inter-file references are correct — for example, INDEX.md test counts must match the actual number of test files, and `features.json` feature counts must align with the knowledge base
* **Preflight** on `scenario-recipes.json` verifies every scenario has a recipe, every token is declared, and the tree roots at the scope entity
* **The validation gate** blocks test-file writes until `autonoma/.endpoint-validated` exists

If any validation fails, the agent receives the exact error message and must fix the issue before the plugin allows it to proceed. You never end up with a broken intermediate output feeding into the next step.

## Review checkpoints

The plugin pauses after Steps 1–5 and asks for your review before the output is consumed by the next step. These are not optional — getting them right determines the quality of the final test suite.

| After step | What to review                                                                     | Why it matters                                                                                                                                            |
| ---------- | ---------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Step 1     | Core flows identified                                                              | Determines 50-60% of test coverage weight. Wrong core flows = poorly prioritized tests.                                                                   |
| Step 2     | Entity audit — factory vs raw SQL classification and identified creation functions | Decides which models run your real business logic during tests. Wrong function = tests that bypass important side effects.                                |
| Step 3     | Scenario entity data + variable fields                                             | Fixed values become direct assertions; variable values become tokens. Wrong names, counts, relationships, or variable markings = brittle tests.           |
| Step 4     | SDK implementation plan (endpoint location, factories, auth callback)              | Ensures the backend integration, secrets, and factory wiring are correct before code is written.                                                          |
| Step 5     | Validation results + any edits made to `scenarios.md` + uploaded recipes           | Confirms the scenarios are feasible against your real database. Any agent edits to scenarios mean the original design missed something — worth reviewing. |
| Step 6     | Test distribution + journey/critical test samples                                  | Confirms assertions reference real UI text, not vague descriptions. Validates coverage weight across tiers.                                               |

Each step page explains what to look for and why it matters in detail.

# Factory fidelity rubric

> Semantic rubric used by the plugin to verify that each factory faithfully reproduces the side effects the Step 2 audit recorded.

This page is fetched at runtime by the plugin’s `.endpoint-implemented` hook. The hook spawns one `claude -p` subprocess per audited model, passing this rubric + the prompt template below. The model returns a JSON verdict. If any model fails, the sentinel is blocked and the compiled feedback is returned to the env-factory agent so it can fix itself.

Edit this page to tune the rubric — the plugin refetches it on every run, so changes take effect without cutting a new plugin version. **Keep every example in this page generic** (use placeholders like `<Model>`, `<ModelService>`, `src/<domain>/<domain>.service.ts`). The rubric is consumed by projects in many languages and ORMs; codebase-specific names would bias the validator toward projects that happen to share those names.

## Scope — which models this rubric judges

The entity audit classifies every model with two orthogonal fields:

* `independently_created: true` means the codebase has a standalone creation path (service, repository, framework hook) for this model. These get their own factory in the handler and are the only models this rubric judges.
* `created_by: [{owner, via, why}]` lists the other models whose creation flows produce this one as a side effect. Pure dependents (`independently_created: false`) never have factories of their own — they fall back to raw SQL or come along with their owner’s factory — so the rubric skips them entirely.

A **dual** model (`independently_created: true` AND non-empty `created_by`) has both a standalone path and an owner that mints it inline. The rubric judges only the standalone path; the via-owner path is the scenario generator’s concern. A dual model’s factory is faithful iff the standalone path is faithful.

## Rubric

A factory for a model is **faithful** if and only if **every** one of the following criteria passes. Any single failure is a hard fail.

### Criterion 1 — Uses the codebase’s creation path, not raw ORM access

The failure mode this catches: the factory body (or a helper it imports) contains a direct database write — `db.<model>.create(...)`, `prisma.<model>.create(...)`, `tx.insert(<table>)`, `<Model>.create(...)` on an ActiveRecord class, `session.add(...)` + `session.commit()`, a raw `INSERT INTO ... VALUES ...`, etc. — **while the application already has a named service / repository / controller / helper function that performs the creation with its full business logic.**

The factory’s call chain, starting at the `create(data, ctx)` body in the handler and following every import one level deep, must reach the `creation_function` named in the **Step 2 audit snapshot** (not the current audit). The Step 2 snapshot is ground truth — it was captured before the factory was written and names the function in the application codebase that performs the real creation.

A factory “uses the codebase’s creation path” if it calls the Step 2 `creation_function` directly, or calls a one-line wrapper that calls it. A factory “uses raw ORM access” if the only write observable in its call chain is a database operation with no business-logic wrapper.

**Framework-hook carve-out (`needs_extraction: true`).** When Step 2 recorded `needs_extraction: true` and `extracted_to: <path>`, the `creation_function` is a framework hook or route closure (Better-Auth `databaseHooks.*`, NextAuth callbacks, Devise callbacks, inline route handlers) that the factory **cannot** call directly — it only runs when the framework’s own entry point is invoked. For these models, Criterion 1 passes iff **(a)** the factory calls the function at `extracted_to` (the lifted named export), AND **(b)** that function’s body reproduces the hook’s call chain (same sibling services / events / analytics). The factory MUST NOT call the hook name itself, and MUST NOT use raw `db.<model>.create(...)`. A raw write in this case fails both Criterion 1 and Criterion 4.

* **PASS** — Factory calls `<Model>Service.create(...)` (from `src/<domain>/<domain>.service.ts`) which is exactly the function Step 2 named and performs whatever hashing / derivation / sibling writes / external calls the service performs in production.
* **PASS (`needs_extraction: true`)** — Factory calls the `extracted_to` function and that function’s body contains the same sibling writes / external calls as the original hook.
* **FAIL** — Factory (or its helper) contains `db.<model>.create({ data })` or equivalent raw write, and the Step 2 audit named a service / repository / controller method that is never invoked.
* **FAIL** — Factory imports a freshly-written helper whose body is just `return db.<model>.create({ data })`. The Step 2 function in the application codebase is bypassed.

### Criterion 2 — Preserves every side effect the audit recorded

The Step 2 audit entry for each model includes a `side_effects:` list. Every item on that list must be reproduced by the call chain, either directly (same function is called) or through an equivalent path (a helper that invokes the same downstream code).

Side effects commonly include:

* Writes to sibling tables (e.g. `Organization` / `Member` / `BillingCustomer` when creating a `User` — actual names vary per project)
* Hashing or cryptographic operations (password hashing, API key hashing, token signing)
* External service calls (analytics events, Slack / email / webhook / GitHub / Stripe)
* State-machine transitions (onboarding advancement, setup status, lifecycle flags)
* Derived field generation (slugs, tokens, refs, search vectors)

The factory MAY omit a side effect only if the audit’s `side_effects` list explicitly marks it as skippable. The older “sibling factory escape hatch” is gone — if a side effect genuinely belongs to another model, the audit must record that via `created_by` on the sibling, not via a comment in this factory.

* **PASS** — Factory’s call chain reaches the Step 2 `creation_function`, which invokes the sibling-write / hashing / external-call helpers named in `side_effects`.
* **FAIL** — Helper file contains a comment admitting missing side effects (“we replicate that logic here without the external side effects”, “no business logic beyond the raw insert”, “skipping the hooks for the test env”) — this is explicit admission that side effects were dropped.
* **FAIL** — A side effect from the audit’s list is missing from the call chain and no comment explains why it is safe to drop.

### Criterion 3 — `creation_file` in the current audit matches the Step 2 snapshot

For every model with `independently_created: true`, the current `creation_file` must equal the Step 2 snapshot’s `creation_file`. The Step 2 audit is a statement about the **existing codebase** — it cannot be repointed at a file the agent wrote for the factory. If Branch 1 extraction is used, the agent should add an `extracted_to:` field; it MUST NOT overwrite `creation_file`.

* **PASS** — `creation_file` unchanged between snapshot and current.
* **FAIL** — `creation_file` changed from a path in the application codebase (e.g. `src/<domain>/<domain>.service.ts`) to a path inside the factory / handler directory (e.g. `src/<handler-dir>/<factories-file>.ts`).

### Criterion 4 — No raw-write helpers masquerading as extractions

If the factory imports a helper (Branch 1 extraction), the helper MUST either:

1. Call the Step 2 `creation_function` directly (thin wrapper), or
2. Be the Step 2 `creation_function` itself (the file was renamed/moved but the function is the original code).

A helper that contains only a raw database write (`db.<model>.<create|insert|upsert>(...)`, `tx.insert(<table>)`, `<Model>.create(...)` on an ORM class, raw `INSERT` SQL, etc.) with no other business logic is a **raw-write helper**, not an extraction, and fails this criterion.

Branch 1 extraction is a refactor of the application codebase — it lifts inline logic out of a closure/hook into a named export and wires the original HTTP caller to call it. The extracted function keeps every side effect. A helper created fresh inside the factory directory that only wraps the ORM has not extracted anything.

* **PASS** — Helper is a thin wrapper, e.g. a one-line `return <realService>.create(data)` that calls the Step 2 function.
* **PASS** — Helper IS the Step 2 function (the file was moved during Branch 1; the body is unchanged and still calls every sibling helper).
* **FAIL** — Helper body is `return db.<model>.create({ data: {...} })` with no call to any service / repository / controller named in the Step 2 audit.

## Reference examples — `defineFactory`

These examples are deliberately generic so they apply to any codebase. Read them as templates for the shape a faithful factory takes versus the shapes that fail each criterion.

### Good — calls the existing service (Branch 2, no extraction needed)

```ts
// handler file — imports the service that already exists in the codebase.
import { UserService } from "../../users/user.service";


export const factories = {
    User: defineFactory({
        async create(data, ctx) {
            // UserService.create is the Step 2 creation_function. It hashes
            // passwords, provisions Org + Member + Billing rows, fires
            // signup analytics, and returns the created user.
            return UserService.create(data, { executor: ctx.executor });
        },
    }),
};
```

### Good — thin wrapper after Branch 1 extraction

```ts
// src/auth/create-user.ts — lifted OUT of the Better-Auth hook closure so the
// factory can reuse the same code path. Original hook now calls this too.
// Extracted from the databaseHooks.user.create closure for Environment Factory
// reuse (preserves Org + Member + billing provisioning). See
// autonoma/entity-audit.md.
export async function createUser(input: CreateUserInput, deps: AuthDeps) {
    const user = await deps.db.user.create({ data: { ...input, password: hash(input.password) } });
    await ensureOrgMembership(user, deps);
    await ensureBillingProvisioning(user, deps);
    await analytics.capture("user_signed_up", { userId: user.id });
    return user;
}


// handler file
import { createUser } from "../../auth/create-user";


export const factories = {
    User: defineFactory({
        async create(data, ctx) {
            return createUser(data, { db: ctx.executor, analytics, billing });
        },
    }),
};
```

### Bad — raw ORM in the factory body (fails Criterion 1)

```ts
import { db } from "../../db";


export const factories = {
    User: defineFactory({
        async create(data, ctx) {
            // WRONG — bypasses UserService.create (Step 2 creation_function).
            // Password is not hashed. Org / Member / Billing rows are never
            // created. Every downstream test that reads them will break.
            return db.user.create({ data });
        },
    }),
};
```

### Bad — raw-write helper masquerading as an extraction (fails Criterion 4)

```ts
// src/<handler-dir>/factories-helpers.ts — created for the factory only.
// The comment is the tell: it documents dropping side effects instead of
// preserving them.
export async function createUser(db, data) {
    // better-auth's internal adapter does the same thing — no business logic
    // beyond the raw insert.
    return db.user.create({ data, select: { id: true } });
}


// handler file
import { createUser } from "./factories-helpers";


export const factories = {
    User: defineFactory({
        async create(data, ctx) {
            return createUser(ctx.executor, data); // fails Criterion 4
        },
    }),
};
```

### Bad — audit rewrite (fails Criterion 3)

The Step 2 snapshot recorded `creation_file: src/auth/auth.ts`. The agent wrote a raw-write helper at `src/<handler-dir>/factories-helpers.ts` and rewrote the current audit’s `creation_file` to point there. Even if every other criterion passed, Criterion 3 fails because the immutable ground-truth column was overwritten.

## Prompt template

The plugin hook substitutes `{{placeholders}}` below before invoking `claude -p`. The hook reads everything between `<!-- prompt:begin -->` and `<!-- prompt:end -->` (exclusive) as the raw template.

<!-- prompt:begin -->

You are a semantic validator for an Autonoma Environment Factory handler. Your job is to answer: does the factory for ONE model faithfully reproduce the creation behaviour that the Step 2 audit recorded? Apply the rubric EXACTLY.

The rubric’s examples use generic placeholders (`<Model>`, `<ModelService>`, `src/<domain>/<domain>.service.ts`). Map them to whatever names the target codebase actually uses — service, repository, controller, helper function, module, etc. The rule is about the SHAPE of the call chain, not specific file paths or class names.

## Rubric (from the Autonoma docs)

{{RUBRIC}}

## Inputs for model: {{MODEL}}

### Step 2 audit entry (ground truth — immutable)

```yaml
{{STEP2_AUDIT_ENTRY}}
```

### Current audit entry (may have drifted)

```yaml
{{CURRENT_AUDIT_ENTRY}}
```

### Factory registration in the handler

File: {{HANDLER\_PATH}}

```plaintext
{{FACTORY_BLOCK}}
```

### Extraction status (from Step 2 snapshot)

* `needs_extraction`: {{NEEDS\_EXTRACTION}}
* `extracted_to`: {{EXTRACTED\_TO}}

When `needs_extraction` is `true`, the Step 2 `creation_function` is a framework hook or inline route closure that cannot be called directly. The factory is expected to call the function at `extracted_to`. Apply the “Framework-hook carve-out” in Criterion 1.

### Helper(s) the factory calls

{{HELPER\_SECTION}}

If the section above says the factory helper was “not resolvable”, treat this as missing-context, NOT as evidence of a raw-write factory. In that case return `error` (see Task) instead of `fail` for criteria that depend on inspecting the helper body.

### Original creation\_function from Step 2 snapshot

File: {{ORIGINAL\_CREATION\_FILE}}

```plaintext
{{ORIGINAL_CREATION_SNIPPET}}
```

## Task

Apply Criteria 1–4 above. For each criterion: PASS, FAIL, or ERROR with a one-sentence reason. Use ERROR only when the information needed to judge the criterion is genuinely absent from the inputs (e.g. helper code was not provided and the helper is the only path through which the criterion could be satisfied). Do NOT use ERROR as a substitute for FAIL when the inputs clearly show a violation.

Overall verdict:

* `pass` — every criterion is PASS.
* `fail` — at least one criterion is FAIL.
* `error` — no criterion is FAIL and at least one is ERROR.

Respond with ONLY a JSON object on a single line, no prose, no code fences:

{“model”: “{{MODEL}}”, “verdict”: “pass” | “fail” | “error”, “criteria”: \[{“id”: 1, “status”: “pass|fail|error”, “reason”: ”…”}, {“id”: 2, …}, {“id”: 3, …}, {“id”: 4, …}], “fix\_hint”: “one actionable sentence or empty string”}

<!-- prompt:end -->

## How the hook uses this

1. On write of `autonoma/.endpoint-implemented`, the plugin fetches this page.
2. It splits the content at `## Prompt template` — the section above is `{{RUBRIC}}`, the block between `<!-- prompt:begin -->` / `<!-- prompt:end -->` is the prompt template.
3. For every model with `independently_created: true` in the Step 2 snapshot, it fills the placeholders and runs `claude -p --output-format json` in parallel (bounded concurrency).
4. It parses the JSON result from the `result` field of the outer envelope and collects `fail` verdicts. If any exist, it blocks the sentinel with the compiled feedback.

The env-factory agent receives the feedback as stderr from the blocked write and can self-correct. The feedback includes the per-criterion reasons and a `fix_hint` for each failing model.

### On “can claude answer?”

Yes — `claude -p --output-format json` returns the assistant’s response in the `result` field of a JSON envelope on stdout. The plugin parses that envelope, then parses the inner JSON the prompt asked for. No intermediate file is needed, which keeps the subprocess stateless and the fan-out cheap. If a future rubric change needs structured artifacts bigger than a single JSON object, the template can be updated to ask the model to write a file at a caller-supplied path — but for the current pass/fail + per-criterion reasoning shape, the return envelope is the right channel.