# TerrariaBench

TerrariaBench is a tModLoader benchmark for testing whether native coding-agent harnesses can control a live Terraria client and complete objective in-game tasks.

It is intentionally modeled after `Infatoshi/KernelBench-Hard`: small task deck, native harness per model family, disposable workspaces, transcript archives, and post-run scoring from an external checker. Here the external checker is the TerrariaBench mod checkpoint log.

## Structure

- `mod/TerrariaBench/`: tModLoader mod source.
- `problems/`: task prompts and metadata.
- `scripts/run_terraria.sh`: one `(harness, model, task)` run.
- `scripts/sweep.sh`: active model matrix.
- `orchestrator.py`: slot launcher and direct OpenRouter smoke runner.
- `tools/terraria_slot.py`: helper exposed to native agents.

## Quick Checks

```sh
uv run python -m py_compile orchestrator.py tools/terraria_slot.py
bash -n scripts/run_terraria.sh scripts/sweep.sh
```

## Native Harness Run

```sh
BUDGET_SECONDS=120 ./scripts/run_terraria.sh codex gpt-5.5 problems/01_open_inventory xhigh
```

Run outputs go to `outputs/runs/<timestamp>_<harness>_<model>_<problem>/`.

## Direct Smoke Run

The direct OpenRouter path is for debugging only:

```sh
uv run orchestrator.py --slots 1 --task inventory --model '~google/gemini-flash-latest' --max-steps 1
```

Official comparisons should use `scripts/run_terraria.sh` and `scripts/sweep.sh`.

## Requirements

- Terraria and tModLoader installed.
- The TerrariaBench mod built and enabled.
- `uv`
- `jq`
- macOS helpers for the current local path: `screencapture`, `osascript`, Quartz via `pyobjc-framework-Quartz`
- Native model CLIs for official runs: `claude`, `codex`, `kimi`, `opencode`

The project is being moved to `anvil-lan`; update `TMODLOADER_DIR` and `TMODLOADER_SAVE_DIR` there as needed.