# TerrariaBench TerrariaBench is a tModLoader benchmark for testing whether native coding-agent harnesses can control a live Terraria client and complete objective in-game tasks. It is intentionally modeled after `Infatoshi/KernelBench-Hard`: small task deck, native harness per model family, disposable workspaces, transcript archives, and post-run scoring from an external checker. Here the external checker is the TerrariaBench mod checkpoint log. ## Structure - `mod/TerrariaBench/`: tModLoader mod source. - `problems/`: task prompts and metadata. - `scripts/run_terraria.sh`: one `(harness, model, task)` run. - `scripts/sweep.sh`: active model matrix. - `orchestrator.py`: slot launcher and direct OpenRouter smoke runner. - `tools/terraria_slot.py`: helper exposed to native agents. ## Quick Checks ```sh uv run python -m py_compile orchestrator.py tools/terraria_slot.py bash -n scripts/run_terraria.sh scripts/sweep.sh ``` ## Native Harness Run ```sh BUDGET_SECONDS=120 ./scripts/run_terraria.sh codex gpt-5.5 problems/01_open_inventory xhigh ``` Run outputs go to `outputs/runs/___/`. ## Direct Smoke Run The direct OpenRouter path is for debugging only: ```sh uv run orchestrator.py --slots 1 --task inventory --model '~google/gemini-flash-latest' --max-steps 1 ``` Official comparisons should use `scripts/run_terraria.sh` and `scripts/sweep.sh`. ## Requirements - Terraria and tModLoader installed. - The TerrariaBench mod built and enabled. - `uv` - `jq` - macOS helpers for the current local path: `screencapture`, `osascript`, Quartz via `pyobjc-framework-Quartz` - Native model CLIs for official runs: `claude`, `codex`, `kimi`, `opencode` The project is being moved to `anvil-lan`; update `TMODLOADER_DIR` and `TMODLOADER_SAVE_DIR` there as needed.