Agentic game-control benchmark

TerrariaBench

TerrariaBench evaluates native coding-agent harnesses controlling a live Terraria/tModLoader client. It follows the KernelBench-Hard pattern: native harnesses, archived transcripts, screenshots, videos, and objective scoring from a mod-authored checkpoint log rather than model self-reporting.

Open run dashboard Specification Repository README Harness trace notes
Published Runs
7
Active Task
open ended
Scoring
checkpoints
Assets
frames + video

Active Sweep

Current official lanes: Claude Opus 4.7 Max through Claude Code, GPT-5.5 xhigh through Codex CLI, Kimi K2.6 through Kimi CLI, and the vision-capable Gemini routes through OpenCode. PVE and hostile mobs are disabled for the current sandbox progression sweep.

Latest Runs

All Published Runs

RunHarnessModelProblemCheckpointsElapsedFramesStatus
20260429_193427_kimi_kimi-k2.6_03_open_ended_progress kimi kimi-k2.6 03_open_ended_progress 8 300s 1 timeout
20260429_185451_claude_claude-opus-4-7_03_open_ended_progress claude claude-opus-4-7 03_open_ended_progress 16 300s 1 timeout
20260429_184419_opencode_openrouter_google_gemini-3.1-pro-preview_03_open_ended_progress opencode openrouter/google/gemini-3.1-pro-preview 03_open_ended_progress 32 301s 2 timeout
20260429_184419_opencode_openrouter_google_gemini-3.1-flash-lite-preview_03_open_ended_progress opencode openrouter/google/gemini-3.1-flash-lite-preview 03_open_ended_progress 5 40s 6 complete
20260429_184419_opencode_openrouter_google_gemini-3-flash-preview_03_open_ended_progress opencode openrouter/google/gemini-3-flash-preview 03_open_ended_progress 31 298s 5 complete
20260429_184419_claude_claude-opus-4-7_03_open_ended_progress claude claude-opus-4-7 03_open_ended_progress 32 300s 1 timeout
20260429_044924_codex_gpt-5.5_03_open_ended_progress codex gpt-5.5 03_open_ended_progress 40 36000s 39 timeout