## https://sploitus.com/exploit?id=B119ADEE-94DF-5B44-A30E-35ED87AF168D
# TrustedRouter-ExploitBench
Notes, harness configs, and a runbook for driving
**[ExploitBench](https://github.com/exploitbench/exploitbench)** (the public
V8-exploitation capability benchmark, [exploitbench.ai](https://exploitbench.ai))
through **[TrustedRouter](https://trustedrouter.com)** โ including TrustedRouter
**Fusion** (a multi-model panel + synthesizing judge) as a single "model."
The goal: measure how far frontier models climb the V8 exploitation ladder, and
test whether **agentic Fusion** (synthesis across a panel) beats the best single
model on a benchmark where the model-to-model gap is large.
> This repo is **methodology + configs + notes**, not a fork. ExploitBench
> itself lives upstream at [`exploitbench/exploitbench`](https://github.com/exploitbench/exploitbench);
> we reference it and add only the TrustedRouter integration and our experiment
> record. It is a companion to
> [`Lore-Hex/PrometheusBench`](https://github.com/Lore-Hex/PrometheusBench)
> (refusal) and
> [`Lore-Hex/prometheus-biomysterybench`](https://github.com/Lore-Hex/prometheus-biomysterybench)
> (bioinformatics capability).
## What ExploitBench measures
ExploitBench scores an agent along the Chromium **V8 exploitation ladder** โ 16
capabilities from *reaching* vulnerable code โ *triggering* the bug โ building
*exploit primitives* โ *arbitrary code execution*. `bench-v8` covers 41 real V8
bugs (`v8.yaml`); a 14-bug historical-baseline subset is in `v8-small.yaml`.
Each episode is up to **300 turns** of reasoning inside a per-bug Docker
container that exposes an ExploitBench MCP server (`setup()` / `grade(...)`).
It drives any model via a direct provider API **or an OpenAI-compatible
gateway** โ which is exactly how we point it at TrustedRouter.
## Why this experiment
On ExploitBench the spread between models is large (per the upstream
leaderboard: Mythos โ 9.5/16, while GPT-5.5 โ 3.8, Gemini 3.1 Pro โ 3.7, Kimi
โ 2.4). That wide gap makes it a good probe for whether **Fusion** can lift a
panel above its strongest member โ the question we also asked on BioMystery
(where the gap was small and Fusion did not help). See
[`NOTES.md`](./NOTES.md).
## Status
**Harness designed; full V8 sweep not yet run.** The honest blockers and the
host decision are documented in [`NOTES.md`](./NOTES.md). Short version:
- The real V8 images are ~70 GB each and amd64-only; episodes are 300 turns โ
impractical on an ARM Mac (disk + emulation).
- Model spend is large (~$80โ200 per Opus 300-turn episode), independent of host.
- There is **no published Opus 4.8 number** to "reproduce" โ ExploitBench's
baselines are Opus **4.6 / 4.7**. A 4.8 run would be a *fresh* measurement.
- Chosen host: a native **x86_64** Linux box (16 c / 27 GB / 682 GB free) โ see
[`RUNBOOK.md`](./RUNBOOK.md).
Results will be added here once a cost-bounded run completes.
## Layout
- [`NOTES.md`](./NOTES.md) โ experiment notes: feasibility, published-number
context, the Fusion question, decisions.
- [`RUNBOOK.md`](./RUNBOOK.md) โ how to stand up ExploitBench + TrustedRouter on
an amd64 host and run our configs.
- [`configs/`](./configs) โ TrustedRouter run configs (single-model Opus 4.8,
Fusion panel, and a no-image smoke).
## Ethics / scope
ExploitBench is a published security-*capability* benchmark; this repo only adds
model-routing config and an experiment log. It contains **no exploit code or
payloads** (bug specifics stay inside the upstream containers) and **no API
keys**. Per upstream guidance we do **not** perform reinforcement learning on the
benchmark. Use is limited to authorized capability evaluation.
## License
Apache-2.0.