Share
## https://sploitus.com/exploit?id=B119ADEE-94DF-5B44-A30E-35ED87AF168D
# TrustedRouter-ExploitBench

Notes, harness configs, and a runbook for driving
**[ExploitBench](https://github.com/exploitbench/exploitbench)** (the public
V8-exploitation capability benchmark, [exploitbench.ai](https://exploitbench.ai))
through **[TrustedRouter](https://trustedrouter.com)** โ€” including TrustedRouter
**Fusion** (a multi-model panel + synthesizing judge) as a single "model."

The goal: measure how far frontier models climb the V8 exploitation ladder, and
test whether **agentic Fusion** (synthesis across a panel) beats the best single
model on a benchmark where the model-to-model gap is large.

> This repo is **methodology + configs + notes**, not a fork. ExploitBench
> itself lives upstream at [`exploitbench/exploitbench`](https://github.com/exploitbench/exploitbench);
> we reference it and add only the TrustedRouter integration and our experiment
> record. It is a companion to
> [`Lore-Hex/PrometheusBench`](https://github.com/Lore-Hex/PrometheusBench)
> (refusal) and
> [`Lore-Hex/prometheus-biomysterybench`](https://github.com/Lore-Hex/prometheus-biomysterybench)
> (bioinformatics capability).

## What ExploitBench measures

ExploitBench scores an agent along the Chromium **V8 exploitation ladder** โ€” 16
capabilities from *reaching* vulnerable code โ†’ *triggering* the bug โ†’ building
*exploit primitives* โ†’ *arbitrary code execution*. `bench-v8` covers 41 real V8
bugs (`v8.yaml`); a 14-bug historical-baseline subset is in `v8-small.yaml`.
Each episode is up to **300 turns** of reasoning inside a per-bug Docker
container that exposes an ExploitBench MCP server (`setup()` / `grade(...)`).

It drives any model via a direct provider API **or an OpenAI-compatible
gateway** โ€” which is exactly how we point it at TrustedRouter.

## Why this experiment

On ExploitBench the spread between models is large (per the upstream
leaderboard: Mythos โ‰ˆ 9.5/16, while GPT-5.5 โ‰ˆ 3.8, Gemini 3.1 Pro โ‰ˆ 3.7, Kimi
โ‰ˆ 2.4). That wide gap makes it a good probe for whether **Fusion** can lift a
panel above its strongest member โ€” the question we also asked on BioMystery
(where the gap was small and Fusion did not help). See
[`NOTES.md`](./NOTES.md).

## Status

**Harness designed; full V8 sweep not yet run.** The honest blockers and the
host decision are documented in [`NOTES.md`](./NOTES.md). Short version:

- The real V8 images are ~70 GB each and amd64-only; episodes are 300 turns โ€”
  impractical on an ARM Mac (disk + emulation).
- Model spend is large (~$80โ€“200 per Opus 300-turn episode), independent of host.
- There is **no published Opus 4.8 number** to "reproduce" โ€” ExploitBench's
  baselines are Opus **4.6 / 4.7**. A 4.8 run would be a *fresh* measurement.
- Chosen host: a native **x86_64** Linux box (16 c / 27 GB / 682 GB free) โ€” see
  [`RUNBOOK.md`](./RUNBOOK.md).

Results will be added here once a cost-bounded run completes.

## Layout

- [`NOTES.md`](./NOTES.md) โ€” experiment notes: feasibility, published-number
  context, the Fusion question, decisions.
- [`RUNBOOK.md`](./RUNBOOK.md) โ€” how to stand up ExploitBench + TrustedRouter on
  an amd64 host and run our configs.
- [`configs/`](./configs) โ€” TrustedRouter run configs (single-model Opus 4.8,
  Fusion panel, and a no-image smoke).

## Ethics / scope

ExploitBench is a published security-*capability* benchmark; this repo only adds
model-routing config and an experiment log. It contains **no exploit code or
payloads** (bug specifics stay inside the upstream containers) and **no API
keys**. Per upstream guidance we do **not** perform reinforcement learning on the
benchmark. Use is limited to authorized capability evaluation.

## License

Apache-2.0.