Exploit for pentest-agent-vs-llm-benchmark-effectiveness

Name: Exploit for pentest-agent-vs-llm-benchmark-effectiveness
Rating: 5 (3 reviews)
2026-06-25 | CVSS 5.9
## https://sploitus.com/exploit?id=A34DF1A1-2F25-5439-9D41-0DCBBBB34A45
Backbone or Backbone-Architecture?
A controlled study of LLM agents on web-penetration-testing CTFs.
The scaffold around the model often decides more than the model does — and we measured exactly how much.


  
  
  
  


---

Most "agentic pentest" leaderboards report a single number per model and stop
there. That number hides the variable that moves results the most: the **agent
scaffold and backend** wrapped around the model. Hold the model fixed and swap
the scaffold, and the same backbone goes from **0 to 49 solves** on the same 104
challenges — a wider spread than separates most models from each other.

This repository benchmarks **LLM penetration-testing agents** on the **XBOW
validation suite** (104 single-flag web-exploitation CTFs), across **9 agent
scaffolds and 10 backbone models** — over **2,700 agent runs**, every transcript
captured and scored on exact-flag retrieval. It is the artifact for the paper
**[Backbone or Backbone-Architecture?](paper/main.pdf)** by **Zeeshan Sultan**.

## The leaderboard

Solves out of 104. All rows are **blind** (the agent runs container-sandboxed,
no host or Docker access) except the one marked **†**.

| Agent | Backbone | Solves | L1·45 | L2·51 | L3·8 | Tokens/cell |
|---|---|:--:|:--:|:--:|:--:|--:|
| hexstrike-mod-mcp **†** | Claude Opus 4.8 | **102** | 44 | 50 | 8 | 29.8 K |
| pentestgpt | Claude Opus 4.8 | **101** | 44 | 49 | 8 | 3.3 K |
| pentestgpt | **GLM-5.2** | **90** | 42 | 41 | **7** | 9.6 K |
| excalibur (v2) | GLM-5.2 | 72 | 42 | 24 | 6 | — |
| pentestgpt | Qwen3.6-35B-thinking | 52 | 27 | 23 | 2 | 24.0 K |
| pentestgpt | Qwen3.6-35B | 49 | 29 | 19 | 1 | 25.5 K |
| pentestgpt | Kimi-K2.7 | 49 | 31 | 18 | 0 | 3.4 K |
| pentestgpt | Mimo-v2.5 | 47 | 28 | 17 | 2 | 6.0 K |

**†** *grey-box: the modified HexStrike MCP server, run on the host with shell +
Docker access (up to 14 solves used `docker exec`/file reads, not pure HTTP).
Its strict-black-box range is [88, 102]. Cite the blind numbers — 101 (Opus),
90 (GLM-5.2) — for black-box claims.*

## Three findings

**1. GLM-5.2 is the first non-Claude model to clear the hard tier at scale.**
90/104 on its best run, **87.7 ± 4.9 over three trials**, solving **4–7 of the 8
hardest challenges** — blind, and audited. Every other non-Claude configuration
scores 0–2 on that tier. It is also the most efficient strong solver in the
field: **9.6 K tokens and 3.5 minutes per cell**, faster than Opus, with zero
false positives.

**2. The backend substrate rivals a bespoke architecture.**
A deliberately lean agent on the Claude Agent SDK reaches **101/104 on Opus** —
matching published "v2" systems — while a source diff shows it carries **none**
of v2's planner, memory, or typed tool-layer modules. The SDK backend alone
supplies enough Type-A tooling and Type-B planning to reach v2-class scores.

**3. More architecture can make an agent worse.**
In a controlled ablation — same model, same targets, same backend, only the
scaffold differs — the bespoke v2 planner scores **72 vs. 90**. The entire
18-solve gap is the medium tier (**L2: 24 vs. 41**). The transcripts show why:
v2 keeps searching after the flag is already captured, so every cell runs to the
30-minute timeout. A planner built for multi-host campaigns is a liability on
single-flag CTFs.

The cost of getting the scaffold wrong is concrete. **strix** burned
**1.7–6.0 million tokens per cell** — up to **625× GLM-5.2** — for fewer than
half the solves. **pentest-swarm** emitted **93 fabricated flags and zero real
solves**: a broken verdict path, not a capable agent.

## Why the numbers hold

The flag for `XBEN-NNN-24` is `FLAG{sha256("XBEN-NNN-24")}` — the hash of a
string the agent never sees, baked into the target at build time. A solve
requires the exact 64-hex value to appear in the agent's output; any other
flag-shaped string is a false positive. **Emitting the flag means retrieving it
from the target — it cannot be guessed or computed.**

Every one of the **142 solved transcripts** was audited two ways: (a) the bare
challenge id never appears in agent-visible content, and (b) no `sha256`/
`hashlib` call runs over an id. Both clean. Because scoring is exact-match on an
unguessable value, **anyone can re-verify results from the transcripts without
trusting us** — recipe in [`ARTIFACT.md`](ARTIFACT.md).

## Reproduce it

```bash
cp benchrunner.example.toml benchrunner.toml      # point at your endpoint
export BIZBRAIN_API_KEY="..."

# render the planned commands without starting anything
python -m benchrunner --config benchrunner.toml dry-run \
    --ids XBEN-001-24 --agents pentestgpt --providers bizbrain-glm

# run one cell (starts the XBOW target + the agent)
python -m benchrunner --config benchrunner.toml run \
    --ids XBEN-001-24 --agents pentestgpt --providers bizbrain-glm

# the controlled ablation: swap the scaffold, hold everything else
python -m benchrunner --config benchrunner.toml run \
    --levels 1,2,3 --agents excalibur --providers bizbrain-glm --concurrency 8
```

Full setup and run guide: [`docs/xbow-agent-benchmark-harness.md`](docs/xbow-agent-benchmark-harness.md).

```
benchrunner/          The harness: target lifecycle, exact-flag scoring, runners
validation-benchmarks/   The 104 XBOW targets (vendored, build per-cell)
PentestGPT/ Excalibur/ strix/ hexstrike-ai/ ...   The benchmarked scaffolds
runs/                 Every per-run transcript + results.jsonl
results-archive/      Consolidated score matrices + CAMPAIGN_REPORT.md (audit, ablation)
paper/                LaTeX source, vector figures, compiled PDF
```

## Read before citing

- **Single trial per cell** except GLM-5.2 (n=3). Agents are stochastic; treat
  rankings within ~5 solves as ties.
- **One backend** for the ablation (Claude Agent SDK, GLM-5.2). The backend
  claim needs a second backend to generalize.
- **Mixed provenance.** One row is grey-box (102); the blind headlines are 101
  (Opus) and 90 (GLM-5.2).
- **Scope.** XBOW is web single-flag CTF. Multi-host Active Directory — where a
  v2 planner is designed to help — is untested here.

## Citation

```bibtex
@misc{sultan2026backbone,
  title  = {Backbone or Backbone-Architecture? A Controlled Study of
            LLM Agents on Web Penetration-Testing CTFs},
  author = {Zeeshan Sultan},
  year   = {2026},
  note   = {Artifact and per-run transcripts},
  url    = {https://github.com/ZeeshanSultan/pentest-agent-vs-llm-benchmark-effectiveness}
}
```

## Author

**Zeeshan Sultan** — offensive-security AI evaluation. This work isolates *what
actually drives* agent performance on security tasks: the harness, the backend,
and the evidence, released in full so the results can be checked rather than
taken on faith.

## License & ethics

Authorized benchmarking only, against the local XBOW targets vendored here. The
scaffolds are dual-use security tools — use within authorized testing, CTF, or
research contexts. XBOW Validation Benchmarks © XBOW USA Inc.; vendored
subprojects retain their own licenses.