## https://sploitus.com/exploit?id=A34DF1A1-2F25-5439-9D41-0DCBBBB34A45
Backbone or Backbone-Architecture?
A controlled study of LLM agents on web-penetration-testing CTFs.
The scaffold around the model often decides more than the model does โ and we measured exactly how much.
---
Most "agentic pentest" leaderboards report a single number per model and stop
there. That number hides the variable that moves results the most: the **agent
scaffold and backend** wrapped around the model. Hold the model fixed and swap
the scaffold, and the same backbone goes from **0 to 49 solves** on the same 104
challenges โ a wider spread than separates most models from each other.
This repository benchmarks **LLM penetration-testing agents** on the **XBOW
validation suite** (104 single-flag web-exploitation CTFs), across **9 agent
scaffolds and 10 backbone models** โ over **2,700 agent runs**, every transcript
captured and scored on exact-flag retrieval. It is the artifact for the paper
**[Backbone or Backbone-Architecture?](paper/main.pdf)** by **Zeeshan Sultan**.
## The leaderboard
Solves out of 104. All rows are **blind** (the agent runs container-sandboxed,
no host or Docker access) except the one marked **โ **.
| Agent | Backbone | Solves | L1ยท45 | L2ยท51 | L3ยท8 | Tokens/cell |
|---|---|:--:|:--:|:--:|:--:|--:|
| hexstrike-mod-mcp **โ ** | Claude Opus 4.8 | **102** | 44 | 50 | 8 | 29.8 K |
| pentestgpt | Claude Opus 4.8 | **101** | 44 | 49 | 8 | 3.3 K |
| pentestgpt | **GLM-5.2** | **90** | 42 | 41 | **7** | 9.6 K |
| excalibur (v2) | GLM-5.2 | 72 | 42 | 24 | 6 | โ |
| pentestgpt | Qwen3.6-35B-thinking | 52 | 27 | 23 | 2 | 24.0 K |
| pentestgpt | Qwen3.6-35B | 49 | 29 | 19 | 1 | 25.5 K |
| pentestgpt | Kimi-K2.7 | 49 | 31 | 18 | 0 | 3.4 K |
| pentestgpt | Mimo-v2.5 | 47 | 28 | 17 | 2 | 6.0 K |
**โ ** *grey-box: the modified HexStrike MCP server, run on the host with shell +
Docker access (up to 14 solves used `docker exec`/file reads, not pure HTTP).
Its strict-black-box range is [88, 102]. Cite the blind numbers โ 101 (Opus),
90 (GLM-5.2) โ for black-box claims.*
## Three findings
**1. GLM-5.2 is the first non-Claude model to clear the hard tier at scale.**
90/104 on its best run, **87.7 ยฑ 4.9 over three trials**, solving **4โ7 of the 8
hardest challenges** โ blind, and audited. Every other non-Claude configuration
scores 0โ2 on that tier. It is also the most efficient strong solver in the
field: **9.6 K tokens and 3.5 minutes per cell**, faster than Opus, with zero
false positives.
**2. The backend substrate rivals a bespoke architecture.**
A deliberately lean agent on the Claude Agent SDK reaches **101/104 on Opus** โ
matching published "v2" systems โ while a source diff shows it carries **none**
of v2's planner, memory, or typed tool-layer modules. The SDK backend alone
supplies enough Type-A tooling and Type-B planning to reach v2-class scores.
**3. More architecture can make an agent worse.**
In a controlled ablation โ same model, same targets, same backend, only the
scaffold differs โ the bespoke v2 planner scores **72 vs. 90**. The entire
18-solve gap is the medium tier (**L2: 24 vs. 41**). The transcripts show why:
v2 keeps searching after the flag is already captured, so every cell runs to the
30-minute timeout. A planner built for multi-host campaigns is a liability on
single-flag CTFs.
The cost of getting the scaffold wrong is concrete. **strix** burned
**1.7โ6.0 million tokens per cell** โ up to **625ร GLM-5.2** โ for fewer than
half the solves. **pentest-swarm** emitted **93 fabricated flags and zero real
solves**: a broken verdict path, not a capable agent.
## Why the numbers hold
The flag for `XBEN-NNN-24` is `FLAG{sha256("XBEN-NNN-24")}` โ the hash of a
string the agent never sees, baked into the target at build time. A solve
requires the exact 64-hex value to appear in the agent's output; any other
flag-shaped string is a false positive. **Emitting the flag means retrieving it
from the target โ it cannot be guessed or computed.**
Every one of the **142 solved transcripts** was audited two ways: (a) the bare
challenge id never appears in agent-visible content, and (b) no `sha256`/
`hashlib` call runs over an id. Both clean. Because scoring is exact-match on an
unguessable value, **anyone can re-verify results from the transcripts without
trusting us** โ recipe in [`ARTIFACT.md`](ARTIFACT.md).
## Reproduce it
```bash
cp benchrunner.example.toml benchrunner.toml # point at your endpoint
export BIZBRAIN_API_KEY="..."
# render the planned commands without starting anything
python -m benchrunner --config benchrunner.toml dry-run \
--ids XBEN-001-24 --agents pentestgpt --providers bizbrain-glm
# run one cell (starts the XBOW target + the agent)
python -m benchrunner --config benchrunner.toml run \
--ids XBEN-001-24 --agents pentestgpt --providers bizbrain-glm
# the controlled ablation: swap the scaffold, hold everything else
python -m benchrunner --config benchrunner.toml run \
--levels 1,2,3 --agents excalibur --providers bizbrain-glm --concurrency 8
```
Full setup and run guide: [`docs/xbow-agent-benchmark-harness.md`](docs/xbow-agent-benchmark-harness.md).
```
benchrunner/ The harness: target lifecycle, exact-flag scoring, runners
validation-benchmarks/ The 104 XBOW targets (vendored, build per-cell)
PentestGPT/ Excalibur/ strix/ hexstrike-ai/ ... The benchmarked scaffolds
runs/ Every per-run transcript + results.jsonl
results-archive/ Consolidated score matrices + CAMPAIGN_REPORT.md (audit, ablation)
paper/ LaTeX source, vector figures, compiled PDF
```
## Read before citing
- **Single trial per cell** except GLM-5.2 (n=3). Agents are stochastic; treat
rankings within ~5 solves as ties.
- **One backend** for the ablation (Claude Agent SDK, GLM-5.2). The backend
claim needs a second backend to generalize.
- **Mixed provenance.** One row is grey-box (102); the blind headlines are 101
(Opus) and 90 (GLM-5.2).
- **Scope.** XBOW is web single-flag CTF. Multi-host Active Directory โ where a
v2 planner is designed to help โ is untested here.
## Citation
```bibtex
@misc{sultan2026backbone,
title = {Backbone or Backbone-Architecture? A Controlled Study of
LLM Agents on Web Penetration-Testing CTFs},
author = {Zeeshan Sultan},
year = {2026},
note = {Artifact and per-run transcripts},
url = {https://github.com/ZeeshanSultan/pentest-agent-vs-llm-benchmark-effectiveness}
}
```
## Author
**Zeeshan Sultan** โ offensive-security AI evaluation. This work isolates *what
actually drives* agent performance on security tasks: the harness, the backend,
and the evidence, released in full so the results can be checked rather than
taken on faith.
## License & ethics
Authorized benchmarking only, against the local XBOW targets vendored here. The
scaffolds are dual-use security tools โ use within authorized testing, CTF, or
research contexts. XBOW Validation Benchmarks ยฉ XBOW USA Inc.; vendored
subprojects retain their own licenses.