## https://sploitus.com/exploit?id=9B329910-D362-5657-8E8B-33CA109ECCD7
# CVE2PoC
A prototype for generating real npm vulnerability PoCs based on LLMs and Docker differential verification. Core objectives:
```
Vulnerability information -> Evidence extraction/root cause analysis -> Runtime probe -> Dynamic Trigger Contract -> LLM-generated PoC -> Execution on both Docker versions -> Oracle determination -> ReAct feedback for repairs -> Result report
```
Currently, the main process only evaluates LLMs/ReActs. The old case-specific manual templates have been removed; no template generation components are retained. ## Let’s see where things are located
| Purpose | Path |
|---|---|
| Vulnerability input | `datasets/benchmark///` |
| Final result | `outputs/validation-llm-only/report.md` |
| Individual vulnerability results | `outputs/validation-llm-only/cases.md` |
| Classification statistics | `outputs/validation-llm-only/type_summary.md` |
| Mechanism tree coverage audit | `outputs/mechanism_tree_audit.md` |
| Single vulnerability PoC | `outputs/validation-llm-only/cases//poc.js` |
| Single vulnerability log | `outputs/validation-llm-only/cases//run.log` |
| Intermediate evidence/analysis/probe | `.work/evidence///` |
## Directory responsibilities
| Directory/File | Function |
|---|---|
| `datasets/benchmark/` | Input data: vulnerability description, patch, oracle, vulnerable/patched version environments |
| `src/cve2poc/` | Core code: evidence extraction, engineer-style process, dynamic Trigger Contract, LLM, Runner, Oracle, ReAct |
| `scripts/` | Execution scripts called by `run.py` |
| `prompts/` | Planner/Executor/ReAct prompts |
| `.work/` | Internal cache: evidence, analysis, engineer-process, LLM requests/responses; not used as a result entry |
| `outputs/` | Final results: report, cases.md, type_summary.md, poc.js, run.log |
Formal results use Markdown. Old CSV files are only compatible with reading/cleaning historical outputs; not used as current result entries. ## Input structure
Single vulnerability case:
```text
metadata.json CVE/GHSA, package name, version, vulnerability type, entry function
report.md Vulnerability description, used as input for LLM
patch.diff Fixing clues, used as input for root cause analysis
oracle.json Success determination signal
vulnerable/ Vulnerable version npm environment
patched/ Patched version npm environment
ground_truth/ Used only for baseline, not involved in LLM generation
```
## How to run
View paths and current status:
```bash
python3 run.py where
python3 run.py status
python3 run.py audit --split validation
```
Verify Docker environment and oracle, without evaluating LLMs:
```bash
python3 run.py baseline --split validation
```
Actually call the LLM to generate PoCs and verify:
```bash
python3 run.py core --split validation
```
Freeze the test set:
```bash
python3 run.py core --split test
```
## Core process
Each vulnerability is executed independently before moving on to the next one:
```text
[CASE n/N]
[1/7] Evidence extraction
[2/7] Runtime probe
[3/7] Planner LLM
[4/7] Executor LLM
[5/7] Docker verification of initial PoC
[6/7] ReAct repair
[7/7] Finalization
```
`engineer_process.json` explicitly structures the manual reproduction process:
```text
Input facts -> Runtime probe -> Root cause/patch analysis -> Entry API -> Trigger conditions -> Payload assumptions -> Oracle observation -> Repair strategy
```
It does not contain handwritten PoCs; it does not read from `ground_truth/poc.js`. Its purpose is to constrain the LLM to first verify environmental facts, then analyze, and finally write code, just like an engineer would do. `runtime_probe.json` is the key evidence for optimizing generalization capabilities: the system runs a non-utilization probe in a dual-version environment with both vulnerable and patched versions, to confirm whether the package’s module format, loadable entries, exported shapes, and metadata entries actually exist. The LLM must prioritize using these executed evidences rather than guessing CommonJS/ESM, function names, or subpaths. `trigger_recipe` is now a dynamic Trigger Contract. It only extracts entry API candidates, payload requirements, observable effects, environmental constraints, and cleanup constraints from the current case’s metadata, report, patch, source delta, observed exports, analysis, and oracle. It no longer uses a fixed trigger template for a specific CVE/package. `exploit_spec` is the core intermediate representation for optimizing generation quality: before generating or fixing a PoC, the LLM must clearly select entry APIs, payload shapes, dynamically observable effects, patched comparison behaviors, and why that path can reach the vulnerability sink, based on the runtime probe, source delta, and analysis. `poc.js` must implement this specification to avoid writing generalized code that merely resembles a vulnerability exploit. Static checks are no longer used as a hard gate. The system records JS syntax, target package loading, oracle signals, and dynamic contract hits as `_contract_warnings`, but it will not skip Docker due to these warnings. The true verification criterion is the dynamic difference results of both vulnerable and patched versions. ## Verification Mechanism
The same `poc.js` will be executed twice:
```text
Docker + vulnerable/node_modules + poc.js
Docker + patched/node_modules + poc.js
```
Success requires that all the following conditions are met simultaneously:
```text
vulnerable stdout contains the vulnerable signal from the oracle
patched stdout contains the patched signal from the oracle
both vulnerable and patched versions have exit_code == 0
neither vulnerable nor patched versions experience a timeout
```
Key logs are located at:
```text
outputs/validation-llm-only/cases//run.log
```
Logs record execution commands, container names, stdout/stderr, crashes/timeouts, and oracle check results. If the PoC quality is poor, the results of each attempt and `_contract_warnings` are checked first. Warnings indicate that the generated results may deviate from the dynamic Trigger Contract, but they do not mean that the vulnerability verification failed. Ultimately, the Docker difference with the oracle is what matters. ## Metric Explanation
| Metric | Meaning |
|---|---|
| `Validation success rate` | Whether the final PoC passes the differential verification by the LLM/ReAct |
| `Success@1` | Whether the initial executor-generated PoC passes on the first attempt |
| `attempts` | The number of times the initial draft and ReAct repairs were attempted |
| `vulnerable_ok` | Whether a vulnerability signal appears on the vulnerable side |
| `patched_ok` | Whether a repair signal appears on the patched side |
| `no_runtime_error` | Whether both versions execute without crashes or timeouts |
Types of failures:
```text
trigger_error: No vulnerability signal triggered on the vulnerable side
module_error: Errors related to CommonJS/ESM, require/import, or module path resolution
api_error / syntax_error: API calls or JS syntax errors
oracle_error: No expected repair signals appeared on the patched side
timeout_error: The PoC execution timed out
```
## Method Boundaries
`ground_truth/poc.js` is only used for baseline environment checks. Case-specific `TEMPLATE_BUILDERS` have been removed. The current main process does not use hand-written templates, download PoCs, or fallback PoCs. Future optimizations should focus on improving information extraction, Trigger Contracts, PoC generation, and feedback repairs for cases where the LLM/ReAct results fail. The current refactoring goal is to prioritize generalization capabilities: we will no longer improve validation scores by adding package-level/CVE-level rules for known-failing cases. While the short-term success rate may decrease, the gap between validation/test results better reflects information extraction, PoC generation, and feedback repair capabilities.