Evaluation Methodology¶

CloneGuard's detection capabilities are evaluated through multiple independent approaches. All evaluation data and scripts are published in the repository.

Approach 1: Pattern-Level Testing¶

Every detection rule has at least one positive and one negative test case. 1,677 automated tests cover rule matching, severity classification, scan mode behavior, and edge cases.

pytest tests/

Approach 2: Pipeline Benchmarks¶

End-to-end benchmarks measure the combined detection pipeline (pattern + semantic + behavioral) against curated sample sets:

185 adversarial samples -- payloads designed to evade detection, sourced from public datasets and generated via adversarial techniques
234 held-out benign samples -- real-world code and documentation that should not trigger detections

Metric	Tier 0 alone	Tier 1.5 alone	Combined
Recall	31.9%	78.4%	80.5%
FPR	9.8%	15.4%	22.2%
False block rate	--	--	3.8%

Approach 3: False Positive Calibration¶

False positive rates were validated against 208,127 real coding-agent sessions mined from three published SWE-bench datasets on HuggingFace (SWE-smith, Nebius, OpenHands). These are benign agent sessions solving GitHub issues.

Rule	FPR	Verdict
SEQ-001 (sensitive read then exfil)	0.0024% (5 / 208,127)	Enforce
SEQ-005 (agent config write)	0.0005% (1 / 208,127)	Enforce
SEQ-004 (write then build)	15.80%	Advisory only

The 5 SEQ-001 matches were manually inspected and confirmed benign -- legitimate test scripts that read config files then make HTTP requests.

Scripts: scripts/mine_trajectories.py, scripts/download_trajectories.py

Approach 4: Adaptive Red Teaming¶

Multi-model adversarial evaluation where AI models attempt to craft payloads that bypass CloneGuard with full knowledge of its detection rules.

Naive baseline: 95.65% detection on 13,597 garak probes (Tier 0 + 1.5)

Adaptive evaluation:

Model	Payloads	Bypassed	Detection Rate
Claude (iterative)	30	5	83.3%
Mistral (iterative)	30	0	100%
Combined (3 models)	118	3	97.5%

The 5 Claude iterative bypasses all used bureaucratic-documentation framing to evade the semantic classifier. This is a known limitation of the MiniLM architecture's mean-pooling approach when attack content is diluted in long institutional-style text.

Full methodology: see docs/ADAPTIVE-RED-TEAM.md in the repository.

Approach 5: Cross-Examination¶

Detection results are independently verified by external AI models (Gemini, Mistral, Qwen) to identify false positives and disputed verdicts. Three of the initial "bypasses" were reclassified as DISPUTED after independent cross-examination found the payloads were not actually functional attacks.

Trajectory Dataset Scope¶

The 208,127 trajectory dataset covers SWE-bench task solving (coding agents fixing GitHub issues). It does not include:

MCP interactions
Browser agent sessions
Financial agent workflows
Autonomous multi-agent sessions

False positive rates for these agent types are not yet validated. Behavioral sequence rules for these domains should be treated as experimental.