Skip to content

Evaluation Results

Detection Performance

Tier 0 (Pattern Matching)

240 compiled regex rules across 34 categories.

Metric Value
Precision 91.33%
Recall 23.26%
F1 37.08%
Speed <50ms full scan

High precision, low recall. Catches known patterns reliably but is evasion-prone to creative rewording.

Tier 1.5 (Semantic Classifier)

Fine-tuned MiniLM-L6-v2, 5-fold cross-validated on 6,472 samples.

Metric Value
F1 94.34% +/- 0.77%
Recall 93.68% +/- 1.77%
Precision 96.23% +/- 0.79%
Speed ~16ms per sample

Catches synonym substitution, social engineering, encoding evasion, and homoglyphs that regex misses.

Combined Pipeline

185 adversarial + 234 benign samples in production configuration.

Metric Value
Combined recall 80.5%
False block rate 3.8%
False warning rate 18.4%
Combined FPR 22.2%

Tier 0 and Tier 1.5 compensate for each other: Tier 0 catches truncation and fragmentation attacks (80% recall on those classes) while Tier 1.5 catches semantic evasion (100% recall on those classes).

Out-of-Distribution

144 MCP tool result samples from an independent source (mcp-guard):

Metric Value
Recall 100%
FPR 43%

Attack patterns generalize across domains. Benign class is domain-specific, leading to high FPR on unfamiliar content types.

Behavioral Sequence Detection

Validated against 208,127 real coding-agent sessions:

Rule False Positives FPR Mode
SEQ-001 5 0.0024% Enforce
SEQ-002 0 0% Enforce
SEQ-005 1 0.0005% Enforce
SEQ-004 32,887 15.80% Advisory

Adaptive Red Team

Evaluation Detection Rate
Naive (13,597 garak probes) 95.65%
Adaptive (118 payloads, 3 models) 97.5%
Claude iterative (30 payloads) 83.3%
Mistral iterative (30 payloads) 100%

Performance Overhead

Operation Time
Tier 0 full repo scan <50ms
Tier 1.5 per sample ~16ms
SEQ rule evaluation <0.5ms per event
Hook invocation overhead ~20ms

Model and Dataset

Published on HuggingFace: