How We Fixed the 2/100 Problem — C# Scanner + Coverage Gating
We scanned a well-designed C#/.NET agent orchestrator and got 2/100 UNGOVERNED. That wasn't the project's fault — it was ours. This is the inside story of the two bugs we shipped in Warden v1.7.0 to fix it, and how VigIA-Orchestrator went from 2/100 to 61/100 PARTIAL (the first framework in our gallery above the UNGOVERNED threshold).
How We Fixed the 2/100 Problem — C# Scanner + Coverage Gating
A few weeks ago we ran Warden against a C#/.NET agent orchestrator called VigIA-Orchestrator. The result was 2/100 UNGOVERNED.
That was wrong. And the way it was wrong taught us something we had to fix before shipping v1.7.0 of the gallery.
This post is the inside story: what the bug was, why it was actually two bugs, how we fixed both, and what the new VigIA score (61/100 PARTIAL — the first framework in the gallery above the UNGOVERNED threshold) actually measures.
The Setup
VigIA is not a Python project. It's a C#/.NET agent orchestrator built on Microsoft.Extensions.AI, using IChatClient, [KernelFunction] registration, explicit Result<T, E> error handling, InvariantEnforcer patterns, AuthorizationPolicyBuilder policies, CancellationToken propagation, ImmutableDictionary state, and FSM-guarded state transitions with an IChatClient abstraction for post-execution verification.
If you read that sentence and thought that sounds well-governed, you're right. VigIA was designed with governance surfaces in mind from day one. Its author has been doing exactly the work that Warden is supposed to reward — explicit invariants, typed errors, immutable state, structured outputs, auth policies.
We scanned it.
Score: 2 / 100 UNGOVERNED
Findings: 1,847
Level: UNGOVERNED (< 33)
Two out of a hundred. For a project that is visibly better-governed than every Python framework in our existing gallery.
The First Bug: The Scanner Couldn't See C#
Warden v1.6.0 had twelve scan layers. All twelve were written against Python — ast parsing for Python files, regex patterns targeting def, async def, @decorator, from x import y, FastAPI / LangChain / LlamaIndex / PydanticAI call patterns, os.environ, python-dotenv, setup.py, pyproject.toml, requirements.txt.
Against a repository that is 100% .cs files, .csproj files, and appsettings.json, every one of those scanners had nothing to match. No Python files meant no tool registrations detected, no credential patterns detected, no auth decorators detected, no input validators detected, no policy hooks detected.
So D1 Tool Inventory: 0/25. D3 Policy Coverage: 0/20. D4 Credential Management: 0/20. D7 Human-in-the-Loop: 0/15. D8 Agent Identity: 0/15. D14 Compliance Maturity: 0/10. D17 Adversarial Resilience: 0/10.
Not because VigIA didn't have these surfaces — it had more of them than most of the Python projects we'd scanned — but because Warden literally could not see them.
This was embarrassingly predictable. We just hadn't hit it until we pointed the scanner at a non-Python project.
The fix for Bug #1: Layer 13, second batch — a C#/.NET scanner.
Layer 13 now detects:
| Governance surface | Detected via |
|---|---|
| Tool / function registration (D1) | [KernelFunction] attributes, IChatClient method registration, Microsoft.Extensions.AI tool declarations |
| Policy coverage (D3) | AuthorizationPolicyBuilder, [Authorize(Policy=...)], InvariantEnforcer.Require(...), Result<T, E>.Ensure(...) |
| Credential management (D4) | DefaultAzureCredential, IHttpClientFactory (to detect that HttpClients aren't being constructed with embedded secrets), managed identity patterns |
| Human-in-the-loop (D7) | Approval gate patterns, CancellationToken propagation, explicit user-confirmation flows |
| Agent identity (D8) | IChatClient abstraction, typed agent contexts, readonly record struct immutable identities |
| Compliance maturity (D14) | ChatResponseFormat.CreateJsonSchemaFormat for structured outputs, ImmutableDictionary state, FSM-guarded state transitions, audit sink wiring |
| Adversarial resilience (D17) | Result<T, E> vs exception-based flow, invariant enforcement, CancellationToken cooperation, schema-validated outputs |
These aren't superficial pattern matches. They correspond to the same governance surfaces the Python scanners look for — a central tool registry, a policy enforcement hook, a credential boundary, an approval gate, a stable agent identity, a compliance-ready output schema, a trap-resistant control flow. The C#/.NET scanner just recognizes them in the idioms a .NET engineer actually writes.
With Layer 13 on, VigIA's dimension scores started moving. D1 went from 0/25 to 12/25 (48%). D3 jumped to 14/20 (70%). D4 went to 13/20 (65%). D14 went to 8/10 (80%). D17 hit 8/10 (80%).
That was progress. The score was still wrong.
The Second Bug: Coverage Failure As Compliance Failure
Even with the C#/.NET scanner detecting real signal, VigIA's score was still absurdly low — somewhere in the low teens. The reason was the second bug, and it was much more interesting than the first.
Warden's scoring model has a core invariant: absence is not compliance. If you don't detect a control, you score 0 for that control. This prevents the failure mode where a vendor claims "we comply with X, Y, Z" based on an auditor never finding any evidence either way. Undetected = 0. Non-negotiable.
But there's a sharp corner to that rule. The core rule is "we didn't find the control, so you get 0 points." The failure mode is: we didn't find the Python control in a repository that has no Python at all, and we still counted the maximum Python-dimension points in the denominator.
Concretely: D2 Risk Detection looks for Python-specific risk classifiers (risk_score, classify_risk, policy_engine.evaluate()). D9 Threat Detection looks for Python trap defense patterns (detect_prompt_injection, sanitize_tool_result). D10 Prompt Security looks for llm_guard, promptguard, presidio imports. D16 Data Flow Governance looks for Python taint tracking libraries. None of those will ever fire on C#/.NET code, because none of those imports can exist in a C#/.NET codebase. You'd need a separate .NET-idiom scanner for each one — and we didn't have those yet.
For LangChain or PydanticAI, a 0/20 on D2 means Python code that should have a risk classifier doesn't have one. That's a real finding.
For VigIA, a 0/20 on D2 meant the scanner was looking for a Python pattern in code that isn't Python. That's not a finding. That's coverage failure being misreported as compliance failure. Adding those zeros into the denominator was turning "we can't see this language yet" into "this project scored 0 on governance dimensions it was never evaluated on."
The result was the 2/100.
The fix for Bug #2: absence-vs-coverage gating with file_counts.
Two halves:
(a) Finding-emission gating. The scanners that emit absence-based findings — trap_defense_scanner and audit_scanner — now receive a file_counts kwarg passed in via functools.partial at the dispatch layer. Before emitting a Python-specific absence finding, they check whether file_counts["python"] > 0. If the repo has zero Python files, the finding is simply not emitted. No noise. No fake CRITICALs claiming that a C#/.NET project is missing presidio-analyzer.
(b) Denominator exclusion at the scoring layer. scoring/engine.py now reads file_counts when computing the normalized score. If a dimension is exclusively wired to a language whose file count is zero, that dimension's max is excluded from the denominator — not zeroed in the numerator. The effect: a project is never penalized on the maximum for surfaces the scanner cannot see yet.
Crucially, the invariant holds. Absence is still not compliance. If Warden detects a C#/.NET project does contain some Python files — say, a build script or a data-science notebook — then the Python dimensions re-enter the denominator for that project. Coverage gating is not an escape hatch. It's a guarantee that "we don't cover this language" is reported honestly instead of laundered into a compliance score.
Six regression tests across test_trap_defense.py and test_audit.py cover both halves. The fix is in commit 6a6144f.
The Rescore
With both fixes landed, we re-ran VigIA against v1.7.0.
Score: 61 / 100 PARTIAL
Raw: 92 / 150 effective (85 max excluded via coverage gate)
Findings: 1
Level: PARTIAL (33 ≤ score < 60 = AT_RISK, 60 ≤ score < 80 = PARTIAL)
(The threshold table is UNGOVERNED < 33 < AT_RISK < 60 ≤ PARTIAL < 80 ≤ GOVERNED. VigIA is the first framework in the gallery above the UNGOVERNED threshold, and the first above AT_RISK as well — it lands in the PARTIAL band at 61.)
Dimension breakdown:
| Dimension | Raw / Max | Percent |
|---|---|---|
| D1 Tool Inventory | 12 / 25 | 48% |
| D2 Risk Detection | — | gated |
| D3 Policy Coverage | 14 / 20 | 70% |
| D4 Credential Management | 13 / 20 | 65% |
| D5 Log Hygiene | 4 / 10 | 40% |
| D6 Framework Coverage | 2 / 5 | 40% |
| D7 Human-in-the-Loop | 10 / 15 | 67% |
| D8 Agent Identity | 10 / 15 | 67% |
| D9 Threat Detection | — | gated |
| D10 Prompt Security | — | gated |
| D11 Cloud / Platform | 6 / 10 | 60% |
| D12 LLM Observability | 5 / 10 | 50% |
| D13 Data Recovery | — | gated |
| D14 Compliance Maturity | 8 / 10 | 80% |
| D15 Post-Exec Verification | — | gated |
| D16 Data Flow Governance | — | gated |
| D17 Adversarial Resilience | 8 / 10 | 80% |
(Gated dimensions are ones where the coverage gate excluded the maximum from the denominator because VigIA contains zero Python files and we don't yet have .NET-idiom scanners for those surfaces. They are not scored as 0 — they are not scored at all.)
Every dimension Warden could see on VigIA scored non-zero. The highest ones are exactly the places the VigIA author invested: explicit auth policies (D3 70%), managed-identity-style credential handling (D4 65%), typed error flow that makes adversarial failure modes visible (D17 80%), and structured outputs with immutable state (D14 80%).
One finding. Compare that to LangChain's 1,516, CrewAI's 2,171, or Semantic Kernel's 1,708. The difference is not that VigIA has 1,515 fewer bugs than LangChain — it's that when Warden actually has a scanner that understands the language, and the coverage gate stops the scanner from screaming about absences it can't see, the honest finding count drops to what's actually wrong.
Why This Matters Beyond VigIA
There's a general lesson here that applies to every static analyzer, not just Warden.
A scanner that can't express what it doesn't cover will lie about what it does cover.
If your scanner has a Python-specific detector for risk classification and you point it at a Go codebase, you have three options:
- Emit 0/20 anyway. This is what Warden v1.6.0 did. It treats every non-Python project as UNGOVERNED by construction. It's wrong in exactly the way VigIA was wrong.
- Pretend the dimension doesn't exist. This is what a lot of tools implicitly do — they run whichever scanners match and report only on those. It's honest, but it destroys comparability across projects, and it makes the top-line score meaningless.
- Gate on coverage: exclude unreachable dimensions from the denominator, don't pretend the dimension doesn't exist, and track the gap as a known coverage hole. This is what Warden v1.7.0 does. The score stays comparable. The honesty of the "undetected = 0" rule is preserved. And the operator of the C#/.NET codebase gets a score that reflects what was actually measured.
Option 3 is the only one that composes. As we add a Go scanner, a TypeScript scanner, a Rust scanner, the coverage gate generalizes — each new language detector expands what Warden can see, and the dimensions that were previously gated for that language re-enter the denominator. The scoring model doesn't have to change. The fix is the same shape for every future language.
The Gallery Rebuild
We rebuilt the full gallery against v1.7.0 and pushed it to sharkrouter.github.io/warden. Every Python target re-ran with identical scoring (the coverage gate only fires when a language's file count is zero, and every Python framework in the gallery has >0 Python files, so their scores are unchanged from the v1.6.0 run).
VigIA was added as the eleventh target and lands at 61/100 PARTIAL — the first framework above the UNGOVERNED threshold, and the first above AT_RISK as well.
That's not an endorsement of .NET over Python. Every Python framework in the gallery could climb into the PARTIAL band — and higher — by investing in the same surfaces VigIA invested in: a central tool registry with inspectable metadata (D1), explicit policy enforcement hooks on every tool call (D3), structured error handling that makes failure modes inspectable (D17), and typed outputs with a verification boundary (D14). Nothing about Python prevents it. The scanner no longer prevents it either.
Try It
Warden v1.7.0 is live on PyPI. The C#/.NET scanner and the coverage gate are both on by default:
pip install warden-ai
warden scan ./your-dotnet-agent --format html
If your project is C#/.NET, the report now tells you which .NET idioms Warden detected and which dimensions were gated for coverage reasons. If your project is mixed (Python + .NET, say), both scanners run in parallel and only the genuinely absent surfaces score zero.
The full VigIA before/after reports are public at sharkrouter.github.io/warden/vigia-orchestrator/. Every dimension, every finding, every raw number is reproducible by running warden scan against the VigIA-Orchestrator repo yourself.
The commit that landed this is 6a6144f — Layer 13 C#/.NET scanner + absence-vs-coverage gating in one bundle because the two fixes only make sense together. If we'd shipped either half alone, VigIA would still be wrong: Layer 13 without coverage gating would have scored it around 40 (Python absences still dragging the denominator), and coverage gating without Layer 13 would have gated away most of the denominator and reported a score based on nothing. Both halves shipped, or neither.
