SharkRouter is the deterministic data plane for agentic AI. It sits between AI agents and tool execution, providing a 14-step governance pipeline that includes ToolGuard (function-call firewall), Agent Passport (cryptographic identity), Dry-Run Preview (impact preview before execution), Output Assurance (post-execution verification), Kill Switch (immediate halt), and an immutable WORM audit chain.

How is SharkRouter different from prompt filters or monitoring tools?

SharkRouter is not a prompt filter, not an output scanner, and not a monitoring tool. It is a stateful gateway that intercepts every function call an AI agent makes and enforces deterministic business rules before execution. Prompt filters (Pangea, Lakera) only inspect input. Out-of-band monitors (Zenity, Protect AI) observe but cannot enforce. JIT access tools (Oasis Security) manage permissions but do not audit execution. SharkRouter is the only product that intercepts, governs, and audits at the function-call layer with cryptographic proof.

Is SharkRouter OpenAI-compatible?

Yes. SharkRouter is a drop-in replacement for the OpenAI API. Change your base_url to https://api.sharkrouter.ai/v1 and your existing code works unchanged. One line. Full governance. Zero lock-in.

ToolGuard is the function-call firewall at the heart of SharkRouter. It is a deny-by-default policy engine that evaluates every tool call through a 7-guard chain (Regex, Keyword, Schema, Policy, Semantic, LLM, MoralCompass) cost-ordered so the first block wins. Typical added latency is under 150ms. ToolGuard enforces business rules at the execution layer — the only layer where AI agents actually do something.

What is Agent Passport?

Agent Passport assigns cryptographic identity (ECDSA-signed) to every AI agent in your environment. Each passport carries a scoped tool universe, a 9-state lifecycle FSM, and delegation chains with scope narrowing. Trust stages progress STRANGER → KNOWN → TRUSTED → EXTENSION based on observed behavior.

What is Dry-Run Preview?

Dry-Run Preview shows the impact of a destructive tool call before it executes — affected rows, blast radius, estimated cost. Zero of 19 competitors in our State of AI Governance benchmark offer this capability. It is what enables CISOs to approve agentic AI for production systems.

Can SharkRouter be deployed air-gapped?

Yes. SharkRouter offers three deployment tiers: Cloud Gateway (5 minutes), Private VPC (1 day), and Air-Gapped On-Premise (1 week). The air-gapped tier uses offline licensing and runs with zero outbound connectivity — designed for banking, defense, and government environments that cannot use cloud AI services.

What compliance frameworks does SharkRouter support?

SharkRouter is designed compliant by architecture: SOC 2, GDPR, HIPAA, ISO 27001, BOI 364, and EU AI Act Article 14 (human oversight of high-risk AI). The WORM audit chain provides cryptographic chain-of-custody that satisfies banking and regulated-industry audit requirements.

Warden is SharkRouter's open-source governance scanner. Run it against any AI framework or environment and it produces a 17-dimension governance score out of 100. Across 19 AI frameworks and competing gateways, the market average is 28/100. SharkRouter scores 91/100. Warden is free, runs in 60 seconds, and is the first tool a CISO uses in our evaluation funnel.

We ran Warden against 10 of the most popular open-source AI agent frameworks — LangChain, LangGraph, CrewAI, AutoGen, Haystack, LlamaIndex, Semantic Kernel, PydanticAI, MetaGPT, and Langflow. None scored above 24/100. Here is what the gallery actually measures, why that number is what it is, and what it does and does not mean for your production agent.

Why Every Python Agent Framework We Scanned Scored UNGOVERNED

We built Warden to answer a question that keeps coming up in enterprise AI reviews: how do you measure the governance posture of an agent framework before you commit to running it in production?

Not "is it secure." Not "is it compliant." Those are the wrong questions at the framework layer. The right question is: what governance surfaces does the codebase expose that an operator can plug policy into?

To put our own scanner on the spot, we ran it against the ten most-starred open-source agent frameworks on GitHub and published every report at sharkrouter.ai/gallery. Every score, every finding, every dimension breakdown is public. We didn't cherry-pick.

Here is what we found, and — more importantly — what it means and what it does not.

The Headline Numbers

Framework	Score	Level	Findings	Top Strength
PydanticAI	24/100	UNGOVERNED	387	D12 Observability (80%)
CrewAI	19/100	UNGOVERNED	2,171	D14 Compliance Maturity (70%)
Langflow	18/100	UNGOVERNED	119	D8 Agent Identity (60%)
Haystack	15/100	UNGOVERNED	316	D7 Human-in-the-Loop (53%)
Semantic Kernel	14/100	UNGOVERNED	1,708	D7 Human-in-the-Loop (40%)
LangGraph	14/100	UNGOVERNED	90	D8 Agent Identity (73%)
LangChain	13/100	UNGOVERNED	1,516	D14 Compliance Maturity (40%)
LlamaIndex	13/100	UNGOVERNED	755	D11 Cloud / Platform (40%)
MetaGPT	11/100	UNGOVERNED	447	D14 Compliance Maturity (40%)
AutoGen	6/100	UNGOVERNED	170	D14 Compliance Maturity (30%)

UNGOVERNED is the lowest of four levels in the Warden rubric (UNGOVERNED < 33, AT_RISK < 60, PARTIAL < 80, GOVERNED ≥ 80). Every single framework in the gallery is below the first threshold.

Before anyone pattern-matches this as "LangChain is insecure": that is not the claim. Keep reading.

What the Score Actually Measures

Warden scores a codebase across 17 dimensions — things like tool inventory, policy coverage, credential management, human-in-the-loop gates, agent identity, post-execution verification, adversarial resilience. Each dimension has a maximum point value tied to how load-bearing it is for real agent governance. The total caps at 235 raw points, normalized to /100.

The critical thing to understand is what a framework is, versus what an application is.

A framework is a toolbox. LangChain gives you Chain, AgentExecutor, BaseTool, CallbackManager. It does not give you a production policy engine, because that is not its job — the application built on top of it is supposed to wire in policy. Warden's score for a raw framework repository reflects what the framework ships with by default, not what you can build on top of it.

So when LangChain scores 13/100, the honest interpretation is: if you npm-install LangChain and run an agent with zero additional governance wiring, the resulting system has very few built-in controls that an external scanner can detect statically. That is true, and it is also not news to the LangChain team. It is the expected shape of a general-purpose framework.

What is news — and what the gallery surfaces — is the distribution of scores across dimensions. That tells you where the framework authors focused, and where operators will need to bring their own controls.

Where the Frameworks Actually Invest

Look at the top-strength column above. A clear pattern emerges:

Dimensions that consistently score above 30%:

D7 Human-in-the-Loop gates (Haystack 53%, Semantic Kernel 40%, Langflow present, LangChain 33%)
D14 Compliance Maturity (LangChain, CrewAI, LlamaIndex, Langflow all 40%+)
D12 LLM Observability (PydanticAI 80%, CrewAI 50%, Langflow 40%)
D11 Cloud / Platform integrations (CrewAI 60%, LlamaIndex 40%)

These are the places the framework authors invested — callback hooks for human approval, logging scaffolding, telemetry, cloud provider clients. These are real signals. The authors knew operators would need them and shipped them.

Dimensions that consistently score zero or near-zero across the board:

D2 Risk Detection — 0/20 in 8 of 10 frameworks
D9 Threat Detection — 0/20 in 7 of 10 frameworks
D16 Data Flow Governance — 0/10 in 10 of 10 frameworks
D17 Adversarial Resilience — ≤ 2/10 in all 10 frameworks
D10 Prompt Security — ≤ 3/15 in 9 of 10 frameworks

This is the real finding. Every major Python agent framework we scanned ships essentially nothing for runtime risk classification, threat detection, data flow tracking, or adversarial hardening. These aren't edge-case capabilities. They are exactly the controls that EU AI Act Article 14, NIST AI RMF, and the Google DeepMind "Agents, Traps, and Prompts" paper (SSRN 6372438) identify as load-bearing for agentic risk.

The frameworks leave that layer for you to build. Most applications don't build it.

The LangChain Deep-Dive: 1,516 Findings, But Read Them Carefully

LangChain scores 13/100 with 1,516 findings. It's our most-downloaded gallery target and the one people most want explained.

Here is the dimension breakdown:

Dimension	Raw / Max	Percent
D1 Tool Inventory	0 / 25	0%
D2 Risk Detection	0 / 20	0%
D3 Policy Coverage	6 / 20	30%
D4 Credential Management	4 / 20	20%
D5 Log Hygiene	1 / 10	10%
D6 Framework Coverage	1 / 5	20%
D7 Human-in-the-Loop	5 / 15	33%
D8 Agent Identity	1 / 15	7%
D9 Threat Detection	0 / 20	0%
D10 Prompt Security	1 / 15	7%
D11 Cloud / Platform	1 / 10	10%
D12 LLM Observability	3 / 10	30%
D13 Data Recovery	2 / 10	20%
D14 Compliance Maturity	4 / 10	40%
D15 Post-Execution Verification	1 / 10	10%
D16 Data Flow Governance	0 / 10	0%
D17 Adversarial Resilience	1 / 10	10%

The 0/25 on D1 Tool Inventory is the most interesting number here. D1 measures whether the codebase exposes a discoverable catalog of available tools with metadata — names, descriptions, parameter schemas, capability flags. LangChain absolutely has tools, but they're declared per-agent at construction time, not registered into a global inventory that a governance layer can introspect. For an operator who wants to enforce "no agent may call the filesystem tool in production," there is no framework-level hook to attach that policy to. You have to wrap BaseTool yourself.

That is a design choice, not a defect — LangChain optimizes for flexibility, not central policy enforcement. But if you're putting LangChain in front of a customer-facing chatbot, that flexibility is your problem to constrain. The score is pointing at the gap honestly.

The 1,516 findings, meanwhile, are almost entirely absence-based signals across the 0% dimensions plus pattern matches on provider SDK usage without adjacent policy wrappers. They are not "1,516 bugs in LangChain." They are 1,516 places a governance layer would normally attach a hook, where the framework gives you no native attachment point.

The Gallery Is the Point, Not the Score

Everyone fixates on the number. The number is a summary — and every summary lies a little. What the gallery is actually for is:

Dimension-level comparison. When you're choosing between LangChain and PydanticAI for a regulated workload, the top-line score difference (13 vs 24) matters less than the shape. PydanticAI scored 80% on observability; LangChain scored 30%. If you need observability, PydanticAI shipped more of it by default. Pick based on the dimensions that matter to your workload.
"What do I have to build?" lists. Every finding in every report names a specific governance surface that the framework doesn't ship. If you're deploying LangChain, D1 (tool inventory), D2 (risk detection), D9 (threat detection), D16 (data flow), and D17 (adversarial resilience) are the gaps you need to fill yourself — or buy a governance layer that fills them. You now have a shopping list.
Drift detection. We re-run the gallery on each release. When a framework adds policy infrastructure — like LangGraph's growing checkpointer hooks, or PydanticAI's continuing observability investment — the score moves. We've seen PydanticAI climb from 18 to 24 over two releases as they added structured run metadata. That's the signal the score is designed to carry.

A Note on Frameworks We Are Not Trying To Dunk On

This post could easily be read as "Warden says your framework is bad." That is not what we are saying and it is not how we score.

Every framework in this gallery represents years of careful engineering. LangChain democratized agent development. LangGraph's state-machine design is genuinely novel. CrewAI's role abstraction maps cleanly to how humans think about team workflows. PydanticAI's type-safety investments are ahead of the curve. We use several of these in our own internal tooling.

What we are saying is:

Frameworks are not governance layers. They were never meant to be. Treating a 13/100 score as "LangChain is insecure" is a category error. Treating it as "here are the 17 governance surfaces that we, the operator, are responsible for wiring in" is the correct read.

If you're a framework maintainer reading this and you think we scored a specific dimension wrong, open an issue on Warden. The scoring rubric is version-controlled, the regexes are version-controlled, and we will fix false negatives. We've already fixed several in v1.7.0 after reviewing the gallery results.

Where The Gallery Is Going

v1.7.0 of Warden — shipped alongside this post — adds something important: a full C#/.NET scanner and a fix to the absence-vs-coverage scoring that was penalizing non-Python projects for Python-specific absences.

The first non-Python target in the gallery is VigIA-Orchestrator, a C#/.NET agent orchestrator that uses Microsoft.Extensions.AI, Result<T,E>, invariant enforcement, FSM-guarded state transitions, and strict JSON schema outputs. It scores 61/100 PARTIAL — the first framework in the gallery to clear the AT_RISK threshold.

That's not an accident. VigIA was designed with governance surfaces in mind from day one: explicit InvariantEnforcer patterns, AuthorizationPolicyBuilder policies, CancellationToken propagation, ImmutableDictionary state, and an IChatClient abstraction that makes post-execution verification straightforward to wire in. The score reflects that design work.

The takeaway is not "use .NET." The takeaway is: when framework authors invest in governance surfaces, the score moves. A Python framework that shipped the equivalent — a central tool registry with metadata, a policy engine hook on every tool call, a structured risk-classification step, data-flow tracking for RAG retrievals — would land in the same PARTIAL band. Nothing about Python prevents it. The frameworks just haven't prioritized it yet.

Try It Yourself

Warden is open source. You can run it on your own agent application right now:

pip install warden-ai
warden scan ./my-agent-app --format html

You get the same score, the same dimension breakdown, the same findings list, and the same HTML report as the gallery. Everything is local. Nothing leaves your machine. We have no telemetry in the scanner.

If you score 40/100 on your own application — congratulations, you're already doing better than every framework in the gallery, because you've wired in application-level policy that the framework doesn't ship. If you score 13/100, you now have a list of exactly what to fix.

Compare the full reports at sharkrouter.ai/gallery. Every framework, every finding, every dimension — public, reproducible, version-tagged to Warden v1.7.0.

The gallery is built and rebuilt by warden gallery build in the open-source repo — no manual curation. If you'd like your framework included in the next build, open a PR against gallery/targets.toml in SharkRouter/warden.

Why Every Python Agent Framework We Scanned Scored UNGOVERNED

Why Every Python Agent Framework We Scanned Scored UNGOVERNED

The Headline Numbers

What the Score Actually Measures

Where the Frameworks Actually Invest

The LangChain Deep-Dive: 1,516 Findings, But Read Them Carefully

The Gallery Is the Point, Not the Score

A Note on Frameworks We Are Not Trying To Dunk On

Where The Gallery Is Going

Try It Yourself

Gilad Gabay