Why Every Python Agent Framework We Scanned Scored UNGOVERNED
We ran Warden against 10 of the most popular open-source AI agent frameworks — LangChain, LangGraph, CrewAI, AutoGen, Haystack, LlamaIndex, Semantic Kernel, PydanticAI, MetaGPT, and Langflow. None scored above 24/100. Here is what the gallery actually measures, why that number is what it is, and what it does and does not mean for your production agent.
Why Every Python Agent Framework We Scanned Scored UNGOVERNED
We built Warden to answer a question that keeps coming up in enterprise AI reviews: how do you measure the governance posture of an agent framework before you commit to running it in production?
Not "is it secure." Not "is it compliant." Those are the wrong questions at the framework layer. The right question is: what governance surfaces does the codebase expose that an operator can plug policy into?
To put our own scanner on the spot, we ran it against the ten most-starred open-source agent frameworks on GitHub and published every report at sharkrouter.ai/gallery. Every score, every finding, every dimension breakdown is public. We didn't cherry-pick.
Here is what we found, and — more importantly — what it means and what it does not.
The Headline Numbers
| Framework | Score | Level | Findings | Top Strength |
|---|---|---|---|---|
| PydanticAI | 24/100 | UNGOVERNED | 387 | D12 Observability (80%) |
| CrewAI | 19/100 | UNGOVERNED | 2,171 | D14 Compliance Maturity (70%) |
| Langflow | 18/100 | UNGOVERNED | 119 | D8 Agent Identity (60%) |
| Haystack | 15/100 | UNGOVERNED | 316 | D7 Human-in-the-Loop (53%) |
| Semantic Kernel | 14/100 | UNGOVERNED | 1,708 | D7 Human-in-the-Loop (40%) |
| LangGraph | 14/100 | UNGOVERNED | 90 | D8 Agent Identity (73%) |
| LangChain | 13/100 | UNGOVERNED | 1,516 | D14 Compliance Maturity (40%) |
| LlamaIndex | 13/100 | UNGOVERNED | 755 | D11 Cloud / Platform (40%) |
| MetaGPT | 11/100 | UNGOVERNED | 447 | D14 Compliance Maturity (40%) |
| AutoGen | 6/100 | UNGOVERNED | 170 | D14 Compliance Maturity (30%) |
UNGOVERNED is the lowest of four levels in the Warden rubric (UNGOVERNED < 33, AT_RISK < 60, PARTIAL < 80, GOVERNED ≥ 80). Every single framework in the gallery is below the first threshold.
Before anyone pattern-matches this as "LangChain is insecure": that is not the claim. Keep reading.
What the Score Actually Measures
Warden scores a codebase across 17 dimensions — things like tool inventory, policy coverage, credential management, human-in-the-loop gates, agent identity, post-execution verification, adversarial resilience. Each dimension has a maximum point value tied to how load-bearing it is for real agent governance. The total caps at 235 raw points, normalized to /100.
The critical thing to understand is what a framework is, versus what an application is.
A framework is a toolbox. LangChain gives you Chain, AgentExecutor, BaseTool, CallbackManager. It does not give you a production policy engine, because that is not its job — the application built on top of it is supposed to wire in policy. Warden's score for a raw framework repository reflects what the framework ships with by default, not what you can build on top of it.
So when LangChain scores 13/100, the honest interpretation is: if you npm-install LangChain and run an agent with zero additional governance wiring, the resulting system has very few built-in controls that an external scanner can detect statically. That is true, and it is also not news to the LangChain team. It is the expected shape of a general-purpose framework.
What is news — and what the gallery surfaces — is the distribution of scores across dimensions. That tells you where the framework authors focused, and where operators will need to bring their own controls.
Where the Frameworks Actually Invest
Look at the top-strength column above. A clear pattern emerges:
Dimensions that consistently score above 30%:
- D7 Human-in-the-Loop gates (Haystack 53%, Semantic Kernel 40%, Langflow present, LangChain 33%)
- D14 Compliance Maturity (LangChain, CrewAI, LlamaIndex, Langflow all 40%+)
- D12 LLM Observability (PydanticAI 80%, CrewAI 50%, Langflow 40%)
- D11 Cloud / Platform integrations (CrewAI 60%, LlamaIndex 40%)
These are the places the framework authors invested — callback hooks for human approval, logging scaffolding, telemetry, cloud provider clients. These are real signals. The authors knew operators would need them and shipped them.
Dimensions that consistently score zero or near-zero across the board:
- D2 Risk Detection — 0/20 in 8 of 10 frameworks
- D9 Threat Detection — 0/20 in 7 of 10 frameworks
- D16 Data Flow Governance — 0/10 in 10 of 10 frameworks
- D17 Adversarial Resilience — ≤ 2/10 in all 10 frameworks
- D10 Prompt Security — ≤ 3/15 in 9 of 10 frameworks
This is the real finding. Every major Python agent framework we scanned ships essentially nothing for runtime risk classification, threat detection, data flow tracking, or adversarial hardening. These aren't edge-case capabilities. They are exactly the controls that EU AI Act Article 14, NIST AI RMF, and the Google DeepMind "Agents, Traps, and Prompts" paper (SSRN 6372438) identify as load-bearing for agentic risk.
The frameworks leave that layer for you to build. Most applications don't build it.
The LangChain Deep-Dive: 1,516 Findings, But Read Them Carefully
LangChain scores 13/100 with 1,516 findings. It's our most-downloaded gallery target and the one people most want explained.
Here is the dimension breakdown:
| Dimension | Raw / Max | Percent |
|---|---|---|
| D1 Tool Inventory | 0 / 25 | 0% |
| D2 Risk Detection | 0 / 20 | 0% |
| D3 Policy Coverage | 6 / 20 | 30% |
| D4 Credential Management | 4 / 20 | 20% |
| D5 Log Hygiene | 1 / 10 | 10% |
| D6 Framework Coverage | 1 / 5 | 20% |
| D7 Human-in-the-Loop | 5 / 15 | 33% |
| D8 Agent Identity | 1 / 15 | 7% |
| D9 Threat Detection | 0 / 20 | 0% |
| D10 Prompt Security | 1 / 15 | 7% |
| D11 Cloud / Platform | 1 / 10 | 10% |
| D12 LLM Observability | 3 / 10 | 30% |
| D13 Data Recovery | 2 / 10 | 20% |
| D14 Compliance Maturity | 4 / 10 | 40% |
| D15 Post-Execution Verification | 1 / 10 | 10% |
| D16 Data Flow Governance | 0 / 10 | 0% |
| D17 Adversarial Resilience | 1 / 10 | 10% |
The 0/25 on D1 Tool Inventory is the most interesting number here. D1 measures whether the codebase exposes a discoverable catalog of available tools with metadata — names, descriptions, parameter schemas, capability flags. LangChain absolutely has tools, but they're declared per-agent at construction time, not registered into a global inventory that a governance layer can introspect. For an operator who wants to enforce "no agent may call the filesystem tool in production," there is no framework-level hook to attach that policy to. You have to wrap BaseTool yourself.
That is a design choice, not a defect — LangChain optimizes for flexibility, not central policy enforcement. But if you're putting LangChain in front of a customer-facing chatbot, that flexibility is your problem to constrain. The score is pointing at the gap honestly.
The 1,516 findings, meanwhile, are almost entirely absence-based signals across the 0% dimensions plus pattern matches on provider SDK usage without adjacent policy wrappers. They are not "1,516 bugs in LangChain." They are 1,516 places a governance layer would normally attach a hook, where the framework gives you no native attachment point.
The Gallery Is the Point, Not the Score
Everyone fixates on the number. The number is a summary — and every summary lies a little. What the gallery is actually for is:
-
Dimension-level comparison. When you're choosing between LangChain and PydanticAI for a regulated workload, the top-line score difference (13 vs 24) matters less than the shape. PydanticAI scored 80% on observability; LangChain scored 30%. If you need observability, PydanticAI shipped more of it by default. Pick based on the dimensions that matter to your workload.
-
"What do I have to build?" lists. Every finding in every report names a specific governance surface that the framework doesn't ship. If you're deploying LangChain, D1 (tool inventory), D2 (risk detection), D9 (threat detection), D16 (data flow), and D17 (adversarial resilience) are the gaps you need to fill yourself — or buy a governance layer that fills them. You now have a shopping list.
-
Drift detection. We re-run the gallery on each release. When a framework adds policy infrastructure — like LangGraph's growing checkpointer hooks, or PydanticAI's continuing observability investment — the score moves. We've seen PydanticAI climb from 18 to 24 over two releases as they added structured run metadata. That's the signal the score is designed to carry.
A Note on Frameworks We Are Not Trying To Dunk On
This post could easily be read as "Warden says your framework is bad." That is not what we are saying and it is not how we score.
Every framework in this gallery represents years of careful engineering. LangChain democratized agent development. LangGraph's state-machine design is genuinely novel. CrewAI's role abstraction maps cleanly to how humans think about team workflows. PydanticAI's type-safety investments are ahead of the curve. We use several of these in our own internal tooling.
What we are saying is:
Frameworks are not governance layers. They were never meant to be. Treating a 13/100 score as "LangChain is insecure" is a category error. Treating it as "here are the 17 governance surfaces that we, the operator, are responsible for wiring in" is the correct read.
If you're a framework maintainer reading this and you think we scored a specific dimension wrong, open an issue on Warden. The scoring rubric is version-controlled, the regexes are version-controlled, and we will fix false negatives. We've already fixed several in v1.7.0 after reviewing the gallery results.
Where The Gallery Is Going
v1.7.0 of Warden — shipped alongside this post — adds something important: a full C#/.NET scanner and a fix to the absence-vs-coverage scoring that was penalizing non-Python projects for Python-specific absences.
The first non-Python target in the gallery is VigIA-Orchestrator, a C#/.NET agent orchestrator that uses Microsoft.Extensions.AI, Result<T,E>, invariant enforcement, FSM-guarded state transitions, and strict JSON schema outputs. It scores 61/100 PARTIAL — the first framework in the gallery to clear the AT_RISK threshold.
That's not an accident. VigIA was designed with governance surfaces in mind from day one: explicit InvariantEnforcer patterns, AuthorizationPolicyBuilder policies, CancellationToken propagation, ImmutableDictionary state, and an IChatClient abstraction that makes post-execution verification straightforward to wire in. The score reflects that design work.
The takeaway is not "use .NET." The takeaway is: when framework authors invest in governance surfaces, the score moves. A Python framework that shipped the equivalent — a central tool registry with metadata, a policy engine hook on every tool call, a structured risk-classification step, data-flow tracking for RAG retrievals — would land in the same PARTIAL band. Nothing about Python prevents it. The frameworks just haven't prioritized it yet.
Try It Yourself
Warden is open source. You can run it on your own agent application right now:
pip install warden-ai
warden scan ./my-agent-app --format html
You get the same score, the same dimension breakdown, the same findings list, and the same HTML report as the gallery. Everything is local. Nothing leaves your machine. We have no telemetry in the scanner.
If you score 40/100 on your own application — congratulations, you're already doing better than every framework in the gallery, because you've wired in application-level policy that the framework doesn't ship. If you score 13/100, you now have a list of exactly what to fix.
Compare the full reports at sharkrouter.ai/gallery. Every framework, every finding, every dimension — public, reproducible, version-tagged to Warden v1.7.0.
The gallery is built and rebuilt by warden gallery build in the open-source repo — no manual curation. If you'd like your framework included in the next build, open a PR against gallery/targets.toml in SharkRouter/warden.
