How we test

AgentJury runs five automated test suites against every MCP server and agent skill. Each produces structured data, not prose. Scores are weighted averages of measurable outcomes.

Security (25% of score)

Input fuzzing with 50+ payloads: SQL injection, path traversal, command injection, prompt injection. Permission escalation attempts. Secrets scanning in source code.

Gate rule: A single critical vulnerability caps the overall score at 3.0 regardless of other results.

Reliability (25% of score)

100 calls with varied inputs: 60 valid, 20 edge cases, 20 intentionally malformed. We measure success rate, p95 latency, and whether error messages are parseable by an agent (not just readable by a human).

Agent usability (20% of score)

An AI agent receives only the tool's published description and tries to complete 10 tasks. No documentation, no examples beyond what the tool itself provides. First-try success rate and common failure modes are recorded.

Compatibility (15% of score)

The tool is tested against Claude Code, OpenAI Agents SDK, and LangChain. Pass, fail, or partial for each, with notes on what breaks.

Code health (15% of score)

Static checks: days since last commit, open issue count, test coverage, dependency freshness, license type, number of active contributors.

What we test

MCP

MCP servers

Model Context Protocol servers from the official repository, cloud providers (AWS, GCP, Cloudflare), and the community. Tested for transport compatibility, tool schema quality, and runtime behavior.

SKILL

Agent skills

Skills from OpenClaw (ClawHub), skills.sh, and the Anthropic skills repository. Tested for instruction quality, security implications, and whether an agent can follow them on first try.

Score thresholds

8.0 - 10.0

Recommended

Reliable, secure, well-maintained

6.0 - 7.9

Acceptable

Works but has notable issues

4.0 - 5.9

Use with caution

Significant reliability or security concerns

0.0 - 3.9

Not recommended

Critical issues found