{"leaderboard":[{"id":1,"runtime_name":"agent-verifier","vendor":"Workloft","url":"https://gitlab.com/Alfpl/agent-verifier","total_score":5.67,"axis_scores":{"A1":{"score":6,"reason":"Agent-verifier focuses on pre-execution checks for individual agent actions, rather than multi-agent orchestration or control loops."},"A2":{"score":7,"reason":"The runtime provides pre-send verification for tool use, catching hallucinated tool references and validating calls before execution."},"A3":{"score":4,"reason":"While it verifies agent session state, the description doesn't detail advanced RAG, episodic, or semantic memory architectures."},"A4":{"score":5,"reason":"The system performs pre-send verification, which can contribute to an audit trail by preventing invalid actions, but explicit cryptographic provenance is not detailed."},"A5":{"score":7,"reason":"Agent-verifier directly addresses model risk by catching issues like hardcoded secrets, unbounded loops, and refusal calibration before execution."},"A6":{"score":3,"reason":"The description does not provide information on jurisdictional routing, on-prem viability, or sovereign-private model paths."},"A7":{"score":5,"reason":"The system focuses on deterministic and semantic checks, implying a form of evaluation, but specific uncertainty estimation or confidence calibration signals are not explicitly mentioned."},"A8":{"score":5,"reason":"The library is described as having no required dependencies and being BYO LLM, suggesting flexibility and potential for cost efficiency."},"A9":{"score":9,"reason":"Agent-verifier is Apache 2.0 licensed, has no required dependencies, and is described as a reference implementation, indicating high replicability."}},"scored_at":"2026-05-09T19:23:57.494967+00:00"}],"count":1,"axes":[{"id":"A1","name":"Agent infra","category":"substrate","definition":"Multi-agent runtimes, orchestration, control loops, agent OS primitives. Does the paper change how an agent stack is built, not just how a single model is prompted?"},{"id":"A2","name":"Tool use · MCP","category":"substrate","definition":"How agents discover, invoke, and validate tools — including MCP server design, schema typing, and authorisation surfaces."},{"id":"A3","name":"RAG · memory","category":"substrate","definition":"Retrieval, long-context, episodic and semantic memory architectures. Reasoning-intensive retrieval rather than vanilla vector search."},{"id":"A4","name":"Audit · provenance","category":"governance","definition":"Cryptographic evidence of agent action — mandate signing (AP2), append-only chains, verifiable historic state for an audit committee."},{"id":"A5","name":"Governance · model risk","category":"governance","definition":"Alignment, refusal calibration, FCA SS1/23 model risk concerns. Anything a Risk function would point at when reading the methods section."},{"id":"A6","name":"Sovereignty · routing","category":"governance","definition":"Jurisdictional routing, on-prem viability, EU AI Act / UK DPA disclosures, sovereign-private model paths. Does this paper survive a 'no US calls' deployment?"},{"id":"A7","name":"Eval · calibration","category":"research","definition":"Uncertainty estimation, confidence calibration, regulator-readable signals. A model that says '87%' and is right ~87% of the time matters more than a SOTA single number."},{"id":"A8","name":"Cost · efficiency","category":"research","definition":"Distillation, KV-cache discipline, sub-billion-param viability, token economics. Substrate that doesn't scale to a council's budget isn't substrate."},{"id":"A9","name":"Replicability","category":"research","definition":"Open code, open weights, runs on a single machine, README that builds. Claims you cannot reproduce are claims you cannot deploy."}],"scoring_rubric_url":"https://workloft.ai/labs/full.html#axes"}