Jina + Claude Code

SOTA42.10%

Claude Opus 4.7 driving Jina MCP over Playwright across 181 AssistantBench test-split tasks. Metrics below are the official HuggingFace grader scores. The task list shows local run status: whether the agent emitted a FINAL answer within its time budget.

About

assistantbench.github.io ↗arXiv:2407.15711 ↗dataset ↗

AssistantBench evaluates language agents on realistic, time-consuming web-research tasks — the kind that require aggregating across many sites, carrying state between steps, and answering with a concrete value. Example questions: “Which EU countries have the largest trade deficit with China relative to total trade?” or “Nearest store selling canned jackfruit to this address.” Test split is 181 tasks with hidden gold answers.

The benchmark was introduced by Yoran et al. at EMNLP 2024.

HuggingFace graderleaderboard ↗

Accuracy42.1%

Answer rate100%

Precision42.1%

Exact match22.7%

Easy87.6%

Medium62.9%

Hard28.6%

base: claude-opus-4-7 + gpt-5.4-nano-2026-03-17

Tasks (181)

181 results