How Do You Actually Evaluate an AI Research Agent?

We’re building expert AI research agents at Grep.ai — think due diligence reports, business research, compliance checks. The kind of work where you need depth, accuracy, and real sources.

Getting the agent to run was the easy part. Making sure it’s actually good? That’s where things get interesting.

The problem with “it works”

Our initial eval setup was basic: does the agent complete the task? Does it return a report? Does it cite sources? Check, check, check.

But that doesn’t tell you if the report is correct. Or if it’s thorough. Or if it’s better than what a human analyst would produce — or what OpenAI’s Deep Research or Gemini would give you for the same query.

We’re building domain-expert agents. “Expert” is a strong claim. How do you prove it?

Single-answer benchmarks don’t fit

The standard approach for evaluating AI is benchmarks like GAIA or Humanity’s Last Exam — give the model a question, check if the answer is correct. Clean and simple.

But our agents don’t produce answers. They produce 10-page research reports. There’s no single “correct” response to compare against. The output is long-form, multi-faceted, and inherently subjective in some dimensions.

So I went looking for how others solve this.

What I found: the decompose-then-verify approach

The most rigorous methodology for evaluating long-form content comes from FActScore and Google’s SAFE framework:

Break the report into atomic claims — individual facts that can be verified independently
Verify each claim against authoritative sources (Google Search, domain databases, etc.)
Calculate precision — what percentage of claims are actually supported?

This is how Google found that ChatGPT only achieves ~58% factual accuracy on biography generation when you actually check the claims. Sobering.

For research reports, VeriScore adds a useful refinement: distinguishing verifiable claims from subjective analysis. You can’t fact-check an opinion, but you can check whether the facts supporting that opinion are accurate.

LLM-as-judge for the subjective stuff

Factuality is necessary but not sufficient. A report can be 100% factually accurate and still be shallow, poorly organized, or miss the point entirely.

For these “softer” dimensions, G-Eval provides a solid framework:

Define explicit criteria and rubrics for each dimension (depth, coherence, instruction-following)
Have the LLM reason through its evaluation before scoring
Use a different model family than your agent to avoid self-preference bias

The key insight: evaluate one dimension per prompt. Asking “rate this report’s quality” gives you noise. Asking “does this report consider multiple perspectives on the topic?” gives you signal.

The benchmarks that actually exist

For comparing against commercial solutions, DeepResearch Bench is the closest to what we need — 100 PhD-level research tasks evaluated on completeness, depth, and citation accuracy. Current leaderboard: Gemini Deep Research at 48.88, OpenAI at 46.98.

For domain-specific evaluation:

Legal: LegalBench — 162 tasks across 6 reasoning types
Medical: MedXpertQA — 4,460 questions across 17 specialties
Financial: PIXIU/FinBen — 24 tasks including stock trading

Notable gap: no standardized benchmarks for compliance/KYC/AML. That’s our domain. We’ll probably have to build our own.

What we’re actually going to do

Based on this research, here’s the eval stack I’m planning:

Tier 1 — Automated (every run)

Task completion and format compliance
Claim extraction + verification via search (FActScore-style)
Citation validity checks (do sources exist? do they support the claims?)

Tier 2 — LLM-as-judge (sampled)

Depth of analysis
Instruction following
Reasoning coherence
Domain-appropriate caveats

Tier 3 — Human expert (calibration + edge cases)

Build gold-standard dataset for calibrating automated evals
Review cases where Tier 1 and Tier 2 disagree
Periodic audits on production outputs

Tier 4 — Competitive benchmarking

Same queries to our agents vs. OpenAI/Gemini/Perplexity
Track accuracy, depth, cost, and latency
Pareto frontier analysis (quality vs. cost)

The efficiency question

One thing that surprised me in the research: Anthropic found that token usage explains 80% of performance variance on browsing benchmarks. More tokens = better results, up to a point.

This means you can’t just compare accuracy — you need quality-adjusted cost metrics. A system that’s 10% more accurate but 5x more expensive might not be the right tradeoff.

We’re tracking cost-per-query alongside accuracy. The goal is finding the Pareto frontier: maximum quality for a given budget.

Still figuring it out

This is a work in progress. The eval framework will evolve as we learn what actually predicts real-world usefulness.

If you’re working on similar problems — evaluating AI agents that produce long-form outputs, especially in regulated domains — I’d love to compare notes. Reach out on LinkedIn.

The problem with “it works”#

Single-answer benchmarks don’t fit#

What I found: the decompose-then-verify approach#

LLM-as-judge for the subjective stuff#

The benchmarks that actually exist#

What we’re actually going to do#

The efficiency question#

Still figuring it out#