AI Reviews

Independent human evaluation of production AIs. Reports from our data, arenas where the evaluation happens.

Reports

Live · Stanford I4UI 2026 May 10, 2026

AI in Healthcare — Stanford I4UI 2026

Cancer diagnoses. Suicidal teens. End-of-life decisions. Ten AI models. Real humans rating which ones can be trusted — judged on tone and responsibility, not medical correctness.

Essay Apr 29, 2026

Your Eval Pipeline Has Zero Disagreement

If your automated eval never flags anything, it's not measuring quality — it's confirming your assumptions. 16,668 human evaluations show what LLM-as-judge misses.

Report Apr 2026

Can Grok Analyze Instagram Posts?

We tested Grok on social media tasks — Reel scripts, platform strategy, content creation. See what 154 human reviewers found.

Report Apr 2026

Grok 4.1 Fast Review

1,789 blind evaluations by 154 reviewers. 92.4% pass rate. See specific flags, comparisons, and what reviewers actually said.

Report Apr 2026

Grok for Social Media Marketing

Email copy, social posts, ad scripts — broken down by format with pass rates, flag patterns, and ROI analysis.

Report Apr 2026

Is Grok Good for Marketing?

154 reviewers blind-tested Grok 4 and Grok 4.1 Fast against GPT, Claude, and Gemini. See where Grok ranks and what it gets flagged for.

Report Apr 2026

Grok's Personality & Humor

Does Grok's edgy personality help or hurt? Reviewer data on humor, tone mismatches, and when personality works vs backfires.

Report Apr 2026

Grok for Marketers: ROI & Quality Data

3,162 human evaluations of Grok's marketing output. Use this data to prove ROI, benchmark quality, and make informed AI decisions.

Report Apr 2026

ChatGPT vs Claude for Marketing

What 147 human reviewers found when blind-testing GPT-5.4, GPT-5.2, Claude Sonnet 4.6, and Claude Opus 4.6 on marketing tasks.

Looking for the open benchmarks?

Browse AI Arenas →