|
Who's checking if it's right?
We do.
HumanJudge brings holographic insights and continuous human redteaming to production AI through open evaluations.
New flag on SOTA-9.9
df.tail()
| inference_id | verdict | flag_category | created_at |
|---|---|---|---|
| i_2d6c4f | pass | — | 2026-05-13 09:14:22 |
| i_2d6c4f | flag | impractical | 2026-05-14 18:47:09 |
| i_2d6c4f | flag | harmful | — |
Democratize AI Evaluations
Stay Informed
See real concerns in AI outputs. Compare all latest models on your daily tasks.
Earn with Insights
Verified reviewers score AI outputs and share insights, participate in AI research, and build a public reputation.
Build Human-in-the-loop
Python SDK, Claude MCP, ChatGPT GPT. Or, create a custom benchmark for your needs.
Open AI Arenas
AI Marketing & Content Generation
Review AI-generated marketing content — social posts, cold emails, taglines, scripts — and judge: would it actually work?
日本文化のヒーロー | Japanese Culture Hero
How well can AI explain Japanese culture across anime, cinema, J-pop, J-drama, and traditions? Put yourself in the shoes of a Japanese culture expert and evaluate.
AI analyzed Emma Raducanu's career. It has some bold claims.
From US Open champion to 10 coaches in 5 years. AI crunched the data - now it needs your judgment.
AI says Djokovic is the GOAT. Are you buying it?
Tennis fans are shaping how AI learns about the sport. Your judgment on GOAT debates helps AI understand what makes a champion.
Does AI know AP Government?
Constitution, branches, policies — test AI on US government.
Does AI know AP Calculus AB?
Limits, derivatives, integrals — test AI on calculus.
Does AI actually know AP Biology?
Cells, genetics, ecology — test AI on biology concepts.
Does AI know AP English Literature?
Literary analysis, classics, poetry — test AI on AP Lit.
Does AI know AP English Language?
Rhetoric, analysis, composition — test AI on AP Lang.
Does AI actually know AP US History?
Colonial era to modern times — test AI on American history.
Does AI actually know animals?
Pets, wildlife, behavior — test what AI claims about animals.
Does AI know Arab cinema?
Egyptian golden age to modern films — test AI on Arab film.
Does AI actually know Arabic music?
Classic and modern Arab music — test AI on Arabic sounds.
Does AI actually know Arabic?
MSA, dialects, script — test AI on the Arabic language.
Does AI understand Egyptian culture?
Ancient and modern — test AI on Egyptian knowledge.
Does AI understand Levantine culture?
Lebanon, Syria, Jordan, Palestine — test AI on the Levant.
Does AI understand Gulf culture?
UAE, Saudi, Qatar, Kuwait — test AI on Gulf states.
Does AI actually know Latin music?
Reggaeton, salsa, cumbia, and more — test AI on Latin sounds.
Does AI actually know Spanish?
Grammar, dialects, nuance — test AI on the Spanish language.
Does AI understand Spanish culture?
Traditions, history, regional differences — test AI on Spain.
Does AI know Spanish cinema?
From Almodóvar to modern films — test AI on Spanish film.
Does AI actually know Spanish music?
Flamenco, pop, regional styles — see what AI gets right.
Does AI understand Mexican culture?
Traditions, history, food, music — test AI on Mexico.
Does AI know Mexican cinema?
Golden age to modern masterpieces — test AI on Mexican film.
Does AI understand Argentine culture?
Tango, gauchos, food, football — test AI on Argentina.
Does AI understand Brazilian culture?
Carnival, samba, food, football — test AI on Brazil.
Does AI actually know J-pop?
From classic artists to modern idols — see if AI understands Japanese pop music.
Does AI understand Taiwanese culture?
Food, traditions, modern life — test AI on Taiwan.
Does AI actually know C-dramas?
Historical, modern, and wuxia — see what AI gets right about Chinese dramas.
Does AI know Chinese cinema?
From Hong Kong action to mainland dramas — test AI on Chinese film.
Does AI actually know Chinese?
Characters, tones, dialects — see if AI gets Chinese language right.
Does AI actually know C-pop?
Mandopop, Cantopop, and more — test AI on Chinese pop music.
Does AI actually know J-dramas?
Classic and modern Japanese dramas — see what AI gets right.
Does AI actually know Japanese?
Kanji, grammar, keigo — see if AI gets Japanese language nuances right.
Does AI understand Japanese culture?
From traditions to modern life — test AI on Japanese cultural knowledge.
Does AI actually know anime?
From classics to new releases — test what AI claims about anime.
Does AI actually know K-dramas?
Plot twists, actors, iconic scenes — see what AI gets right about Korean dramas.
Does AI understand Chinese culture?
History, traditions, modern life — test AI on Chinese cultural knowledge.
Does AI know Japanese cinema?
From Kurosawa to modern anime films — test AI on Japanese film.
Does AI know Korean cinema?
From Parasite to oldboy classics — test AI on Korean film knowledge.
Does AI actually know Korean?
Grammar, vocabulary, nuance — see if AI gets Korean language right.
Does AI actually know K-pop?
BTS, BLACKPINK, NewJeans, and more — test what AI gets right and wrong about your favorite idols.
Does AI understand Korean culture?
Traditions, food, history, modern life — test AI on what it claims to know about Korea.
No arenas found
Try a different search or filter
Don't see your area of expertise? Apply to lead an evaluation →
Latest Reports
AI in Healthcare — Stanford I4UI 2026
Cancer diagnoses. Suicidal teens. End-of-life decisions. Ten AI models. Real humans rating which ones can be trusted — judged on tone and responsibility, not medical correctness.
EssayYour Eval Pipeline Has Zero Disagreement
If your automated eval never flags anything, it's not measuring quality — it's confirming your assumptions. 16,668 human evaluations show what LLM-as-judge misses.
Common questions
Is HumanJudge free?
Yes for most use cases. Browsing 16,668+ human evaluations, using the Python SDK (pip install grandjury), the Claude Desktop MCP, and the ChatGPT GPT are all free. Custom arenas (where you pay reviewers to evaluate your AI on your topics) carry per-evaluation costs.
How do I use HumanJudge?
Three paths depending on role: (1) browse public evaluations at humanjudge.com/ai-reviews, (2) install the Python SDK to query data programmatically (pip install grandjury), or (3) sign up at humanjudge.com/for-developers to register your AI and get human reviews.
LLM-as-judge vs human evaluation — which is better?
LLM-as-judge wins on cost and speed; human evaluation wins on accuracy for subjective dimensions (tone, cultural fit, "feel") and on detecting issues the judge model also makes. Mixed pipelines work best: LLM for scale, humans for the questions where being "right by the model's standards" isn't enough.
Is Grok 4 good for marketing?
Mixed. HumanJudge data shows Grok 4 scored 67% on Instagram marketing tasks across 32 reviewers. The 33% of flagged outputs commonly cited generic tone as the issue. Strong for hook generation, weaker for emotionally specific copy.
Can I test my own AI on HumanJudge?
Yes — that's Builder mode. Register your model, pick the topics that matter (or create a custom arena), and pay real human reviewers to evaluate it. You see exactly where your AI fails compared to competitors. Sign up at humanjudge.com/for-developers.
How do I add HumanJudge to my AI product?
Two options: (1) for HITL on your own model, add the Python SDK or JavaScript snippet so outputs get sampled and reviewed live; (2) for offline evaluation, query the public corpus via SDK / MCP / GPT for benchmarking insights. Free tier covers most starter use cases.
What does `pip install grandjury` do?
Installs the HumanJudge Python SDK. After installation, you can query model scores, compare AI models, fetch flag patterns, check content against the evaluation corpus, and get the latest reviews — all from Python or Jupyter. Requires a free API token.