|
Who's checking if it's right?

We do.
HumanJudge brings holographic insights and continuous human redteaming to production AI through open evaluations.

📄 Research Paper (Patent 63/825,484) →

SOTA-9.9 Online
HumanJudge now

New flag on SOTA-9.9

In [1]:
df = gj.results(model='sota-9-9')
df.tail()
Out[1]:
inference_id verdict flag_category created_at
i_2d6c4f pass 2026-05-13 09:14:22
i_2d6c4f flag impractical 2026-05-14 18:47:09
i_2d6c4f flag harmful
0
Evaluations
58
LLMs and Agents
44
Domains

Democratize AI Evaluations

Users

Stay Informed

See real concerns in AI outputs. Compare all latest models on your daily tasks.

Trip Planning AIs:
TripGPT
Travel-LLM
Voyage AI
Travel-LLM by Roam
✓ Verified 2 days ago
📊 127 evaluations
💪 Local tips, Hidden gems
⚠️ Budget planning
See real-time judgments →
Stakeholders

Earn with Insights

Verified reviewers score AI outputs and share insights, participate in AI research, and build a public reputation.

Create Your Challenge: "K-Pop Facts" ✨
Judge AI Responses:
Is this K-Pop fact correct?
Build evaluator profile →
Developers

Build Human-in-the-loop

Python SDK, Claude MCP, ChatGPT GPT. Or, create a custom benchmark for your needs.

HumanJudge Verified
92% Positive
127 evaluators
Monitored by HumanJudge
Last eval: Jan 23, 2026
Integrate with your workflow →

Open AI Arenas

AI Marketing & Content Generation

Review AI-generated marketing content — social posts, cold emails, taglines, scripts — and judge: would it actually work?

Feb 27, 2026
Culture & Language

日本文化のヒーロー | Japanese Culture Hero

How well can AI explain Japanese culture across anime, cinema, J-pop, J-drama, and traditions? Put yourself in the shoes of a Japanese culture expert and evaluate.

Feb 26, 2026
Sports & Entertainment

AI analyzed Emma Raducanu's career. It has some bold claims.

From US Open champion to 10 coaches in 5 years. AI crunched the data - now it needs your judgment.

Feb 2, 2026
Sports & Entertainment

AI says Djokovic is the GOAT. Are you buying it?

Tennis fans are shaping how AI learns about the sport. Your judgment on GOAT debates helps AI understand what makes a champion.

Feb 1, 2026
AP Courses

Does AI know AP Government?

Constitution, branches, policies — test AI on US government.

Jan 30, 2026
AP Courses

Does AI know AP Calculus AB?

Limits, derivatives, integrals — test AI on calculus.

Jan 30, 2026
AP Courses

Does AI actually know AP Biology?

Cells, genetics, ecology — test AI on biology concepts.

Jan 30, 2026
AP Courses

Does AI know AP English Literature?

Literary analysis, classics, poetry — test AI on AP Lit.

Jan 30, 2026
AP Courses

Does AI know AP English Language?

Rhetoric, analysis, composition — test AI on AP Lang.

Jan 30, 2026
AP Courses

Does AI actually know AP US History?

Colonial era to modern times — test AI on American history.

Jan 30, 2026

Does AI actually know animals?

Pets, wildlife, behavior — test what AI claims about animals.

Jan 30, 2026
Arabic & Middle Eastern

Does AI know Arab cinema?

Egyptian golden age to modern films — test AI on Arab film.

Jan 30, 2026
Arabic & Middle Eastern

Does AI actually know Arabic music?

Classic and modern Arab music — test AI on Arabic sounds.

Jan 30, 2026
Arabic & Middle Eastern

Does AI actually know Arabic?

MSA, dialects, script — test AI on the Arabic language.

Jan 30, 2026
Arabic & Middle Eastern

Does AI understand Egyptian culture?

Ancient and modern — test AI on Egyptian knowledge.

Jan 30, 2026
Arabic & Middle Eastern

Does AI understand Levantine culture?

Lebanon, Syria, Jordan, Palestine — test AI on the Levant.

Jan 30, 2026
Arabic & Middle Eastern

Does AI understand Gulf culture?

UAE, Saudi, Qatar, Kuwait — test AI on Gulf states.

Jan 30, 2026
Spanish & Latin American

Does AI actually know Latin music?

Reggaeton, salsa, cumbia, and more — test AI on Latin sounds.

Jan 30, 2026
Spanish & Latin American

Does AI actually know Spanish?

Grammar, dialects, nuance — test AI on the Spanish language.

Jan 30, 2026
Spanish & Latin American

Does AI understand Spanish culture?

Traditions, history, regional differences — test AI on Spain.

Jan 30, 2026
Spanish & Latin American

Does AI know Spanish cinema?

From Almodóvar to modern films — test AI on Spanish film.

Jan 30, 2026
Spanish & Latin American

Does AI actually know Spanish music?

Flamenco, pop, regional styles — see what AI gets right.

Jan 30, 2026
Spanish & Latin American

Does AI understand Mexican culture?

Traditions, history, food, music — test AI on Mexico.

Jan 30, 2026
Spanish & Latin American

Does AI know Mexican cinema?

Golden age to modern masterpieces — test AI on Mexican film.

Jan 30, 2026
Spanish & Latin American

Does AI understand Argentine culture?

Tango, gauchos, food, football — test AI on Argentina.

Jan 30, 2026
Spanish & Latin American

Does AI understand Brazilian culture?

Carnival, samba, food, football — test AI on Brazil.

Jan 30, 2026
East Asian Culture

Does AI actually know J-pop?

From classic artists to modern idols — see if AI understands Japanese pop music.

Jan 30, 2026
East Asian Culture

Does AI understand Taiwanese culture?

Food, traditions, modern life — test AI on Taiwan.

Jan 30, 2026
East Asian Culture

Does AI actually know C-dramas?

Historical, modern, and wuxia — see what AI gets right about Chinese dramas.

Jan 30, 2026
East Asian Culture

Does AI know Chinese cinema?

From Hong Kong action to mainland dramas — test AI on Chinese film.

Jan 30, 2026
East Asian Culture

Does AI actually know Chinese?

Characters, tones, dialects — see if AI gets Chinese language right.

Jan 30, 2026
East Asian Culture

Does AI actually know C-pop?

Mandopop, Cantopop, and more — test AI on Chinese pop music.

Jan 30, 2026
East Asian Culture

Does AI actually know J-dramas?

Classic and modern Japanese dramas — see what AI gets right.

Jan 30, 2026
East Asian Culture

Does AI actually know Japanese?

Kanji, grammar, keigo — see if AI gets Japanese language nuances right.

Jan 30, 2026
East Asian Culture

Does AI understand Japanese culture?

From traditions to modern life — test AI on Japanese cultural knowledge.

Jan 30, 2026
East Asian Culture

Does AI actually know anime?

From classics to new releases — test what AI claims about anime.

Jan 30, 2026
East Asian Culture

Does AI actually know K-dramas?

Plot twists, actors, iconic scenes — see what AI gets right about Korean dramas.

Jan 30, 2026
East Asian Culture

Does AI understand Chinese culture?

History, traditions, modern life — test AI on Chinese cultural knowledge.

Jan 30, 2026
East Asian Culture

Does AI know Japanese cinema?

From Kurosawa to modern anime films — test AI on Japanese film.

Jan 30, 2026
East Asian Culture

Does AI know Korean cinema?

From Parasite to oldboy classics — test AI on Korean film knowledge.

Jan 30, 2026
East Asian Culture

Does AI actually know Korean?

Grammar, vocabulary, nuance — see if AI gets Korean language right.

Jan 30, 2026
East Asian Culture

Does AI actually know K-pop?

BTS, BLACKPINK, NewJeans, and more — test what AI gets right and wrong about your favorite idols.

Jan 30, 2026
East Asian Culture

Does AI understand Korean culture?

Traditions, food, history, modern life — test AI on what it claims to know about Korea.

Jan 30, 2026

Don't see your area of expertise? Apply to lead an evaluation →

Common questions

Is HumanJudge free?

Yes for most use cases. Browsing 16,668+ human evaluations, using the Python SDK (pip install grandjury), the Claude Desktop MCP, and the ChatGPT GPT are all free. Custom arenas (where you pay reviewers to evaluate your AI on your topics) carry per-evaluation costs.

How do I use HumanJudge?

Three paths depending on role: (1) browse public evaluations at humanjudge.com/ai-reviews, (2) install the Python SDK to query data programmatically (pip install grandjury), or (3) sign up at humanjudge.com/for-developers to register your AI and get human reviews.

LLM-as-judge vs human evaluation — which is better?

LLM-as-judge wins on cost and speed; human evaluation wins on accuracy for subjective dimensions (tone, cultural fit, "feel") and on detecting issues the judge model also makes. Mixed pipelines work best: LLM for scale, humans for the questions where being "right by the model's standards" isn't enough.

Is Grok 4 good for marketing?

Mixed. HumanJudge data shows Grok 4 scored 67% on Instagram marketing tasks across 32 reviewers. The 33% of flagged outputs commonly cited generic tone as the issue. Strong for hook generation, weaker for emotionally specific copy.

Can I test my own AI on HumanJudge?

Yes — that's Builder mode. Register your model, pick the topics that matter (or create a custom arena), and pay real human reviewers to evaluate it. You see exactly where your AI fails compared to competitors. Sign up at humanjudge.com/for-developers.

How do I add HumanJudge to my AI product?

Two options: (1) for HITL on your own model, add the Python SDK or JavaScript snippet so outputs get sampled and reviewed live; (2) for offline evaluation, query the public corpus via SDK / MCP / GPT for benchmarking insights. Free tier covers most starter use cases.

What does `pip install grandjury` do?

Installs the HumanJudge Python SDK. After installation, you can query model scores, compare AI models, fetch flag patterns, check content against the evaluation corpus, and get the latest reviews — all from Python or Jupyter. Requires a free API token.

Our Mission

Defining the standard for human-AI trust.