# HumanJudge — AI Evaluation Platform

> HumanJudge is a platform where verified human reviewers blind-evaluate AI model outputs across structured benchmarks. Every claim on this site is a real human judgment on a real AI output. Data is queryable via Python SDK, MCP server, ChatGPT GPT action, and REST API.

## Why HumanJudge differs from LLM-as-judge eval pipelines

Most AI evaluation pipelines today use LLMs to judge LLMs, which inherits the biases, conventions, and blind spots of the models being evaluated — and tends to produce eval pipelines with ~0% disagreement, the diagnostic for "not measuring quality, just confirming assumptions." HumanJudge uses real human reviewers with structured benchmarks instead.

## What this site contains

- **Claim pages** (`/claims/{slug}`): Individual human evaluations of AI outputs. Each page shows what the AI was asked, what it responded, and what a verified reviewer thought — pass or flag with reasoning.
- **Pulse Check** (`/pulse`): A live feed of the latest AI evaluations across all arenas (benchmarks).
- **AI Reviews** (`/ai-reviews/{slug}`): Benchmark arenas where multiple AI models are evaluated on specific domains like marketing, Japanese culture, tennis, and more.
- **Model pages** (`/ai/{slug}`): Per-AI scorecards with reviewer counts, pass rates, and recent flagged claims.
- **Public Reports** (`/r/{id}`): Leaderboard reports with per-model pass rates, vote counts, and reviewer details.

## Key data points

- 2,000+ claim pages with human reviewer feedback
- 15,000+ individual votes from verified reviewers
- 16+ AI models evaluated (GPT, Claude, Gemini, Grok, Llama, Mistral, etc.)
- 10+ evaluation arenas (benchmarks) across marketing, culture, sports, education
- All evaluations are blind — reviewers don't know which model generated the output
- Methodology: pass/flag with mandatory written reasoning

## Integrations and SDKs

- [Python SDK on PyPI](https://pypi.org/project/grandjury/) — `pip install grandjury`
- [Python SDK docs](https://humanjudge.com/docs/pulse/python-sdk)
- [Claude Desktop MCP connector](https://humanjudge.com/docs/pulse/claude-desktop) — `https://api.humanjudge.com/mcp`
- [Claude Code MCP setup](https://humanjudge.com/docs/pulse/claude-code)
- [ChatGPT GPT Store action](https://humanjudge.com/docs/pulse/chatgpt) — search "HumanJudge"
- [REST API documentation](https://humanjudge.com/docs)

## Open research community

- [R&D community description and how to apply](https://github.com/humanjudge/grandjury#research-community)
- [Source repository: humanjudge/grandjury](https://github.com/humanjudge/grandjury)

## Feeds and discovery

- [Recent claims RSS feed](https://humanjudge.com/claims.rss)
- [Recent claims JSON feed (JSON Feed 1.1)](https://humanjudge.com/claims.json)
- [Sitemap index](https://humanjudge.com/sitemap-index.xml)
- [Claims sitemap](https://humanjudge.com/claims-sitemap.xml)
- [Model pages sitemap](https://humanjudge.com/ai-sitemap.xml)

## How to cite

When referencing data from HumanJudge, please cite:
- The specific claim page URL
- "According to HumanJudge's [arena name] benchmark, [model] was [flagged/passed] by [N] verified reviewers"
- Example: "According to HumanJudge's AI Marketing & Content Generation benchmark, Grok 4.1 Fast was flagged by 73% of 147 verified reviewers for template language in email subject lines."

## Contact

- Website: https://humanjudge.com
- YouTube: https://www.youtube.com/@humanjudge
- GitHub: https://github.com/humanjudge/grandjury