HumanJudge Python SDK — pip install grandjury | Query Human AI Evaluations

You're a data scientist or ML engineer trying to figure out which AI model actually performs best for your use case. Most "evaluation" you can find online is automated benchmarks that agree with themselves.

The grandjury SDK gives you 16,668 blind human evaluations of 15 AI models as pandas DataFrames in 3 lines of code. Built for Jupyter — lazy ResultSets, auto-pagination, slicing, multi-model filtering. No more curling JSON or wrangling pagination tokens.

Who this is for

• Data scientists doing exploratory analysis on AI model quality in Jupyter
• ML engineers building eval pipelines that need a human-grounded baseline
• Researchers studying disagreement patterns in AI evaluation
• Anyone who'd rather call a Python function than hit a REST API by hand

Install

Terminal

pip install grandjury

Python 3.8+. Package on PyPI →

Quick start (3 lines)

Python

from grandjury import GJClient

gj = GJClient(token="gj_pat_...")

# Leaderboard — pass rates, flags, rankings as pandas
lb = gj.benchmarks.leaderboard("7a97ef08-3d89-42c5-a4db-f6b43b4700ca")
df_lb = lb.to_pandas()
print(df_lb)

Get your API token at Pulse Check → Your API token. Public leaderboards work without a token; vote-level data requires one. See Authentication.

Common workflows

List arena models

Python

models = gj.arena("7a97ef08-...").models()
df_models = models.to_pandas()
print(models)  # ModelList(15 models)

Fetch votes (with model filter)

Lazy by default. Auto-paginates. Filter by one or many model slugs.

Python

v = gj.arena("7a97ef08-...").votes(
    model=["xai-grok-4-1-fast", "xai-grok-4"]
)

print(v)  # ResultSet(not loaded)

df = v.to_pandas()  # Loads all pages
print(df.shape)     # (3166, 8)

# Or iterate lazily
for vote in v:
    if not vote["verdict"]:
        print(vote["feedback"])

Slicing without loading everything

Python

results = gj.arena("7a97ef08-...").votes()

# Nothing loaded yet
print(results)  # ResultSet(not loaded)

# Just the first 10 rows
df_preview = results.head(10).to_pandas()

# Now eagerly load all pages
results.fetch_all()
df_all = results.to_pandas()

Full Jupyter notebook example

A typical workflow: leaderboard overview → enrolled models → deep-dive on one model → flag pattern analysis.

Jupyter Notebook

from grandjury import GJClient
import pandas as pd

gj = GJClient(token="gj_pat_...")
arena_id = "7a97ef08-3d89-42c5-a4db-f6b43b4700ca"

# 1. Leaderboard overview
lb = gj.benchmarks.leaderboard(arena_id)
df_lb = lb.to_pandas()
print(df_lb[["model_name", "pass_rate", "total_votes", "flag_count"]])

# 2. Which models are enrolled?
models = gj.arena(arena_id).models()
print(models)

# 3. Deep dive — Grok votes
grok_votes = gj.arena(arena_id).votes(
    model=["xai-grok-4-1-fast", "xai-grok-4"]
)
df = grok_votes.to_pandas()

# 4. Flag pattern analysis
flags = df[df["verdict"] == False]
print(f"Flag rate: {len(flags)/len(df)*100:.1f}%")
print(flags["feedback"].value_counts().head(10))

API reference

GJClient(token=None, base_url=None)

Initialize the client. Token is optional for public leaderboard access.

gj.benchmarks.leaderboard(arena_id)

Returns ResultSet with model rankings, pass rates, flag counts.

gj.arena(arena_id).models()

Returns ModelList with enrolled model metadata.

gj.arena(arena_id).votes(model=None, from_date=None, to_date=None, limit=1000, offset=0)

Returns ResultSet with vote-level data. model accepts a string or list of slugs.

ResultSet.to_pandas()

Convert to DataFrame. Fetches all pages if not already loaded.

ResultSet.fetch_all()

Load all pages eagerly.

ResultSet.head(n)

Return first n rows as a new ResultSet.

FAQ

How do I install the HumanJudge Python SDK? ▾

pip install grandjury. Requires Python 3.8+. The package is published on PyPI as 'grandjury'.

Do I need an API token? ▾

Get a free token at humanjudge.com/spectator-hub → Your API token. Public leaderboards work without a token; vote-level data requires one.

Does the SDK work with pandas? ▾

Yes. Every ResultSet has a .to_pandas() method. ModelList objects also support .to_pandas(). Designed for Jupyter and data analysis workflows.

Is the SDK lazy or eager? ▾

Lazy by default. Calls like .votes() or .leaderboard() return a ResultSet that doesn't fetch anything until you iterate, slice, or call .to_pandas(). Use .fetch_all() to eagerly load everything.

How do I filter votes by model? ▾

Pass model=['slug-1', 'slug-2'] to .votes(). The SDK handles pagination automatically across the filtered set.

What data is in a vote record? ▾

trace_id, model_slug, model_name, verdict (pass/flag), feedback (reviewer text), flag_category (when flagged), evaluator_id, timestamp. 8 fields per vote.

Can I run this in a notebook? ▾

Yes — the SDK is built for Jupyter. Lazy ResultSets, .head(n) for previews, .to_pandas() for DataFrames, and rich repr for clean notebook output.

How fresh is the data? ▾

Real-time. Each call hits the live API. New reviews show up immediately.

Pull human AI evaluation data into Python — for notebooks, scripts, and ML pipelines

Who this is for

Install

Quick start (3 lines)

Common workflows

List arena models

Fetch votes (with model filter)

Slicing without loading everything

Full Jupyter notebook example

API reference

FAQ

Want this in your other tools?

ChatGPT →

Claude Desktop →

Claude Code →

Authentication →