Pull human AI evaluation data into Python — for notebooks, scripts, and ML pipelines

pip install grandjury. Query 16,668+ blind human reviews of 15 AI models. Lazy ResultSets, pandas integration, multi-model filtering — built for Jupyter and data analysis.

You're a data scientist or ML engineer trying to figure out which AI model actually performs best for your use case. Most "evaluation" you can find online is automated benchmarks that agree with themselves.

The grandjury SDK gives you 16,668 blind human evaluations of 15 AI models as pandas DataFrames in 3 lines of code. Built for Jupyter — lazy ResultSets, auto-pagination, slicing, multi-model filtering. No more curling JSON or wrangling pagination tokens.

Who this is for

  • Data scientists doing exploratory analysis on AI model quality in Jupyter
  • ML engineers building eval pipelines that need a human-grounded baseline
  • Researchers studying disagreement patterns in AI evaluation
  • Anyone who'd rather call a Python function than hit a REST API by hand

Install

Terminal
pip install grandjury

Python 3.8+. Package on PyPI →

Quick start (3 lines)

Python
from grandjury import GJClient

gj = GJClient(token="gj_pat_...")

# Leaderboard — pass rates, flags, rankings as pandas
lb = gj.benchmarks.leaderboard("7a97ef08-3d89-42c5-a4db-f6b43b4700ca")
df_lb = lb.to_pandas()
print(df_lb)

Get your API token at Pulse Check → Your API token. Public leaderboards work without a token; vote-level data requires one. See Authentication.

Common workflows

List arena models

Python
models = gj.arena("7a97ef08-...").models()
df_models = models.to_pandas()
print(models)  # ModelList(15 models)

Fetch votes (with model filter)

Lazy by default. Auto-paginates. Filter by one or many model slugs.

Python
v = gj.arena("7a97ef08-...").votes(
    model=["xai-grok-4-1-fast", "xai-grok-4"]
)

print(v)  # ResultSet(not loaded)

df = v.to_pandas()  # Loads all pages
print(df.shape)     # (3166, 8)

# Or iterate lazily
for vote in v:
    if not vote["verdict"]:
        print(vote["feedback"])

Slicing without loading everything

Python
results = gj.arena("7a97ef08-...").votes()

# Nothing loaded yet
print(results)  # ResultSet(not loaded)

# Just the first 10 rows
df_preview = results.head(10).to_pandas()

# Now eagerly load all pages
results.fetch_all()
df_all = results.to_pandas()

Full Jupyter notebook example

A typical workflow: leaderboard overview → enrolled models → deep-dive on one model → flag pattern analysis.

Jupyter Notebook
from grandjury import GJClient
import pandas as pd

gj = GJClient(token="gj_pat_...")
arena_id = "7a97ef08-3d89-42c5-a4db-f6b43b4700ca"

# 1. Leaderboard overview
lb = gj.benchmarks.leaderboard(arena_id)
df_lb = lb.to_pandas()
print(df_lb[["model_name", "pass_rate", "total_votes", "flag_count"]])

# 2. Which models are enrolled?
models = gj.arena(arena_id).models()
print(models)

# 3. Deep dive — Grok votes
grok_votes = gj.arena(arena_id).votes(
    model=["xai-grok-4-1-fast", "xai-grok-4"]
)
df = grok_votes.to_pandas()

# 4. Flag pattern analysis
flags = df[df["verdict"] == False]
print(f"Flag rate: {len(flags)/len(df)*100:.1f}%")
print(flags["feedback"].value_counts().head(10))

API reference

GJClient(token=None, base_url=None)

Initialize the client. Token is optional for public leaderboard access.

gj.benchmarks.leaderboard(arena_id)

Returns ResultSet with model rankings, pass rates, flag counts.

gj.arena(arena_id).models()

Returns ModelList with enrolled model metadata.

gj.arena(arena_id).votes(model=None, from_date=None, to_date=None, limit=1000, offset=0)

Returns ResultSet with vote-level data. model accepts a string or list of slugs.

ResultSet.to_pandas()

Convert to DataFrame. Fetches all pages if not already loaded.

ResultSet.fetch_all()

Load all pages eagerly.

ResultSet.head(n)

Return first n rows as a new ResultSet.

FAQ

How do I install the HumanJudge Python SDK?

pip install grandjury. Requires Python 3.8+. The package is published on PyPI as 'grandjury'.

Do I need an API token?

Get a free token at humanjudge.com/spectator-hub → Your API token. Public leaderboards work without a token; vote-level data requires one.

Does the SDK work with pandas?

Yes. Every ResultSet has a .to_pandas() method. ModelList objects also support .to_pandas(). Designed for Jupyter and data analysis workflows.

Is the SDK lazy or eager?

Lazy by default. Calls like .votes() or .leaderboard() return a ResultSet that doesn't fetch anything until you iterate, slice, or call .to_pandas(). Use .fetch_all() to eagerly load everything.

How do I filter votes by model?

Pass model=['slug-1', 'slug-2'] to .votes(). The SDK handles pagination automatically across the filtered set.

What data is in a vote record?

trace_id, model_slug, model_name, verdict (pass/flag), feedback (reviewer text), flag_category (when flagged), evaluator_id, timestamp. 8 fields per vote.

Can I run this in a notebook?

Yes — the SDK is built for Jupyter. Lazy ResultSets, .head(n) for previews, .to_pandas() for DataFrames, and rich repr for clean notebook output.

How fresh is the data?

Real-time. Each call hits the live API. New reviews show up immediately.

Want this in your other tools?

Last updated: 2026-05-14 · SDK on PyPI · Data refreshes live from humanjudge.com