Scoring Algorithm

How HumanJudge calculates time-weighted quality scores and freshness metrics for your AI evaluation reports.

Time-Weighted Quality Score

The quality score represents a time-weighted average of human evaluations. Recent votes have more influence than older ones, ensuring the score reflects the current state of your AI's performance.

The Formula

// Decay factor (how much old scores are weighted)
α = e-λ × Δt
where λ = 0.01, Δt = seconds since last evaluation
// Updated score
new_score = α × prev_score + (1 - α) × mean_vote

How It Works

1

Calculate Time Decay

The decay factor α ranges from 0 to 1. When Δt is small (recent evaluation), α is close to 1, meaning the previous score carries more weight. As time passes, α decreases.

2

Weight New Votes

Each evaluator's vote is weighted by their reputation score. The mean_vote is the reputation-weighted average of all new votes (Pass = 1.0, Flag = 0.0).

3

Blend Old and New

The new score blends the decayed previous score with the new vote average. This ensures smooth transitions while giving recent evaluations appropriate influence.

Example Calculation

Previous score: 0.50
Time since last eval: 7 seconds
New vote: Flag (0.0)
Decay (α = e-0.01×7): ≈ 0.93
New score: 0.93 × 0.5 + 0.07 × 0.0 = 0.465 (47%)

Vote Freshness

Freshness indicates how recent the evaluations are. It's the inverse of the decay factor, showing how much influence new votes had on the current score.

Freshness Formula

freshness = 1 - α
When α ≈ 0.93, freshness = 7%

High Freshness (e.g., 80%)

Evaluations are recent or there's been a long gap since the last score update. The current score heavily reflects the latest votes.

Low Freshness (e.g., 7%)

Evaluations came shortly after previous ones. The score is a blend of historical and recent data, providing stability.

Why Time-Weighted Scoring?

Reflects Current Performance

AI systems improve over time. Time-weighting ensures old evaluations don't unfairly drag down scores after improvements are made.

Prevents Gaming

You can't boost your score with a burst of positive reviews and ignore it. The score naturally decays, requiring consistent quality.

Smooth Transitions

Exponential decay creates gradual score changes rather than jarring jumps, giving you time to identify and address issues.

Technical Details

Parameter Value Description
λ (decay_lambda) 0.01 Decay rate constant
Pass vote 1.0 Vote value for approval
Flag vote 0.0 Vote value for flagging
Initial score 0.5 Starting score for new projects

Open Source

The scoring algorithm is available as an open-source Python package. Install it via pip to use the same algorithms in your own analysis.

$ pip install grandjury
View on PyPI →

Questions?

Have questions about how scores are calculated? Email support@humanjudge.com