Scoring Algorithm
How HumanJudge calculates time-weighted quality scores and freshness metrics for your AI evaluation reports.
Time-Weighted Quality Score
The quality score represents a time-weighted average of human evaluations. Recent votes have more influence than older ones, ensuring the score reflects the current state of your AI's performance.
The Formula
How It Works
Calculate Time Decay
The decay factor α ranges from 0 to 1. When Δt is small (recent evaluation), α is close to 1, meaning the previous score carries more weight. As time passes, α decreases.
Weight New Votes
Each evaluator's vote is weighted by their reputation score. The mean_vote is the reputation-weighted average of all new votes (Pass = 1.0, Flag = 0.0).
Blend Old and New
The new score blends the decayed previous score with the new vote average. This ensures smooth transitions while giving recent evaluations appropriate influence.
Example Calculation
Vote Freshness
Freshness indicates how recent the evaluations are. It's the inverse of the decay factor, showing how much influence new votes had on the current score.
Freshness Formula
High Freshness (e.g., 80%)
Evaluations are recent or there's been a long gap since the last score update. The current score heavily reflects the latest votes.
Low Freshness (e.g., 7%)
Evaluations came shortly after previous ones. The score is a blend of historical and recent data, providing stability.
Why Time-Weighted Scoring?
Reflects Current Performance
AI systems improve over time. Time-weighting ensures old evaluations don't unfairly drag down scores after improvements are made.
Prevents Gaming
You can't boost your score with a burst of positive reviews and ignore it. The score naturally decays, requiring consistent quality.
Smooth Transitions
Exponential decay creates gradual score changes rather than jarring jumps, giving you time to identify and address issues.
Technical Details
| Parameter | Value | Description |
|---|---|---|
| λ (decay_lambda) | 0.01 | Decay rate constant |
| Pass vote | 1.0 | Vote value for approval |
| Flag vote | 0.0 | Vote value for flagging |
| Initial score | 0.5 | Starting score for new projects |
Open Source
The scoring algorithm is available as an open-source Python package. Install it via pip to use the same algorithms in your own analysis.
Questions?
Have questions about how scores are calculated? Email support@humanjudge.com