Model-Evaluation

Motivation I started looking into eval-awareness because it sits upstream of almost every other trust problem in AI safety — if a model behaves differently when it knows it’s being tested, every downstream evaluation we rely on becomes unreliable. Safety evaluations are only trustworthy if models don’t behave differently when they detect they’re being tested. Eval-awareness — the ability of a model to distinguish evaluation from deployment contexts — directly undermines our ability to assess model safety. Chaudhary et al. (2025) claimed this capability scales predictably as a power law across model families, with larger models becoming progressively more eval-aware. But that claim rested on sparse within-family data and excluded the entire Qwen model family citing contamination concerns that were never empirically tested. This post reports what I found when I tested that claim directly across 11 models and 3 families — and why the answer is more complicated than a power law. ...