AI-Safety

Across 11 models in three families, I find a size-dependent phase transition in how models internally encode eval-awareness: small models peak in the final layers, large models flip to the very first layers — confirmed independently in Qwen 2.5 and Gemma 2. Probe AUROC does not scale monotonically in either family, challenging prior claims of a universal power law. White-box probing consistently reveals stronger signals than black-box testing, and the gap between the two methods is family-dependent.