We Should Consider Educating Models Before Training Them
Representational geometry at 1.03B parameters — same data, same architecture, only the order changes
Model: GPT-2 architecture (20L / 16H / 2048E, ~1.03B params) trained from scratch on a 21M-token classical curriculum (113 texts).
Conditions: Educated (sequenced, 7.5K steps) vs Trained (shuffled, 7.5K steps).
Probes: 54 concepts across 3 domains, hidden states at every layer.