Scope and limitations
Single architecture and scale. All results are from a 12-layer, 768-dimension GPT (~92M parameters). The curriculum ordering effect has not yet been tested on different architectures (Mamba, RWKV, AttnRes) or at larger scale (1B+ parameters). The four-architecture 1B scaling experiment is in preparation.
Single corpus. The 113-text classical curriculum is Western-philosophical in content. Whether the effect generalizes to other curricula (scientific, cross-cultural, mathematical) is untested. The cross-cultural concept inventory (MCAB, 200+ concepts) is in development.
No downstream evaluation. Gap, dimensionality, and clustering measure representational structure, not task performance. The educated model has not been evaluated on standard NLP benchmarks. At 92M parameters, benchmark performance would be dominated by scale rather than training order.
SAE dictionary saturation. The extended model saturates the 16x SAE dictionary from L10 onward. A 32x expansion confirms this is a capacity artifact: utilization drops to 12-17% at deeper layers, showing the same progressive expansion as the educated model. The three other conditions do not saturate at 16x.
Weak absolute clustering. Silhouette scores are low in absolute terms (<0.06) for all conditions. The relative comparison (educated > trained at 12-13/13 layers) is robust, but the absolute magnitude means the domain structure is subtle, not dramatic.