Curriculum Ordering: Mamba 100M

How does Mamba organize knowledge?

These scatter plots show where each of the 54 concept probes lands in the model's representation space, at the final layer of training (step 5000). Each dot is one concept; colors indicate domain.

Does curriculum ordering produce different geometric structure in an SSM?

Both models use PCA to project 768 dimensions down to 2. The shuffled model is projected into the sequenced model's eigenvector space, then Procrustes-rotated, so the two panels are directly comparable. Tight clusters of same-color dots indicate domain organization; scattered mixing indicates none.

Layer L24

Physical

Institutional

Moral

Sequenced

Mamba Sequenced

Shuffled

Mamba Shuffled

Concept inventory (click to highlight)

Each dot is one word — a probe concept the model has seen during training. We feed that word to the trained model and record its internal activation at each layer.

Colors are domains. Teal = physical world (mass, velocity, friction). Yellow = institutional (contract, authority, property). Red = moral (fairness, harm, duty).

Clusters indicate organization. If same-color dots group together, the model has learned that those concepts are related. If colors are mixed randomly, the model treats them as unrelated.

Analogy: imagine plotting people by where they live, using GPS. A model with good geographic structure would cluster neighbors together. A model without it would scatter them randomly across the map.

GPT (attention) builds clusters. Mamba (SSM) does not — regardless of training order. Both conditions show negative silhouette throughout training, meaning domains are actively anti-clustered.

What does the internal structure look like?

These charts show per-layer geometry metrics at step 5000. Effective dimensionality measures how many independent directions the model uses to represent concepts. Silhouette score measures whether same-domain concepts cluster.

Does ordering affect representation geometry even when clustering doesn't emerge?

Both conditions show negative silhouette (no domain clustering) — a fundamental difference from GPT. But they diverge in dimensionality: the sequenced model maintains stable high-dimensional representations while the shuffled model collapses in deep layers.

Effective dimensionality per layer

Higher = richer representations. Effective dimensionality (90% variance threshold via SVD) counts how many independent directions the model uses to distinguish 54 concepts at a given layer.

Mamba runs hot. Both Mamba conditions start high (34+), compared to GPT which starts at 14. Mamba's state space doesn't compress early representations the way attention does.

The divergence is in the deep layers. The sequenced model stays flat (37–41 throughout). The shuffled model collapses sharply in the final layers, dropping to ~20. This suggests the shuffled model's late layers are doing less representational work — consistent with its memorization pattern.

Silhouette score per layer

Positive = domains cluster. Negative = domains anti-cluster. Silhouette score measures whether same-domain concepts are more similar to each other than to other domains.

Both Mamba conditions are negative throughout. This is the core architectural finding: attention (GPT) builds domain-organized geometry from curriculum ordering; Mamba's selective state space does not, regardless of training order.

But ordering still matters. The sequenced model's silhouette is consistently less negative than the shuffled model's. It is not building structure, but it is less actively disorganized.

Analogy: neither model is sorting books by genre, but the sequenced model has them in a pile while the shuffled model has them face-down in random stacks.

Cross-condition CKA per layer

Mean CKA = 0.569 between sequenced and shuffled Mamba conditions. For reference, Llama and Qwen (trained on completely different data) show CKA ~0.96. Training order produces more geometric divergence in Mamba than training on entirely different data produces between different model families — even without producing clustering.

CKA measures representational similarity between models. 1.0 = identical representations; 0.0 = completely different. We compute it per layer, comparing how both models represent the same 54 concepts.

Low early, higher late. The first few layers diverge sharply (CKA ~0.14–0.40). Later layers converge somewhat but never fully. This mirrors the GPT pattern: ordering affects how early representations are built, and the effect propagates forward.

How does geometry develop over training?

These charts trace dimensionality and silhouette at each analysis checkpoint, at the final layer. They show how the two conditions diverge as training progresses.

When does the ordering effect appear?

Note: sequenced checkpoints are available at steps 500, 1000, 2500, 5000. Shuffled checkpoints are available at steps 1000–5000.

Effective dimensionality (final layer) over training

Teal = sequenced, gray = shuffled. The sequenced model's final-layer dimensionality is stable throughout training (37–41). The shuffled model starts similarly but collapses after step 3000, reaching ~20 by step 5000.

This is a late-training effect. The two conditions look similar through step 2000. The divergence emerges as the shuffled model's training signal pushes it toward memorization rather than generalization.

Silhouette (final layer) over training

Both conditions remain negative throughout. Neither Mamba model builds domain clustering at any point in training. This is a clean architectural null result: the effect observed in GPT is absent in Mamba.

The gap widens over time. The shuffled model's silhouette deteriorates progressively while the sequenced model's remains closer to zero. By step 5000 the gap is substantial.

Training dynamics and sparse features

Training loss curves (shuffled condition only — sequenced log not retrieved from pod) and SAE feature analysis (shuffled condition only).

Does the shuffled model memorize? Do SAE features differentiate between conditions?

Training and validation loss (shuffled)

Final val-train gap: 5.35 for the shuffled model — a strong memorization signal. The sequenced training log was not retrieved from the pod; this chart shows shuffled only.

Train loss (lower line) vs val loss (upper line). When they diverge, the model is memorizing training data rather than learning patterns that generalize.

Gap of 5.35 is large. For context, the 1B GPT shuffled condition showed a gap of ~1.56. The Mamba shuffled model memorizes more severely — consistent with Mamba's recurrent architecture having less inherent regularization than attention.

Note on missing sequenced log. The sequenced training log was on the pod and was not retrieved in the results package. The geometry data (npz files) was retrieved separately. A future run with the refactored harness will capture both.

SAE alive features per layer (shuffled, 16× expansion)

100% alive features across all 25 layers for the shuffled Mamba model (12,286–12,288 / 12,288 features active). The sequenced SAE was run on the pod but not retrieved. This result replicates the earlier finding: Mamba representations are not sparse — a Sparse Autoencoder finds no sparse bottleneck in the activation space.

A Sparse Autoencoder (SAE) learns to reconstruct activations using as few features as possible. "Alive" features are those that activate for at least 1% of inputs. A high alive count means the model's representations are dense — everything is active everywhere.

GPT shows progressive sparsity. In the GPT sequenced model, alive feature counts vary substantially by layer, and the pattern differs between sequenced and shuffled conditions.

Mamba shows 100% alive everywhere. This held at both 16× and 64× expansion ratios, and now replicates for the shuffled condition. Mamba's state space computes through uniform activation rather than selective feature use. This is an architectural property, not a training artifact.

Analogy: GPT activates specific lights on a switchboard depending on what it is processing. Mamba lights up every switch, all the time.