Word embeddings from a Markov model

Fig. 1 Words of the enabled corpus, placed by the leading singular axes of the order-L transition matrix. Hue marks the source; only the most distinctive words are named.

Shakespeare King James Bible shared

scroll or pinch to zoom, drag to pan

Methods — what the coordinates mean

The picture is the geometry of one object: the order-L Markov model estimated from whatever books are switched on. Nothing here is a neural network; it is the spectral decomposition of a transition matrix.

1. The order-L transition matrix

Fix the Markov order L (the −/+ control). The model predicts the next word from the previous L words. From the enabled corpus we count every observed context—next pair and form, for each current word $w$, its distribution over next words, aggregated over the order-$L$ contexts ending in $w$:

$$ P_{ww'} \;=\; P\!\left(w' \mid \text{last word}=w,\ \text{order } L\right) \;=\; \frac{C_L(w \to w')}{\sum_{u} C_L(w \to u)} . $$

At $L=1$ this is exactly the bigram $P(w'\mid w)$ the walk steps along. Each row of $P$ sums to one. Counts are additive across books, so toggling a book just adds or removes its counts before this normalization.

2. From transitions to coordinates

Two words should sit together when the model continues them the same way—when their rows of $P$ are similar. Similarity of rows is measured by the Gram matrix

$$ G \;=\; P\,P^{\top}, \qquad G_{ab} \;=\; \langle P_{a\cdot},\, P_{b\cdot}\rangle . $$

We take its top eigenvectors (equivalently, the left singular vectors of $P$). With $P = U\Sigma V^{\top}$, word $w$'s embedding is row $w$ of $U\Sigma$:

$$ \mathbf{x}_w \;=\; \big(\sigma_1 U_{w1},\ \sigma_2 U_{w2},\ \dots,\ \sigma_k U_{wk}\big)\in\mathbb{R}^{k}. $$

The scatter plots components $2$ and $3$. Component $1$ is dropped on purpose: for a row-stochastic $P$ it mostly encodes raw word frequency and would flatten everything onto a simplex (the triangle).

3. The entropy ladder (Shannon 1951)

The readout $H(L)$ is the conditional entropy of the order-$L$ model on the enabled corpus,

$$ H(L) \;=\; -\sum_{c}\, P(c)\sum_{w'} P(w'\mid c)\,\log_2 P(w'\mid c) \quad\text{bits/word}, $$

where $c$ ranges over the order-$L$ contexts. The classes nest, $\mathcal{H}_0\subset\mathcal{H}_1\subset\mathcal{H}_2\subset\cdots$, so $H$ falls as $L$ grows—more context, less surprise. Pushed far enough on a finite corpus every long context has a single observed continuation and $H\to 0$: the model has memorized rather than generalized.

4. Why the points move smoothly

Toggling a book re-estimates $P$ and re-runs the SVD, whose axes are fixed only up to rotation. Before drawing, the new embedding $X$ is rotated onto the previous one $Y$ by the orthogonal $R=UV^{\top}$ from $\mathrm{SVD}(X^{\top}Y)$ (Procrustes), so a word whose next-word behaviour did not change stays put and only real shifts move. Color marks which corpus a word lives in; tf-idf chooses which words get labels.