Tree Positional Encodings

Bringing Hierarchical Structure into Transformers

Based on: Shiv & Quirk, "Novel Positional Encodings to Enable Tree-Based Transformers"

NeurIPS 2019

Vimal

The Problem — Structured Text Isn't Flat

Flat sequence

Standard PE handles this.
Position is just a number.
PE at distance \(k\) depends only on \(k\).

Tree structure

Code, JSON, XML, math expressions...
Position is a path.

How do we encode "go to parent" or "go to 2nd child" in a way the transformer can use?

What We Want

Sinusoidal PE gives us these guarantees for sequences:

Uniqueness: Every position gets a distinct PE vector
Relative navigation: \(\text{PE}(\text{pos}+k) = R_k \cdot \text{PE}(\text{pos})\) — shift by \(k\) is a fixed affine transform (rotation)

Can we get the same for trees?

Uniqueness: Can every node get a distinct PE vector?
Relative navigation: Can we compute a node's PE from another node's PE, using just the path between them?

Why "just the path"? In sinusoidal PE, the shift operation is a rotation matrix — a simple linear operation that the transformer's layers (QKV projections, linear layers) can represent natively. We want tree navigation to have the same property: an operation simple enough that the transformer doesn't need any special architecture to use it on inputs which are structured like trees.

Before diving into the solution, let's be clear about what we want. Sinusoidal PE gives two nice guarantees for flat sequences: every position gets a unique vector, and shifting by k is a fixed rotation — a simple linear operation. We want the exact same properties for trees. Every node should get a unique vector, and we should be able to compute one node's PE from another using just the path between them. Why do we care that it's a simple operation? Because the transformer's building blocks — QKV projections, LayerNorm, linear layers — are all linear or affine operations. If tree navigation is also a simple linear operation, the transformer can represent it natively. No special architecture needed — just swap the PE. The specific type of "simple operation" turns out to be an affine transform — I'll define that precisely when we get to the solution.

Warm-up — Sinusoidal PE Does This for Sequences

\text{PE}(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d}}\right), \;\; \text{PE}(\text{pos}, 2i{+}1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d}}\right)

Key property: \(\text{PE}(\text{pos} + k) = R_k \cdot \text{PE}(\text{pos})\) where \( R_k \) is a rotation matrix (from trig identities): \[ \small \begin{bmatrix} \sin(x+y) \\ \cos(x+y) \end{bmatrix} = \begin{bmatrix} \cos y & \sin y \\ -\sin y & \cos y \end{bmatrix} \begin{bmatrix} \sin x \\ \cos x \end{bmatrix} \]

Shifting position by \( k \) = fixed rotation matrix — an affine transform with \(\mathbf{b} = 0\)
Can we do the same for tree paths instead of linear shifts?

From Distances to Paths

In sequences, the relationship between two positions is simply the distance \(k\)
In trees, the relationship is a path: a series of steps along branches, each step going up to a parent or down to a child

Desired property: For every path \(\phi\), there is an affine transform \(A_\phi\) such that: \[ \text{PE}_b = A_\phi \cdot \text{PE}_a \] This allows the transformer to learn path-wise relationships within its embedding layers.

Paths as Compositions of Steps

Two types of steps in any tree path:
- \( D_i \) — step down to the \(i\)-th child
- \( U \) — step up to the parent
Key insight: if each step is an affine transform, the whole path is too
(composition of affines is affine)
Constraint: \( U \cdot D_i = I \; \forall \, i \in \{1, \ldots, n\} \) — going up negates going down

For any path \(\phi\), construct the transform as a composition of \(D\)'s and \(U\)'s: \[ A_\phi = D_{j_m} \cdots D_{j_1} \cdot U^s \] ( go up \(s\) steps, then down through children \(j_1, j_2, \ldots, j_m\) )

Examples — Navigating the Tree

Up and down — path from z to w: up 3 steps to r, then down to 3rd child: \( \text{PE}_w = D_2 \cdot U^3 \cdot \text{PE}_z \)

Going down — path from x to z (via y): step to 3rd child, then to 2nd child: \( \text{PE}_z = D_1 \cdot D_2 \cdot \text{PE}_x \)

Going up — path from z to r: 3 steps up: \( \text{PE}_r = U^3 \cdot \text{PE}_z \)

The Encoding Scheme

Two hyperparameters: \(n\) (degree of tree), \(k\) (max depth)
Each positional encoding is a vector of dimension \(n \cdot k\)
Each operator \(U, D_1, \ldots, D_n\) preserves this dimensionality
The root is encoded as the zero vector
Every other node is encoded according to its path from the root

Since paths from root consist only of steps downward \(\langle b_1, \ldots, b_L \rangle\):

\mathbf{x} = D_{b_L} \, D_{b_{L-1}} \cdots D_{b_1} \, \mathbf{0}

The paper proposes a stack-like encoding for tree positions. It takes two hyperparameters: n, the degree of the tree (assumed regular), and k, the maximum depth. Each positional encoding is a vector of dimension n times k. Each operator — U and the D operators — preserves this dimensionality. The root position is the zero vector. Every other node's position is encoded according to its path from root. Since paths from root only go downward, we can write the path as a sequence of child choices b_1 through b_L, where L is the depth of the node. The encoding is then: apply D_b1, then D_b2, and so on up to D_bL, starting from zero. At this point we know what the PE represents — but not how D_i actually works internally. That's the next slide.

Inside the Operators

Treat the PE vector as a stack of single steps (at most \(k\) entries, one per depth level)
Each step is represented as a one-hot \(n\)-vector indicating which child was taken
\( D_i \): pushes the one-hot vector for child \(i\) onto the stack (truncating the oldest entry to preserve dimension)
\( U \): pops the most recent entry off the stack (padding with zeros)

Tree PE example: nodes r, x, y, z with their PE vectors as stacks of one-hots

Now we define D_i and U. The intuition is to treat the positional encoding as a stack of single steps. Each step is a one-hot n-vector that says "which child was taken at this depth." D_i pushes a new one-hot vector for child i onto the stack. To preserve the fixed dimensionality, the oldest entry gets truncated. U does the reverse — pops the most recent entry and pads with zeros. The figure shows this in action. Root is all zeros. D_0 pushes [1,0,0] onto the stack — you can see it appear in x's encoding. D_2 pushes [0,0,1] — it appears in y's encoding, shifting x's chunk over. D_1 pushes [0,1,0] — it appears in z's encoding. The key observation: D_i is affine — the shift is the linear part, the one-hot insertion is the translation. U is purely linear. So any path composed from these operators is affine by construction.

Why Are These Operations Affine?

Concrete example: \(n = 3\) (ternary tree), \(k = 3\) (max depth) → PE dimension = 9

\( D_1 \) (push child 1)

D_1 \mathbf{x} = \underbrace{\begin{bmatrix} 0&0&0&0&0&0&0&0&0 \\ 0&0&0&0&0&0&0&0&0 \\ 0&0&0&0&0&0&0&0&0 \\ \color{#42affa}{1}&0&0&0&0&0&0&0&0 \\ 0&\color{#42affa}{1}&0&0&0&0&0&0&0 \\ 0&0&\color{#42affa}{1}&0&0&0&0&0&0 \\ 0&0&0&\color{#42affa}{1}&0&0&0&0&0 \\ 0&0&0&0&\color{#42affa}{1}&0&0&0&0 \\ 0&0&0&0&0&\color{#42affa}{1}&0&0&0 \end{bmatrix}}_{A} \mathbf{x} + \underbrace{\begin{bmatrix} 0\\\color{#ff6b6b}{1}\\0 \\ 0\\0\\0 \\ 0\\0\\0 \end{bmatrix}}_{\mathbf{b}}

\(A\): shifts elements right by \(n\) positions (selects first 6, places at positions 4–9)
\(\mathbf{b}\): one-hot for child 1 in the first chunk

\( U \) (pop / go to parent)

U \mathbf{x} = \underbrace{\begin{bmatrix} 0&0&0&\color{#42affa}{1}&0&0&0&0&0 \\ 0&0&0&0&\color{#42affa}{1}&0&0&0&0 \\ 0&0&0&0&0&\color{#42affa}{1}&0&0&0 \\ 0&0&0&0&0&0&\color{#42affa}{1}&0&0 \\ 0&0&0&0&0&0&0&\color{#42affa}{1}&0 \\ 0&0&0&0&0&0&0&0&\color{#42affa}{1} \\ 0&0&0&0&0&0&0&0&0 \\ 0&0&0&0&0&0&0&0&0 \\ 0&0&0&0&0&0&0&0&0 \end{bmatrix}}_{A} \mathbf{x} + \underbrace{\begin{bmatrix} 0\\0\\0 \\ 0\\0\\0 \\ 0\\0\\0 \end{bmatrix}}_{\mathbf{0}}

\(A\): shifts elements left by \(n\) positions (selects last 6, places at positions 1–6)
No translation needed — purely linear

Push/pop on a fixed-size vector = rearranging elements (matrix \(A\)) + inserting a constant (vector \(\mathbf{b}\)) = affine. Composing affines gives affine → \(\text{PE}_b = A_\phi \cdot \text{PE}_a + \mathbf{b}_\phi\).

Let's see exactly why these operations are affine by writing out the actual matrices. Take a ternary tree with max depth 3, so PE vectors are 9-dimensional. D_1 — push child 1 — does two things. First, shift all elements right by 3 positions. This drops the last 3 elements and opens up the first 3 positions. That's the matrix A — it has 1s on a diagonal shifted down by 3. Second, put a one-hot vector for child 1 in the first 3 positions. That's the vector b — it has a 1 in position 2 (child 1, zero-indexed). So D_1 x = Ax + b. Matrix times vector plus constant vector. That's affine. U does the reverse. Shift everything left by 3 positions, pad zeros at the end. The matrix A has 1s on a diagonal shifted up by 3. No constant vector needed — b is zero. So U is purely linear. The punch line: push and pop on a fixed-size vector are just rearranging elements (that's a matrix multiply) and optionally inserting a constant (that's translation). That's exactly what affine means. And composing affines gives affine.

Worked Example

Using our ternary tree (\(n=3, k=3\)). Start with node \(y\) from earlier: \(y = D_2 \cdot D_0 \cdot \mathbf{0}\)

\mathbf{y} = [\underbrace{1,0,0}_{\text{child 0}},\underbrace{0,0,1}_{\text{child 2}},\underbrace{0,0,0}_{\text{empty}}]

Apply \(D_1\): go to child 1 of \(y\) → node \(z\)

D_1 \mathbf{y} = A\mathbf{y} + \mathbf{b} = [\underbrace{0,0,0,\;1,0,0,\;0,0,1}_{A\mathbf{y}: \text{ shift right by 3}}] + [\underbrace{0,1,0}_{\mathbf{b}},\;0,0,0,\;0,0,0] = [\underbrace{0,1,0}_{\text{child 1}},\underbrace{1,0,0}_{\text{child 0}},\underbrace{0,0,1}_{\text{child 2}}] = \mathbf{z}

Apply \(U\): go back up from \(z\) → node \(y\)

U \mathbf{z} = A\mathbf{z} + \mathbf{0} = [\underbrace{1,0,0,\;0,0,1,\;0,0,0}_{A\mathbf{z}: \text{ shift left by 3, pad zeros}}] = \mathbf{y}

Because these operations can be described as matrix operations, they are affine. And \( U \cdot D_1 \cdot \mathbf{y} = \mathbf{y} \) — the constraint \( U \cdot D_i = I \) holds exactly.

Let's trace through a concrete example using the tree from earlier. Node y has encoding [1,0,0, 0,0,1, 0,0,0]. You can read this as: first chunk is child 0, second chunk is child 2, third chunk is empty. Apply D_1 to go to child 1 of y. First, A times y shifts everything right by 3: the first chunk becomes empty, the old first chunk moves to position 2, the old second chunk moves to position 3. Then add b — the one-hot for child 1 — in the first chunk. Result: [0,1,0, 1,0,0, 0,0,1]. That's z's encoding. Now apply U to z to go back up. A times z shifts everything left by 3 and pads zeros at the end. Result: [1,0,0, 0,0,1, 0,0,0]. That's y again. U times D_1 times y equals y. The constraint U D_i = I holds exactly.

Why Design the Embedding Instead of Learning It?

	Learned (Free)	Designed (Sinusoidal PE, Tree PE)
How	Gradient descent finds embeddings	Engineered for algebraic properties
Guarantees	None — structure may emerge	Exact by construction
Flexibility	High	Constrained to design
Length generalization	No	Yes

Word2Vec: Learned embeddings can discover affine structure (\(\text{king} - \text{man} + \text{woman} \approx \text{queen}\)) — but it's approximate, not guaranteed. The tree PE we just saw gives you exactness by construction.

Now that we've seen how tree PE works, let's step back and ask: why go through all this design effort? Why not just learn the positional embeddings like BERT and GPT do? Learned embeddings are flexible — gradient descent figures out the vectors. And they CAN discover affine-like structure. Word2Vec famously showed king minus man plus woman approximately equals queen — an affine relationship that emerged from training. But it's approximate. What we just built with tree PE is exact by construction. The push/pop operators are exactly affine. Path composition is exactly affine. And because the structure is designed in, it generalizes to tree depths not seen during training. When you know the structure your data has — and trees are a case where you do — designing it in is worth the trade-off.

Practical Enhancements

Geometric Weighting

Scale each stack level by \( p^i \): \(\; [\;1 \cdot c_1 \;|\; p \cdot c_2 \;|\; p^2 \cdot c_3 \;|\; \cdots \;] \)
Learnable \( p \) — controls weight of deep vs shallow ancestry
Concatenate versions with different \( p \) (like multiple frequencies in sinusoidal PE)

Binarization

Any tree → binary via left-child-right-sibling encoding
Reduces branching factor to \( n = 2 \) → compact PE vectors

Takeaway

Tree PE encodes hierarchical structure as a stack of one-hot vectors,
making every tree navigation an exact affine transform —
just as sinusoidal PE makes every position shift a rotation.

References

Shiv & Quirk, "Novel Positional Encodings to Enable Tree-Based Transformers", NeurIPS 2019
Vaswani et al., "Attention Is All You Need", NeurIPS 2017
Mikolov et al., "Efficient Estimation of Word Representations in Vector Space" (Word2Vec), 2013

Thank you!