Tree Positional Encodings

Bringing Hierarchical Structure into Transformers

Based on: Shiv & Quirk, "Novel Positional Encodings to Enable Tree-Based Transformers"

NeurIPS 2019

Vimal

The Problem — Structured Text Isn't Flat

Flat sequence

1 2 3 4 5

Standard PE handles this.
Position is just a number.
PE at distance \(k\) depends only on \(k\).

Tree structure

root A B C D E

Code, JSON, XML, math expressions...
Position is a path.

How do we encode "go to parent" or "go to 2nd child" in a way the transformer can use?

What We Want

Sinusoidal PE gives us these guarantees for sequences:

  • Uniqueness: Every position gets a distinct PE vector
  • Relative navigation: \(\text{PE}(\text{pos}+k) = R_k \cdot \text{PE}(\text{pos})\) — shift by \(k\) is a fixed affine transform (rotation)

Can we get the same for trees?

  • Uniqueness: Can every node get a distinct PE vector?
  • Relative navigation: Can we compute a node's PE from another node's PE, using just the path between them?
Why "just the path"? In sinusoidal PE, the shift operation is a rotation matrix — a simple linear operation that the transformer's layers (QKV projections, linear layers) can represent natively. We want tree navigation to have the same property: an operation simple enough that the transformer doesn't need any special architecture to use it on inputs which are structured like trees.

Warm-up — Sinusoidal PE Does This for Sequences

\[ \text{PE}(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d}}\right), \;\; \text{PE}(\text{pos}, 2i{+}1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d}}\right) \]
Key property: \(\text{PE}(\text{pos} + k) = R_k \cdot \text{PE}(\text{pos})\) where \( R_k \) is a rotation matrix (from trig identities): \[ \small \begin{bmatrix} \sin(x+y) \\ \cos(x+y) \end{bmatrix} = \begin{bmatrix} \cos y & \sin y \\ -\sin y & \cos y \end{bmatrix} \begin{bmatrix} \sin x \\ \cos x \end{bmatrix} \]
  • Shifting position by \( k \) = fixed rotation matrix — an affine transform with \(\mathbf{b} = 0\)
  • Can we do the same for tree paths instead of linear shifts?

From Distances to Paths

  • In sequences, the relationship between two positions is simply the distance \(k\)
  • In trees, the relationship is a path: a series of steps along branches, each step going up to a parent or down to a child
Desired property: For every path \(\phi\), there is an affine transform \(A_\phi\) such that: \[ \text{PE}_b = A_\phi \cdot \text{PE}_a \] This allows the transformer to learn path-wise relationships within its embedding layers.

Paths as Compositions of Steps

  • Two types of steps in any tree path:
    • \( D_i \) — step down to the \(i\)-th child
    • \( U \) — step up to the parent
  • Key insight: if each step is an affine transform, the whole path is too
    (composition of affines is affine)
  • Constraint: \( U \cdot D_i = I \; \forall \, i \in \{1, \ldots, n\} \) — going up negates going down
For any path \(\phi\), construct the transform as a composition of \(D\)'s and \(U\)'s: \[ A_\phi = D_{j_m} \cdots D_{j_1} \cdot U^s \] ( go up \(s\) steps, then down through children \(j_1, j_2, \ldots, j_m\) )

Examples — Navigating the Tree

r x w y z D₀ D₂ D₂ D₁
Up and down — path from z to w: up 3 steps to r, then down to 3rd child:   \( \text{PE}_w = D_2 \cdot U^3 \cdot \text{PE}_z \)
Going down — path from x to z (via y): step to 3rd child, then to 2nd child:   \( \text{PE}_z = D_1 \cdot D_2 \cdot \text{PE}_x \)
Going up — path from z to r: 3 steps up:   \( \text{PE}_r = U^3 \cdot \text{PE}_z \)

The Encoding Scheme

  • Two hyperparameters: \(n\) (degree of tree), \(k\) (max depth)
  • Each positional encoding is a vector of dimension \(n \cdot k\)
  • Each operator \(U, D_1, \ldots, D_n\) preserves this dimensionality
  • The root is encoded as the zero vector
  • Every other node is encoded according to its path from the root

Since paths from root consist only of steps downward \(\langle b_1, \ldots, b_L \rangle\):

\[ \mathbf{x} = D_{b_L} \, D_{b_{L-1}} \cdots D_{b_1} \, \mathbf{0} \]

Inside the Operators

  • Treat the PE vector as a stack of single steps (at most \(k\) entries, one per depth level)
  • Each step is represented as a one-hot \(n\)-vector indicating which child was taken
  • \( D_i \): pushes the one-hot vector for child \(i\) onto the stack (truncating the oldest entry to preserve dimension)
  • \( U \): pops the most recent entry off the stack (padding with zeros)
Tree PE example: nodes r, x, y, z with their PE vectors as stacks of one-hots

Why Are These Operations Affine?

Concrete example: \(n = 3\) (ternary tree), \(k = 3\) (max depth) → PE dimension = 9

\( D_1 \) (push child 1)

\[ D_1 \mathbf{x} = \underbrace{\begin{bmatrix} 0&0&0&0&0&0&0&0&0 \\ 0&0&0&0&0&0&0&0&0 \\ 0&0&0&0&0&0&0&0&0 \\ \color{#42affa}{1}&0&0&0&0&0&0&0&0 \\ 0&\color{#42affa}{1}&0&0&0&0&0&0&0 \\ 0&0&\color{#42affa}{1}&0&0&0&0&0&0 \\ 0&0&0&\color{#42affa}{1}&0&0&0&0&0 \\ 0&0&0&0&\color{#42affa}{1}&0&0&0&0 \\ 0&0&0&0&0&\color{#42affa}{1}&0&0&0 \end{bmatrix}}_{A} \mathbf{x} + \underbrace{\begin{bmatrix} 0\\\color{#ff6b6b}{1}\\0 \\ 0\\0\\0 \\ 0\\0\\0 \end{bmatrix}}_{\mathbf{b}} \]
  • \(A\): shifts elements right by \(n\) positions (selects first 6, places at positions 4–9)
  • \(\mathbf{b}\): one-hot for child 1 in the first chunk

\( U \) (pop / go to parent)

\[ U \mathbf{x} = \underbrace{\begin{bmatrix} 0&0&0&\color{#42affa}{1}&0&0&0&0&0 \\ 0&0&0&0&\color{#42affa}{1}&0&0&0&0 \\ 0&0&0&0&0&\color{#42affa}{1}&0&0&0 \\ 0&0&0&0&0&0&\color{#42affa}{1}&0&0 \\ 0&0&0&0&0&0&0&\color{#42affa}{1}&0 \\ 0&0&0&0&0&0&0&0&\color{#42affa}{1} \\ 0&0&0&0&0&0&0&0&0 \\ 0&0&0&0&0&0&0&0&0 \\ 0&0&0&0&0&0&0&0&0 \end{bmatrix}}_{A} \mathbf{x} + \underbrace{\begin{bmatrix} 0\\0\\0 \\ 0\\0\\0 \\ 0\\0\\0 \end{bmatrix}}_{\mathbf{0}} \]
  • \(A\): shifts elements left by \(n\) positions (selects last 6, places at positions 1–6)
  • No translation needed — purely linear
Push/pop on a fixed-size vector = rearranging elements (matrix \(A\)) + inserting a constant (vector \(\mathbf{b}\)) = affine. Composing affines gives affine → \(\text{PE}_b = A_\phi \cdot \text{PE}_a + \mathbf{b}_\phi\).

Worked Example

Using our ternary tree (\(n=3, k=3\)). Start with node \(y\) from earlier: \(y = D_2 \cdot D_0 \cdot \mathbf{0}\)

\[ \mathbf{y} = [\underbrace{1,0,0}_{\text{child 0}},\underbrace{0,0,1}_{\text{child 2}},\underbrace{0,0,0}_{\text{empty}}] \]

Apply \(D_1\): go to child 1 of \(y\) → node \(z\)

\[ D_1 \mathbf{y} = A\mathbf{y} + \mathbf{b} = [\underbrace{0,0,0,\;1,0,0,\;0,0,1}_{A\mathbf{y}: \text{ shift right by 3}}] + [\underbrace{0,1,0}_{\mathbf{b}},\;0,0,0,\;0,0,0] = [\underbrace{0,1,0}_{\text{child 1}},\underbrace{1,0,0}_{\text{child 0}},\underbrace{0,0,1}_{\text{child 2}}] = \mathbf{z} \]

Apply \(U\): go back up from \(z\) → node \(y\)

\[ U \mathbf{z} = A\mathbf{z} + \mathbf{0} = [\underbrace{1,0,0,\;0,0,1,\;0,0,0}_{A\mathbf{z}: \text{ shift left by 3, pad zeros}}] = \mathbf{y} \]
Because these operations can be described as matrix operations, they are affine. And \( U \cdot D_1 \cdot \mathbf{y} = \mathbf{y} \) — the constraint \( U \cdot D_i = I \) holds exactly.

Why Design the Embedding Instead of Learning It?

Learned (Free) Designed (Sinusoidal PE, Tree PE)
How Gradient descent finds embeddings Engineered for algebraic properties
Guarantees None — structure may emerge Exact by construction
Flexibility High Constrained to design
Length generalization No Yes
Word2Vec: Learned embeddings can discover affine structure (\(\text{king} - \text{man} + \text{woman} \approx \text{queen}\)) — but it's approximate, not guaranteed. The tree PE we just saw gives you exactness by construction.

Practical Enhancements

Geometric Weighting

  • Scale each stack level by \( p^i \): \(\; [\;1 \cdot c_1 \;|\; p \cdot c_2 \;|\; p^2 \cdot c_3 \;|\; \cdots \;] \)
  • Learnable \( p \) — controls weight of deep vs shallow ancestry
  • Concatenate versions with different \( p \) (like multiple frequencies in sinusoidal PE)

Binarization

  • Any tree → binary via left-child-right-sibling encoding
  • Reduces branching factor to \( n = 2 \) → compact PE vectors

Takeaway

Tree PE encodes hierarchical structure as a stack of one-hot vectors,
making every tree navigation an exact affine transform —
just as sinusoidal PE makes every position shift a rotation.

References

  • Shiv & Quirk, "Novel Positional Encodings to Enable Tree-Based Transformers", NeurIPS 2019
  • Vaswani et al., "Attention Is All You Need", NeurIPS 2017
  • Mikolov et al., "Efficient Estimation of Word Representations in Vector Space" (Word2Vec), 2013

Thank you!