Based on: Shiv & Quirk, "Novel Positional Encodings to Enable Tree-Based Transformers"
NeurIPS 2019
Vimal
The Problem — Structured Text Isn't Flat
Flat sequence
Standard PE handles this. Position is just a number. PE at distance \(k\) depends only on \(k\).
Tree structure
Code, JSON, XML, math expressions... Position is a path.
How do we encode "go to parent" or "go to 2nd child" in a way the transformer can use?
What We Want
Sinusoidal PE gives us these guarantees for sequences:
Uniqueness: Every position gets a distinct PE vector
Relative navigation: \(\text{PE}(\text{pos}+k) = R_k \cdot \text{PE}(\text{pos})\) — shift by \(k\) is a fixed affine transform (rotation)
Can we get the same for trees?
Uniqueness: Can every node get a distinct PE vector?
Relative navigation: Can we compute a node's PE from another node's PE, using just the path between them?
Why "just the path"? In sinusoidal PE, the shift operation is a rotation matrix — a simple linear operation that the transformer's layers (QKV projections, linear layers) can represent natively. We want tree navigation to have the same property: an operation simple enough that the transformer doesn't need any special architecture to use it on inputs which are structured like trees.
Key property: \(\text{PE}(\text{pos} + k) = R_k \cdot \text{PE}(\text{pos})\)
where \( R_k \) is a rotation matrix (from trig identities):
\[ \small \begin{bmatrix} \sin(x+y) \\ \cos(x+y) \end{bmatrix}
= \begin{bmatrix} \cos y & \sin y \\ -\sin y & \cos y \end{bmatrix}
\begin{bmatrix} \sin x \\ \cos x \end{bmatrix} \]
Shifting position by \( k \) = fixed rotation matrix — an affine transform with \(\mathbf{b} = 0\)
Can we do the same for tree paths instead of linear shifts?
From Distances to Paths
In sequences, the relationship between two positions is simply the distance \(k\)
In trees, the relationship is a path: a series of steps along branches, each step going up to a parent or down to a child
Desired property: For every path \(\phi\), there is an affine transform \(A_\phi\) such that:
\[ \text{PE}_b = A_\phi \cdot \text{PE}_a \]
This allows the transformer to learn path-wise relationships within its embedding layers.
Paths as Compositions of Steps
Two types of steps in any tree path:
\( D_i \) — step down to the \(i\)-th child
\( U \) — step up to the parent
Key insight: if each step is an affine transform, the whole path is too (composition of affines is affine)
Constraint: \( U \cdot D_i = I \; \forall \, i \in \{1, \ldots, n\} \) — going up negates going down
For any path \(\phi\), construct the transform as a composition of \(D\)'s and \(U\)'s:
\[ A_\phi = D_{j_m} \cdots D_{j_1} \cdot U^s \]
( go up \(s\) steps, then down through children \(j_1, j_2, \ldots, j_m\) )
Examples — Navigating the Tree
Up and down — path from z to w:
up 3 steps to r, then down to 3rd child:
\( \text{PE}_w = D_2 \cdot U^3 \cdot \text{PE}_z \)
Going down — path from x to z (via y):
step to 3rd child, then to 2nd child:
\( \text{PE}_z = D_1 \cdot D_2 \cdot \text{PE}_x \)
Going up — path from z to r:
3 steps up:
\( \text{PE}_r = U^3 \cdot \text{PE}_z \)
The Encoding Scheme
Two hyperparameters: \(n\) (degree of tree), \(k\) (max depth)
Each positional encoding is a vector of dimension \(n \cdot k\)
Each operator \(U, D_1, \ldots, D_n\) preserves this dimensionality
The root is encoded as the zero vector
Every other node is encoded according to its path from the root
Since paths from root consist only of steps downward \(\langle b_1, \ldots, b_L \rangle\):
\[
U \mathbf{z} = A\mathbf{z} + \mathbf{0}
= [\underbrace{1,0,0,\;0,0,1,\;0,0,0}_{A\mathbf{z}: \text{ shift left by 3, pad zeros}}]
= \mathbf{y}
\]
Because these operations can be described as matrix operations, they are affine. And \( U \cdot D_1 \cdot \mathbf{y} = \mathbf{y} \) — the constraint \( U \cdot D_i = I \) holds exactly.
Why Design the Embedding Instead of Learning It?
Learned (Free)
Designed (Sinusoidal PE, Tree PE)
How
Gradient descent finds embeddings
Engineered for algebraic properties
Guarantees
None — structure may emerge
Exact by construction
Flexibility
High
Constrained to design
Length generalization
No
Yes
Word2Vec: Learned embeddings can discover affine structure
(\(\text{king} - \text{man} + \text{woman} \approx \text{queen}\))
— but it's approximate, not guaranteed.
The tree PE we just saw gives you exactness by construction.
Practical Enhancements
Geometric Weighting
Scale each stack level by \( p^i \):
\(\; [\;1 \cdot c_1 \;|\; p \cdot c_2 \;|\; p^2 \cdot c_3 \;|\; \cdots \;] \)
Learnable \( p \) — controls weight of deep vs shallow ancestry
Concatenate versions with different \( p \) (like multiple frequencies in sinusoidal PE)
Binarization
Any tree → binary via left-child-right-sibling encoding
Reduces branching factor to \( n = 2 \) → compact PE vectors
Takeaway
Tree PE encodes hierarchical structure as a stack of one-hot vectors,
making every tree navigation an exact affine transform —
just as sinusoidal PE makes every position shift a rotation.