Encode — pre-norm Transformer; bidirectional attention
over the whole manifest.
Aggregate — bottom-up pool →
one subtree vector per node + a doc_vec.
Heads — Key Head predicts the unified masked key;
Recon Head rebuilds a subtree from doc_vec.
Bottom-up Aggregation
Each internal node = pool( its children's subtree vectors ) → sᵢ;
the root's pool → doc_vec.
The pool is a simple mean — an attention-weighted
pool was also tried later.
How It's Trained — Self-Supervised, Two Masks
No labels. 276K real manifests; the model learns by predicting
parts of the input that are hidden from it.
Masked-key prediction (MLM) — randomly mask ~15% of keys; the
Key Head predicts each one back from bidirectional
context (+ doc_vec + parent subtree vector).
Subtree reconstruction — separately mask 1–3 whole
subtrees; the Recon Head rebuilds each subtree's
bag of value sub-words (a multi-hot vector over the vocab)
from doc_vec alone — so doc_vec
must encode which content is present, not reproduce it.
The two masks are kept disjoint — a reconstruction subtree never
overlaps an MLM-masked key — so the objectives don't interfere.
MLM teaches local structure ("what key belongs here"); reconstruction
forces doc_vec to summarize a subtree.
loss = MLM + λ·recon.
How the Architecture Evolved
Three phases — the version numbers don't matter.
Phase A
The token-level model
A basic tree-aware BERT — MLM on keys, tree positional encoding.
Explored how to inject structure into the prediction target
(bigram → trigram), plus vocabulary & data-hygiene fixes.
→
Phase B
Document vectors
Added the bottom-up aggregator: sub-structure vectors +
a doc_vec. The model stops being a token-predictor and starts
producing document & subtree embeddings.
→
Phase C
Content in the vector
Sub-word tokenization for values; then pushing content into
doc_vec (mean-pool → attention-pool).
How Every Version Was Judged
Every test masks a key and
checks the model's top-5. The capability suite (93 cases) is scored each version, so the
numbers stay comparable; two harder suites probe edge cases.
What's tested
Example
Structural basics
masked key must fit its parent / depth / sibling
Kind conditioning
same spec slot → Deployment replicas, Service type, ConfigMap data
key under status → predict status keys, not spec keys
Robustness (edge cases)
rare keys / novel CRDs → sensible guess, no [UNK] or CRD-schema leak
The catch: past enough data, the suite stops discriminating — very
different architectures all land ~90–93. A test that can't tell designs apart can't
justify a change. So we added harder probes on doc_vec:
cosine, retrieval, version-ordering.
Ablations
Does encoding the tree actually help?
We trained the same model with tree-PE and with a plain sequential PE.
With little data, tree-PE clearly won — passed ~5–6 more tests.
With more data, the measured gap closed — both ~92/93.
Tree-PE got there with 10× fewer positional parameters.
→ tree-PE buys data- &
parameter-efficiency. At scale the benchmark couldn't separate it from plain
PE — but that benchmark saturates, so "no gap" may just mean "can't tell."
Tried but Failed
Two dead-ends — and they failed in different ways.
Tree-bias attention. Add a learned tree-distance bias to the attention scores.
→ trained ~15× slower — abandoned before a single result.
An engineering wall.
Content in the vector. Push values into doc_vec (mean → attention pool).
→ retrieved by value, not just shape
(0 → 67%), and ordered numbers like replicas —
but capability and structural probes both dropped.
A shape-vs-content tradeoff.
Wanted to Try
Directions we didn't get to.
Two vectors, not one. A separate shape-vector and content-vector —
each similarity job gets its own embedding instead of fighting over one.
The direct fix to the tradeoff.
Cross-document attention. Let the model see several manifests at once,
so it can learn relationships across them — does this Service select that Pod?
YAML-T5. A tree-aware encoder–decoder that generates and edits
YAML — without the overhead of a general language model.
What We Learned
An aux loss teaches nothing if its answer is already in the input
— the model reads it off the residual stream; loss → 0, no learning.
A sparse-target loss can silently collapse to all-zeros
— reconstruction loss flatlined until we up-weighted the positives.
Bundle five changes and you can't attribute the result — which is why
the next experiment changed exactly one.
Data hygiene often beats architecture — a data fix can outdo a clever model change.
Mean-pool collapses minority signal under stereotyped inputs.
One vector can't carry two similarity metrics — shape vs. content.
Throughput is part of the architecture — correct but 15× slower is still a no-go.
Meta: attention routes around explicit structure when
a content shortcut is cheaper — kind-awareness came from reading the kind:
value, not our structural embeddings (removing values cost 36.7% key accuracy).
These transfer to any from-scratch model work — which is what makes the project
worth presenting, even though the later attempts never beat the baseline.