YAML-BERT

Training a Tree-Aware Encoder for Kubernetes — from scratch

Built to answer one question:
how does attention behave when the input is a tree, not a sentence?

276K real Kubernetes manifests

Vimal

From YAML to Vectors

YAML manifest Linearize → tree nodes depth · sibling · node_type BPE tokenizer → subwords Embedding: token ⊕ tree-PE depth + sibling + node_type Transformer Encoder (pre-norm) Bottom-up Tree Aggregator subtree vecs sᵢ doc_vec Key Head [hᵢ ; doc_vec ; s_parent] Recon Head doc_vec → bag-of-subwords
  • Linearize — YAML text into a node sequence; each node has depth, sibling index, node type.
  • Tokenize — BPE tokenizer trained on keys and values.
  • Embed — token embedding + tree positional encoding (depth + sibling + node_type).
  • Encode — pre-norm Transformer; bidirectional attention over the whole manifest.
  • Aggregatebottom-up pool → one subtree vector per node + a doc_vec.
  • Heads — Key Head predicts the unified masked key; Recon Head rebuilds a subtree from doc_vec.

Bottom-up Aggregation

bottom-up pool ( mean ) image: nginx port: 80 replicas: 3 sᵢ containers sᵢ spec (root) doc_vec leaf vectors = encoder hidden states
Each internal node = pool( its children's subtree vectors ) → sᵢ; the root's pool → doc_vec.  The pool is a simple mean — an attention-weighted pool was also tried later.

How It's Trained — Self-Supervised, Two Masks

  • No labels. 276K real manifests; the model learns by predicting parts of the input that are hidden from it.
  • Masked-key prediction (MLM) — randomly mask ~15% of keys; the Key Head predicts each one back from bidirectional context (+ doc_vec + parent subtree vector).
  • Subtree reconstructionseparately mask 1–3 whole subtrees; the Recon Head rebuilds each subtree's bag of value sub-words (a multi-hot vector over the vocab) from doc_vec alone — so doc_vec must encode which content is present, not reproduce it.
  • The two masks are kept disjoint — a reconstruction subtree never overlaps an MLM-masked key — so the objectives don't interfere.
MLM teaches local structure ("what key belongs here"); reconstruction forces doc_vec to summarize a subtree.  loss = MLM + λ·recon.

How the Architecture Evolved

Three phases — the version numbers don't matter.

Phase A

The token-level model

A basic tree-aware BERT — MLM on keys, tree positional encoding. Explored how to inject structure into the prediction target (bigram → trigram), plus vocabulary & data-hygiene fixes.

Phase B

Document vectors

Added the bottom-up aggregator: sub-structure vectors + a doc_vec. The model stops being a token-predictor and starts producing document & subtree embeddings.

Phase C

Content in the vector

Sub-word tokenization for values; then pushing content into doc_vec (mean-pool → attention-pool).

How Every Version Was Judged

Every test masks a key and checks the model's top-5. The capability suite (93 cases) is scored each version, so the numbers stay comparable; two harder suites probe edge cases.

What's testedExample
Structural basics masked key must fit its parent / depth / sibling
Kind conditioning same spec slot → Deployment replicas, Service type, ConfigMap data
Per-kind schema (15+ kinds) StatefulSet serviceName, Job backoffLimit, HPA maxReplicas
Negative calibration containers under metadata → must not be top-1
Spec vs status (structural) key under status → predict status keys, not spec keys
Robustness (edge cases) rare keys / novel CRDs → sensible guess, no [UNK] or CRD-schema leak
The catch: past enough data, the suite stops discriminating — very different architectures all land ~90–93. A test that can't tell designs apart can't justify a change. So we added harder probes on doc_vec: cosine, retrieval, version-ordering.

Ablations

Does encoding the tree actually help? We trained the same model with tree-PE and with a plain sequential PE.

  • With little data, tree-PE clearly won — passed ~5–6 more tests.
  • With more data, the measured gap closed — both ~92/93.
  • Tree-PE got there with 10× fewer positional parameters.
→ tree-PE buys data- & parameter-efficiency. At scale the benchmark couldn't separate it from plain PE — but that benchmark saturates, so "no gap" may just mean "can't tell."

Tried but Failed

Two dead-ends — and they failed in different ways.

Tree-bias attention. Add a learned tree-distance bias to the attention scores.
→ trained ~15× slower — abandoned before a single result. An engineering wall.
Content in the vector. Push values into doc_vec (mean → attention pool).
→ retrieved by value, not just shape (0 → 67%), and ordered numbers like replicas — but capability and structural probes both dropped. A shape-vs-content tradeoff.

Wanted to Try

Directions we didn't get to.

Two vectors, not one. A separate shape-vector and content-vector — each similarity job gets its own embedding instead of fighting over one. The direct fix to the tradeoff.
Cross-document attention. Let the model see several manifests at once, so it can learn relationships across them — does this Service select that Pod?
YAML-T5. A tree-aware encoder–decoder that generates and edits YAML — without the overhead of a general language model.

What We Learned

  • An aux loss teaches nothing if its answer is already in the input — the model reads it off the residual stream; loss → 0, no learning.
  • A sparse-target loss can silently collapse to all-zeros — reconstruction loss flatlined until we up-weighted the positives.
  • Bundle five changes and you can't attribute the result — which is why the next experiment changed exactly one.
  • Data hygiene often beats architecture — a data fix can outdo a clever model change.
  • Mean-pool collapses minority signal under stereotyped inputs.
  • One vector can't carry two similarity metrics — shape vs. content.
  • Throughput is part of the architecture — correct but 15× slower is still a no-go.
  • Meta: attention routes around explicit structure when a content shortcut is cheaper — kind-awareness came from reading the kind: value, not our structural embeddings (removing values cost 36.7% key accuracy).
These transfer to any from-scratch model work — which is what makes the project worth presenting, even though the later attempts never beat the baseline.

Thank you

Try the deployed model

huggingface.co/spaces/vimalk78/yaml-bert

Vimal