YAML-BERT

A tree-aware transformer for Kubernetes manifests

tree positional encoding + structural prediction + an honest evaluation arc

Vimal Kumar

YAML is a tree. Transformers see sequences.

What the encoder sees today

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:                # ← "spec" at depth 0
  replicas: 3
  template:
    spec:            # ← "spec" at depth 2
      containers:
      - name: web
        image: nginx

Same token, different roles. Standard PE says "different sequence positions" — but not different tree positions.

What it actually is

apiVersion kind metadata └─ name spec ├─ replicas └─ template └─ spec └─ containers[] ├─ name └─ image

Sequential position is 1D.
Tree position is multi-dimensional — depth, sibling, parent path.

Hypothesis: tell the model about the tree directly — encode depth + sibling + node type — and it should learn K8s structure faster and more reliably than inferring it all from context.

The approach

1. Tree positional encoding (replaces standard learned PE)

tree_pos = depth_emb[depth]      # 16 depth slots
         + sibling_emb[sibling]   # 32 sibling slots
         + node_type_emb[type]    # KEY / VALUE / LIST_KEY / LIST_VALUE
input    = LayerNorm(token_emb + tree_pos)

Architecture-agnostic — could plug into any sequential encoder. No sequence-index encoding at all; position is purely structural.

2. BERT-style encoder with masked-key prediction

6-layer transformer encoder, d_model 256, 8 heads, 7.8M params. MLM training: mask 15% of KEY tokens, predict each from bidirectional context.

3. Hybrid prediction targets — exploit structural locality

Simple target: parent::child → spec::replicas
Kind-specific target: kind::parent::child → Deployment::spec::replicas
Two prediction heads route by position; same shared encoder body

Design principle: the kind (Deployment) and parent key (spec) are NOT in the input embedding. The model must discover them through attention to predict the trigram correctly. Probing: kind recoverability climbs 21% → 51% from embedding to final layer — built entirely by attention.

Trained on 276K K8s manifests from substratusai/the-stack-yaml-k8s, 30 epochs on an L4 GPU.

It works — v5 evaluation

Metric	v5
Pretrain capability tests	93 / 93
Structural tests	6 / 9
Document similarity (off-diagonal cosine)	0.46 (target < 0.70)
Model parameters	7.8 M
Training	30 epochs on L4 GPU

Downstream: missing-field suggestion (real K8s YAML, no fine-tuning)

spec.template.spec.containers.0.resources:
    [99.8%] requests (STRONG)
spec:
    [99.4%] strategy (STRONG)
metadata:
    [96.7%] labels (STRONG)
ports.0:
    [95.6%] name (STRONG)

Run on a sample nginx Deployment that omits these fields. Model identifies them as missing and ranks confidence.

But ablations told a more honest story

Compared 4 positional-encoding variants under identical training (same data, same hyperparams, same seed) at two data scales:

Variant	5K docs × 10 ep	25K × 10 ep (seed 42)	25K × 10 ep (seed 7)
FULL (depth + sibling + node_type)	85 / 93	92 / 93	92 / 93
no_depth	80 / 93	93 / 93	90 / 93
no_sibling	79 / 93	91 / 93	92 / 93
sequential (learned pos[seq_idx])	79 / 93	92 / 93	92 / 93

At 5K data, tree-PE wins by 5–6 tests. At 25K, all variants land in 90–93 — within single-seed noise. Tree-PE is a data-efficiency lever, not a permanent edge.

Bigger issue surfaced: the capability suite saturates. It cannot differentiate trained models. Any architectural claim past 25K data needs harder tests.

I needed a bigger boat — and found a data issue

Bigger-boat test design

6 categories of harder tests, 25 cases total
Status-side completion
API-version awareness
Wrong-level prediction scoring
CRD-instance handling
Adversarial / OOD calibration
Confidence floors on existing tests

Embedding-analysis finding

CRDs are 3% of docs but 46% of training tokens.

Average CRD is 1,219 tokens vs average Deployment's 80 — ~15× larger. Half of all gradient updates were on JSON-schema content rather than actual K8s state.

Left bars: share of documents. Right bars: share of training tokens. CRDs jump from 3% to 46%.

Bigger boat exposed a sharp bug

Masked Deployment.status.replicas in v5 →

top-5 predictions:
  1. [UNK]              (99.63%)   ← model is 99% confident in [UNK]
  2. password           ( 0.07%)
  3. password           ( 0.05%)
  4. loadBalancer       ( 0.05%)
  5. tls.crt            ( 0.03%)

Root cause: status.X bigrams were below min_freq=100 in training → became [UNK] in the target vocab. The training loop was supervising the model to predict [UNK] confidently at 5.5% of all key positions. A masked-LM bug we'd been carrying since v1.

Measured across the whole corpus: 718,725 of 13.1M training key positions (5.5%) had [UNK] targets — all of them training the model to be confident about the wrong answer.

v6.1 — fix the bug (5 lines)

# yaml_bert/dataset.py::_getitem_v4 — Lever 1
# BEFORE: any masked position got supervised, even if target was [UNK]
# AFTER: compute target first; if it'd be [UNK], skip this position entirely

target, head_type = compute_target(nodes[i], kind)
target_id = vocab.encode_simple_target(target) if head_type == "simple" \
                                                else vocab.encode_kind_target(target)
if skip_unk_targets and target_id == unk_id:
    continue   # ← the fix: position stays as input context, never becomes a label

Position stays in the input sequence as context. Just doesn't generate a training signal. Same model, same vocab, same data — only the masking policy changes.

v6.1 trained from scratch on the same 276K corpus, same 30 epochs, same hyperparameters. Only difference: this 5-line change.

v6.1 results — bug fixed, but vocab gap remains

Metric	v5	v6.1
Pretrain capability tests	93 / 93	92 / 93
Structural tests	6 / 9	9 / 9
Bigger boat — crd_pollution	4 / 4	4 / 4
Bigger boat — annotation_keys	2 / 2	2 / 2
Bigger boat — confidence_calib	3 / 3	3 / 3
Bigger boat — vocab_gap (status)	0 / 4	0 / 4 *

The [UNK] bug is gone. v5 predicted [UNK] at 99% on status fields; v6.1 predicts real keys (wrong ones, but not [UNK]). Structural tests went from 6/9 → 9/9.

* But bigger-boat vocab_gap still fails — because the status targets are still absent from the vocab. The model defaults to predicting spec-side keys. Lever 1 fixed the supervision; v6.2 needs to fix the vocab.

Lessons + what's next

Lessons

Capability test suites saturate. Once they do, they can no longer differentiate models. Plan for that.
Embedding analysis surfaces data-quality issues the test suite can't catch (CRD token dominance).
Some "failures" are supervised bugs, not architectural limits. v5's [UNK] overconfidence was the trainer doing exactly what we asked.
An inductive bias (tree-PE) can be a data-efficiency lever without being a permanent edge at scale.

v6.2 + v7 roadmap

v6.2: add status targets to the vocab (per-parent min_freq); apiVersion-aware kind head; universal depth cap to rebalance CRD content
v7: sub-tokenization for OOV resilience; tree-bias attention for wrong-sibling prediction; bigger-boat suite expansion

Honest research arc: hypothesis → result → ablation → harder tests → surfaced bug → fix → honest gap analysis. The journey taught more than the headline number.

A few takeaways. One — capability tests are great but they saturate. Don't trust them as the sole instrument. Two — when capability tests saturate, embedding analysis can surface data quality issues you'd otherwise miss. The CRD finding was a "look at the data" moment that the test suite never would have caught. Three — not every failure is an architecture problem. Sometimes it's a bug. Lever 1 was a 5-line bug fix that took a status test from 99% [UNK] confidence to passing. Four — inductive biases buy you data efficiency, not permanent advantages. Tree-PE helped at 5K, was a wash at 25K. That's a real result, not a failure. The honest evaluation arc itself is the point of this talk.