apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec: # ← "spec" at depth 0
replicas: 3
template:
spec: # ← "spec" at depth 2
containers:
- name: web
image: nginx
Same token, different roles. Standard PE
says "different sequence positions" — but not different tree positions.
What it actually is
apiVersion
kind
metadata
└─ name
spec
├─ replicas
└─ template
└─ spec
└─ containers[]
├─ name
└─ image
Sequential position is 1D. Tree position is multi-dimensional — depth, sibling, parent path.
Hypothesis: tell the model about the tree directly — encode
depth + sibling + node type — and it should learn K8s structure faster and
more reliably than inferring it all from context.
The approach
1. Tree positional encoding (replaces standard learned PE)
Two prediction heads route by position; same shared encoder body
Design principle: the kind (Deployment) and parent
key (spec) are NOT in the input embedding. The model must
discover them through attention to predict the trigram correctly.
Probing: kind recoverability climbs 21% → 51% from embedding to final layer —
built entirely by attention.
Trained on 276K K8s manifests from
substratusai/the-stack-yaml-k8s, 30 epochs on an L4 GPU.
It works — v5 evaluation
Metric
v5
Pretrain capability tests
93 / 93
Structural tests
6 / 9
Document similarity (off-diagonal cosine)
0.46 (target < 0.70)
Model parameters
7.8 M
Training
30 epochs on L4 GPU
Downstream: missing-field suggestion (real K8s YAML, no fine-tuning)
Run on a sample nginx Deployment that omits these fields.
Model identifies them as missing and ranks confidence.
But ablations told a more honest story
Compared 4 positional-encoding variants under identical training
(same data, same hyperparams, same seed) at two data scales:
Variant
5K docs × 10 ep
25K × 10 ep (seed 42)
25K × 10 ep (seed 7)
FULL (depth + sibling + node_type)
85 / 93
92 / 93
92 / 93
no_depth
80 / 93
93 / 93
90 / 93
no_sibling
79 / 93
91 / 93
92 / 93
sequential (learned pos[seq_idx])
79 / 93
92 / 93
92 / 93
At 5K data, tree-PE wins by 5–6 tests. At 25K, all variants land in 90–93 — within
single-seed noise. Tree-PE is a data-efficiency lever, not a
permanent edge.
Bigger issue surfaced: the capability suite saturates. It
cannot differentiate trained models. Any architectural claim past 25K data
needs harder tests.
I needed a bigger boat — and found a data issue
Bigger-boat test design
6 categories of harder tests, 25 cases total
Status-side completion
API-version awareness
Wrong-level prediction scoring
CRD-instance handling
Adversarial / OOD calibration
Confidence floors on existing tests
Embedding-analysis finding
CRDs are 3% of docs but 46% of training tokens.
Average CRD is 1,219 tokens vs average Deployment's 80 — ~15× larger.
Half of all gradient updates were on JSON-schema content rather than actual K8s state.
Left bars: share of documents. Right bars: share of
training tokens. CRDs jump from 3% to 46%.
Bigger boat exposed a sharp bug
Masked
Deployment.status.replicas in v5 →
top-5 predictions:
1. [UNK] (99.63%) ← model is 99% confident in [UNK]
2. password ( 0.07%)
3. password ( 0.05%)
4. loadBalancer ( 0.05%)
5. tls.crt ( 0.03%)
Root cause: status.X bigrams were below min_freq=100
in training → became [UNK] in the target vocab. The
training loop was supervising the model to predict [UNK] confidently at 5.5%
of all key positions. A masked-LM bug we'd been carrying since v1.
Measured across the whole corpus: 718,725 of 13.1M training
key positions (5.5%) had [UNK] targets — all of them training the model to be
confident about the wrong answer.
v6.1 — fix the bug (5 lines)
# yaml_bert/dataset.py::_getitem_v4 — Lever 1
# BEFORE: any masked position got supervised, even if target was [UNK]
# AFTER: compute target first; if it'd be [UNK], skip this position entirely
target, head_type = compute_target(nodes[i], kind)
target_id = vocab.encode_simple_target(target) if head_type == "simple" \
else vocab.encode_kind_target(target)
if skip_unk_targets and target_id == unk_id:
continue # ← the fix: position stays as input context, never becomes a label
Position stays in the input sequence as context. Just doesn't generate a
training signal. Same model, same vocab, same data — only the masking policy
changes.
v6.1 trained from scratch on the same 276K corpus,
same 30 epochs, same hyperparameters. Only difference: this 5-line change.
v6.1 results — bug fixed, but vocab gap remains
Metric
v5
v6.1
Pretrain capability tests
93 / 93
92 / 93
Structural tests
6 / 9
9 / 9
Bigger boat — crd_pollution
4 / 4
4 / 4
Bigger boat — annotation_keys
2 / 2
2 / 2
Bigger boat — confidence_calib
3 / 3
3 / 3
Bigger boat — vocab_gap (status)
0 / 4
0 / 4 *
The [UNK] bug is gone. v5 predicted [UNK] at 99% on status
fields; v6.1 predicts real keys (wrong ones, but not [UNK]). Structural tests
went from 6/9 → 9/9.
* But bigger-boat vocab_gap still fails — because the
status targets are still absent from the vocab. The model defaults to
predicting spec-side keys. Lever 1 fixed the supervision; v6.2 needs to fix the
vocab.
Lessons + what's next
Lessons
Capability test suites saturate. Once they do, they can no
longer differentiate models. Plan for that.
Embedding analysis surfaces data-quality issues the test suite
can't catch (CRD token dominance).
Some "failures" are supervised bugs, not architectural limits.
v5's [UNK] overconfidence was the trainer doing exactly what we asked.
An inductive bias (tree-PE) can be a data-efficiency lever
without being a permanent edge at scale.
v6.2 + v7 roadmap
v6.2: add status targets to the vocab (per-parent min_freq);
apiVersion-aware kind head; universal depth cap to rebalance CRD content
v7: sub-tokenization for OOV resilience; tree-bias attention
for wrong-sibling prediction; bigger-boat suite expansion
Honest research arc: hypothesis → result → ablation → harder
tests → surfaced bug → fix → honest gap analysis. The journey
taught more than the headline number.