DECODE-RAPL

Predicting CPU Power from OS-Level Metrics

DElayed Embedding COnvolutional network for DEcoding RAPL

A Deep Learning Journey Through Multiple Architectures

Problem & Challenge

  • RAPL (Running Average Power Limit) — Intel's hardware power meter — is unavailable in VMs
  • Goal: Predict CPU package power (Watts) using only VM-visible OS metrics from /proc
  • Same CPU% ≠ Same Power: 50% CPU on AVX-512 matrix multiply (~45W) vs. scalar sorting (~30W)
  • OS metrics miss microarchitectural details: instruction mix, cache behavior, pipeline utilization
  • Power has temporal dynamics: thermal inertia, frequency scaling, ramp-up/cool-down
Core question: Can we reconstruct enough hidden CPU state from OS-level observables to predict power accurately?
Approach 1

MS-TCN: Multi-Scale Temporal CNN

MS-TCN Architecture
  • 19 system metrics at 62.5 Hz, 740K params
  • Multi-scale convolutions (kernel 3/5/7)
  • 6 dilated temporal blocks (dilation 1→32)
  • Multi-head attention (8 heads)
  • Validation: R² = 0.96, MAE = 3.50W
  • Real workloads: Failed — data 98% skewed to <30W
Lesson: The problem isn't model complexity — it's the information content of the features.
Approach 2

v1: Delay Embedding + AE + LSTM

Key Idea: Takens' Theorem

  • Treat CPU power as a dynamical system with hidden state
  • Time-delayed copies reconstruct full state space
  • h(t) = [y(t), y(t-τ), ..., y(t-24τ)]
  • 4 features × 25 delays = 100-D input
  • Features: user%, system%, iowait%, log(ctx/sec)
Delay Embedding
v1 Architecture

Loss = α · Lpower + β · Lreconstruction

Approach 2

v1: Results — Great Offline, Bad Live

v1 stress-ng results

stress-ng test: R² = 0.97

v1 real workload results

Real workloads: R² = 0.29–0.40, +10W bias

Diagnosis: Model memorized stress-ng patterns. Training data lacked diversity. LSTM may be redundant given delay embedding already encodes temporal info.

Better Data: The "Gold Mine" Dataset

  • Combinatorial workload generator: 6 stressors × multiple levels = 2,025 unique combos
  • Go-based collector: 16ms interval, minimal overhead
  • 5.8M samples, power range 24W–102W
  • Includes true idle + real /proc reader workloads
  • Multi-tau delay embedding (τ = 1, 4, 8)
  • MinMaxScaler on train split; global shuffle
Idle run

Pure idle (all stressors = 0)

Mixed workload

Mixed: cpu=20, sys=40, pipe=20, cache=20

v2 & v3: Simplifying the Architecture

v2: AE + Power Head v2 Architecture
  • Removed LSTM — delay embedding has temporal info
  • 268K params
  • Overfit at epoch 2 without normalization
  • Reconstruction loss competes with prediction
v3: Encoder + Power Head v3 Architecture
  • Removed decoder entirely — pure predictor
  • Added MinMaxScaler normalization
  • Offline: R² = 0.96, MAE = 2.16W
  • Live: Still fails on idle & real workloads
Approach 4

v3: Best Offline, Still Fails Live

v3 Test Predictions

Offline: R² = 0.96, MAE = 2.16W

v3 Live Predictions

Live: Significant errors on real data

The offline/online gap persists. Near-perfect on shuffled test set, but fails on sequential live data.
Approach 5

v4: 1D-CNN Encoder + Power Head

v4 Architecture
  • Replaced MLP with 1D Conv layers
  • Input: (batch, 4, 25) — per-feature temporal processing
  • Kernels (5 steps ≈ 80ms) learn temporal motifs
  • Switched to MAE loss (L1)
  • Tested across τ = 1, 4, 8
  • Offline: R² ≈ 0.95
Approach 5

v4: Live Predictions Across τ Values

v4 tau=1

τ = 1

v4 tau=4

τ = 4

v4 tau=8

τ = 8

  • Idle baseline improved by ~3W (MAE loss helped)
  • Spike overreaction persists across all τ values

Summary: The Consistent Pattern

ApproachArchitectureOffline R²Live PerformanceKey Change
MS-TCNMulti-scale CNN + Attention 0.96Failed on real workloads 19 raw features
v1AE + LSTM + Power Head 0.97R² = 0.29–0.40 Delay embedding, 4 features
v2AE + Power Head Overfitting Removed LSTM
v3Encoder + Power Head 0.96Failed on live Removed decoder, added norm
v41D-CNN + Power Head 0.95Slightly better (+3W idle) Conv layers, MAE loss
Root Cause: OS-level features fundamentally cannot capture microarchitectural details (instruction mix, cache behavior, pipeline utilization). The features, not the models, are the bottleneck.

Future Directions

  • Alternative VM-visible metrics: guest CPU frequency, page fault rates, I/O throughput, eBPF tracing
  • Post-processing calibration: per-workload-class correction factors (e.g., subtract +3W idle offset)
  • Adversarial/domain-adaptation training: close the offline–online gap with domain-invariant latent representations
  • Multi-machine generalization: train across same-architecture machines for architecture-level power models
  • Hybrid approach: physics-informed constraints on power bounds + learned representations
The model works. The features don't.
The deep learning pipeline is validated. Progress requires richer input signals that expose more of the CPU's hidden microarchitectural state.