Precise memory estimation for modern language models
Parameters (in billions) × Quantization bytes × 1B / 1024³
Batch × Context × Hidden Size × Layers × 2 bytes (FP16) / 1024³
Batch × Seq Length × Layers × (2 × Head Dim × KV Heads) × 2 bytes / 1024³
√(Parameters / (6 × Layers)) or calculated from heads and head dimension
(Weights + Activations + KV Cache) × 1.20 (20% overhead)
Note: This is an estimation. Actual VRAM usage depends on implementation, optimizations, and framework overhead.