415.tech
AI & tech, from the frontlines of Silicon Valley
HuggingFace shows why torch.compile gains nothing on a single nn.Linear — and where fusion starts to matter

HuggingFace shows why torch.compile gains nothing on a single nn.Linear — and where fusion starts to matter

HuggingFace's second PyTorch profiling post walks from nn.Linear to a fused MLP kernel, showing the bias-add is already folded into the cuBLAS GEMM epilogue — torch.compile has nothing left to fuse on a single layer. Scripts and annotated traces are included, so a developer can verify whether a layer is memory-bandwidth-bound before moving to custom kernel work.

Source: huggingface.co

Post on XEmail

A common reflex is to reach for torch.compile whenever a model feels slow. For a single GEMM-with-bias, compile has very little to do.

HuggingFace

Why this matters

  • → cuBLAS already fuses bias into GEMM epilogue; torch.compile adds nothing for single layers
  • → Compilation removes CPU dispatch overhead by precomputing strides at trace time
  • → Understanding kernel fusion requirements helps developers avoid premature optimization
Fusion myths debunked
Also in this edition