
HuggingFace shows why torch.compile gains nothing on a single nn.Linear — and where fusion starts to matter
HuggingFace's second PyTorch profiling post walks from nn.Linear to a fused MLP kernel, showing the bias-add is already folded into the cuBLAS GEMM epilogue — torch.compile has nothing left to fuse on a single layer. Scripts and annotated traces are included, so a developer can verify whether a layer is memory-bandwidth-bound before moving to custom kernel work.
Source: huggingface.co ↗
A common reflex is to reach for torch.compile whenever a model feels slow. For a single GEMM-with-bias, compile has very little to do.
HuggingFace
Why this matters
- → cuBLAS already fuses bias into GEMM epilogue; torch.compile adds nothing for single layers
- → Compilation removes CPU dispatch overhead by precomputing strides at trace time
- → Understanding kernel fusion requirements helps developers avoid premature optimization
Fusion myths debunked