HuggingFace shows why torch.compile gains nothing on a single nn.Linear — and where fusion starts to matter

HuggingFace's second PyTorch profiling post walks from nn.Linear to a fused MLP kernel, showing the bias-add is already folded into the cuBLAS GEMM epilogue — torch.compile has nothing left to fuse on a single layer. Scripts and annotated traces are included, so a developer can verify whether a layer is memory-bandwidth-bound before moving to custom kernel work.

Source: huggingface.co ↗

A common reflex is to reach for torch.compile whenever a model feels slow. For a single GEMM-with-bias, compile has very little to do.

HuggingFace

Why this matters

→ cuBLAS already fuses bias into GEMM epilogue; torch.compile adds nothing for single layers
→ Compilation removes CPU dispatch overhead by precomputing strides at trace time
→ Understanding kernel fusion requirements helps developers avoid premature optimization

Fusion myths debunked

Also in this edition

Apollo leads $35B private financing for Broadcom's AI compute platform, targeting 20GW by 2028
Editorial polish — DiffusionGemma
Google ships Gemini 3.5 Live Translate — continuous speech-to-speech across 2,000+ language pairs