Use standard gradient checkpointing for small sequence lengths

danielhanchen · danielhanchen · commit fe9b2fe48c0a · 2026-01-08T11:56:42.000Z
When max_seq_length &lt; 512, the overhead of gradient offloading in
gc="unsloth" mode is not worth it. Benchmarks on B200 show:

| seq_len | gc=unsloth | gc=True  | Difference |
|---------|------------|----------|------------|
| 256     | 6,803 t/s  | 6,993 t/s| +2.8%      |
| 384     | 9,889 t/s  | 9,963 t/s| +0.7%      |
| 512     | 13,151 t/s | 13,092 t/s| -0.4%     |
| 1024    | 26,662 t/s | 25,094 t/s| -5.9%     |

The crossover point is around seq_len 384-512. For sequences shorter
than 512, we now automatically use standard gradient checkpointing
instead of the custom offloading implementation.
diff --git a/unsloth/models/llama.py b/unsloth/models/llama.py
@@ -2641,9 +2641,15 @@ def get_peft_model(
         transformers_set_seed(random_state)
 
         if use_gradient_checkpointing == "unsloth":
-            patch_unsloth_smart_gradient_checkpointing(
-                dtype = model.get_input_embeddings().weight.dtype
-            )
+            # Gradient offloading overhead is not worth it for small sequences.
+            # Benchmarks show crossover point is around seq_len 384-512.
+            # For seq < 512, standard gradient checkpointing is faster.
+            if hasattr(model, "max_seq_length") and model.max_seq_length < 512:
+                use_gradient_checkpointing = True
+            else:
+                patch_unsloth_smart_gradient_checkpointing(
+                    dtype = model.get_input_embeddings().weight.dtype
+                )
 
         if type(r) is not int:
             raise TypeError(f"Unsloth: Rank of {str(r)} must be an integer.")