Clarify explanation of requires_grad in PyTorch #3717
+6
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #3716
Description
It was challenging for me to initially grasp why
requires_gradwas done afterweightsinitialization, but in the same line asbias.The existing explanation ("we don't want that step included in the gradient") is technically correct but omits the practical consequence: Leaf Node status.
If
requires_grad=Trueis set before the initialization math (the division bysqrt(n)), theweightstensor records that operation and becomes a calculated output (non-leaf node) rather than a source parameter. This makes it impossible for optimizers to update it.This PR clarifies that we set
requires_gradafter the math to ensure the tensor remains a trainable Leaf Node.Checklist
P.S. Leaving the 3rd point unchecked since there are no label as of yet in the issue.
cc @svekars @sekyondaMeta @AlannaBurke @albanD @jbschlosser