Difference between revisions of "PyTorch FSDP vs DDP"
From Ben's Personal Wiki
(Created page with "I observed that PyTorch's FSDP with NO_SHARD is substantially faster and slightly more memory-efficient in a multi-node setting than DDP. Apparently this comes down to differe...") |
|||
Line 2: | Line 2: | ||
Another important note, regarding gradient accumulation: you need to use <code>no_sync</code> for all except the last batch (which is available for both FSDP and DDP). | Another important note, regarding gradient accumulation: you need to use <code>no_sync</code> for all except the last batch (which is available for both FSDP and DDP). | ||
+ | |||
+ | [[Category:Machine Learning]] |
Latest revision as of 19:49, 6 June 2024
I observed that PyTorch's FSDP with NO_SHARD is substantially faster and slightly more memory-efficient in a multi-node setting than DDP. Apparently this comes down to differences in the bucketing strategy used.
Another important note, regarding gradient accumulation: you need to use no_sync
for all except the last batch (which is available for both FSDP and DDP).