Difference between revisions of "PyTorch FSDP vs DDP"

Latest revision as of 19:49, 6 June 2024

I observed that PyTorch's FSDP with NO_SHARD is substantially faster and slightly more memory-efficient in a multi-node setting than DDP. Apparently this comes down to differences in the bucketing strategy used.

Another important note, regarding gradient accumulation: you need to use no_sync for all except the last batch (which is available for both FSDP and DDP).

Revision as of 19:47, 6 June 2024 (view source) Ben (talk \| contribs) (Created page with "I observed that PyTorch's FSDP with NO_SHARD is substantially faster and slightly more memory-efficient in a multi-node setting than DDP. Apparently this comes down to differe...")		Latest revision as of 19:49, 6 June 2024 (view source) Ben (talk \| contribs)
Line 2:		Line 2:

	Another important note, regarding gradient accumulation: you need to use <code>no_sync</code> for all except the last batch (which is available for both FSDP and DDP).		Another important note, regarding gradient accumulation: you need to use <code>no_sync</code> for all except the last batch (which is available for both FSDP and DDP).
		+
		+	[[Category:Machine Learning]]

Difference between revisions of "PyTorch FSDP vs DDP"

Latest revision as of 19:49, 6 June 2024

Navigation menu

Views

Personal tools

Navigation

Search

Tools