PyTorch FSDP vs DDP

I observed that PyTorch's FSDP with NO_SHARD is substantially faster and slightly more memory-efficient in a multi-node setting than DDP. Apparently this comes down to differences in the bucketing strategy used.

Another important note, regarding gradient accumulation: you need to use no_sync for all except the last batch (which is available for both FSDP and DDP).