Distributed PyTorch
Using ProcessGroup for PyTorch Distributed Training
Any time you want to do distributed training on GPUs with PyTorch, there is necessary configuration
to the PyTorch backend. Doing that with ProcessGroup is straightforward albeit unslightly as is always
the case for distributed training. Future work will provide helper classes to complete most standard
configurations. In the meantime, given some PyTorch function designed for distributed training training_fn,
these code snippets will aid in using ProcessGroup with a CUDA backend.
1from dragon.native.machine import System
2from dragon.ai.collective_group import CollectiveGroup, RankInfo
3
4import torch
5import torch.distributed as dist
6
7
8def train():
9 rank_info = RankInfo()
10 rank = rank_info.my_rank
11 master_addr = rank_info.master_addr
12 master_port = rank_info.master_port
13 world_size = rank_info.world_size
14
15 dist.init_process_group(
16 backend="nccl",
17 init_method=f"tcp://{master_addr}:{master_port}",
18 world_size=world_size,
19 rank=rank,
20 )
21
22 device = torch.device("cuda") # the provided Policy already sets which GPU id to use
23 tensor = torch.ones(1, device=device) * rank
24 dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
25
26 print(f"Rank {rank}: Tensor after all_reduce = {tensor.item()}")
27
28 dist.destroy_process_group()
29
30if __name__ == "main":
31
32 gpu_policies = System().gpu_policies()
33 pg = CollectiveGroup(
34 training_fn=train,
35 training_args=None,
36 training_kwargs=None,
37 policies=gpu_policies,
38 hide_stderr=False,
39 port=29500,
40 )
41 pg.init()
42 pg.start()
43 pg.join()
44 pg.close()
Loading Training Data with PyTorch
Coming soon…