torch分布式调试
最近给项目加入分布式推理及张量并行方面的优化,在实践中综合运用了torch distributed相关的send, receive, broadcast等操作。里面一些值得记录的细节如下:
通常使用的是NCCL做torch distributed的后端,那基本的NCCL_DEBUG环境变量要设置,值通常是INFO或者WARN。还要设置NCCL_DEBUG_SUBSYS, 值是COLL (stands for collectives), P2P (stands for peer-to-peer), SHM (stands for shared memory), NET (stands for network), GRAPH (stands for topology detection and graph search), TUNING (stands for algorithm/protocol tuning), ENV (stands for environment settings), ALLOC (stands for memory allocations), and ALL (includes every subsystem).