speechbrain.utils.distributed 模块

只在主进程上运行特定操作的守护

作者

Abdel Heba 2020
Aku Rouhe 2020
Peter Plantinga 2023
Adel Moumen 2024

摘要

类

MainProcessContext

上下文管理器，确保代码只在主进程上运行。

函数

`ddp_all_reduce`	在 DDP 模式下，此函数将使用指定的 torch 操作符执行 all_reduce 操作。
`ddp_barrier`	在分布式数据并行 (DDP) 模式下同步所有进程。
`ddp_broadcast`	在 DDP 模式下，此函数将把一个对象广播到所有进程。
`ddp_init_group`	如果在 python 命令行中提供了 distributed_launch 布尔值，此函数将初始化 ddp 组。
`get_rank`	获取当前进程的 rank。
`if_main_process`	返回当前进程是否为主进程。
`is_distributed_initialized`	返回当前系统是否为分布式。
`main_process_only`	函数装饰器，确保函数只在主进程上运行。
`rank_prefixed_message`	在消息前加上进程的 rank。
`run_on_main`	运行支持 DPP (多 GPU) 的函数。

参考

speechbrain.utils.distributed.rank_prefixed_message(message: str) → str[source]

在消息前加上进程的 rank。

参数:: message (str) – 要加前缀的消息。
返回:: 如果已知，消息前会加上 rank。
返回类型:: str

speechbrain.utils.distributed.get_rank() → int | None[source]

获取当前进程的 rank。

此代码取自 Pytorch Lightning 库：https://github.com/Lightning-AI/pytorch-lightning/blob/bc3c9c536dc88bfa9a46f63fbce22b382a86a9cb/src/lightning/fabric/utilities/rank_zero.py#L39-L48

返回:: 当前进程的 rank，如果无法确定则为 None。
返回类型:: int 或 None

speechbrain.utils.distributed.run_on_main(func, args=None, kwargs=None, post_func=None, post_args=None, post_kwargs=None, run_post_on_main=False)[source]

运行支持 DPP (多 GPU) 的函数。

主函数只在主进程上运行。可以指定一个 post_function，在主函数完成后在非主进程上运行。这样主函数产生的任何内容都可以在其他进程上加载。

参数:

func (可调用对象) – 在主进程上运行的函数。
args (list, None) – 传递给 func 的位置参数。
kwargs (dict, None) – 传递给 func 的关键字参数。
post_func (可调用对象, None) – 在 func 在主进程上完成后运行的函数。默认情况下只在非主进程上运行。
post_args (list, None) – 传递给 post_func 的位置参数。
post_kwargs (dict, None) – 传递给 post_func 的关键字参数。
run_post_on_main (bool) – 是否也在主进程上运行 post_func。（默认：False）

speechbrain.utils.distributed.is_distributed_initialized() → bool[source]: 返回当前系统是否为分布式。

speechbrain.utils.distributed.if_main_process() → bool[source]: 返回当前进程是否为主进程。

class speechbrain.utils.distributed.MainProcessContext[source]

基类：object

上下文管理器，确保代码只在主进程上运行。这对于确保即使在 main_proc_wrapped_func 函数内部引发异常，MAIN_PROC_ONLY 全局变量也能被递减非常有用。

__enter__()[source]: 进入上下文。增加计数器。

__exit__(exc_type, exc_value, traceback)[source]: 退出上下文。减少计数器。

speechbrain.utils.distributed.main_process_only(function)[source]: 函数装饰器，确保函数只在主进程上运行。这对于保存到文件系统或记录日志到 Web 地址等只需要在单个进程上执行的操作非常有用。

speechbrain.utils.distributed.ddp_barrier()[source]

在分布式数据并行 (DDP) 模式下同步所有进程。

此函数阻塞当前进程的执行，直到分布式组中的所有进程都到达同一位置。它确保在所有其他进程都到达此障碍之前，没有进程向前移动。如果未使用 DDP（即只有一个进程正在运行），此函数不生效并立即返回。

返回类型:: None

示例

>>> ddp_barrier()
>>> print("hello world")
hello world

speechbrain.utils.distributed.ddp_broadcast(communication_object, src=0)[source]

在 DDP 模式下，此函数将把一个对象广播到所有进程。

参数:

communication_object (任意类型) – 要发送给所有进程的对象。必须是可 pickle 化的。请参阅 torch.distributed.broadcast_object_list() 的文档。
src (int) – 拥有要发送的对象的 rank。

返回类型:

在 rank src 上传递的 communication_object。

speechbrain.utils.distributed.ddp_all_reduce(communication_object, reduce_op)[source]

在 DDP 模式下，此函数将使用指定的 torch 操作符执行 all_reduce 操作。

参见：https://pytorch.ac.cn/docs/stable/distributed.html#torch.distributed.all_reduce

参数:

communication_object (任意类型) – 要在进程间进行 reduce 的对象。
reduce_op (torch.distributed.ReduceOp) – 要执行的操作。例如包括 torch.distributed.ReduceOp.AVG 或 torch.distributed.ReduceOp.SUM。详情请参阅 Torch 文档。

返回类型:

经过 reduce 后的 communication_object（如果未初始化 DDP，则为其本身）

speechbrain.utils.distributed.ddp_init_group(run_opts)[source]

如果在 python 命令行中提供了 distributed_launch 布尔值，此函数将初始化 ddp 组。

ddp 组将使用 distributed_backend 参数设置 DDP 通信协议。Unix 变量 RANK 将用于向 ddp 组注册子进程。

参数:: run_opts (list) – 要解析的参数列表，通常来自 sys.argv[1:]。
返回类型:: None