speechbrain.lobes.models.ECAPA_TDNN 模块

一个流行的说话人识别和分割模型。

作者

Hwidong Na 2020

摘要

类

`AttentiveStatisticsPooling`	此类实现了针对每个通道的注意力统计池化层。
`BatchNorm1d`	一维批量归一化。
`Classifier`	此类在特征顶部实现了余弦相似度。
`Conv1d`	一维卷积。
`ECAPA_TDNN`	一篇论文中说话人嵌入模型的实现。
`Res2NetBlock`	Res2NetBlock w/ 扩张的实现。
`SEBlock`	squeeze-and-excitation 块的实现。
`SERes2NetBlock`	ECAPA-TDNN 中构建块的实现，即 TDNN-Res2Net-TDNN-SEBlock。
`TDNNBlock`	TDNN 的实现。

参考

class speechbrain.lobes.models.ECAPA_TDNN.Conv1d(*args, **kwargs)[source]

基类: Conv1d

一维卷积。使用跳过转置以提高效率。

class speechbrain.lobes.models.ECAPA_TDNN.BatchNorm1d(*args, **kwargs)[source]

基类: BatchNorm1d

一维批量归一化。使用跳过转置以提高效率。

class speechbrain.lobes.models.ECAPA_TDNN.TDNNBlock(in_channels, out_channels, kernel_size, dilation, activation=<class 'torch.nn.modules.activation.ReLU'>, groups=1, dropout=0.0)[source]

基类: Module

TDNN 的实现。

参数:

in_channels (int) – 输入通道数。
out_channels (int) – 输出通道数。
kernel_size (int) – TDNN 块的核大小。
dilation (int) – TDNN 块的扩张率。
activation (torch class) – 用于构建激活层的类。
groups (int) – TDNN 块的组大小。
dropout (float) – 训练期间通道丢弃率。

示例

>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
>>> layer = TDNNBlock(64, 64, kernel_size=3, dilation=1)
>>> out_tensor = layer(inp_tensor).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 120, 64])

forward(x)[source]: 处理输入张量 x 并返回输出张量。

class speechbrain.lobes.models.ECAPA_TDNN.Res2NetBlock(in_channels, out_channels, scale=8, kernel_size=3, dilation=1, dropout=0.0)[source]

基类: Module

Res2NetBlock w/ 扩张的实现。

参数:

in_channels (int) – 输入中预期的通道数。
out_channels (int) – 输出通道数。
scale (int) – Res2Net 块的 scale。
kernel_size (int) – Res2Net 块的核大小。
dilation (int) – Res2Net 块的扩张率。
dropout (float) – 训练期间通道丢弃率。

示例

>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
>>> layer = Res2NetBlock(64, 64, scale=4, dilation=3)
>>> out_tensor = layer(inp_tensor).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 120, 64])

forward(x)[source]: 处理输入张量 x 并返回输出张量。

class speechbrain.lobes.models.ECAPA_TDNN.SEBlock(in_channels, se_channels, out_channels)[source]

基类: Module

squeeze-and-excitation 块的实现。

参数:

in_channels (int) – 输入通道数。
se_channels (int) – Squeeze 后输出通道数。
out_channels (int) – 输出通道数。

示例

>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
>>> se_layer = SEBlock(64, 16, 64)
>>> lengths = torch.rand((8,))
>>> out_tensor = se_layer(inp_tensor, lengths).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 120, 64])

forward(x, lengths=None)[source]: 处理输入张量 x 并返回输出张量。

class speechbrain.lobes.models.ECAPA_TDNN.AttentiveStatisticsPooling(channels, attention_channels=128, global_context=True)[source]

基类: Module

此类实现了针对每个通道的注意力统计池化层。它返回输入张量的连接均值和标准差。

参数:

channels (int) – 输入通道数。
attention_channels (int) – 注意力通道数。
global_context (bool) – 是否使用全局上下文。

示例

>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
>>> asp_layer = AttentiveStatisticsPooling(64)
>>> lengths = torch.rand((8,))
>>> out_tensor = asp_layer(inp_tensor, lengths).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 1, 128])

forward(x, lengths=None)[source]

计算批次 (输入张量) 的均值和标准差。

参数:

x (torch.Tensor) – 形状为 [N, C, L] 的张量。
lengths (torch.Tensor) – 输入对应的相对长度。

返回:

pooled_stats – 批次的均值和标准差

返回类型:

torch.Tensor

class speechbrain.lobes.models.ECAPA_TDNN.SERes2NetBlock(in_channels, out_channels, res2net_scale=8, se_channels=128, kernel_size=1, dilation=1, activation=<class 'torch.nn.modules.activation.ReLU'>, groups=1, dropout=0.0)[source]

基类: Module

ECAPA-TDNN 中构建块的实现，即 TDNN-Res2Net-TDNN-SEBlock。

参数:

in_channels (int) – 输入通道的预期大小。
out_channels (int) – 输出通道数。
res2net_scale (int) – Res2Net 块的 scale。
se_channels (int) – Squeeze 后输出通道数。
kernel_size (int) – TDNN 块的核大小。
dilation (int) – Res2Net 块的扩张率。
activation (torch class) – 用于构建激活层的类。
groups (int) – 从输入通道到输出通道的阻塞连接数。
dropout (float) – 训练期间通道丢弃率。

示例

>>> x = torch.rand(8, 120, 64).transpose(1, 2)
>>> conv = SERes2NetBlock(64, 64, res2net_scale=4)
>>> out = conv(x).transpose(1, 2)
>>> out.shape
torch.Size([8, 120, 64])

forward(x, lengths=None)[source]: 处理输入张量 x 并返回输出张量。

class speechbrain.lobes.models.ECAPA_TDNN.ECAPA_TDNN(input_size, device='cpu', lin_neurons=192, activation=<class 'torch.nn.modules.activation.ReLU'>, channels=[512, 512, 512, 512, 1536], kernel_sizes=[5, 3, 3, 3, 1], dilations=[1, 2, 3, 4, 1], attention_channels=128, res2net_scale=8, se_channels=128, global_context=True, groups=[1, 1, 1, 1, 1], dropout=0.0)[source]

基类: Module

一篇论文中说话人嵌入模型的实现。“ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification” (https://arxiv.org/abs/2005.07143)。

参数:

input_size (int) – 输入维度的预期大小。
device (str) – 使用的设备，例如 “cpu” 或 “cuda”。
lin_neurons (int) – 线性层中的神经元数量。
activation (torch class) – 用于构建激活层的类。
channels (list of ints) – TDNN/SERes2Net 层的输出通道。
kernel_sizes (list of ints) – 各层核大小列表。
dilations (list of ints) – 各层核的扩张率列表。
attention_channels (int) – 注意力通道数。
res2net_scale (int) – Res2Net 块的 scale。
se_channels (int) – Squeeze 后输出通道数。
global_context (bool) – 是否使用全局上下文。
groups (list of ints) – 各层核的组列表。
dropout (float) – 训练期间通道丢弃率。

示例

>>> input_feats = torch.rand([5, 120, 80])
>>> compute_embedding = ECAPA_TDNN(80, lin_neurons=192)
>>> outputs = compute_embedding(input_feats)
>>> outputs.shape
torch.Size([5, 1, 192])

forward(x, lengths=None)[source]

返回嵌入向量。

参数:

x (torch.Tensor) – 形状为 (batch, time, channel) 的张量。
lengths (torch.Tensor) – 输入对应的相对长度。

返回:

x – 嵌入向量。

返回类型:

torch.Tensor

class speechbrain.lobes.models.ECAPA_TDNN.Classifier(input_size, device='cpu', lin_blocks=0, lin_neurons=192, out_neurons=1211)[source]

基类: Module

此类在特征顶部实现了余弦相似度。

参数:

input_size (int) – 输入维度的预期大小。
device (str) – 使用的设备，例如 “cpu” 或 “cuda”。
lin_blocks (int) – 线性层数量。
lin_neurons (int) – 线性层中的神经元数量。
out_neurons (int) – 类别数量。

示例

>>> classify = Classifier(input_size=2, lin_neurons=2, out_neurons=2)
>>> outputs = torch.tensor([ [1., -1.], [-9., 1.], [0.9, 0.1], [0.1, 0.9] ])
>>> outputs = outputs.unsqueeze(1)
>>> cos = classify(outputs)
>>> (cos < -1.0).long().sum()
tensor(0)
>>> (cos > 1.0).long().sum()
tensor(0)

forward(x)[source]

返回说话人概率输出。

参数:: x (torch.Tensor) – Torch 张量。
返回:: out – 说话人概率输出。
返回类型:: torch.Tensor