speechbrain.lobes.models.transformer.Transformer 模块

SpeechBrain 风格的 Transformer 实现。作者 * Jianyuan Zhong 2020 * Samuele Cornell 2021 * Shucong Zhang 2024

概要

类

`NormalizedEmbedding`	此类实现了 Transformer 的归一化嵌入层。
`PositionalEncoding`	此类实现了绝对正弦位置编码函数。
`TransformerDecoder`	此类实现了 Transformer 解码器。
`TransformerDecoderLayer`	此类实现了自注意力解码器层。
`TransformerEncoder`	此类实现了 transformer 编码器。
`TransformerEncoderLayer`	这是自注意力编码器层的实现。
`TransformerInterface`	这是 transformer 模型的接口。

函数

`get_key_padding_mask`	创建一个二进制掩码以防止关注填充位置。
`get_lookahead_mask`	为每个序列创建一个二进制掩码，用于掩盖未来帧。
`get_mask_from_lengths`	根据序列长度创建二进制掩码

参考

class speechbrain.lobes.models.transformer.Transformer.TransformerInterface(d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, d_ffn=2048, dropout=0.1, activation: type = <class 'torch.nn.modules.activation.ReLU'>, custom_src_module=None, custom_tgt_module=None, positional_encoding='fixed_abs_sine', normalize_before=True, kernel_size: int = 31, bias: bool = True, encoder_module: str = 'transformer', conformer_activation: type = <class 'speechbrain.nnet.activations.Swish'>, branchformer_activation: type = <class 'torch.nn.modules.activation.GELU'>, attention_type: str = 'regularMHA', max_length: int = 2500, causal: bool = False, encoder_kdim: int | None = None, encoder_vdim: int | None = None, decoder_kdim: int | None = None, decoder_vdim: int | None = None, csgu_linear_units: int = 3072, gate_activation: type = <class 'torch.nn.modules.linear.Identity'>, use_linear_after_conv: bool = False, output_hidden_states=False, layerdrop_prob=0.0)[source]

基类：Module

这是 transformer 模型的接口。用户可以根据自己的任务修改属性并定义 forward 函数。该架构基于论文“Attention Is All You Need”：https://arxiv.org/pdf/1706.03762.pdf

参数：

d_model (int) – 编码器/解码器输入中预期的特征数量（默认值=512）。
nhead (int) – 多头注意力模型中的头数量（默认值=8）。
num_encoder_layers (int, 可选) – 编码器中的编码器层数量。
num_decoder_layers (int, 可选) – 解码器中的解码器层数量。
d_ffn (int, 可选) – 前馈网络模型隐藏层的维度。
dropout (int, 可选) – Dropout 值。
activation (torch.nn.Module, 可选) – 前馈网络层的激活函数，例如 relu、gelu 或 swish。
custom_src_module (torch.nn.Module, 可选) – 处理源特征以达到预期特征维度的模块。
custom_tgt_module (torch.nn.Module, 可选) – 处理源特征以达到预期特征维度的模块。
positional_encoding (str, 可选) – 使用的位置编码类型。例如，‘fixed_abs_sine’ 用于固定绝对位置编码。
normalize_before (bool, 可选) – 在 Transformer 层中，归一化是应用于 MHA 或 FFN 之前还是之后。默认为 True，因为这已被证明能带来更好的性能和训练稳定性。
kernel_size (int, 可选) – 使用 Conformer 时卷积层的核大小。
bias (bool, 可选) – Conformer 卷积层是否使用偏置。
encoder_module (str, 可选) – 在 Branchformer、Conformer 和 Transformer 中选择编码器。解码器固定为 Transformer。
conformer_activation (torch.nn.Module, 可选) – Conformer 卷积层后使用的激活模块。例如 Swish、ReLU 等。它必须是一个 torch Module。
branchformer_activation (torch.nn.Module, 可选) – Branchformer 编码器内部使用的激活模块。例如 Swish、ReLU 等。它必须是一个 torch Module。
attention_type (str, 可选) – 所有 Transformer 或 Conformer 层中使用的注意力层类型。例如 regularMHA 或 RelPosMHA。
max_length (int, 可选) – 输入中目标和源序列的最大长度。用于位置编码。
causal (bool, 可选) – 编码器是否应是因果的（解码器始终是因果的）。如果为因果，则 Conformer 卷积层是因果的。
encoder_kdim (int, 可选) – 编码器键的维度。
encoder_vdim (int, 可选) – 编码器值的维度。
decoder_kdim (int, 可选) – 解码器键的维度。
decoder_vdim (int, 可选) – 解码器值的维度。
csgu_linear_units (int, 可选) – CSGU 模块隐藏线性单元中的神经元数量。 -> Branchformer
gate_activation (torch.nn.Module, 可选) – CSGU 模块门控使用的激活函数。 -> Branchformer
use_linear_after_conv (bool, 可选) – 如果为 True，将应用大小为 input_size//2 的线性变换。 -> Branchformer
output_hidden_states (bool, 可选) – 模型是否应将隐藏状态输出为张量列表。
layerdrop_prob (float) – 丢弃整个层的概率。

forward(**kwags)[source]: 用户应根据自己的任务修改此函数。

class speechbrain.lobes.models.transformer.Transformer.PositionalEncoding(input_size, max_len=2500)[source]

基类：Module

此类实现了绝对正弦位置编码函数。 PE(pos, 2i) = sin(pos/(10000^(2i/dmodel))) PE(pos, 2i+1) = cos(pos/(10000^(2i/dmodel)))

参数：

input_size (int) – 嵌入维度。
max_len (int, 可选) – 输入序列的最大长度（默认值 2500）。

示例

>>> a = torch.rand((8, 120, 512))
>>> enc = PositionalEncoding(input_size=a.shape[-1])
>>> b = enc(a)
>>> b.shape
torch.Size([1, 120, 512])

forward(x)[source]

参数：: x (torch.Tensor) – 输入特征形状 (batch, time, fea)
返回类型：: 位置编码。

class speechbrain.lobes.models.transformer.Transformer.TransformerEncoderLayer(d_ffn, nhead, d_model, kdim=None, vdim=None, dropout=0.0, activation: type = <class 'torch.nn.modules.activation.ReLU'>, normalize_before=False, attention_type='regularMHA', ffn_type='regularFFN', ffn_cnn_kernel_size_list=[3, 3], causal=False)[source]

基类：Module

这是自注意力编码器层的实现。

参数：

d_ffn (int, 可选) – 前馈网络模型隐藏层的维度。
nhead (int) – 多头注意力模型中的头数量（默认值=8）。
d_model (int) – 编码器/解码器输入中预期的特征数量（默认值=512）。
kdim (int, 可选) – 键的维度。
vdim (int, 可选) – 值的维度。
dropout (int, 可选) – Dropout 值。
activation (torch.nn.Module, 可选) – 前馈网络层的激活函数，例如 relu、gelu 或 swish。
normalize_before (bool, 可选) – 在 Transformer 层中，归一化是应用于 MHA 或 FFN 之前还是之后。默认为 True，因为这已被证明能带来更好的性能和训练稳定性。
attention_type (str, 可选) – 所有 Transformer 或 Conformer 层中使用的注意力层类型。例如 regularMHA 或 RelPosMHA。
ffn_type (str) – ffn 类型：regularFFN/1dcnn
ffn_cnn_kernel_size_list (list of int) – 如果 ffn_type 为 1dcnn，则为两个 1d 卷积的核大小
causal (bool, 可选) – 编码器是否应是因果的（解码器始终是因果的）。如果为因果，则 Conformer 卷积层是因果的。

示例

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> net = TransformerEncoderLayer(512, 8, d_model=512)
>>> output = net(x)
>>> output[0].shape
torch.Size([8, 60, 512])

forward(src, src_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, pos_embs: Tensor | None = None)[source]

参数：

src (torch.Tensor) – 编码器层的输入序列。
src_mask (torch.Tensor) – 批处理中每个示例的源查询掩码。
src_key_padding_mask (torch.Tensor, 可选) – 批处理中每个示例的源键掩码。
pos_embs (torch.Tensor, 可选) – 位置嵌入张量。

返回：

output – transformer 编码器层的输出。

返回类型：

torch.Tensor

class speechbrain.lobes.models.transformer.Transformer.TransformerEncoder(num_layers, nhead, d_ffn, input_shape=None, d_model=None, kdim=None, vdim=None, dropout=0.0, activation=<class 'torch.nn.modules.activation.ReLU'>, normalize_before=False, causal=False, layerdrop_prob=0.0, attention_type='regularMHA', ffn_type='regularFFN', ffn_cnn_kernel_size_list=[3, 3], output_hidden_states=False)[source]

基类：Module

此类实现了 transformer 编码器。

参数：

num_layers (int) – 要包含的 transformer 层数量。
nhead (int) – 注意力头数量。
d_ffn (int) – 自注意力前馈层隐藏层大小。
input_shape (tuple) – 输入的预期形状。
d_model (int) – 输入嵌入的维度。
kdim (int) – 键的维度（可选）。
vdim (int) – 值的维度（可选）。
dropout (float) – 编码器的 Dropout（可选）。
activation (torch.nn.Module, 可选) – 前馈网络层的激活函数，例如 relu、gelu 或 swish。
normalize_before (bool, 可选) – 在 Transformer 层中，归一化是应用于 MHA 或 FFN 之前还是之后。默认为 True，因为这已被证明能带来更好的性能和训练稳定性。
causal (bool, 可选) – 编码器是否应是因果的（解码器始终是因果的）。如果为因果，则 Conformer 卷积层是因果的。
layerdrop_prob (float) – 丢弃整个层的概率
attention_type (str, 可选) – 所有 Transformer 或 Conformer 层中使用的注意力层类型。例如 regularMHA 或 RelPosMHA。
ffn_type (str) – ffn 类型：regularFFN/1dcnn
ffn_cnn_kernel_size_list (list of int) – 如果 ffn_type 为 1dcnn，则为两个 1d 卷积的核大小
output_hidden_states (bool, 可选) – 模型是否应将隐藏状态输出为张量列表。

示例

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> net = TransformerEncoder(1, 8, 512, d_model=512)
>>> output, _ = net(x)
>>> output.shape
torch.Size([8, 60, 512])

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> net = TransformerEncoder(1, 8, 512, d_model=512, output_hidden_states=True)
>>> output, attn_list, hidden_list = net(x)
>>> hidden_list[0].shape
torch.Size([8, 60, 512])
>>> len(hidden_list)
2

forward(src, src_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, pos_embs: Tensor | None = None, dynchunktrain_config=None)[source]

参数：

src (torch.Tensor) – 编码器层的输入序列（必需）。
src_mask (torch.Tensor) – 源序列的掩码（可选）。
src_key_padding_mask (torch.Tensor) – 每批次源键的掩码（可选）。
pos_embs (torch.Tensor) – 位置嵌入张量
dynchunktrain_config (config) – 此编码器不支持。

返回：

output (torch.Tensor) – transformer 的输出。
attention_lst (list) – 注意力值列表。
hidden_state_lst (list, 可选) – 编码器隐藏层的输出。仅在 output_hidden_states 设置为 True 时有效。

class speechbrain.lobes.models.transformer.Transformer.TransformerDecoderLayer(d_ffn, nhead, d_model, kdim=None, vdim=None, dropout=0.0, activation=<class 'torch.nn.modules.activation.ReLU'>, normalize_before=False, attention_type='regularMHA', causal=None)[source]

基类：Module

此类实现了自注意力解码器层。

参数：

d_ffn (int) – 自注意力前馈层隐藏层大小。
nhead (int) – 注意力头数量。
d_model (int) – 模型的维度。
kdim (int, 可选) – 键的维度（可选）。
vdim (int, 可选) – 值的维度（可选）。
dropout (float, 可选) – 解码器的 Dropout（可选）。
activation (Callable) – 层间使用的函数，默认 nn.ReLU
normalize_before (bool) – 是否在层之前进行归一化。
attention_type (str) – 使用的注意力类型，“regularMHA” 或 “RelPosMHAXL”
causal (bool) – 是否掩盖未来位置。

示例

>>> src = torch.rand((8, 60, 512))
>>> tgt = torch.rand((8, 60, 512))
>>> net = TransformerDecoderLayer(1024, 8, d_model=512)
>>> output, self_attn, multihead_attn = net(src, tgt)
>>> output.shape
torch.Size([8, 60, 512])

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos_embs_tgt=None, pos_embs_src=None)[source]

参数：

tgt (torch.Tensor) – 解码器层的输入序列（必需）。
memory (torch.Tensor) – 编码器最后一层的序列（必需）。
tgt_mask (torch.Tensor) – 目标序列的掩码（可选）。
memory_mask (torch.Tensor) – memory 序列的掩码（可选）。
tgt_key_padding_mask (torch.Tensor) – 每批次目标键的掩码（可选）。
memory_key_padding_mask (torch.Tensor) – 每批次 memory 键的掩码（可选）。
pos_embs_tgt (torch.Tensor) – 目标的定位嵌入（可选）。
pos_embs_src (torch.Tensor) – 源的定位嵌入（可选）。

class speechbrain.lobes.models.transformer.Transformer.TransformerDecoder(num_layers, nhead, d_ffn, d_model, kdim=None, vdim=None, dropout=0.0, activation=<class 'torch.nn.modules.activation.ReLU'>, normalize_before=False, causal=False, attention_type='regularMHA')[source]

基类：Module

此类实现了 Transformer 解码器。

参数：

num_layers (int) – 解码器的 transformer 层数量。
nhead (int) – 注意力头数量。
d_ffn (int) – 自注意力前馈层隐藏层大小。
d_model (int) – 模型的维度。
kdim (int, 可选) – 键的维度（可选）。
vdim (int, 可选) – 值的维度（可选）。
dropout (float, 可选) – 解码器的 Dropout（可选）。
activation (Callable) – 层间应用的函数，默认 nn.ReLU
normalize_before (bool) – 是否在层之前进行归一化。
causal (bool) – 解码时是否允许未来信息。
attention_type (str) – 使用的注意力类型，“regularMHA” 或 “RelPosMHAXL”

示例

>>> src = torch.rand((8, 60, 512))
>>> tgt = torch.rand((8, 60, 512))
>>> net = TransformerDecoder(1, 8, 1024, d_model=512)
>>> output, _, _ = net(src, tgt)
>>> output.shape
torch.Size([8, 60, 512])

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos_embs_tgt=None, pos_embs_src=None)[source]

参数：

tgt (torch.Tensor) – 解码器层的输入序列（必需）。
memory (torch.Tensor) – 编码器最后一层的序列（必需）。
tgt_mask (torch.Tensor) – 目标序列的掩码（可选）。
memory_mask (torch.Tensor) – memory 序列的掩码（可选）。
tgt_key_padding_mask (torch.Tensor) – 每批次目标键的掩码（可选）。
memory_key_padding_mask (torch.Tensor) – 每批次 memory 键的掩码（可选）。
pos_embs_tgt (torch.Tensor) – 目标的定位嵌入（可选）。
pos_embs_src (torch.Tensor) – 源的定位嵌入（可选）。

class speechbrain.lobes.models.transformer.Transformer.NormalizedEmbedding(d_model, vocab)[source]

基类：Module

此类实现了 transformer 的归一化嵌入层。由于自注意力的点积总是由 sqrt(d_model) 归一化，并且用于预测的最终线性投影与嵌入层共享权重，因此我们将嵌入层的输出乘以 sqrt(d_model)。

参数：

d_model (int) – 编码器/解码器输入中预期的特征数量（默认值=512）。
vocab (int) – 词汇表大小。

示例

>>> emb = NormalizedEmbedding(512, 1000)
>>> trg = torch.randint(0, 999, (8, 50))
>>> emb_fea = emb(trg)

forward(x)[source]: 处理输入张量 x 并返回输出张量。

speechbrain.lobes.models.transformer.Transformer.get_key_padding_mask(padded_input, pad_idx)[source]

创建一个二进制掩码以防止关注填充位置。我们建议使用 get_mask_from_lengths 而非此函数。

参数：

padded_input (torch.Tensor) – 填充后的输入。
pad_idx (int) – 填充元素的索引。

返回：

key_padded_mask – 防止关注填充的二进制掩码。

返回类型：

torch.Tensor

示例

>>> a = torch.LongTensor([[1,1,0], [2,3,0], [4,5,0]])
>>> get_key_padding_mask(a, pad_idx=0)
tensor([[False, False,  True],
        [False, False,  True],
        [False, False,  True]])

speechbrain.lobes.models.transformer.Transformer.get_lookahead_mask(padded_input)[source]

为每个序列创建一个二进制掩码，用于掩盖未来帧。

参数：: padded_input (torch.Tensor) – 填充后的输入张量。
返回：: mask – 用于掩盖未来帧的二进制掩码。
返回类型：: torch.Tensor

示例

>>> a = torch.LongTensor([[1,1,0], [2,3,0], [4,5,0]])
>>> get_lookahead_mask(a)
tensor([[0., -inf, -inf],
        [0., 0., -inf],
        [0., 0., 0.]])

speechbrain.lobes.models.transformer.Transformer.get_mask_from_lengths(lengths, max_len=None)[source]

根据序列长度创建二进制掩码

参数：

lengths (torch.Tensor) – 一个序列长度张量
max_len (int (可选)) – 最大序列长度，默认为 None。

返回：

mask – 填充元素设置为 True 的掩码。然后可以使用 tensor.masked_fill_(mask, 0) 进行掩码操作。

返回类型：

torch.Tensor

示例

>>> lengths = torch.tensor([3, 2, 4])
>>> get_mask_from_lengths(lengths)
tensor([[False, False, False,  True],
        [False, False,  True,  True],
        [False, False, False, False]])