speechbrain.lobes.models.g2p.model 模块

用于字形到音素的注意力 RNN 模型

作者

Mirco Ravanelli 2021
Artem Ploujnikov 2021

摘要

类

`AttentionSeq2Seq`	注意力 RNN 编码器-解码器模型
`TransformerG2P`	基于 Transformer 的字形到音素模型
`WordEmbeddingEncoder`	一个用于降低维度和归一化词嵌入的小型编码器模块

函数

`get_dummy_phonemes`	创建一个虚拟音素序列
`input_dim`	计算输入维度（用于 hparam 文件）

参考

class speechbrain.lobes.models.g2p.model.AttentionSeq2Seq(enc, encoder_emb, emb, dec, lin, out, bos_token=0, use_word_emb=False, word_emb_enc=None)[source]

基类：Module

注意力 RNN 编码器-解码器模型

参数：

enc (torch.nn.Module) – 编码器模块
encoder_emb (torch.nn.Module) – 编码器嵌入模块
emb (torch.nn.Module) – 嵌入模块
dec (torch.nn.Module) – 解码器模块
lin (torch.nn.Module) – 线性模块
out (torch.nn.Module) – 输出层（通常是 log_softmax）
bos_token (int) – 句子开始 (Beginning-of-Sentence) token 的索引
use_word_emb (bool) – 是否使用词嵌入
word_emb_enc (nn.Module) – 用于编码词嵌入的模块

forward(grapheme_encoded, phn_encoded=None, word_emb=None)[source]

计算前向传播

参数：

grapheme_encoded (torch.Tensor) – 编码为 Torch 张量的字形
phn_encoded (torch.Tensor) – 编码的音素
word_emb (torch.Tensor) – 词嵌入（可选）

返回：

p_seq (torch.Tensor) – 一个 (batch x position x token) 张量，表示每个位置的 token 概率
char_lens (torch.Tensor) – 字符序列长度张量
encoder_out – 编码器的原始输出

class speechbrain.lobes.models.g2p.model.WordEmbeddingEncoder(word_emb_dim, word_emb_enc_dim, norm=None, norm_type=None)[source]

基类：Module

一个用于降低维度和归一化词嵌入的小型编码器模块

参数：

word_emb_dim (int) – 原始词嵌入的维度
word_emb_enc_dim (int) – 编码后的词嵌入的维度
norm (torch.nn.Module) –

将使用的归一化（
例如 speechbrain.nnet.normalization.LayerNorm）
norm_type (str) – 将使用的归一化类型

forward(emb)[source]

计算嵌入的前向传播

参数：: emb (torch.Tensor) – 原始词嵌入
返回：: emb_enc – 编码后的词嵌入
返回类型：: torch.Tensor

NORMS = {'batch': <class 'speechbrain.nnet.normalization.BatchNorm1d'>, 'instance': <class 'speechbrain.nnet.normalization.InstanceNorm1d'>, 'layer': <class 'speechbrain.nnet.normalization.LayerNorm'>}

class speechbrain.lobes.models.g2p.model.TransformerG2P(emb, encoder_emb, char_lin, phn_lin, lin, out, d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, d_ffn=2048, dropout=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, custom_src_module=None, custom_tgt_module=None, positional_encoding='fixed_abs_sine', normalize_before=True, kernel_size=15, bias=True, encoder_module='transformer', attention_type='regularMHA', max_length=2500, causal=False, pad_idx=0, encoder_kdim=None, encoder_vdim=None, decoder_kdim=None, decoder_vdim=None, use_word_emb=False, word_emb_enc=None)[source]

基类：TransformerInterface

基于 Transformer 的字形到音素模型

参数：

emb (torch.nn.Module) – 嵌入模块
encoder_emb (torch.nn.Module) – 编码器嵌入模块
char_lin (torch.nn.Module) – 将输入连接到 transformer 的线性模块
phn_lin (torch.nn.Module) – 将输出连接到 transformer 的线性模块
out (torch.nn.Module) – 解码器模块（通常是 Softmax）
lin (torch.nn.Module) – 输出的线性模块
d_model (int) – 编码器/解码器输入中预期特征的数量（默认=512）。
nhead (int) – 多头注意力模型中的头数（默认=8）。
num_encoder_layers (int, optional) – 编码器中的编码器层数。
num_decoder_layers (int, optional) – 解码器中的解码器层数。
dim_ffn (int, optional) – 前馈网络模型隐藏层的维度。
dropout (int, optional) – Dropout 值。
activation (torch.nn.Module, optional) – 前馈网络层的激活函数，例如 relu, gelu 或 swish。
custom_src_module (torch.nn.Module, optional) – 将源特征处理到预期特征维度的模块。
custom_tgt_module (torch.nn.Module, optional) – 将源特征处理到预期特征维度的模块。
positional_encoding (str, optional) – 使用的位置编码类型。例如，对于固定绝对位置编码，使用 'fixed_abs_sine'。
normalize_before (bool, optional) – 在 Transformer 层中，归一化应该在 MHA 或 FFN 之前还是之后应用。默认为 True，因为这已被证明能带来更好的性能和训练稳定性。
kernel_size (int, optional) – 使用 Conformer 时卷积层中的核大小。
bias (bool, optional) – 在 Conformer 卷积层中是否使用偏置。
encoder_module (str, optional) – 选择编码器使用 Conformer 还是 Transformer。解码器固定为 Transformer。
conformer_activation (torch.nn.Module, optional) – Conformer 卷积层之后使用的激活模块。例如 Swish, ReLU 等。必须是 torch 模块。
attention_type (str, optional) – 所有 Transformer 或 Conformer 层中使用的注意力层类型。例如 regularMHA 或 RelPosMHA。
max_length (int, optional) – 输入中目标序列和源序列的最大长度。用于位置编码。
causal (bool, optional) – 编码器是否应该是因果的（解码器总是因果的）。如果为因果，则 Conformer 卷积层是因果的。
pad_idx (int) – 填充索引（用于掩码）
encoder_kdim (int, optional) – 编码器 key 的维度。
encoder_vdim (int, optional) – 编码器 value 的维度。
decoder_kdim (int, optional) – 解码器 key 的维度。
decoder_vdim (int, optional) – 解码器 value 的维度。

forward(grapheme_encoded, phn_encoded=None, word_emb=None)[source]

计算前向传播

参数：

grapheme_encoded (torch.Tensor) – 编码为 Torch 张量的字形
phn_encoded (torch.Tensor) – 编码的音素
word_emb (torch.Tensor) – 词嵌入（如果适用）

返回：

p_seq (torch.Tensor) – 序列中单个 token 的对数概率
char_lens (torch.Tensor) – 字符长度
encoder_out (torch.Tensor) – 编码器状态
attention (torch.Tensor) – 注意力状态

make_masks(src, tgt, src_len=None, pad_idx=0)[source]

此方法生成用于训练 transformer 模型的掩码。

参数：

src (torch.Tensor) – 编码器的序列输入（必需）。
tgt (torch.Tensor) – 解码器的序列输入（必需）。
src_len (torch.Tensor) – 对应于源张量的长度。
pad_idx (int) – <pad> token 的索引（默认=0）。

返回：

src_key_padding_mask (torch.Tensor) – 源 key 填充掩码
tgt_key_padding_mask (torch.Tensor) – 目标 key 填充掩码
src_mask (torch.Tensor) – 源掩码
tgt_mask (torch.Tensor) – 目标掩码

decode(tgt, encoder_out, enc_lens)[source]

此方法实现了 transformer 模型的解码步骤。

参数：

tgt (torch.Tensor) – 解码器的序列输入。
encoder_out (torch.Tensor) – 编码器的隐藏层输出。
enc_lens (torch.Tensor) – 编码器输入的对应长度。

返回：

prediction (torch.Tensor) – 预测序列
attention (torch.Tensor) – 对应于最后一个注意力头的注意力矩阵（可用于绘制注意力图）

speechbrain.lobes.models.g2p.model.input_dim(use_word_emb, embedding_dim, word_emb_enc_dim)[source]

计算输入维度（用于 hparam 文件）

参数：

use_word_emb (bool) – 是否使用词嵌入
embedding_dim (int) – 嵌入维度
word_emb_enc_dim (int) – 编码后的词嵌入维度

返回：

input_dim – 输入维度

返回类型：

int

speechbrain.lobes.models.g2p.model.get_dummy_phonemes(batch_size, device)[source]

创建一个虚拟音素序列

参数：

batch_size (int) – 批量大小
device (str) – 目标设备

返回：

结果

返回类型：

torch.Tensor