speechbrain.alignment.aligner 模块

对齐代码

作者

Elena Rastorgueva 2020
Loren Lugosch 2020

摘要

类

HMMAligner

此类在前向传播方法中计算 Viterbi 对齐。

函数

`batch_log_matvecmul`	对于批处理中的每对“矩阵”和“向量”，在对数域中进行矩阵-向量乘法，即使用 logsumexp 代替加法，使用加法代替乘法。
`batch_log_maxvecmul`	类似于 batch_log_matvecmul，但使用最大值代替 logsumexp。
`map_inds_to_intersect`	将包含来自不同音素集的音素索引的两个列表转换为单个音素，以便比较结果列表中索引的相等性将得出正确的准确率。

参考

class speechbrain.alignment.aligner.HMMAligner(states_per_phoneme=1, output_folder='', neg_inf=-100000.0, batch_reduction='none', input_len_norm=False, target_len_norm=False, lexicon_path=None)[source]

基类：Module

此类在前向传播方法中计算 Viterbi 对齐。

它还记录对齐并创建它们的批次，用于 Viterbi 训练。

参数:

states_per_phoneme (int) – 每个音素使用的隐藏状态数。
output_folder (str) – 将对齐保存到磁盘时存储的文件夹。尚未实现。
neg_inf (float) – 用于表示负无穷大对数概率的浮点数。使用 -float("Inf") 往往会导致数值不稳定。当使用 genbmm 库时（目前未使用），比 -1e5 更负的数字有时也会导致错误。（默认值：-1e5）
batch_reduction (string) – “none”, “sum” 或 “mean”之一。要对前向传播方法中计算的损失应用何种批级别归约。
input_len_norm (bool) – 是否根据输入的长度归一化前向传播方法中的损失。
target_len_norm (bool) – 是否根据目标的长度归一化前向传播方法中的损失。
lexicon_path (string) – 词典的位置。

示例

>>> log_posteriors = torch.tensor([[[ -1., -10., -10.],
...                                 [-10.,  -1., -10.],
...                                 [-10., -10.,  -1.]],
...
...                                [[ -1., -10., -10.],
...                                 [-10.,  -1., -10.],
...                                 [-10., -10., -10.]]])
>>> lens = torch.tensor([1., 0.66])
>>> phns = torch.tensor([[0, 1, 2],
...                      [0, 1, 0]])
>>> phn_lens = torch.tensor([1., 0.66])
>>> aligner = HMMAligner()
>>> forward_scores = aligner(
...        log_posteriors, lens, phns, phn_lens, 'forward'
... )
>>> forward_scores.shape
torch.Size([2])
>>> viterbi_scores, alignments = aligner(
...        log_posteriors, lens, phns, phn_lens, 'viterbi'
... )
>>> alignments
[[0, 1, 2], [0, 1]]
>>> viterbi_scores.shape
torch.Size([2])

use_lexicon(words, interword_sils=True, sample_pron=False)[source]

使用词典进行处理，返回可能的音素序列、转移/pi 概率以及可能的最终状态。按每个发音单位进行处理。批处理中的每个发音单位都由辅助方法 _use_lexicon 进行处理。

参数:

words (list) – 转录文本中的词列表
interword_sils (bool) – 如果为 True，则在每个词之间插入可选的静音。如果为 False，则可选的静音仅放置在每个发音单位的开头和结尾。
sample_pron (bool) – 如果为 True，则采样单个可能的音素序列。如果为 False，则返回所有可能的音素序列的统计信息。

返回:

poss_phns (torch.Tensor (batch, possible phn sequence 中的音素)) – 每个发音单位中可能存在的音素。
poss_phn_lens (torch.Tensor (batch)) – 批处理中每个可能的音素序列的相对长度。
trans_prob (torch.Tensor (batch, from, to)) – 包含转移（对数）概率的 Tensor。
pi_prob (torch.Tensor (batch, state)) – 包含初始（对数）概率的 Tensor。
final_state (list of lists of ints) – 每个发音单位可能的最终状态的列表的列表。

示例

>>> aligner = HMMAligner()
>>> aligner.lexicon = {
...                     "a": {0: "a"},
...                     "b": {0: "b", 1: "c"}
...                   }
>>> words = [["a", "b"]]
>>> aligner.lex_lab2ind = {
...                   "sil": 0,
...                   "a":  1,
...                   "b":  2,
...                   "c":  3,
...                 }
>>> poss_phns, poss_phn_lens, trans_prob, pi_prob, final_states = aligner.use_lexicon(
...     words,
...     interword_sils = True
... )
>>> poss_phns
tensor([[0, 1, 0, 2, 3, 0]])
>>> poss_phn_lens
tensor([1.])
>>> trans_prob
tensor([[[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05,
          -1.0000e+05],
         [-1.0000e+05, -1.3863e+00, -1.3863e+00, -1.3863e+00, -1.3863e+00,
          -1.0000e+05],
         [-1.0000e+05, -1.0000e+05, -1.0986e+00, -1.0986e+00, -1.0986e+00,
          -1.0000e+05],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -6.9315e-01, -1.0000e+05,
          -6.9315e-01],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05, -6.9315e-01,
          -6.9315e-01],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05,
           0.0000e+00]]])
>>> pi_prob
tensor([[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05,
         -1.0000e+05]])
>>> final_states
[[3, 4, 5]]
>>> # With no optional silences between words
>>> poss_phns_, _, trans_prob_, pi_prob_, final_states_ = aligner.use_lexicon(
...     words,
...     interword_sils = False
... )
>>> poss_phns_
tensor([[0, 1, 2, 3, 0]])
>>> trans_prob_
tensor([[[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05],
         [-1.0000e+05, -1.0986e+00, -1.0986e+00, -1.0986e+00, -1.0000e+05],
         [-1.0000e+05, -1.0000e+05, -6.9315e-01, -1.0000e+05, -6.9315e-01],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -6.9315e-01, -6.9315e-01],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05,  0.0000e+00]]])
>>> pi_prob_
tensor([[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05]])
>>> final_states_
[[2, 3, 4]]
>>> # With sampling of a single possible pronunciation
>>> import random
>>> random.seed(0)
>>> poss_phns_, _, trans_prob_, pi_prob_, final_states_ = aligner.use_lexicon(
...     words,
...     sample_pron = True
... )
>>> poss_phns_
tensor([[0, 1, 0, 2, 0]])
>>> trans_prob_
tensor([[[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05],
         [-1.0000e+05, -1.0986e+00, -1.0986e+00, -1.0986e+00, -1.0000e+05],
         [-1.0000e+05, -1.0000e+05, -6.9315e-01, -6.9315e-01, -1.0000e+05],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -6.9315e-01, -6.9315e-01],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05,  0.0000e+00]]])

forward(emission_pred, lens, phns, phn_lens, dp_algorithm, prob_matrices=None)[source]

准备相关的（对数）概率张量并进行动态规划：前向或 Viterbi 算法。根据对象初始化期间指定的方式应用归约。

参数:

emission_pred (torch.Tensor (batch, time, vocabulary 中的音素)) – 来自声学模型的后验概率。
lens (torch.Tensor (batch)) – 每个发音单位声音文件的相对时长。
phns (torch.Tensor (batch, phn sequence 中的音素)) – 每个发音单位中已知/可能存在的音素。
phn_lens (torch.Tensor (batch)) – 批处理中每个音素序列的相对长度。
dp_algorithm (string) – “forward” 或 “viterbi”。
prob_matrices (dict) – （可选）必须包含键 ‘trans_prob’, ‘pi_prob’ 和 ‘final_states’。用于覆盖默认的前向和 viterbi 操作，这些操作会强制遍历 phns 序列中的所有状态。

返回:

如果 dp_algorithm == “forward”。

forward_scores : torch.Tensor (batch, 或 scalar)

批处理中每个发音单位的（对数）似然，如果指定，则应用归约。（或者）
如果 dp_algorithm == “viterbi”。

viterbi_scores : torch.Tensor (batch, 或 scalar)

每个发音单位的 Viterbi 路径的（对数）似然，如果指定，则应用归约。

alignments : list of lists of int

批处理中文件的 Viterbi 对齐。

返回类型:

tensor

expand_phns_by_states_per_phoneme(phns, phn_lens)[source]

根据 HMM 中定义的每个音素的隐藏状态数扩展 phn 序列中的每个音素。

参数:

phns (torch.Tensor (batch, phn sequence 中的音素)) – 每个发音单位中已知/可能存在的音素。
phn_lens (torch.Tensor (batch)) – 批处理中每个音素序列的相对长度。

返回:

expanded_phns

返回类型:

torch.Tensor (batch, expanded phn sequence 中的音素)

示例

>>> phns = torch.tensor([[0., 3., 5., 0.],
...                      [0., 2., 0., 0.]])
>>> phn_lens = torch.tensor([1., 0.75])
>>> aligner = HMMAligner(states_per_phoneme = 3)
>>> expanded_phns = aligner.expand_phns_by_states_per_phoneme(
...         phns, phn_lens
... )
>>> expanded_phns
tensor([[ 0.,  1.,  2.,  9., 10., 11., 15., 16., 17.,  0.,  1.,  2.],
        [ 0.,  1.,  2.,  6.,  7.,  8.,  0.,  1.,  2.,  0.,  0.,  0.]])

store_alignments(ids, alignments)[source]

将 Viterbi 对齐记录到 self.align_dict 中。

参数:

ids (list of str) – 批处理中文件的 ID。
alignments (list of lists of int) – 批处理中文件的 Viterbi 对齐。不包含填充。

示例

>>> aligner = HMMAligner()
>>> ids = ['id1', 'id2']
>>> alignments = [[0, 2, 4], [1, 2, 3, 4]]
>>> aligner.store_alignments(ids, alignments)
>>> aligner.align_dict.keys()
dict_keys(['id1', 'id2'])
>>> aligner.align_dict['id1']
tensor([0, 2, 4], dtype=torch.int16)

get_prev_alignments(ids, emission_pred, lens, phns, phn_lens)[source]

如果可用，则获取先前记录的 Viterbi 对齐。如果不可用，则获取平坦起始对齐。目前，假设如果批处理中的第一个发音单位没有可用的 Viterbi 对齐，则其余发音单位也将没有可用的 Viterbi 对齐。

参数:

ids (list of str) – 批处理中文件的 ID。
emission_pred (torch.Tensor (batch, time, vocabulary 中的音素)) – 来自声学模型的后验概率。用于推断批处理中最长发音单位的时长。
lens (torch.Tensor (batch)) – 每个发音单位声音文件的相对时长。
phns (torch.Tensor (batch, phn sequence 中的音素)) – 每个发音单位中已知/可能存在的音素。
phn_lens (torch.Tensor (batch)) – 批处理中每个音素序列的相对长度。

返回:

零填充的对齐。

返回类型:

torch.Tensor (batch, time)

示例

>>> ids = ['id1', 'id2']
>>> emission_pred = torch.tensor([[[ -1., -10., -10.],
...                                [-10.,  -1., -10.],
...                                [-10., -10.,  -1.]],
...
...                               [[ -1., -10., -10.],
...                                [-10.,  -1., -10.],
...                                [-10., -10., -10.]]])
>>> lens = torch.tensor([1., 0.66])
>>> phns = torch.tensor([[0, 1, 2],
...                      [0, 1, 0]])
>>> phn_lens = torch.tensor([1., 0.66])
>>> aligner = HMMAligner()
>>> alignment_batch = aligner.get_prev_alignments(
...        ids, emission_pred, lens, phns, phn_lens
... )
>>> alignment_batch
tensor([[0, 1, 2],
        [0, 1, 0]])

calc_accuracy(alignments, ends, phns, ind2labs=None)[source]

计算预测对齐和地面真实对齐之间的平均准确率。地面真实对齐是从地面真实音素及其在音频样本中的结束位置得出的。

参数:

alignments (list of lists of ints/floats) – 批处理中每个发音单位的预测对齐。
ends (list of lists of ints) – 根据转录文本，每个地面真实音素结束的样本索引列表的列表。注意：当前实现假定“ends”标记下一个音素开始的索引。
phns (list of lists of ints/floats) – 批处理中地面真实音素的未填充列表的列表。
ind2labs (tuple) – （可选）包含第一个和第二个音素序列的原始索引到标签字典。

返回:

mean_acc – 上采样预测对齐与地面真实对齐匹配的平均百分比。

返回类型:

float

示例

>>> aligner = HMMAligner()
>>> alignments = [[0., 0., 0., 1.]]
>>> phns = [[0., 1.]]
>>> ends = [[2, 4]]
>>> mean_acc = aligner.calc_accuracy(alignments, ends, phns)
>>> mean_acc.item()
75.0

collapse_alignments(alignments)[source]

将对齐转换为每个音素一个状态的风格。

参数:: alignments (list of ints) – 单个发音单位的预测对齐。
返回:: sequence – 转换为每个音素一个状态风格的预测对齐。
返回类型:: list of ints

示例

>>> aligner = HMMAligner(states_per_phoneme = 3)
>>> alignments = [0, 1, 2, 3, 4, 5, 3, 4, 5, 0, 1, 2]
>>> sequence = aligner.collapse_alignments(alignments)
>>> sequence
[0, 1, 1, 0]

speechbrain.alignment.aligner.map_inds_to_intersect(lists1, lists2, ind2labs)[source]

将包含来自不同音素集的音素索引的两个列表转换为单个音素，以便比较结果列表中索引的相等性将得出正确的准确率。

参数:

lists1 (list of lists of ints) – 包含第一个音素序列的索引。
lists2 (list of lists of ints) – 包含第二个音素序列的索引。
ind2labs (tuple (dict, dict)) – 包含第一个和第二个音素序列的原始索引到标签字典。

返回:

lists1_new (list of lists of ints) – 包含映射到新音素集的第一个音素序列的索引。
lists2_new (list of lists of ints) – 包含映射到新音素集的第二个音素序列的索引。

示例

>>> lists1 = [[0, 1]]
>>> lists2 = [[0, 1]]
>>> ind2lab1 = {
...        0: "a",
...        1: "b",
...        }
>>> ind2lab2 = {
...        0: "a",
...        1: "c",
...        }
>>> ind2labs = (ind2lab1, ind2lab2)
>>> out1, out2 = map_inds_to_intersect(lists1, lists2, ind2labs)
>>> out1
[[0, 1]]
>>> out2
[[0, 2]]

speechbrain.alignment.aligner.batch_log_matvecmul(A, b)[source]

对于批处理中的每对“矩阵”和“向量”，在对数域中进行矩阵-向量乘法，即使用 logsumexp 代替加法，使用加法代替乘法。

参数:

A (torch.Tensor (batch, dim1, dim2)) – Tensor
b (torch.Tensor (batch, dim1)) – Tensor。

返回:

x

返回类型:

torch.Tensor (batch, dim1)

示例

>>> A = torch.tensor([[[   0., 0.],
...                    [ -1e5, 0.]]])
>>> b = torch.tensor([[0., 0.,]])
>>> x = batch_log_matvecmul(A, b)
>>> x
tensor([[0.6931, 0.0000]])
>>>
>>> # non-log domain equivalent without batching functionality
>>> A_ = torch.tensor([[1., 1.],
...                    [0., 1.]])
>>> b_ = torch.tensor([1., 1.,])
>>> x_ = torch.matmul(A_, b_)
>>> x_
tensor([2., 1.])

speechbrain.alignment.aligner.batch_log_maxvecmul(A, b)[source]

类似于 batch_log_matvecmul，但使用最大值代替 logsumexp。返回最大值和 argmax。

参数:

A (torch.Tensor (batch, dim1, dim2)) – Tensor。
b (torch.Tensor (batch, dim1)) – Tensor

返回:

x (torch.Tensor (batch, dim1)) – Tensor。
argmax (torch.Tensor (batch, dim1)) – Tensor。

示例

>>> A = torch.tensor([[[   0., -1.],
...                    [ -1e5,  0.]]])
>>> b = torch.tensor([[0., 0.,]])
>>> x, argmax = batch_log_maxvecmul(A, b)
>>> x
tensor([[0., 0.]])
>>> argmax
tensor([[0, 1]])