speechbrain.dataio.encoder 模块

将类别数据编码为整数

作者

Samuele Cornell 2020
Aku Rouhe 2020

总结

类

`CTCTextEncoder`	TextEncoder 的子类，也提供处理 CTC 空格标记的方法。
`CategoricalEncoder`	编码离散集合的标签。
`TextEncoder`	CategoricalEncoder 的子类，提供编码文本和处理特殊标记以用于序列到序列模型训练的特定方法。

函数

load_text_encoder_tokens

从预训练模型加载编码器标记。

参考

class speechbrain.dataio.encoder.CategoricalEncoder(starting_index=0, **special_labels)[source]

基类: object

编码离散集合的标签。

用于编码，例如，说话人识别中的说话人身份。给定可哈希对象的集合（例如字符串），它将每个唯一项编码为一个整数值：[“spk0”, “spk1”] –> [0, 1]。在内部，每个标签与其索引之间的对应关系由两个字典处理：lab2ind 和 ind2lab。

标签整数编码可以从 SpeechBrain DynamicItemDataset 中自动生成，只需在注解中指定所需的条目（例如 spkid）并调用 update_from_didataset 方法

>>> from speechbrain.dataio.encoder import CategoricalEncoder
>>> from speechbrain.dataio.dataset import DynamicItemDataset
>>> dataset = {"ex_{}".format(x) : {"spkid" : "spk{}".format(x)} for x in range(20)}
>>> dataset = DynamicItemDataset(dataset)
>>> encoder = CategoricalEncoder()
>>> encoder.update_from_didataset(dataset, "spkid")
>>> assert len(encoder) == len(dataset) # different speaker for each utterance

然而也可以从可迭代对象更新

>>> from speechbrain.dataio.encoder import CategoricalEncoder
>>> from speechbrain.dataio.dataset import DynamicItemDataset
>>> dataset = ["spk{}".format(x) for x in range(20)]
>>> encoder = CategoricalEncoder()
>>> encoder.update_from_iterable(dataset)
>>> assert len(encoder) == len(dataset)

注意

在这两种方法中，可以指定可迭代对象或数据集中的单个元素是否应被视为序列（默认为 False）。如果是序列，序列中的每个元素都将被编码。

>>> from speechbrain.dataio.encoder import CategoricalEncoder
>>> from speechbrain.dataio.dataset import DynamicItemDataset
>>> dataset = [[x+1, x+2] for x in range(20)]
>>> encoder = CategoricalEncoder()
>>> encoder.ignore_len()
>>> encoder.update_from_iterable(dataset, sequence_input=True)
>>> assert len(encoder) == 21 # there are only 21 unique elements 1-21

这个类提供了 4 种不同的方法来在内部字典中显式添加标签：add_label, ensure_label, insert_label, enforce_label。如果标签已存在于内部字典中，add_label 和 insert_label 将引发错误。insert_label, enforce_label 还允许指定所需标签编码的整数值。

编码可以使用 4 种不同的方法执行：encode_label, encode_sequence, encode_label_torch 和 encode_sequence_torch。encode_label 对单个标签操作并简单地返回相应的整数编码

>>> from speechbrain.dataio.encoder import CategoricalEncoder
>>> from speechbrain.dataio.dataset import DynamicItemDataset
>>> dataset = ["spk{}".format(x) for x in range(20)]
>>> encoder.update_from_iterable(dataset)
>>>
22
>>>
encode_sequence on sequences of labels:
>>> encoder.encode_sequence(["spk1", "spk19"])
[22, 40]
>>>
encode_label_torch and encode_sequence_torch return torch tensors
>>> encoder.encode_sequence_torch(["spk1", "spk19"])
tensor([22, 40])
>>>
Decoding can be performed using decode_torch and decode_ndim methods.
>>> encoded = encoder.encode_sequence_torch(["spk1", "spk19"])
>>> encoder.decode_torch(encoded)
['spk1', 'spk19']
>>>
decode_ndim is used for multidimensional list or pytorch tensors
>>> encoded = encoded.unsqueeze(0).repeat(3, 1)
>>> encoder.decode_torch(encoded)
[['spk1', 'spk19'], ['spk1', 'spk19'], ['spk1', 'spk19']]
>>>

在某些应用中，测试期间可能会遇到训练期间未遇到的标签。为了处理这个问题外词汇问题，可以使用 add_unk。每个外词汇标签都被映射到这个特殊的 <unk> 标签及其相应的整数编码。

>>> import torch
>>> try:
...     encoder.encode_label("spk42")
... except KeyError:
...        print("spk42 is not in the encoder this raises an error!")
spk42 is not in the encoder this raises an error!
>>> encoder.add_unk()
41
>>> encoder.encode_label("spk42")
41
>>>
returns the <unk> encoding

这个类还提供了使用 save 和 load 方法以及 load_or_create 来保存和加载标签与标记之间内部映射的方法。

VALUE_SEPARATOR = ' => '

EXTRAS_SEPARATOR = '================\n'

handle_special_labels(special_labels)[source]: 处理特殊标签，例如 unk_label。

classmethod from_saved(path)[source]: 直接重新创建先前保存的编码器

update_from_iterable(iterable, sequence_input=False)[source]

从迭代器更新

参数:

iterable (iterable) – 要操作的输入序列。
sequence_input (bool) – 可迭代对象是生成标签序列还是直接生成单个标签。（默认为 False）

update_from_didataset(didataset, output_key, sequence_input=False)[source]

从 DynamicItemDataset 更新。

参数:

didataset (DynamicItemDataset) – 要操作的数据集。
output_key (str) – 数据集（数据或动态项中）中要编码的键。
sequence_input (bool) – 指定键产生的数据是由标签序列组成还是直接由单个标签组成。

limited_labelset_from_iterable(iterable, sequence_input=False, n_most_common=None, min_count=1)[source]

根据标签计数从可迭代对象生成标签映射

用于限制标签集大小。

参数:

iterable (iterable) – 要操作的输入序列。
sequence_input (bool) – 可迭代对象是生成标签序列还是直接生成单个标签。默认为 False。
n_most_common (int, None) – 最多将这些数量的标签作为标签集，保留最常见的标签。如果为 None（默认），则采用所有标签。
min_count (int) – 如果标签出现次数少于此值，则不采用该标签。

返回:

不同标签的计数（未过滤）。

返回类型:

collections.Counter

load_or_create(path, from_iterables=[], from_didatasets=[], sequence_input=False, output_key=None, special_labels={})[source]

有条件创建编码器的便捷语法

这种模式在许多实验中都会重复出现，因此我们决定在此添加一个便捷的快捷方式。当前版本是多 GPU (DDP) 安全的。

add_label(label)[source]

将新标签添加到编码器中，放在下一个空闲位置。

参数:: label (hashable) – 标签通常是 str，但支持任何可以用作字典键的内容。请注意，默认保存/加载仅支持 Python 字面量。
返回:: 用于编码此标签的索引。
返回类型:: int

ensure_label(label)[source]

如果标签不存在，则添加该标签。

参数:: label (hashable) – 标签通常是 str，但支持任何可以用作字典键的内容。请注意，默认保存/加载仅支持 Python 字面量。
返回:: 用于编码此标签的索引。
返回类型:: int

insert_label(label, index)[source]

添加一个新标签，强制其索引为特定值。

如果一个标签已具有指定索引，则将其移至映射末尾。

参数:

label (hashable) – 标签通常是 str，但支持任何可以用作字典键的内容。请注意，默认保存/加载仅支持 Python 字面量。
index (int) – 要使用的特定索引。

enforce_label(label, index)[source]

确保标签存在并编码到特定索引。

如果标签存在但编码到其他索引，则将其移至给定索引。

如果给定索引处已存在另一个标签，则将该标签移至下一个空闲位置。

add_unk(unk_label='<unk>')[source]

添加未知标记（外词汇）的标签。

当要求编码未知标签时，可以将它们映射到此处。

参数:: unk_label (hashable, 可选) – 标签通常是 str，但支持任何可以用作字典键的内容。请注意，默认保存/加载仅支持 Python 字面量。默认值：<unk>。也可以为 None！
返回:: 用于编码此标记的索引。
返回类型:: int

is_continuous()[source]

检查索引集是否没有间隙

例如：如果起始索引 = 1 连续：[1,2,3,4] 连续：[0,1,2] 非连续：[2,3,4] 非连续：[1,2,4]

返回:: 如果连续则为 True。
返回类型:: bool

encode_label(label, allow_unk=True)[source]

将标签编码为 int

参数:

label (hashable) – 要编码的标签，必须存在于映射中。
allow_unk (bool) – 如果给出，并且该标签不在标签集中，并且已使用 add_unk() 添加了 unk_label，则允许编码到 unk_label 的索引。

返回:

对应的编码整数值。

返回类型:

int

encode_label_torch(label, allow_unk=True)[source]

将标签编码为 torch.LongTensor。

参数:

label (hashable) – 要编码的标签，必须存在于映射中。
allow_unk (bool) – 如果给出，并且该标签不在标签集中，并且已使用 add_unk() 添加了 unk_label，则允许编码到 unk_label 的索引。

返回:

对应的编码整数值。张量形状 [1]。

返回类型:

torch.LongTensor

encode_sequence(sequence, allow_unk=True)[source]

将标签序列编码为列表

参数:

sequence (iterable) – 要编码的标签序列，必须存在于映射中。
allow_unk (bool) – 如果给出，并且该标签不在标签集中，并且已使用 add_unk() 添加了 unk_label，则允许编码到 unk_label 的索引。

返回:

对应的整数标签。

返回类型:

list

encode_sequence_torch(sequence, allow_unk=True)[source]

将标签序列编码为 torch.LongTensor

参数:

sequence (iterable) – 要编码的标签序列，必须存在于映射中。
allow_unk (bool) – 如果给出，并且该标签不在标签集中，并且已使用 add_unk() 添加了 unk_label，则允许编码到 unk_label 的索引。

返回:

对应的整数标签。张量形状 [len(sequence)]。

返回类型:

torch.LongTensor

decode_torch(x)[source]

将任意嵌套的 torch.Tensor 解码为标签列表。

单独提供是因为 Torch 提供了更清晰的自省，因此不需要 try-except。

参数:: x (torch.Tensor) – 要解码的某种整数 dtype (Long, int) 和任意形状的 Torch 张量。
返回:: 原始标签列表
返回类型:: list

decode_ndim(x)[source]

将任意嵌套的可迭代对象解码为标签列表。

这适用于基本上任何 Python 式的可迭代对象（包括 torch），也适用于单个元素。

参数:: x (Any) – Python 列表或其他可迭代对象、torch.Tensor 或单个整数元素
返回:: ndim 原始标签列表，如果输入是单个元素，输出也将是单个元素。
返回类型:: list, Any

save(path)[source]

保存类别编码以便后续使用和恢复

保存使用 Python 字面量格式，支持元组标签等，被认为是安全加载（不像 pickle 等）。

参数:: path (str, Path) – 保存路径。将会覆盖。

load(path)[source]

从给定路径加载。

CategoricalEncoder 使用 Python 字面量格式，支持元组标签等，被认为是安全加载（不像 pickle 等）。

参数:: path (str, Path) – 加载路径。

load_if_possible(path, end_of_epoch=False)[source]

如果可能则加载，并返回一个布尔值指示是否加载成功。

参数:

path (str, Path) – 加载路径。
end_of_epoch (bool) – 检查点是否为 epoch 结束。

返回:

如果加载成功。

返回类型:

bool

示例

>>> encoding_file = getfixture('tmpdir') / "encoding.txt"
>>> encoder = CategoricalEncoder()
>>> # The idea is in an experiment script to have something like this:
>>> if not encoder.load_if_possible(encoding_file):
...     encoder.update_from_iterable("abcd")
...     encoder.save(encoding_file)
>>> # So the first time you run the experiment, the encoding is created.
>>> # However, later, the encoding exists:
>>> encoder = CategoricalEncoder()
>>> encoder.expect_len(4)
>>> if not encoder.load_if_possible(encoding_file):
...     assert False  # We won't get here!
>>> encoder.decode_ndim(range(4))
['a', 'b', 'c', 'd']

expect_len(expected_len)[source]

指定预期的类别计数。如果在编码/解码期间观察到的类别计数与此不符，将引发错误。

这对于检测编码器是使用数据集动态构建，但下游代码期望特定类别计数（否则可能会静默中断）的情况下的错误非常有用。

这可以在任何时候调用，类别计数检查只会在实际的编码/解码任务期间执行。

参数:: expected_len (int) – 预期的最终类别计数，即 len(encoder)。

示例

>>> encoder = CategoricalEncoder()
>>> encoder.update_from_iterable("abcd")
>>> encoder.expect_len(3)
>>> encoder.encode_label("a")
Traceback (most recent call last):
  ...
RuntimeError: .expect_len(3) was called, but 4 categories found
>>> encoder.expect_len(4)
>>> encoder.encode_label("a")
0

ignore_len()[source]

指定在编码/解码时忽略类别计数。

有效地抑制“.expect_len 未曾调用”警告。当类别计数已知时，优先使用 expect_len()。

class speechbrain.dataio.encoder.TextEncoder(starting_index=0, **special_labels)[source]

基础类：CategoricalEncoder

CategoricalEncoder 子类，提供了用于文本编码和处理序列到序列模型训练所需特殊标记的特定方法。具体来说，除了 CategoricalEncoder 中用于处理词汇表外标记的现有特殊 <unk> 标记外，这里还定义了用于处理序列开始 <bos> 和序列结束 <eos> 标记的特殊方法。

注意：这里的 update_from_iterable 和 update_from_didataset 默认参数 sequence_input=True，因为假定此编码器用于字符串可迭代对象，例如：

>>> from speechbrain.dataio.encoder import TextEncoder
>>> dataset = [["encode", "this", "textencoder"], ["foo", "bar"]]
>>> encoder = TextEncoder()
>>> encoder.update_from_iterable(dataset)
>>> encoder.expect_len(5)
>>> encoder.encode_label("this")
1
>>> encoder.add_unk()
5
>>> encoder.expect_len(6)
>>> encoder.encode_sequence(["this", "out-of-vocab"])
[1, 5]
>>>

可以使用两种方法将 <bos> 和 <eos> 添加到内部字典中：insert_bos_eos, add_bos_eos。

>>> encoder.add_bos_eos()
>>> encoder.expect_len(8)
>>> encoder.lab2ind[encoder.eos_label]
7
>>>
add_bos_eos adds the special tokens at the end of the dict indexes
>>> encoder = TextEncoder()
>>> encoder.update_from_iterable(dataset)
>>> encoder.insert_bos_eos(bos_index=0, eos_index=1)
>>> encoder.expect_len(7)
>>> encoder.lab2ind[encoder.eos_label]
1
>>>
insert_bos_eos allows to specify whose index will correspond to each of them.
Note that you can also specify the same integer encoding for both.

可以使用四种方法来前置 <bos> 并后置 <eos>。prepend_bos_label 和 append_eos_label 分别将 <bos> 和 <eos> 字符串标记添加到输入序列中

>>> words = ["foo", "bar"]
>>> encoder.prepend_bos_label(words)
['<bos>', 'foo', 'bar']
>>> encoder.append_eos_label(words)
['foo', 'bar', '<eos>']

prepend_bos_index 和 append_eos_index 分别将 <bos> 和 <eos> 索引添加到输入的编码序列中。

>>> words = ["foo", "bar"]
>>> encoded = encoder.encode_sequence(words)
>>> encoder.prepend_bos_index(encoded)
[0, 3, 4]
>>> encoder.append_eos_index(encoded)
[3, 4, 1]

handle_special_labels(special_labels)[source]: 处理特殊标记，如 bos 和 eos。

update_from_iterable(iterable, sequence_input=True)[source]: 将 sequence_input 的默认值更改为 True。

update_from_didataset(didataset, output_key, sequence_input=True)[source]: 将 sequence_input 的默认值更改为 True。

limited_labelset_from_iterable(iterable, sequence_input=True, n_most_common=None, min_count=1)[source]: 将 sequence_input 的默认值更改为 True。

add_bos_eos(bos_label='<bos>', eos_label='<eos>')[source]

在标记集中添加句子边界标记。

如果句子开始标记和句子结束标记相同，将只使用一个句子边界标记。

此方法将标记添加到索引的末尾，而不是像 insert_bos_eos 那样添加到开头。

参数:

bos_label (hashable) – 句子开始标记，可以是任何标记。
eos_label (hashable) – 句子结束标记，可以是任何标记。如果与 bos_label 设置为相同标记，将只使用一个句子边界标记。

insert_bos_eos(bos_label='<bos>', eos_label='<eos>', bos_index=0, eos_index=None)[source]

在标记集中插入句子边界标记。

如果句子开始标记和句子结束标记相同，将只使用一个句子边界标记。

参数:

bos_label (hashable) – 句子开始标记，可以是任何标记
eos_label (hashable) – 句子结束标记，可以是任何标记。如果与 bos_label 设置为相同标记，将只使用一个句子边界标记。
bos_index (int) – 插入 bos_label 的位置。eos_index = bos_index + 1
eos_index (可选, int) – 插入 eos_label 的位置。默认值：eos_index = bos_index + 1

get_bos_index()[source]: 返回 blank 编码后的索引

get_eos_index()[source]: 返回 blank 编码后的索引

prepend_bos_label(x)[source]: 返回 x 的列表版本，并在其前面加上 BOS

prepend_bos_index(x)[source]: 返回 x 的列表版本，并在其前面加上 BOS 索引。如果输入是张量，则返回张量。

append_eos_label(x)[source]: 返回 x 的列表版本，并在其末尾加上 EOS。

append_eos_index(x)[source]: 返回 x 的列表版本，并在其末尾加上 EOS 索引。如果输入是张量，则返回张量。

class speechbrain.dataio.encoder.CTCTextEncoder(starting_index=0, **special_labels)[source]

基础类：TextEncoder

TextEncoder 的子类，也提供处理 CTC 空格标记的方法。

add_blank 和 insert_blank 可用于将 <blank> 特殊标记添加到编码器状态。

>>> from speechbrain.dataio.encoder import CTCTextEncoder
>>> chars = ["a", "b", "c", "d"]
>>> encoder = CTCTextEncoder()
>>> encoder.update_from_iterable(chars)
>>> encoder.add_blank()
>>> encoder.expect_len(5)
>>> encoder.encode_sequence(chars)
[0, 1, 2, 3]
>>> encoder.get_blank_index()
4
>>> encoder.decode_ndim([0, 1, 2, 3, 4])
['a', 'b', 'c', 'd', '<blank>']

collapse_labels 和 collapse_indices_ndim 可用于应用 CTC 折叠规则： >>> encoder.collapse_labels([“a”, “a”, “b”, “c”, “d”]) [‘a’, ‘b’, ‘c’, ‘d’] >>> encoder.collapse_indices_ndim([4, 4, 0, 1, 2, 3, 4, 4]) # 4 is <blank> [0, 1, 2, 3]

handle_special_labels(special_labels)[source]: 处理特殊标记，例如 blank。

add_blank(blank_label='<blank>')[source]: 将 blank 符号添加到标记集。

insert_blank(blank_label='<blank>', index=0)[source]: 在给定的标记集中插入 blank 符号。

get_blank_index()[source]: 返回 blank 编码后的索引。

collapse_labels(x, merge_repeats=True)[source]

对一个标记序列应用 CTC 折叠规则。

参数:

x (iterable) – 要操作的标记序列。
merge_repeats (bool) – 是否在移除 blank 之前合并重复标记。在基本的 CTC 标记拓扑中，重复标记会被合并。然而，在 RNN-T 中则不会。

返回:

应用折叠规则后的标记列表。

返回类型:

list

collapse_indices_ndim(x, merge_repeats=True)[source]

对任意标记序列应用 CTC 折叠规则。

参数:

x (iterable) – 要操作的标记序列。
merge_repeats (bool) – 是否在移除 blank 之前合并重复标记。在基本的 CTC 标记拓扑中，重复标记会被合并。然而，在 RNN-T 中则不会。

返回:

应用折叠规则后的标记列表。

返回类型:

list

speechbrain.dataio.encoder.load_text_encoder_tokens(model_path)[source]

从预训练模型加载编码器标记。

当您使用预训练的 HF 模型时，此方法非常有用。它将加载 yaml 文件中的标记，然后您将能够直接在 YAML 文件中实例化任何 CTCBaseSearcher。

参数:: model_path (str, Path) – 预训练模型的路径。
返回:: 标记列表。
返回类型:: list