在 GitHub 上执行、查看或下载此笔记本

在您训练好的 SpeechBrain 模型上进行推理

在本教程中，我们将学习在训练好的模型上进行推理的不同方法。请注意，这与加载预训练模型进行进一步训练或迁移学习无关。如果您对这些主题感兴趣，请参考相应的教程。

先决条件

背景

在本例中，我们将考虑一个用户，他希望使用由他自己训练的自定义预训练语音识别器来转录一些音频文件。如果您对使用在线可用的预训练模型感兴趣，请参考预训练教程。以下内容可以扩展到 SpeechBrain 支持的任何任务，因为我们提供了一种统一的处理方式。

可用选项

目前，您有三个选项可用

在您的 ASR 类（继承自 Brain）中定义一个自定义 Python 函数。这会在训练 recipe 和您的转录结果之间引入强耦合。对于原型开发和获取数据集的简单转录结果非常方便。但是，不推荐用于部署。
使用已有的接口（例如 EncoderDecoderASR，在预训练教程中有介绍）。这可能是最优雅、最方便的方式。但是，您的模型应符合一些约束才能适配所提议的接口。
构建完全适合您的自定义 ASR 模型的接口。

重要提示：所有这些解决方案也适用于其他任务（说话人识别、源分离等）

1. 训练脚本中的自定义函数

这种方法的目的是使用户能够在 train.py 结束时调用一个函数来转录给定数据集

    # Trainer initialization
    asr_brain = ASR(
        modules=hparams["modules"],
        opt_class=hparams["opt_class"],
        hparams=hparams,
        run_opts=run_opts,
        checkpointer=hparams["checkpointer"],
    )

    # Training
    asr_brain.fit(
        asr_brain.hparams.epoch_counter,
        datasets["train"],
        datasets["valid"],
        train_loader_kwargs=hparams["train_dataloader_opts"],
        valid_loader_kwargs=hparams["valid_dataloader_opts"],
    )

    # Load best checkpoint for evaluation
    test_stats = asr_brain.evaluate(
        test_set=datasets["test"],
        min_key="WER",
        test_loader_kwargs=hparams["test_dataloader_opts"],
    )

    # Load best checkpoint for transcription !!!!!!
    # You need to create this function w.r.t your system architecture !!!!!!
    transcripts = asr_brain.transcribe_dataset(
        dataset=datasets["your_dataset"], # Must be obtained from the dataio_function
        min_key="WER", # We load the model with the lowest WER
        loader_kwargs=hparams["transcribe_dataloader_opts"], # opts for the dataloading
    )

如您所见，由于需要实例化 Brain 类，这与训练 recipe 存在强耦合。

注意 1： 如果您不想调用 .fit() 和 .evaluate()，可以将其移除。这只是一个示例，旨在更好地突出如何使用它。

注意 2： 在这里，.transcribe_dataset() 函数接受一个 dataset 对象进行转录。您也可以直接使用路径。如何实现此函数完全取决于您。

现在：在这个函数中应该放什么？这里，我们将给出一个基于模板的示例，但您需要根据您自己的系统进行调整。

def transcribe_dataset(
        self,
        dataset, # Must be obtained from the dataio_function
        min_key, # We load the model with the lowest WER
        loader_kwargs # opts for the dataloading
    ):
  
    # If dataset isn't a Dataloader, we create it.
    if not isinstance(dataset, DataLoader):
        loader_kwargs["ckpt_prefix"] = None
        dataset = self.make_dataloader(
            dataset, Stage.TEST, **loader_kwargs
        )
    
    
    self.on_evaluate_start(min_key=min_key) # We call the on_evaluate_start that will load the best model
    self.modules.eval() # We set the model to eval mode (remove dropout etc)

    # Now we iterate over the dataset and we simply compute_forward and decode
    with torch.no_grad():

        transcripts = []
        for batch in tqdm(dataset, dynamic_ncols=True):
            
            # Make sure that your compute_forward returns the predictions !!!
            # In the case of the template, when stage = TEST, a beam search is applied
            # in compute_forward().
            out = self.compute_forward(batch, stage=sb.Stage.TEST)
            p_seq, wav_lens, predicted_tokens = out
            
            # We go from tokens to words.
            predicted_words = self.tokenizer(
                predicted_tokens, task="decode_from_list"
            )
            transcripts.append(predicted_words)
            
    return transcripts

管道很简单：加载模型 -> 执行 compute_forward -> 解除分词。

2. 使用 `EncoderDecoderASR` 接口

EncoderDecoderASR 类接口允许您将训练好的模型与训练 recipe 解耦，并通过少量代码对任何新的音频文件进行推理（或编码）。如果您对 ASR 不感兴趣，可以在 interfaces.py 文件中找到许多其他适合您目的的接口。如果您打算以生产方式部署模型，即如果您计划大量稳定地使用模型，则应首选此解决方案。当然，这将需要您稍微修改 yaml 文件。

该类具有以下方法

encode_batch：将编码器应用于输入批次并返回一些编码特征。
transcribe_file：转录输入的单个音频文件。
transcribe_batch：转录输入的批次。

实际上，如果您满足我们将在下一段中详细介绍的一些约束，您可以简单地执行以下操作

from speechbrain.inference.ASR import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="your_folder", hparams_file='your_file.yaml', savedir="pretrained_model")
asr_model.transcribe_file('your_file.wav')

然而，为了实现对所有可能的 Encoder-Decoder ASR 管道的泛化，在部署您的系统时必须考虑一些约束

必要的模块。如您在 EncoderDecoderASR 类中看到的，您的 yaml 文件中定义的模块必须包含具有特定名称的某些元素。实际上，您需要一个分词器、一个编码器和一个解码器。编码器可以简单地是一个由特征计算、归一化和模型编码序列组成的 speechbrain.nnet.containers.LengthsCapableSequential。

    HPARAMS_NEEDED = ["tokenizer"]
    MODULES_NEEDED = [
        "encoder",
        "decoder",
    ]

您还需要在 YAML 文件中声明这些实体，并创建以下名为 modules 的字典

encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
    input_shape: [null, null, !ref <n_mels>]
    compute_features: !ref <compute_features>
    normalize: !ref <normalize>
    model: !ref <enc>

ctc_scorer: !new:speechbrain.decoders.scorer.CTCScorer
    eos_index: !ref <eos_index>
    blank_index: !ref <blank_index>
    ctc_fc: !ref <ctc_lin>

coverage_scorer: !new:speechbrain.decoders.scorer.CoverageScorer
    vocab_size: !ref <output_neurons>

rnnlm_scorer: !new:speechbrain.decoders.scorer.RNNLMScorer
    language_model: !ref <lm_model>
    temperature: !ref <temperature_lm>

scorer: !new:speechbrain.decoders.scorer.ScorerBuilder
    scorer_beam_scale: 1.5
    full_scorers: [
        !ref <rnnlm_scorer>,
        !ref <coverage_scorer>]
    partial_scorers: [!ref <ctc_scorer>]
    weights:
        rnnlm: !ref <lm_weight>
        coverage: !ref <coverage_penalty>
        ctc: !ref <ctc_weight_decode>

decoder: !new:speechbrain.decoders.S2SRNNBeamSearcher
    embedding: !ref <emb>
    decoder: !ref <dec>
    linear: !ref <seq_lin>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <test_beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    temperature: !ref <temperature>
    scorer: !ref <scorer>

modules:
    encoder: !ref <encoder>
    decoder: !ref <decoder>
    lm_model: !ref <lm_model>

在此示例中，enc 是 CRDNN，但也可以是任何自定义神经网络。

为什么需要确保这一点？原因很简单，因为这些是在 EncoderDecoderASR 类上进行推理时我们会调用的模块。以下是 encode_batch() 函数的示例。

[...]
  wavs = wavs.float()
  wavs, wav_lens = wavs.to(self.device), wav_lens.to(self.device)
  encoder_out = self.modules.encoder(wavs, wav_lens)
return encoder_out

如果我的 asr_encoder 结构很复杂，包含多个深度神经网络等怎么办？只需将所有内容放入 yaml 文件中的 torch.nn.ModuleList 中即可

asr_encoder: !new:torch.nn.ModuleList
    - [!ref <enc>, my_different_blocks ... ]

调用预训练器加载检查点。最后，您需要定义一个对预训练器的调用，它将把训练模型的不同检查点加载到相应的 SpeechBrain 模块中。简而言之，它会加载您的编码器、语言模型的权重，甚至只是加载分词器。

pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    loadables:
        asr: !ref <asr_model>
        lm: !ref <lm_model>
        tokenizer: !ref <tokenizer>
    paths:
      asr: !ref <asr_model_ptfile>
      lm: !ref <lm_model_ptfile>
      tokenizer: !ref <tokenizer_ptfile>

loadable 字段在文件（例如 lm，与 <lm_model_ptfile> 中的检查点相关）与 yaml 实例（例如 <lm_model>，即您的语言模型）之间创建链接。

如果您遵守这两个约束，它应该就能工作！这里，我们提供一个仅用于推理的完整 yaml 示例

# ############################################################################
# Model: E2E ASR with attention-based ASR
# Encoder: CRDNN model
# Decoder: GRU + beamsearch + RNNLM
# Tokens: BPE with unigram
# Authors:  Ju-Chieh Chou, Mirco Ravanelli, Abdel Heba, Peter Plantinga 2020
# ############################################################################


# Feature parameters
sample_rate: 16000
n_fft: 400
n_mels: 40

# Model parameters
activation: !name:torch.nn.LeakyReLU
dropout: 0.15
cnn_blocks: 2
cnn_channels: (128, 256)
inter_layer_pooling_size: (2, 2)
cnn_kernelsize: (3, 3)
time_pooling_size: 4
rnn_class: !name:speechbrain.nnet.RNN.LSTM
rnn_layers: 4
rnn_neurons: 1024
rnn_bidirectional: True
dnn_blocks: 2
dnn_neurons: 512
emb_size: 128
dec_neurons: 1024
output_neurons: 1000  # index(blank/eos/bos) = 0
blank_index: 0

# Decoding parameters
bos_index: 0
eos_index: 0
min_decode_ratio: 0.0
max_decode_ratio: 1.0
beam_size: 80
eos_threshold: 1.5
using_max_attn_shift: True
max_attn_shift: 240
lm_weight: 0.50
coverage_penalty: 1.5
temperature: 1.25
temperature_lm: 1.25

normalize: !new:speechbrain.processing.features.InputNormalization
    norm_type: global

compute_features: !new:speechbrain.lobes.features.Fbank
    sample_rate: !ref <sample_rate>
    n_fft: !ref <n_fft>
    n_mels: !ref <n_mels>

enc: !new:speechbrain.lobes.models.CRDNN.CRDNN
    input_shape: [null, null, !ref <n_mels>]
    activation: !ref <activation>
    dropout: !ref <dropout>
    cnn_blocks: !ref <cnn_blocks>
    cnn_channels: !ref <cnn_channels>
    cnn_kernelsize: !ref <cnn_kernelsize>
    inter_layer_pooling_size: !ref <inter_layer_pooling_size>
    time_pooling: True
    using_2d_pooling: False
    time_pooling_size: !ref <time_pooling_size>
    rnn_class: !ref <rnn_class>
    rnn_layers: !ref <rnn_layers>
    rnn_neurons: !ref <rnn_neurons>
    rnn_bidirectional: !ref <rnn_bidirectional>
    rnn_re_init: True
    dnn_blocks: !ref <dnn_blocks>
    dnn_neurons: !ref <dnn_neurons>

emb: !new:speechbrain.nnet.embedding.Embedding
    num_embeddings: !ref <output_neurons>
    embedding_dim: !ref <emb_size>

dec: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
    enc_dim: !ref <dnn_neurons>
    input_size: !ref <emb_size>
    rnn_type: gru
    attn_type: location
    hidden_size: !ref <dec_neurons>
    attn_dim: 1024
    num_layers: 1
    scaling: 1.0
    channels: 10
    kernel_size: 100
    re_init: True
    dropout: !ref <dropout>

ctc_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dnn_neurons>
    n_neurons: !ref <output_neurons>

seq_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dec_neurons>
    n_neurons: !ref <output_neurons>

log_softmax: !new:speechbrain.nnet.activations.Softmax
    apply_log: True

lm_model: !new:speechbrain.lobes.models.RNNLM.RNNLM
    output_neurons: !ref <output_neurons>
    embedding_dim: !ref <emb_size>
    activation: !name:torch.nn.LeakyReLU
    dropout: 0.0
    rnn_layers: 2
    rnn_neurons: 2048
    dnn_blocks: 1
    dnn_neurons: 512
    return_hidden: True  # For inference

tokenizer: !new:sentencepiece.SentencePieceProcessor

asr_model: !new:torch.nn.ModuleList
    - [!ref <enc>, !ref <emb>, !ref <dec>, !ref <ctc_lin>, !ref <seq_lin>]

# We compose the inference (encoder) pipeline.
encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
    input_shape: [null, null, !ref <n_mels>]
    compute_features: !ref <compute_features>
    normalize: !ref <normalize>
    model: !ref <enc>


ctc_scorer: !new:speechbrain.decoders.scorer.CTCScorer
    eos_index: !ref <eos_index>
    blank_index: !ref <blank_index>
    ctc_fc: !ref <ctc_lin>

coverage_scorer: !new:speechbrain.decoders.scorer.CoverageScorer
    vocab_size: !ref <output_neurons>

rnnlm_scorer: !new:speechbrain.decoders.scorer.RNNLMScorer
    language_model: !ref <lm_model>
    temperature: !ref <temperature_lm>

scorer: !new:speechbrain.decoders.scorer.ScorerBuilder
    scorer_beam_scale: 1.5
    full_scorers: [
        !ref <rnnlm_scorer>,
        !ref <coverage_scorer>]
    partial_scorers: [!ref <ctc_scorer>]
    weights:
        rnnlm: !ref <lm_weight>
        coverage: !ref <coverage_penalty>
        ctc: !ref <ctc_weight_decode>

decoder: !new:speechbrain.decoders.S2SRNNBeamSearcher
    embedding: !ref <emb>
    decoder: !ref <dec>
    linear: !ref <seq_lin>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <test_beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    temperature: !ref <temperature>
    scorer: !ref <scorer>
    

modules:
    encoder: !ref <encoder>
    decoder: !ref <decoder>
    lm_model: !ref <lm_model>

pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    loadables:
        asr: !ref <asr_model>
        lm: !ref <lm_model>
        tokenizer: !ref <tokenizer>

如您所见，这是一个标准的 YAML 文件，但带有一个加载模型的预训练器。它与用于训练的 yaml 文件相似。我们只需移除所有训练特定的部分（例如，训练参数、优化器、检查点等），并添加预训练器以及将所需模块与其预训练文件链接起来的 encoder、decoder 元素。

3. 开发您自己的推理接口

虽然 EncoderDecoderASR 类被设计得尽可能通用，但您可能需要一个更复杂的、更适合您需求的推理方案。在这种情况下，您必须开发自己的接口。请按照以下步骤操作

创建继承自 Pretrained 的自定义接口（代码在此文件中）

class MySuperTask(Pretrained):
  # Here, do not hesitate to also add some required modules
  # for further transparency.
  HPARAMS_NEEDED = ["mymodule1", "mymodule2"]
  MODULES_NEEDED = [
        "mytask_enc",
        "my_searcher",
  ]
  def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Do whatever is needed here w.r.t your system

这将使您的类能够调用有用的函数，例如基于 HyperPyYAML 文件获取和加载的 .from_hparams()，以及加载给定音频文件的 load_audio()。我们编写在 Pretrained 类中的大多数方法可能都适合您的需求。如果不是，您可以覆盖它们以实现自定义功能。

开发您的接口和不同功能。遗憾的是，我们无法在此提供一个足够通用的示例。您可以向此类添加您认为可以使在数据/模型上进行推理更轻松自然的任何函数。例如，我们可以在此处创建一个函数，该函数只需使用 mytask_enc 模块对 wav 文件进行编码。

class MySuperTask(Pretrained):
  # Here, do not hesitate to also add some required modules
  # for further transparency.
  HPARAMS_NEEDED = ["mymodule1", "mymodule2"]
  MODULES_NEEDED = [
        "mytask_enc",
        "my_searcher",
  ]
  def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Do whatever is needed here w.r.t your system
  
  def encode_file(self, path):
        waveform = self.load_audio(path)
        # Fake a batch:
        batch = waveform.unsqueeze(0)
        rel_length = torch.tensor([1.0])
        with torch.no_grad():
          rel_lens = rel_length.to(self.device)
          encoder_out = self.encode_batch(waveform, rel_lens)
        
        return encode_file

现在，我们可以按以下方式使用您的接口

from speechbrain.inference.my_super_task import MySuperTask

my_model = MySuperTask.from_hparams(source="your_local_folder", hparams_file='your_file.yaml', savedir="pretrained_model")
audio_file = 'your_file.wav'
encoded = my_model.encode_file(audio_file)

如您所见，这种形式非常灵活，使您能够创建一个整体接口，用于使用预训练模型执行任何您想要的操作。

我们为端到端 ASR、说话人识别、源分离、语音增强等提供不同的通用接口。如果您感兴趣，请此处查看！

通用预训练推理

在某些情况下，用户可能希望在外部文件开发推理接口。这可以使用foreign class 来完成。您可以查看此处报告的示例

from speechbrain.inference.interfaces import foreign_class
classifier = foreign_class(source="speechbrain/emotion-recognition-wav2vec2-IEMOCAP", pymodule_file="custom_interface.py", classname="CustomEncoderWav2vec2Classifier")
out_prob, score, index, text_lab = classifier.classify_file("speechbrain/emotion-recognition-wav2vec2-IEMOCAP/anger.wav")
print(text_lab)

在这种情况下，推理接口不是写在 speechbrain.pretrained.interfaces 中的类，而是编码在外部文件（custom_interface.py）中。

如果您需要的接口在 speechbrain.pretrained.interfaces 中不可用，这可能很有用。如果愿意，您可以将其添加到那里。但是，如果您使用 foreign_class，我们还为您提供了从任何其他路径获取推理代码的可能性。

引用 SpeechBrain

如果您在研究或商业中使用 SpeechBrain，请使用以下 BibTeX 条目进行引用

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}