执行或查看/下载此 notebook，请访问 GitHub

使用 SpeechBrain 和 HuggingFace 微调或使用 Whisper、wav2vec2、HuBERT 等模型

本教程介绍如何结合使用（使用和微调）来自 HuggingFace Transformers 库的预训练模型，例如 Whisper、wav2vec 2.0、HuBERT、WavLM 等。这些模型可以轻松插入 SpeechBrain 中，以处理语音或音频相关的任务：自动语音识别、说话人识别、口语理解等。

关于预训练？ 预训练大型 SSL 模型非常复杂，原因很多，从必要的资源（数十个 GPU 数百小时）到由于流水线引起的可重现性问题。目前，SpeechBrain 仅提供 wav2vec 2.0 模型的预训练。

为什么选择 SpeechBrain？ 可以提及许多不同的原因来支持使用 SpeechBrain。然而，在预训练模型的特定上下文中，SpeechBrain 使研究人员和用户能够将这些架构连接到最先进的语音和音频相关技术。例如，SpeechBrain 允许您轻松地微调预训练的 wav2vec2 模型，将其与 transformer 解码器、beam search 算法以及 transformer 语言模型耦合，以构建 SOTA 语音识别器。它还可以帮助您简单地使用预训练 Whisper 的编码器来执行情感识别。据我们所知，大多数其他工具包无法让您实现这一点。

本教程关注的架构 我们将只考虑两个最新的现有预训练模型：wav2vec 2.0 和 Whisper。然而，SpeechBrain 支持许多其他模型：wavLM、HuBERT 等。

Wav2Vec 是一种基于 transformer 的编码器架构，能够进行语音的自监督表示学习。有关更多详细信息，请参阅官方论文：wav2vec2。

Wav2vec2 示意图，来源。

Whisper 是一个完整的 transformer（编码器-解码器），在大量半监督数据（60万小时以上语音）上进行训练。有关更多详细信息，请参阅官方论文：whisper

Whisper 示意图，来源。

通过本教程，您将学习如何

实例化 wav2vec2 或 Whisper 以从音频文件中提取特征。
将 wav2vec2 和 Whisper 编码器用作流水线的一个模块 (ASR, TIMIT)。
将 Whisper 用作编码器-解码器架构进行微调 (ASR, LibriSpeech)
了解当前集成的局限性。

先决条件

来自 HuggingFace 的 Wav2Vec 2.0 和 Whisper

Wav2vec 2.0 模型最初通过 Faiseq GitHub 共享，最近非常方便地集成到 HuggingFace Transformers API 中，并迁移到了 HuggingFace。同样的事情也发生在 Whisper 模型上，它从原始仓库迁移到了 HuggingFace Transformers API。因此，如果您想在 SpeechBrain 中使用预训练的 Transformer 模型，您只需要一个 HuggingFace 仓库！（例如 “facebook/wav2vec2-large-lv60”、“openai/whisper-large” 或 “microsoft/wavlm-large”）。

但首先，让我们安装所有必需的包...

%%capture
# Installing SpeechBrain
BRANCH = 'develop'
!git clone https://github.com/speechbrain/speechbrain.git -b $BRANCH
%cd /content/speechbrain/
!python -m pip install .

安装 HuggingFace Transformers 接口。

最后，让我们下载并加载一个音频文件来试玩。

%%capture
!wget https://www.dropbox.com/s/u8qyvuyie2op286/spk1_snt1.wav

import speechbrain as sb

source = sb.dataio.dataio.read_audio('spk1_snt1.wav').squeeze()
print(source.shape)

这是导入的信号

import matplotlib.pyplot as plt
plt.figure(1)
plt.plot(source)
plt.show()

from IPython.display import Audio
Audio('spk1_snt1.wav')

Wav2vec2、HuBERT、WavLM 和 Whisper 模型在 SpeechBrain 中作为层提供。因此，它们的实现可以在以下位置找到：

speechbrain.lobes.models.huggingface_wav2vec.py
speechbrain.lobes.models.huggingface_whisper.py

现在，我们实例化其中的每一个。需要注意的是，在下面的示例中，返回的对象是 标准 PyTorch Module，这在 SpeechBrain 中几乎总是如此。

# BE CAREFUL, IF YOU ARE NOT CONNECTED TO A GPU RUNTIME, THIS WILL CRASH
# THis only happens on Colab, you can of course load models on
from speechbrain.integrations.huggingface.wav2vec2 import Wav2Vec2
from speechbrain.integrations.huggingface.whisper import Whisper

# HuggingFace model hub
model_hub_w2v2 = "facebook/wav2vec2-base-960h"
model_hub_whisper = "openai/whisper-tiny"

model_w2v2 = Wav2Vec2(model_hub_w2v2, save_path='/content/pretrained/')
model_whisper = Whisper(model_hub_whisper, save_path='/content/pretrained/')

在这里，我们可以探索模型...

print(model_whisper)

现在，我们可以尝试从这些模型中提取音频特征！然而，在我们的示例中，我们有两个不同的模型，如果我们的目标仅仅是检索音频输入的潜在表示，则需要不同的 forward 操作。Wav2vec 2.0 是一个 transformer 编码器，所以我们只需要获取最后一层的输出。另一方面，Whisper 被打包为一个完全训练好的编码器-解码器。因此，我们必须确保我们只检索编码器的输出！

source = source.unsqueeze(0)
print(source.shape)

fea_w2v2 = model_w2v2(source)
print(fea_w2v2.shape)

# This can be given as an argument when we instantiate the model as well
model_whisper.encoder_only=True
fea_whisper = model_whisper(source)
print(fea_whisper.shape)

我在看什么？

这些特征对应于在 transformer 后获得的上下文表示（参见初始 wav2vec2 示意图中的 C）。因此，对于基础模型，这个输出维度是 768（如论文中所述）。然后，wav2vec2 的输出频率是 50Hz，音频文件长 2.87 秒，这解释了我们在时间维度上获得的 143。实际上，形状是 [batch, time, features]。同样的逻辑可以应用于 Whisper，因为我们获得了 transformer 编码器的最后一个隐藏状态。

将 Wav2Vec 2.0 和 Whisper 编码器作为流水线的一个模块 (ASR, TIMIT)

到目前为止，我们只看到了如何使用预训练的 wav2vec2 和 whisper 对单个音频文件进行推理。当然，如果您只想提取特征，您可以简单地遍历您的数据集并存储所有内容……或者您可以使用 SpeechBrain 将这些模型直接插入到您的流水线中，以实时计算特征（并对其进行微调！）。

事实上，如果您熟悉我们的 YAML 规范（如果您不熟悉，请先查看我们的教程），HuggingFaceWav2Vec2 和 HuggingFaceWhisper 可以简单地作为模块添加到您的 hyperparams 文件中

对于 Wav2vec 2.0

wav2vec2: !new:speechbrain.integrations.huggingface.wav2vec2.Wav2Vec2
    source: !ref <wav2vec2_hub>
    freeze: True
    save_path: !ref <save_folder>/wav2vec2_checkpoint

对于 Whisper

whisper: !new:speechbrain.integrations.huggingface.whisper.Whisper
    pretrained_path: !ref <wav2vec2_url>
    freeze: True
    encoder_only: True
    save_path: !ref <save_folder>/wav2vec2_checkpoint/model.pt

freeze 允许您微调 (False) 或冻结 (True) 神经网络参数。请注意，您还可以要求仅冻结 Whisper 的编码器或仅冻结 wav2vec 2.0 的特征提取器。您的流水线中将留下两个 PyTorch module 对象，它们可以用作标准层来传播您的数据！

在此之后，您需要具备 SpeechBrain 的基础知识。如果您有任何不理解的地方，请参阅先决条件（在本教程开头）。

现在，我们将更深入地探讨 LibriSpeech ASR (CTC) Recipe，可以在此处找到。

如果您不熟悉 CTC ASR，请参阅我们简化且带有详细注释的模板。

在下一节中，我们将只突出显示在您的 recipe 中使用 whisper 或 wav2vec2 模型所需的关键代码部分！

理解 yaml 参数。

在此设置中，我们希望根据我们的下游任务对 whisper 或 wav2vec2 模型进行微调。更精确地说，模型的架构是

[ wav -> wav2vec2 or whisper -> Dense ] = encoder

为了实现这一点，我们的 YAML 文件包含不同的关键组件（如果您对 whisper 感兴趣，请删除 w2v2 的引用，反之亦然）

  [...]

  # URL for the biggest and already fine-tuned english wav2vec2 model and parameters.
  # URL for the medium whisper as well.
  wav2vec2_hub: "facebook/wav2vec2-large-960h-lv60-self"
  whisper_hub: "openai/whisper-medium"
  freeze_pretrained: False
  lr_pretrained: 0.0001

  [...]

  # The instianciation of the SpeechBrain lobe
  wav2vec2: !new:speechbrain.integrations.huggingface.wav2vec2.Wav2Vec2
    source: !ref <wav2vec2_hub>
    freeze: !ref <freeze_pretrained>
    save_path: !ref <save_folder>/wav2vec2_checkpoint

  # The instianciation of the SpeechBrain lobe
  whisper: !new:speechbrain.integrations.huggingface.whisper.Whisper
    source: !ref <whisper_hub>
    freeze: !ref <freeze_pretrained>
    encoder_only: True
    save_path: !ref <save_folder>/whisper_checkpoint
  
  # A simple DNN that receive as inputs the output of the pretrained model
  # Here the output dimensionality of the LARGE wav2vec2 and MEDIUM whisper are 1024.
  enc: !new:speechbrain.lobes.models.VanillaNN.VanillaNN
    input_shape: [null, null, 1024]
    activation: !ref <activation>
    dnn_blocks: !ref <dnn_layers>
    dnn_neurons: !ref <dnn_neurons>

  [...]

  # Two optimizers and schedulers to allow:
  # 1. The learning of the encoder and the decoders.
  # 2. Slowly fine-tune only the pretrained (w2v2 or whisper) parts.
  adam_opt_class: !name:torch.optim.AdamW
    lr: !ref <lr>

  pretrained_opt_class: !name:torch.optim.AdamW
    lr: !ref <lr_pretrained>
    
  lr_annealing_adam: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr>
    improvement_threshold: 0.0025
    annealing_factor: 0.8
    patient: 0

  lr_annealing_pretrained: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr_pretrained>
    improvement_threshold: 0.0025
    annealing_factor: 0.9

  # We add the wav2vec2 / whisper to the modules list so it is uploaded on the GPUs.
  # Remove the one that is not used!
  modules:
    wav2vec2: !ref <wav2vec2>
    whisper: !ref <whisper>
    enc: !ref <enc>
    emb: !ref <emb>
    dec: !ref <dec>
    ctc_lin: !ref <ctc_lin>
    seq_lin: !ref <seq_lin>

  # We do not add the wav2vec2 / whisper to the model list, so we can apply one optimizer
  # to the randomly initialized model and the other to the pretrained model.
  model: !new:torch.nn.ModuleList
    - [!ref <enc>, !ref <emb>, !ref <dec>, !ref <ctc_lin>, !ref <seq_lin>]

  # We add the wav2vec2 /whisper to our checkpointer so the model can be saved!
  checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
    checkpoints_dir: !ref <save_folder>
    recoverables:
        model: !ref <model>
        wav2vec2: !ref <wav2vec2>
        whisper: !ref <whisper>
        lr_annealing_adam: !ref <lr_annealing_adam>
        lr_annealing_wav2vec: !ref <lr_annealing_wav2vec>
        counter: !ref <epoch_counter>

然后我们在 python recipe 文件中将所有内容组合起来

  class ASR(sb.Brain):
    def compute_forward(self, batch, stage):
      [...]
      # The compute forward is strictly identical to any compute_forward method
      # for ASR, except that we just call the wav2vec2 / whisper on the wavs instead of computing acoustic features (FBANKs, MFCCs ...).
      feats = self.modules.wav2vec2(wavs)
      feats = self.modules.whisper(wavs)
      x = self.modules.enc(feats)
      [...]
    
    def init_optimizers(self):
        # Initializes the whisper optimizer and model optimizer. The same can be done for wav2vec2.
        self.pretrained_optimizer = self.hparams.pretrained_opt_class(
            self.modules.whisper.parameters()
        )
        self.adam_optimizer = self.hparams.adam_opt_class(
            self.hparams.model.parameters()
        )
        [...]
    
    def on_stage_end(self, stage, stage_loss, epoch):
        #Gets called at the end of a epoch.
        [...]
        if stage == sb.Stage.VALID:

            # Here we apply our learning_rate annealing on both optimizers
            old_lr_adam, new_lr_adam = self.hparams.lr_annealing_adam(wer)
            old_lr_pretrained, new_lr_pretrained = self.hparams.lr_annealing_pretrained(wer)
            sb.nnet.schedulers.update_learning_rate(
                self.adam_optimizer, new_lr_adam
            )
            sb.nnet.schedulers.update_learning_rate(
                self.pretrained_optimizer, new_lr_wav2vec
            )

    def fit_batch(self, batch):
        # Override of the Brain Class fit_batch function.
        # Managing automatic mixed precision
        [...]
        outputs = self.compute_forward(batch, sb.Stage.TRAIN)

        loss = self.compute_objectives(outputs, batch, sb.Stage.TRAIN)
        loss.backward()

        # Here we manage both optimizers
        # (Learning enc+dec and Fine-tuning wav2vec2).
        if self.check_gradients(loss):
            self.pretrained_optimizer.step()
            self.adam_optimizer.step()

        self.pretrained_optimizer.zero_grad()
        self.adam_optimizer.zero_grad()

        return loss.detach().cpu()

注意：当然，如果您正在使用一个冻结的 wav2vec2 模型，则不需要使用两个不同的优化器 ;-) 就是这样！如果您像这样运行您的 recipe，您的 whisper / wav2vec 2.0 预训练编码器将成为您架构的一部分，并根据您的要求进行微调（或不进行微调）。

将 Whisper 用作完全预训练的编码器-解码器

Whisper 是一个完整的 transformer。理论上，这意味着您可以将其用于零样本语音识别或语音翻译。实际上，您最可能希望在内部数据集上对其进行微调。这两种选项都可以在 SpeechBrain 中完成，我们只需要相应地稍微更改一下 YAML 文件和脚本。确实，我们不再需要 DNN 解码器，因为 Whisper 自带了解码器。我们也不再依赖 CTC 损失，因为 Transformer 解码器可以使用负对数似然进行训练。最后，我们必须决定是否要将模型与贪婪搜索解码或更复杂的带/不带语言模型评分的 beam searcher 连接起来！SpeechBrain 对 Whisper 的支持摘要如下：

特征提取
编码器微调
编码器-解码器零样本 ASR 或 ST
编码器-解码器微调
贪婪解码
带或不带 LM 的 Beam Search 解码

在这里，我们将重点介绍使用贪婪解码在 LibriSpeech 上微调基础 Whisper 模型。

为此，我们首先必须修改之前的 YaML 文件和 python 脚本。在这里，我们需要将 encoder_only 设置为 False，因为我们想保留解码器。我们还需要集成一个搜索函数，该函数将获取解码器预测的最可能的 token，并以自回归方式将其反馈（与之前的 token 连接）到解码器。与之前的示例不同，我们不需要在 Whisper 解码器顶部添加语言建模头，因为当您获取 Whisper 模型时，它已经为您创建好了。现在您已经拥有了微调 Whisper 编码器-解码器所需的一切！

让我们看看实际情况如何

  [...]

  whisper_hub: "openai/whisper-medium"
  freeze_pretrained: False
  lr_pretrained: 0.0001

  # we need to specify the language of the inputs audios.
  language: english

  # These values will be used during decoding.
  # The first one design the first token to be added during searching.
  # The second is the token to stop the expansion of hypotheses that have reached eos.
  timestamp_index: 50363
  eos_index: 50257

  # This value is the ratio of steps during the decoding.
  # e.g, encoded speech is [B, T, F], then the maximal number of steps will be T * max_decode_ratio.
  max_decode_ratio: 0.5

  [...]

  # The instanciation of the SpeechBrain lobe
  whisper: !new:speechbrain.integrations.huggingface.whisper.Whisper
    source: !ref <whisper_hub>
    freeze: !ref <freeze_pretrained>
    encoder_only: False # :)
    save_path: !ref <save_folder>/whisper_checkpoint

  [...]

  pretrained_opt_class: !name:torch.optim.AdamW
    lr: !ref <lr_pretrained>
    
  lr_annealing_pretrained: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr_pretrained>
    improvement_threshold: 0.0025
    annealing_factor: 0.9

  # We add the  whisper to the modules list so it is uploaded on the GPUs.
  modules:
    whisper: !ref <whisper>

  # We creates the searcher method to decode the Whisper model.
  valid_greedy_searcher: !new:speechbrain.decoders.seq2seq.S2SWhisperGreedySearch
    model: !ref <whisper>
    bos_index: !ref <timestamp_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: 0
    max_decode_ratio: !ref <max_decode_ratio>

  # We add the whisper to our checkpointer so the model can be saved!
  checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
   checkpoints_dir: !ref <save_folder>
   recoverables:
      whisper: !ref <whisper>
      scheduler_whisper: !ref <lr_annealing_whisper>
      counter: !ref <epoch_counter>

然后我们在 python recipe 文件中将所有内容组合起来

  class ASR(sb.Brain):
    def compute_forward(self, batch, stage):
      wavs, wav_lens = batch.sig
      bos_tokens, bos_tokens_lens = batch.tokens_bos
      
      [...]

      # The compute forward is similar to any compute_forward method for ASR   
      # with Transformers in SpeechBrain.

      # Forward encoder + decoder
      enc_out, logits, _ = self.modules.whisper(wavs, bos_tokens)

      log_probs = self.hparams.log_softmax(logits)

      hyps = None
      if stage != sb.Stage.TRAIN:
          # perform greedy searcher and return the hypotheses found
          hyps, _ = self.hparams.valid_greedy_searcher(enc_out, wav_lens)

      [...]

      return log_probs, hyps, wav_lens

    def compute_objectives(self, predictions, batch, stage):
      log_probs, hyps, wav_lens, = predictions

      tokens_eos, tokens_eos_lens = batch.tokens_eos

      [...]

      # compute the NLL loss
      loss = self.hparams.nll_loss(
            log_probs, tokens_eos, tokens_eos_lens,
        )

      if stage != sb.Stage.TRAIN:
        tokens, tokens_lens = batch.tokens

        # Decode token terms to words
        predicted_words = self.tokenizer.batch_decode(
            hyps, skip_special_tokens=True
        )

        # Convert indices to words
        target_words = undo_padding(tokens, tokens_lens)
        target_words = self.tokenizer.batch_decode(
            target_words, skip_special_tokens=True
        )

        # Compute our metrics
        self.wer_metric.append(ids, predicted_words, target_words)
        self.cer_metric.append(ids, predicted_words, target_words)

        [...]

      return loss
    
 def on_stage_end(self, stage, stage_loss, epoch):
        """Gets called at the end of an epoch."""
        # Compute/store important stats
        stage_stats = {"loss": stage_loss}
        if stage == sb.Stage.TRAIN:
            self.train_stats = stage_stats
        else:
            stage_stats["CER"] = self.cer_metric.summarize("error_rate")
            stage_stats["WER"] = self.wer_metric.summarize("error_rate")

        # Perform end-of-iteration things, like annealing, logging, etc.
        if stage == sb.Stage.VALID:

            old_lr_whisper, new_lr_whisper = self.hparams.lr_annealing_whisper(
                stage_stats["loss"]
            )

            sb.nnet.schedulers.update_learning_rate(
                self.optimizer, new_lr_whisper
            )
            self.hparams.train_logger.log_stats(
                stats_meta={"epoch": epoch, "lr_whisper": old_lr_whisper},
                train_stats=self.train_stats,
                valid_stats=stage_stats,
            )
            self.checkpointer.save_and_keep_only(
                meta={"WER": stage_stats["WER"]}, min_keys=["WER"],
            )
        elif stage == sb.Stage.TEST:
            self.hparams.train_logger.log_stats(
                stats_meta={"Epoch loaded": self.hparams.epoch_counter.current},
                test_stats=stage_stats,
            )
            with open(self.hparams.wer_file, "w") as w:
                self.wer_metric.write_stats(w)

有了这个，您就可以在您选择的数据集上微调最新的 Whisper 模型了！

您可以尝试使用此模型，并尝试使用 beam search 解码而不是贪婪搜索来改进它，或者您可以简单地扩大规模并使用最大的可用 whisper 模型...所有这些都可以通过 SpeechBrain 完成！

引用 SpeechBrain

如果您在研究或业务中使用了 SpeechBrain，请使用以下 BibTeX 条目引用它

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}