在 GitHub 上执行或查看/下载此笔记

从零开始的语音增强

你想用语音做回归任务吗？别再找了，你来对地方了。本教程将通过一个基本的 SpeechBrain 语音增强模板来引导你了解构建新 recipe 所需的所有组件。

在深入代码之前，让我们稍微介绍一下语音增强问题。语音增强的目标是去除输入录音中的噪声

SpeechBrain-Page-5 (1).png

这个问题非常困难，因为可能会损坏语音信号的干扰种类繁多。

解决这个问题有不同的方法。如今，最流行的技术之一是基于掩码的语音增强

SpeechBrain-Page-5 (2).png

在掩码方法中，我们不是直接估计增强后的信号，而是估计一个软掩码。然后，我们通过将带噪信号乘以软掩码来估计增强后的信号。

根据输入/输出的类型，我们可以有

波形掩码（如上图所示）
谱掩码（如下图所示）

SpeechBrain-Page-5 (3).png

在谱掩码中，系统将带噪谱图映射到干净谱图。这种映射通常被认为比波形到波形映射更容易。然而，在时域中恢复信号需要添加相位信息。常用的解决方案（合理但不理想）是使用带噪信号的相位。波形掩码方法不受此限制，并且正逐渐在该社区中流行起来。

值得一提的是，SpeechBrain 目前还支持更高级的语音增强解决方案，例如 MetricGAN+（在对抗训练框架内学习 PESQ 指标）和 MimicLoss（使用从语音识别器导出的信息实现更好的增强）。

在本教程中，我们将引导你创建一个基于谱掩码的简单语音增强系统。

特别是，我们将参考此处报告的示例

https://github.com/speechbrain/speechbrain/blob/develop/templates/enhancement/

README 提供了一个很好的介绍，因此在此转载

==========================

此文件夹提供了从头开始训练语音增强模型的工作示例，基于少量数据。我们使用的数据来自 Mini Librispeech + OpenRIR。

这里有四个文件

train.py：主代码文件，概述了整个训练过程。
train.yaml：超参数文件，设置所有执行参数。
custom_model.py：包含 PyTorch 模块定义的文件。
mini_librispeech_prepare.py：如有必要，下载并准备数据 manifest。

要训练增强模型，只需在命令行执行以下命令

python train.py train.yaml --data_folder /path/to/save/mini_librispeech

这将自动下载并准备 Mini Librispeech 的数据 manifest，然后使用噪声、混响和嘈杂声训练一个动态生成带噪样本的模型。

=========================

因此，首先，确保我们可以直接运行模板而不进行修改。

%%capture
# Installing SpeechBrain via pip
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH

# Clone SpeechBrain repository
!git clone https://github.com/speechbrain/speechbrain/

import speechbrain as sb

%cd speechbrain/templates/enhancement
!python train.py train.yaml --device='cpu' --debug

Train.py 中的 Recipe 概述

让我们从 recipe 的最高层视图开始，然后向下深入。为此，我们应该查看 recipe 的底部，其中 if __name__ == "__main__": 块定义了 recipe 结构。基本过程是

加载超参数和命令行覆盖。
准备数据 manifest 并加载对象。
将 SEBrain 子类实例化为 se_brain。
调用 se_brain.fit() 执行训练。
调用 se_brain.evaluate() 检查最终性能。

就是这样！在我们实际运行这段代码之前，让我们手动定义 Brain 类的 SEBrain 子类。如果你想更深入地了解 Brain 类的工作原理，请查看 Brain 教程。

为了简单起见，我们只定义第一个方法覆盖的子类，然后逐个添加其他覆盖。第一个方法是 compute_forward 方法，它简单地定义了模型如何使用数据进行预测。返回值应包括模型进行的任何预测。具体来说，此方法计算相关特征，计算预测的掩码，然后应用掩码并重新计算时域信号。

class SEBrain(sb.Brain):
    """Class that manages the training loop. See speechbrain.core.Brain."""

    def compute_forward(self, batch, stage):
        """Apply masking to convert from noisy waveforms to enhanced signals.

        Arguments
        ---------
        batch : PaddedBatch
            This batch object contains all the relevant tensors for computation.
        stage : sb.Stage
            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.

        Returns
        -------
        predictions : dict
            A dictionary with keys {"spec", "wav"} with predicted features.
        """

        # We first move the batch to the appropriate device, and
        # compute the features necessary for masking.
        batch = batch.to(self.device)
        self.clean_wavs, self.lens = batch.clean_sig

        noisy_wavs, self.lens = self.hparams.wav_augment(
            self.clean_wavs, self.lens
        )

        noisy_feats = self.compute_feats(noisy_wavs)

        # Masking is done here with the "signal approximation (SA)" algorithm.
        # The masked input is compared directly with clean speech targets.
        mask = self.modules.model(noisy_feats)
        predict_spec = torch.mul(mask, noisy_feats)

        # Also return predicted wav, for evaluation. Note that this could
        # also be used for a time-domain loss term.
        predict_wav = self.hparams.resynth(
            torch.expm1(predict_spec), noisy_wavs
        )

        # Return a dictionary so we don't have to remember the order
        return {"spec": predict_spec, "wav": predict_wav}

如果你想知道这里的 self.modules 和 self.hparams 对象是什么，你问对问题了。这些对象在 SEBrain 类实例化时构造，并直接来自初始化器的 dict 参数：modules 和 hparams。字典的键提供了你用来引用对象的名称，例如，为 modules 传递 {"model": model} 将允许你使用 self.modules.model 访问模型。

需要定义在 Brain 子类中的另一个方法是 compute_objectives 函数。我们子类化 SEBrain 本身只是为了提供一种方便的方式来分割类定义，在生产代码中不要使用这种技术！

class SEBrain(SEBrain):
    def compute_objectives(self, predictions, batch, stage):
        """Computes the loss given the predicted and targeted outputs.

        Arguments
        ---------
        predictions : dict
            The output dict from `compute_forward`.
        batch : PaddedBatch
            This batch object contains all the relevant tensors for computation.
        stage : sb.Stage
            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.

        Returns
        -------
        loss : torch.Tensor
            A one-element tensor used for backpropagating the gradient.
        """

        # Prepare clean targets for comparison
        clean_spec = self.compute_feats(self.clean_wavs)

        # Directly compare the masked spectrograms with the clean targets
        loss = sb.nnet.losses.mse_loss(
            predictions["spec"], clean_spec, self.lens
        )

        # Append this batch of losses to the loss metric for easy
        self.loss_metric.append(
            batch.id,
            predictions["spec"],
            clean_spec,
            self.lens,
            reduction="batch",
        )

        # Some evaluations are slower, and we only want to perform them
        # on the validation set.
        if stage != sb.Stage.TRAIN:

            # Evaluate speech intelligibility as an additional metric
            self.stoi_metric.append(
                batch.id,
                predictions["wav"],
                self.clean_wavs,
                self.lens,
                reduction="batch",
            )

        return loss

这两个方法都使用第三个不是覆盖的方法，称为 compute_feats，我们在这里快速定义它

class SEBrain(SEBrain):
    def compute_feats(self, wavs):
        """Returns corresponding log-spectral features of the input waveforms.

        Arguments
        ---------
        wavs : torch.Tensor
            The batch of waveforms to convert to log-spectral features.
        """

        # Log-spectral features
        feats = self.hparams.compute_STFT(wavs)
        feats = sb.processing.features.spectral_magnitude(feats, power=0.5)

        # Log1p reduces the emphasis on small differences
        feats = torch.log1p(feats)

        return feats

只定义了另外两个方法，用于跟踪统计信息和保存检查点。它们是 on_stage_start 和 on_stage_end 方法，由 fit() 在迭代每个数据集之前和之后调用。在每个阶段开始之前，我们设置指标跟踪器

class SEBrain(SEBrain):
    def on_stage_start(self, stage, epoch=None):
        """Gets called at the beginning of each epoch.

        Arguments
        ---------
        stage : sb.Stage
            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.
        epoch : int
            The currently-starting epoch. This is passed
            `None` during the test stage.
        """

        # Set up statistics trackers for this stage
        self.loss_metric = sb.utils.metric_stats.MetricStats(
            metric=sb.nnet.losses.mse_loss
        )

        # Set up evaluation-only statistics trackers
        if stage != sb.Stage.TRAIN:
            self.stoi_metric = sb.utils.metric_stats.MetricStats(
                metric=sb.nnet.loss.stoi_loss.stoi_loss
            )

在验证阶段之后，我们使用跟踪器汇总统计信息，并保存检查点。

class SEBrain(SEBrain):
    def on_stage_end(self, stage, stage_loss, epoch=None):
        """Gets called at the end of an epoch.

        Arguments
        ---------
        stage : sb.Stage
            One of sb.Stage.TRAIN, sb.Stage.VALID, sb.Stage.TEST
        stage_loss : float
            The average loss for all of the data processed in this stage.
        epoch : int
            The currently-starting epoch. This is passed
            `None` during the test stage.
        """

        # Store the train loss until the validation stage.
        if stage == sb.Stage.TRAIN:
            self.train_loss = stage_loss

        # Summarize the statistics from the stage for record-keeping.
        else:
            stats = {
                "loss": stage_loss,
                "stoi": -self.stoi_metric.summarize("average"),
            }

        # At the end of validation, we can write stats and checkpoints
        if stage == sb.Stage.VALID:
            # The train_logger writes a summary to stdout and to the logfile.
            self.hparams.train_logger.log_stats(
                {"Epoch": epoch},
                train_stats={"loss": self.train_loss},
                valid_stats=stats,
            )

            # Save the current checkpoint and delete previous checkpoints,
            # unless they have the current best STOI score.
            self.checkpointer.save_and_keep_only(meta=stats, max_keys=["stoi"])

        # We also write statistics about test data to stdout and to the logfile.
        if stage == sb.Stage.TEST:
            self.hparams.train_logger.log_stats(
                {"Epoch loaded": self.hparams.epoch_counter.current},
                test_stats=stats,
            )

好的，这就是定义 SEBrain 类所需的一切！在我们实际运行这个东西之前，唯一剩下的是数据加载函数。我们将使用 DynamicItemDatasets，你可以在数据加载教程中了解更多信息。我们只需要定义加载音频数据的函数，然后就可以用它创建所有数据集了！

def dataio_prep(hparams):
    """This function prepares the datasets to be used in the brain class.
    It also defines the data processing pipeline through user-defined functions.

    We expect `prepare_mini_librispeech` to have been called before this,
    so that the `train.json` and `valid.json` manifest files are available.

    Arguments
    ---------
    hparams : dict
        This dictionary is loaded from the `train.yaml` file, and it includes
        all the hyperparameters needed for dataset construction and loading.

    Returns
    -------
    datasets : dict
        Contains two keys, "train" and "valid" that correspond
        to the appropriate DynamicItemDataset object.
    """

    # Define audio pipeline. Adds noise, reverb, and babble on-the-fly.
    # Of course for a real enhancement dataset, you'd want a fixed valid set.
    @sb.utils.data_pipeline.takes("wav")
    @sb.utils.data_pipeline.provides("clean_sig")
    def audio_pipeline(wav):
        """Load the signal, and pass it and its length to the corruption class.
        This is done on the CPU in the `collate_fn`."""
        clean_sig = sb.dataio.dataio.read_audio(wav)
        return clean_sig

    # Define datasets sorted by ascending lengths for efficiency
    datasets = {}
    data_info = {
        "train": hparams["train_annotation"],
        "valid": hparams["valid_annotation"],
        "test": hparams["test_annotation"],
    }
    hparams["dataloader_options"]["shuffle"] = False
    for dataset in data_info:
        datasets[dataset] = sb.dataio.dataset.DynamicItemDataset.from_json(
            json_path=data_info[dataset],
            replacements={"data_root": hparams["data_folder"]},
            dynamic_items=[audio_pipeline],
            output_keys=["id", "clean_sig"],
        ).filtered_sorted(sort_key="length")
    return datasets

现在我们已经定义了 train.py 中除 __main__ 块之外的所有代码，我们可以开始运行我们的 recipe 了！这段代码经过了少量修改，以简化与在 Colab 中运行代码不一定相关的部分。第一步是加载超参数。这会自动创建许多必需的对象。你可以在我们的 HyperPyYAML 教程中找到有关 HyperPyYAML 工作原理的更多信息。此外，我们将创建用于存储实验数据、检查点和统计信息的文件夹。

from hyperpyyaml import load_hyperpyyaml
with open("train.yaml") as fin:
  hparams = load_hyperpyyaml(fin)
sb.create_experiment_directory(hparams["output_folder"])

就这样轻松地，我们可以访问我们的 pytorch 模型以及许多其他超参数。你可以随意探索 hparams 对象，这里有一些例子

# Already-applied random seed
hparams["seed"]

# STFT function
hparams["compute_STFT"]

# Masking model
hparams["model"]

准备数据 manifest，并使用我们之前定义的函数创建数据集对象

from mini_librispeech_prepare import prepare_mini_librispeech
prepare_mini_librispeech(
  data_folder=hparams["data_folder"],
  save_json_train=hparams["train_annotation"],
  save_json_valid=hparams["valid_annotation"],
  save_json_test=hparams["test_annotation"],
)
datasets = dataio_prep(hparams)

我们可以通过查看第一项来检查数据是否正确加载

import torch
datasets["train"][0]

datasets["valid"][0]

实例化 SEBrain 对象以准备训练

se_brain = SEBrain(
  modules=hparams["modules"],
  opt_class=hparams["opt_class"],
  hparams=hparams,
  checkpointer=hparams["checkpointer"],
)

然后调用 fit() 进行训练！fit() 方法迭代训练循环，调用更新模型参数的必要方法。由于所有带有状态变化的对象都由 Checkpointer 管理，训练可以在任何时候停止，并在下次调用时恢复。

se_brain.fit(
  epoch_counter=se_brain.hparams.epoch_counter,
  train_set=datasets["train"],
  valid_set=datasets["valid"],
  train_loader_kwargs=hparams["dataloader_options"],
  valid_loader_kwargs=hparams["dataloader_options"],
)

训练完成后，我们可以加载在验证数据上表现最佳（由 STOI 衡量）的检查点进行评估。

se_brain.evaluate(
  test_set=datasets["test"],
  max_key="stoi",
  test_loader_kwargs=hparams["dataloader_options"],
)

引用 SpeechBrain

如果您在研究或商业中使用 SpeechBrain，请使用以下 BibTeX 条目引用它

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}