在 GitHub 上执行或查看/下载此笔记
从零开始的语音增强
你想用语音做回归任务吗?别再找了,你来对地方了。本教程将通过一个基本的 SpeechBrain 语音增强模板来引导你了解构建新 recipe 所需的所有组件。
在深入代码之前,让我们稍微介绍一下语音增强问题。语音增强的目标是去除输入录音中的噪声
这个问题非常困难,因为可能会损坏语音信号的干扰种类繁多。
解决这个问题有不同的方法。如今,最流行的技术之一是基于掩码的语音增强
在掩码方法中,我们不是直接估计增强后的信号,而是估计一个软掩码。然后,我们通过将带噪信号乘以软掩码来估计增强后的信号。
根据输入/输出的类型,我们可以有
波形掩码(如上图所示)
谱掩码(如下图所示)
在谱掩码中,系统将带噪谱图映射到干净谱图。这种映射通常被认为比波形到波形映射更容易。然而,在时域中恢复信号需要添加相位信息。常用的解决方案(合理但不理想)是使用带噪信号的相位。波形掩码方法不受此限制,并且正逐渐在该社区中流行起来。
值得一提的是,SpeechBrain 目前还支持更高级的语音增强解决方案,例如 MetricGAN+(在对抗训练框架内学习 PESQ 指标)和 MimicLoss(使用从语音识别器导出的信息实现更好的增强)。
在本教程中,我们将引导你创建一个基于谱掩码的简单语音增强系统。
特别是,我们将参考此处报告的示例
https://github.com/speechbrain/speechbrain/blob/develop/templates/enhancement/
README 提供了一个很好的介绍,因此在此转载
==========================
此文件夹提供了从头开始训练语音增强模型的工作示例,基于少量数据。我们使用的数据来自 Mini Librispeech + OpenRIR。
这里有四个文件
train.py
:主代码文件,概述了整个训练过程。train.yaml
:超参数文件,设置所有执行参数。custom_model.py
:包含 PyTorch 模块定义的文件。mini_librispeech_prepare.py
:如有必要,下载并准备数据 manifest。
要训练增强模型,只需在命令行执行以下命令
python train.py train.yaml --data_folder /path/to/save/mini_librispeech
这将自动下载并准备 Mini Librispeech 的数据 manifest,然后使用噪声、混响和嘈杂声训练一个动态生成带噪样本的模型。
=========================
因此,首先,确保我们可以直接运行模板而不进行修改。
%%capture
# Installing SpeechBrain via pip
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH
# Clone SpeechBrain repository
!git clone https://github.com/speechbrain/speechbrain/
import speechbrain as sb
%cd speechbrain/templates/enhancement
!python train.py train.yaml --device='cpu' --debug
Train.py 中的 Recipe 概述
让我们从 recipe 的最高层视图开始,然后向下深入。为此,我们应该查看 recipe 的底部,其中 if __name__ == "__main__":
块定义了 recipe 结构。基本过程是
加载超参数和命令行覆盖。
准备数据 manifest 并加载对象。
将
SEBrain
子类实例化为se_brain
。调用
se_brain.fit()
执行训练。调用
se_brain.evaluate()
检查最终性能。
就是这样!在我们实际运行这段代码之前,让我们手动定义 Brain
类的 SEBrain
子类。如果你想更深入地了解 Brain
类的工作原理,请查看 Brain 教程。
为了简单起见,我们只定义第一个方法覆盖的子类,然后逐个添加其他覆盖。第一个方法是 compute_forward
方法,它简单地定义了模型如何使用数据进行预测。返回值应包括模型进行的任何预测。具体来说,此方法计算相关特征,计算预测的掩码,然后应用掩码并重新计算时域信号。
class SEBrain(sb.Brain):
"""Class that manages the training loop. See speechbrain.core.Brain."""
def compute_forward(self, batch, stage):
"""Apply masking to convert from noisy waveforms to enhanced signals.
Arguments
---------
batch : PaddedBatch
This batch object contains all the relevant tensors for computation.
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.
Returns
-------
predictions : dict
A dictionary with keys {"spec", "wav"} with predicted features.
"""
# We first move the batch to the appropriate device, and
# compute the features necessary for masking.
batch = batch.to(self.device)
self.clean_wavs, self.lens = batch.clean_sig
noisy_wavs, self.lens = self.hparams.wav_augment(
self.clean_wavs, self.lens
)
noisy_feats = self.compute_feats(noisy_wavs)
# Masking is done here with the "signal approximation (SA)" algorithm.
# The masked input is compared directly with clean speech targets.
mask = self.modules.model(noisy_feats)
predict_spec = torch.mul(mask, noisy_feats)
# Also return predicted wav, for evaluation. Note that this could
# also be used for a time-domain loss term.
predict_wav = self.hparams.resynth(
torch.expm1(predict_spec), noisy_wavs
)
# Return a dictionary so we don't have to remember the order
return {"spec": predict_spec, "wav": predict_wav}
如果你想知道这里的 self.modules
和 self.hparams
对象是什么,你问对问题了。这些对象在 SEBrain
类实例化时构造,并直接来自初始化器的 dict
参数:modules
和 hparams
。字典的键提供了你用来引用对象的名称,例如,为 modules
传递 {"model": model}
将允许你使用 self.modules.model
访问模型。
需要定义在 Brain
子类中的另一个方法是 compute_objectives
函数。我们子类化 SEBrain
本身只是为了提供一种方便的方式来分割类定义,在生产代码中不要使用这种技术!
class SEBrain(SEBrain):
def compute_objectives(self, predictions, batch, stage):
"""Computes the loss given the predicted and targeted outputs.
Arguments
---------
predictions : dict
The output dict from `compute_forward`.
batch : PaddedBatch
This batch object contains all the relevant tensors for computation.
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.
Returns
-------
loss : torch.Tensor
A one-element tensor used for backpropagating the gradient.
"""
# Prepare clean targets for comparison
clean_spec = self.compute_feats(self.clean_wavs)
# Directly compare the masked spectrograms with the clean targets
loss = sb.nnet.losses.mse_loss(
predictions["spec"], clean_spec, self.lens
)
# Append this batch of losses to the loss metric for easy
self.loss_metric.append(
batch.id,
predictions["spec"],
clean_spec,
self.lens,
reduction="batch",
)
# Some evaluations are slower, and we only want to perform them
# on the validation set.
if stage != sb.Stage.TRAIN:
# Evaluate speech intelligibility as an additional metric
self.stoi_metric.append(
batch.id,
predictions["wav"],
self.clean_wavs,
self.lens,
reduction="batch",
)
return loss
这两个方法都使用第三个不是覆盖的方法,称为 compute_feats
,我们在这里快速定义它
class SEBrain(SEBrain):
def compute_feats(self, wavs):
"""Returns corresponding log-spectral features of the input waveforms.
Arguments
---------
wavs : torch.Tensor
The batch of waveforms to convert to log-spectral features.
"""
# Log-spectral features
feats = self.hparams.compute_STFT(wavs)
feats = sb.processing.features.spectral_magnitude(feats, power=0.5)
# Log1p reduces the emphasis on small differences
feats = torch.log1p(feats)
return feats
只定义了另外两个方法,用于跟踪统计信息和保存检查点。它们是 on_stage_start
和 on_stage_end
方法,由 fit()
在迭代每个数据集之前和之后调用。在每个阶段开始之前,我们设置指标跟踪器
class SEBrain(SEBrain):
def on_stage_start(self, stage, epoch=None):
"""Gets called at the beginning of each epoch.
Arguments
---------
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.
epoch : int
The currently-starting epoch. This is passed
`None` during the test stage.
"""
# Set up statistics trackers for this stage
self.loss_metric = sb.utils.metric_stats.MetricStats(
metric=sb.nnet.losses.mse_loss
)
# Set up evaluation-only statistics trackers
if stage != sb.Stage.TRAIN:
self.stoi_metric = sb.utils.metric_stats.MetricStats(
metric=sb.nnet.loss.stoi_loss.stoi_loss
)
在验证阶段之后,我们使用跟踪器汇总统计信息,并保存检查点。
class SEBrain(SEBrain):
def on_stage_end(self, stage, stage_loss, epoch=None):
"""Gets called at the end of an epoch.
Arguments
---------
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, sb.Stage.TEST
stage_loss : float
The average loss for all of the data processed in this stage.
epoch : int
The currently-starting epoch. This is passed
`None` during the test stage.
"""
# Store the train loss until the validation stage.
if stage == sb.Stage.TRAIN:
self.train_loss = stage_loss
# Summarize the statistics from the stage for record-keeping.
else:
stats = {
"loss": stage_loss,
"stoi": -self.stoi_metric.summarize("average"),
}
# At the end of validation, we can write stats and checkpoints
if stage == sb.Stage.VALID:
# The train_logger writes a summary to stdout and to the logfile.
self.hparams.train_logger.log_stats(
{"Epoch": epoch},
train_stats={"loss": self.train_loss},
valid_stats=stats,
)
# Save the current checkpoint and delete previous checkpoints,
# unless they have the current best STOI score.
self.checkpointer.save_and_keep_only(meta=stats, max_keys=["stoi"])
# We also write statistics about test data to stdout and to the logfile.
if stage == sb.Stage.TEST:
self.hparams.train_logger.log_stats(
{"Epoch loaded": self.hparams.epoch_counter.current},
test_stats=stats,
)
好的,这就是定义 SEBrain
类所需的一切!在我们实际运行这个东西之前,唯一剩下的是数据加载函数。我们将使用 DynamicItemDatasets
,你可以在数据加载教程中了解更多信息。我们只需要定义加载音频数据的函数,然后就可以用它创建所有数据集了!
def dataio_prep(hparams):
"""This function prepares the datasets to be used in the brain class.
It also defines the data processing pipeline through user-defined functions.
We expect `prepare_mini_librispeech` to have been called before this,
so that the `train.json` and `valid.json` manifest files are available.
Arguments
---------
hparams : dict
This dictionary is loaded from the `train.yaml` file, and it includes
all the hyperparameters needed for dataset construction and loading.
Returns
-------
datasets : dict
Contains two keys, "train" and "valid" that correspond
to the appropriate DynamicItemDataset object.
"""
# Define audio pipeline. Adds noise, reverb, and babble on-the-fly.
# Of course for a real enhancement dataset, you'd want a fixed valid set.
@sb.utils.data_pipeline.takes("wav")
@sb.utils.data_pipeline.provides("clean_sig")
def audio_pipeline(wav):
"""Load the signal, and pass it and its length to the corruption class.
This is done on the CPU in the `collate_fn`."""
clean_sig = sb.dataio.dataio.read_audio(wav)
return clean_sig
# Define datasets sorted by ascending lengths for efficiency
datasets = {}
data_info = {
"train": hparams["train_annotation"],
"valid": hparams["valid_annotation"],
"test": hparams["test_annotation"],
}
hparams["dataloader_options"]["shuffle"] = False
for dataset in data_info:
datasets[dataset] = sb.dataio.dataset.DynamicItemDataset.from_json(
json_path=data_info[dataset],
replacements={"data_root": hparams["data_folder"]},
dynamic_items=[audio_pipeline],
output_keys=["id", "clean_sig"],
).filtered_sorted(sort_key="length")
return datasets
现在我们已经定义了 train.py
中除 __main__
块之外的所有代码,我们可以开始运行我们的 recipe 了!这段代码经过了少量修改,以简化与在 Colab 中运行代码不一定相关的部分。第一步是加载超参数。这会自动创建许多必需的对象。你可以在我们的 HyperPyYAML 教程中找到有关 HyperPyYAML
工作原理的更多信息。此外,我们将创建用于存储实验数据、检查点和统计信息的文件夹。
from hyperpyyaml import load_hyperpyyaml
with open("train.yaml") as fin:
hparams = load_hyperpyyaml(fin)
sb.create_experiment_directory(hparams["output_folder"])
就这样轻松地,我们可以访问我们的 pytorch 模型以及许多其他超参数。你可以随意探索 hparams
对象,这里有一些例子
# Already-applied random seed
hparams["seed"]
# STFT function
hparams["compute_STFT"]
# Masking model
hparams["model"]
准备数据 manifest,并使用我们之前定义的函数创建数据集对象
from mini_librispeech_prepare import prepare_mini_librispeech
prepare_mini_librispeech(
data_folder=hparams["data_folder"],
save_json_train=hparams["train_annotation"],
save_json_valid=hparams["valid_annotation"],
save_json_test=hparams["test_annotation"],
)
datasets = dataio_prep(hparams)
我们可以通过查看第一项来检查数据是否正确加载
import torch
datasets["train"][0]
datasets["valid"][0]
实例化 SEBrain 对象以准备训练
se_brain = SEBrain(
modules=hparams["modules"],
opt_class=hparams["opt_class"],
hparams=hparams,
checkpointer=hparams["checkpointer"],
)
然后调用 fit()
进行训练!fit()
方法迭代训练循环,调用更新模型参数的必要方法。由于所有带有状态变化的对象都由 Checkpointer 管理,训练可以在任何时候停止,并在下次调用时恢复。
se_brain.fit(
epoch_counter=se_brain.hparams.epoch_counter,
train_set=datasets["train"],
valid_set=datasets["valid"],
train_loader_kwargs=hparams["dataloader_options"],
valid_loader_kwargs=hparams["dataloader_options"],
)
训练完成后,我们可以加载在验证数据上表现最佳(由 STOI 衡量)的检查点进行评估。
se_brain.evaluate(
test_set=datasets["test"],
max_key="stoi",
test_loader_kwargs=hparams["dataloader_options"],
)
引用 SpeechBrain
如果您在研究或商业中使用 SpeechBrain,请使用以下 BibTeX 条目引用它
@misc{speechbrainV1,
title={Open-Source Conversational AI with {SpeechBrain} 1.0},
author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
year={2024},
eprint={2407.00463},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
title={{SpeechBrain}: A General-Purpose Speech Toolkit},
author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
year={2021},
eprint={2106.04624},
archivePrefix={arXiv},
primaryClass={eess.AS},
note={arXiv:2106.04624}
}