在 GitHub 上执行或查看/下载此 notebook

SpeechBrain 能做什么？

SpeechBrain 已经可以做很多很棒的事情了。您可以使用 SpeechBrain 解决以下类型的问题：

语音分类 (多对一，例如说话人识别)
语音回归 (语音到语音映射，例如语音增强)
序列到序列 (语音到语音映射，例如语音识别)

更准确地说，SpeechBrain 支持许多对话式 AI 任务（请参阅我们的 README）。另请参阅所有不同的教程。

对于所有这些任务，我们提供了允许用户从零开始训练模型的 recipe。我们提供预训练模型和实验日志。

使用 SpeechBrain 从零开始训练模型的通常方法如下：

cd recipe/dataset_name/task_name
python train.py train.yaml --data_folder=/path/to/the/dataset

有关训练的更多信息，请参阅上述教程。

在这个简短的教程中，我们只展示如何使用 HuggingFace 上可用的一些预训练模型。首先，让我们安装 SpeechBrain

%%capture
# Installing SpeechBrain via pip
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH

%%capture
%cd /content
!wget -O example_mandarin.wav "https://www.dropbox.com/scl/fi/7jn7jg9ea2u6d9d70657z/example_mandarin.wav?rlkey=eh220qallihxp9yppm2kx7a2i&dl=1"
!wget -O example_rw.mp3 "https://www.dropbox.com/scl/fi/iplkymn8c8mbc6oclxem3/example_rw.mp3?rlkey=yhmqfsn8q43pmvd1uvjo3yl0s&dl=1"
!wget -O example_whamr.wav "https://www.dropbox.com/scl/fi/gxbtbf3c3hxr0y9dbf0nw/example_whamr.wav?rlkey=1wt5d49kjl36h0zypwrmsy8nz&dl=1"
!wget -O example-fr.wav "https://www.dropbox.com/scl/fi/vjn98vu8e3i2mvsw17msh/example-fr.wav?rlkey=vabmu4fgqp60oken8aosg75i0&dl=1"
!wget -O example-it.wav "https://www.dropbox.com/scl/fi/o3t7j53s7czaob8yq73rz/example-it.wav?rlkey=x9u6bkbcp6lh3602fb9uai5h3&dl=1"
!wget -O example.wav "https://www.dropbox.com/scl/fi/uws97livpeta7rowb7q7g/example.wav?rlkey=swppq2so15jibmpmihenrktbt&dl=1"
!wget -O example1.wav "https://www.dropbox.com/scl/fi/mu1tdejny4cbgxczwm944/example1.wav?rlkey=8pi7hjz15syvav80u1xzfbfhn&dl=1"
!wget -O example2.flac "https://www.dropbox.com/scl/fi/k9ouk6ec1q1fkevamodrn/example2.flac?rlkey=vtbyc6bzp9hknzvn9rb63z3yf&dl=1"
!wget -O test_mixture.wav "https://www.dropbox.com/scl/fi/4327g66ajs8aq3dck0fzn/test_mixture.wav?rlkey=bjdcw3msxw3armpelxuayug5i&dl=1"

安装完成后，您应该能够使用 python 导入 speechbrain 项目

import speechbrain as sb
from speechbrain.dataio.dataio import read_audio
from IPython.display import Audio

不同语言的语音识别

英语

from speechbrain.inference.ASR import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="pretrained_models/asr-crdnn-rnnlm-librispeech")
asr_model.transcribe_file('/content/example.wav')

signal = read_audio("/content/example.wav").squeeze()
Audio(signal, rate=16000)

法语

from speechbrain.inference.ASR import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-commonvoice-fr", savedir="pretrained_models/asr-crdnn-commonvoice-fr")
asr_model.transcribe_file("/content/example-fr.wav")

signal = read_audio("/content/example-fr.wav").squeeze()
Audio(signal, rate=44100)

意大利语

from speechbrain.inference.ASR import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-commonvoice-it", savedir="pretrained_models/asr-crdnn-commonvoice-it")
asr_model.transcribe_file("/content/example-it.wav")

signal = read_audio("/content/example-it.wav").squeeze()
Audio(signal, rate=16000)

普通话

from speechbrain.inference.interfaces import foreign_class

asr_model = foreign_class(source="speechbrain/asr-wav2vec2-ctc-aishell",  pymodule_file="custom_interface.py", classname="CustomEncoderDecoderASR")
asr_model.transcribe_file("/content/example_mandarin.wav")

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://hugging-face.cn/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Some weights of Wav2Vec2Model were not initialized from the model checkpoint at TencentGameMate/chinese-wav2vec2-large and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

['她', '应', '该', '也', '是', '喜', '欢']

signal = read_audio("/content/example_mandarin.wav").squeeze()
Audio(signal, rate=16000)

基尼亚卢旺达语

from speechbrain.inference.ASR import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-wav2vec2-commonvoice-rw", savedir="pretrained_models/asr-wav2vec2-commonvoice-rw")
asr_model.transcribe_file("/content/example_rw.mp3")

signal = read_audio("/content/example_rw.mp3").squeeze()
Audio(signal, rate=44100)

语音分离

这里展示的是包含 2 个说话人的混合语音，但我们也有用于分离 3 个说话人混合语音的最新系统。我们还有处理噪声和混响的模型。请参阅我们的 HuggingFace 仓库

from speechbrain.inference.separation import SepformerSeparation as separator

model = separator.from_hparams(source="speechbrain/sepformer-wsj02mix", savedir='pretrained_models/sepformer-wsj02mix')
est_sources = model.separate_file(path='/content/test_mixture.wav')

signal = read_audio("/content/test_mixture.wav").squeeze()
Audio(signal, rate=8000)

Audio(est_sources[:, :, 0].detach().cpu().squeeze(), rate=8000)

Audio(est_sources[:, :, 1].detach().cpu().squeeze(), rate=8000)

语音增强

语音增强的目标是去除影响录音的噪声。SpeechBrain 有几种语音增强系统。在下文中，您可以找到一个由 SepFormer (经过增强训练的版本) 处理的示例

from speechbrain.inference.separation import SepformerSeparation as separator
import torchaudio

model = separator.from_hparams(source="speechbrain/sepformer-whamr-enhancement", savedir='pretrained_models/sepformer-whamr-enhancement4')
enhanced_speech = model.separate_file(path='/content/example_whamr.wav')

signal = read_audio("/content/example_whamr.wav").squeeze()
Audio(signal, rate=8000)

Audio(enhanced_speech[:, :].detach().cpu().squeeze(), rate=8000)

说话人确认

这里的任务是判断两个句子是否属于同一个说话人。

from speechbrain.inference.speaker import SpeakerRecognition
verification = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="pretrained_models/spkrec-ecapa-voxceleb")
score, prediction = verification.verify_files("/content/example1.wav", "/content/example2.flac")

print(prediction, score)

signal = read_audio("/content/example1.wav").squeeze()
Audio(signal, rate=16000)

signal = read_audio("/content/example2.flac").squeeze()
Audio(signal, rate=16000)

语音合成 (Text-to-Speech)

语音合成的目标是从输入文本创建语音信号。在下文中，您可以找到一个使用流行的 Tacotron2 模型与 HiFiGAN（作为声码器）结合的示例

import torchaudio
from speechbrain.inference.TTS import Tacotron2
from speechbrain.inference.vocoders import HIFIGAN

# Initialize TTS (tacotron2) and Vocoder (HiFIGAN)
tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir="tmpdir_tts")
hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir="tmpdir_vocoder")

# Running the TTS
mel_output, mel_length, alignment = tacotron2.encode_text("This is an open-source toolkit for the development of speech technologies.")

# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_output)

Audio(waveforms.detach().cpu().squeeze(), rate=22050)

引用 SpeechBrain

如果您在研究或商业中使用 SpeechBrain，请使用以下 BibTeX 条目引用它：

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}