speechbrain.inference.TTS 模块

指定文本转语音 (TTS) 模块的推理接口。

作者
  • Aku Rouhe 2021

  • Peter Plantinga 2021

  • Loren Lugosch 2020

  • Mirco Ravanelli 2020

  • Titouan Parcollet 2021

  • Abdel Heba 2021

  • Andreas Nautsch 2022, 2023

  • Pooneh Mousavi 2023

  • Sylvain de Langen 2023

  • Adel Moumen 2023

  • Pradnya Kandarkar 2023

摘要

FastSpeech2

Fastspeech2 (文本 -> mel_spec) 的即用型包装器。

FastSpeech2InternalAlignment

带有内部对齐功能的 Fastspeech2 (文本 -> mel_spec) 的即用型包装器。

MSTacotron2

用于零样本多说话人 Tacotron2 的即用型包装器。

Tacotron2

用于 Tacotron2 (文本 -> mel_spec) 的即用型包装器。

参考

class speechbrain.inference.TTS.Tacotron2(*args, **kwargs)[source]

基类: Pretrained

用于 Tacotron2 (文本 -> mel_spec) 的即用型包装器。

参数:
  • *args (tuple)

  • **kwargs (dict) – 参数转发给 Pretrained 父类。

示例

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir=tmpdir_tts)
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, mel_lengths, alignments = tacotron2.encode_batch(items)
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder)
>>> # Running the TTS
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output)
HPARAMS_NEEDED = ['model', 'text_to_sequence']
text_to_seq(txt)[source]

使用自定义文本到序列函数将原始文本编码为张量

encode_batch(texts)[source]

计算文本列表的梅尔声谱图

文本必须按长度降序排列

参数:

texts (List[str]) – 要编码为声谱图的文本

返回类型:

输出声谱图、输出长度和对齐的张量

encode_text(text)[source]

对单个文本字符串运行推理

forward(texts)[source]

编码输入文本。

class speechbrain.inference.TTS.MSTacotron2(*args, **kwargs)[source]

基类: Pretrained

用于零样本多说话人 Tacotron2 的即用型包装器。用于语音克隆:(文本, 参考音频) -> (mel_spec)。用于生成随机说话人声音:(文本) -> (mel_spec)。

参数:
  • *args (tuple)

  • **kwargs (dict) – 参数转发给 Pretrained 父类。

示例

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> mstacotron2 = MSTacotron2.from_hparams(source="speechbrain/tts-mstacotron2-libritts", savedir=tmpdir_tts)
>>> # Sample rate of the reference audio must be greater or equal to the sample rate of the speaker embedding model
>>> reference_audio_path = "tests/samples/single-mic/example1.wav"
>>> input_text = "Mary had a little lamb."
>>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path)
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-libritts-22050Hz", savedir=tmpdir_vocoder)
>>> # Running the TTS
>>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path)
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output)
>>> # For generating a random speaker voice, use the following
>>> mel_output, mel_length, alignment = mstacotron2.generate_random_voice(input_text)
HPARAMS_NEEDED = ['model']
clone_voice(texts, audio_path)[source]

使用输入文本和参考音频生成梅尔声谱图

参数:
  • texts (str or list) – 输入文本

  • audio_path (str) – 参考音频

返回类型:

输出声谱图、输出长度和对齐的张量

generate_random_voice(texts)[source]

使用输入文本和随机说话人声音生成梅尔声谱图

参数:

texts (str or list) – 输入文本

返回类型:

输出声谱图、输出长度和对齐的张量

class speechbrain.inference.TTS.FastSpeech2(*args, **kwargs)[source]

基类: Pretrained

Fastspeech2 (文本 -> mel_spec) 的即用型包装器。

参数:
  • *args (tuple)

  • **kwargs (dict) – 参数转发给 Pretrained 父类。

示例

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> fastspeech2 = FastSpeech2.from_hparams(source="speechbrain/tts-fastspeech2-ljspeech", savedir=tmpdir_tts)
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."])
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items)
>>>
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder)
>>> # Running the TTS
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."])
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_outputs)
HPARAMS_NEEDED = ['spn_predictor', 'model', 'input_encoder']
encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算文本列表的梅尔声谱图

参数:
  • texts (List[str]) – 要转换为声谱图的文本

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

返回类型:

输出声谱图、输出长度和对齐的张量

encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算音素序列列表的梅尔声谱图

参数:
  • phonemes (List[List[str]]) – 要转换为声谱图的音素

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

返回类型:

输出声谱图、输出长度和对齐的张量

encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量进行批量推理

参数:
  • tokens_padded (torch.Tensor) – 要转换为声谱图的编码音素序列

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

返回:

  • post_mel_outputs (torch.Tensor)

  • durations (torch.Tensor)

  • pitch (torch.Tensor)

  • energy (torch.Tensor)

forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量进行批量推理

参数:
  • text (str) – 要转换为声谱图的文本

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

返回类型:

编码后的文本

class speechbrain.inference.TTS.FastSpeech2InternalAlignment(*args, **kwargs)[source]

基类: Pretrained

带有内部对齐功能的 Fastspeech2 (文本 -> mel_spec) 的即用型包装器。

参数:
  • *args (tuple)

  • **kwargs (dict) – 参数转发给 Pretrained 父类。

示例

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> fastspeech2 = FastSpeech2InternalAlignment.from_hparams(source="speechbrain/tts-fastspeech2-internal-alignment-ljspeech", savedir=tmpdir_tts)
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."])
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items)
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder)
>>> # Running the TTS
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."])
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_outputs)
HPARAMS_NEEDED = ['model', 'input_encoder']
encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算文本列表的梅尔声谱图

参数:
  • texts (List[str]) – 要转换为声谱图的文本

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

返回类型:

输出声谱图、输出长度和对齐的张量

encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算音素序列列表的梅尔声谱图

参数:
  • phonemes (List[List[str]]) – 要转换为声谱图的音素

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

返回类型:

输出声谱图、输出长度和对齐的张量

encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量进行批量推理

参数:
  • tokens_padded (torch.Tensor) – 要转换为声谱图的编码音素序列

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

返回:

  • post_mel_outputs (torch.Tensor)

  • durations (torch.Tensor)

  • pitch (torch.Tensor)

  • energy (torch.Tensor)

forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量进行批量推理

参数:
  • text (str) – 要转换为声谱图的文本

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

返回类型:

编码后的文本