speechbrain.inference.TTS 模块

指定文本转语音 (TTS) 模块的推理接口。

作者

Aku Rouhe 2021
Peter Plantinga 2021
Loren Lugosch 2020
Mirco Ravanelli 2020
Titouan Parcollet 2021
Abdel Heba 2021
Andreas Nautsch 2022, 2023
Pooneh Mousavi 2023
Sylvain de Langen 2023
Adel Moumen 2023
Pradnya Kandarkar 2023

摘要

类

`FastSpeech2`	Fastspeech2 (文本 -> mel_spec) 的即用型包装器。
`FastSpeech2InternalAlignment`	带有内部对齐功能的 Fastspeech2 (文本 -> mel_spec) 的即用型包装器。
`MSTacotron2`	用于零样本多说话人 Tacotron2 的即用型包装器。
`Tacotron2`	用于 Tacotron2 (文本 -> mel_spec) 的即用型包装器。

参考

class speechbrain.inference.TTS.Tacotron2(*args, **kwargs)[source]

基类: Pretrained

用于 Tacotron2 (文本 -> mel_spec) 的即用型包装器。

参数:

*args (tuple)
**kwargs (dict) – 参数转发给 Pretrained 父类。

示例

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir=tmpdir_tts)
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, mel_lengths, alignments = tacotron2.encode_batch(items)

>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder)
>>> # Running the TTS
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output)

HPARAMS_NEEDED = ['model', 'text_to_sequence']

text_to_seq(txt)[source]: 使用自定义文本到序列函数将原始文本编码为张量

encode_batch(texts)[source]

计算文本列表的梅尔声谱图

文本必须按长度降序排列

参数:: texts (List[str]) – 要编码为声谱图的文本
返回类型:: 输出声谱图、输出长度和对齐的张量

encode_text(text)[source]: 对单个文本字符串运行推理

forward(texts)[source]: 编码输入文本。

class speechbrain.inference.TTS.MSTacotron2(*args, **kwargs)[source]

基类: Pretrained

用于零样本多说话人 Tacotron2 的即用型包装器。用于语音克隆：(文本, 参考音频) -> (mel_spec)。用于生成随机说话人声音：(文本) -> (mel_spec)。

参数:

*args (tuple)
**kwargs (dict) – 参数转发给 Pretrained 父类。

示例

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> mstacotron2 = MSTacotron2.from_hparams(source="speechbrain/tts-mstacotron2-libritts", savedir=tmpdir_tts)
>>> # Sample rate of the reference audio must be greater or equal to the sample rate of the speaker embedding model
>>> reference_audio_path = "tests/samples/single-mic/example1.wav"
>>> input_text = "Mary had a little lamb."
>>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path)
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-libritts-22050Hz", savedir=tmpdir_vocoder)
>>> # Running the TTS
>>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path)
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output)
>>> # For generating a random speaker voice, use the following
>>> mel_output, mel_length, alignment = mstacotron2.generate_random_voice(input_text)

HPARAMS_NEEDED = ['model']

clone_voice(texts, audio_path)[source]

使用输入文本和参考音频生成梅尔声谱图

参数:

texts (str or list) – 输入文本
audio_path (str) – 参考音频

返回类型:

输出声谱图、输出长度和对齐的张量

generate_random_voice(texts)[source]

使用输入文本和随机说话人声音生成梅尔声谱图

参数:: texts (str or list) – 输入文本
返回类型:: 输出声谱图、输出长度和对齐的张量

class speechbrain.inference.TTS.FastSpeech2(*args, **kwargs)[source]

基类: Pretrained

Fastspeech2 (文本 -> mel_spec) 的即用型包装器。

参数:

*args (tuple)
**kwargs (dict) – 参数转发给 Pretrained 父类。

示例

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> fastspeech2 = FastSpeech2.from_hparams(source="speechbrain/tts-fastspeech2-ljspeech", savedir=tmpdir_tts)
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."])
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items)
>>>
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder)
>>> # Running the TTS
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."])
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_outputs)

HPARAMS_NEEDED = ['spn_predictor', 'model', 'input_encoder']

encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算文本列表的梅尔声谱图

参数:

texts (List[str]) – 要转换为声谱图的文本
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

返回类型:

输出声谱图、输出长度和对齐的张量

encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算音素序列列表的梅尔声谱图

参数:

phonemes (List[List[str]]) – 要转换为声谱图的音素
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

返回类型:

输出声谱图、输出长度和对齐的张量

encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量进行批量推理

参数:

tokens_padded (torch.Tensor) – 要转换为声谱图的编码音素序列
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

返回:

post_mel_outputs (torch.Tensor)
durations (torch.Tensor)
pitch (torch.Tensor)
energy (torch.Tensor)

forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量进行批量推理

参数:

text (str) – 要转换为声谱图的文本
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

返回类型:

编码后的文本

class speechbrain.inference.TTS.FastSpeech2InternalAlignment(*args, **kwargs)[source]

基类: Pretrained

带有内部对齐功能的 Fastspeech2 (文本 -> mel_spec) 的即用型包装器。

参数:

*args (tuple)
**kwargs (dict) – 参数转发给 Pretrained 父类。

示例

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> fastspeech2 = FastSpeech2InternalAlignment.from_hparams(source="speechbrain/tts-fastspeech2-internal-alignment-ljspeech", savedir=tmpdir_tts)
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."])
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items)
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder)
>>> # Running the TTS
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."])
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_outputs)

HPARAMS_NEEDED = ['model', 'input_encoder']

encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算文本列表的梅尔声谱图

参数:

texts (List[str]) – 要转换为声谱图的文本
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

返回类型:

输出声谱图、输出长度和对齐的张量

encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算音素序列列表的梅尔声谱图

参数:

phonemes (List[List[str]]) – 要转换为声谱图的音素
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

返回类型:

输出声谱图、输出长度和对齐的张量

encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量进行批量推理

参数:

tokens_padded (torch.Tensor) – 要转换为声谱图的编码音素序列
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

返回:

post_mel_outputs (torch.Tensor)
durations (torch.Tensor)
pitch (torch.Tensor)
energy (torch.Tensor)

forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量进行批量推理

参数:

text (str) – 要转换为声谱图的文本
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

返回类型:

编码后的文本