speechbrain.lobes.models.DiffWave 模块

用于 DIFFWAVE 的神经网络模块：一种多功能音频合成扩散模型

更多详情请参阅：https://arxiv.org/pdf/2009.09761.pdf

作者

Yingzhi WANG 2022

摘要

类

`DiffWave`	带空洞残差块的 DiffWave 模型
`DiffWaveDiffusion`	增强的扩散实现，带有 DiffWave 特有的推理
`DiffusionEmbedding`	将扩散步骤嵌入到 DiffWave 的输入向量中
`ResidualBlock`	带空洞卷积的残差块
`SpectrogramUpsampler`	使用转置卷积对频谱图进行上采样仅在此处完成上采样，特定层的卷积可在残差块中找到，用于将 mel 频带映射到 2× 残差通道

函数

diffwave_mel_spectogram

计算原始音频信号的 MelSpectrogram 并对其进行预处理以用于 diffwave 训练

参考

speechbrain.lobes.models.DiffWave.diffwave_mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, audio)[source]

计算原始音频信号的 MelSpectrogram 并对其进行预处理以用于 diffwave 训练

参数:

sample_rate (int) – 音频信号采样率。
hop_length (int) – STFT 窗口之间的跳跃长度。
win_length (int) – 窗口大小。
n_fft (int) – FFT 大小。
n_mels (int) – mel 滤波器组数量。
f_min (float) – 最小频率。
f_max (float) – 最大频率。
power (float) – 幅度谱的指数。
normalized (bool) – 是否在 stft 后按幅度归一化。
norm (str 或 None) – 如果是“slaney”，则将三角 mel 权重除以 mel 频带的宽度
mel_scale (str) – 使用的刻度：“htk” 或 “slaney”。
audio (torch.Tensor) – 输入音频信号

返回:

mel

返回类型:

torch.Tensor

class speechbrain.lobes.models.DiffWave.DiffusionEmbedding(max_steps)[source]

基类： Module

将扩散步骤嵌入到 DiffWave 的输入向量中

参数:: max_steps (int) – 总扩散步骤

示例

>>> from speechbrain.lobes.models.DiffWave import DiffusionEmbedding
>>> diffusion_embedding = DiffusionEmbedding(max_steps=50)
>>> time_step = torch.randint(50, (1,))
>>> step_embedding = diffusion_embedding(time_step)
>>> step_embedding.shape
torch.Size([1, 512])

forward(diffusion_step)[source]

扩散步骤嵌入的前向函数

参数:: diffusion_step (torch.Tensor) – 要执行的扩散步骤
返回:: 扩散步骤嵌入
返回类型:: tensor [bs, 512]

class speechbrain.lobes.models.DiffWave.SpectrogramUpsampler[source]

基类： Module

使用转置卷积对频谱图进行上采样仅在此处完成上采样，特定层的卷积可在残差块中找到，用于将 mel 频带映射到 2× 残差通道

示例

>>> from speechbrain.lobes.models.DiffWave import SpectrogramUpsampler
>>> spec_upsampler = SpectrogramUpsampler()
>>> mel_input = torch.rand(3, 80, 100)
>>> upsampled_mel = spec_upsampler(mel_input)
>>> upsampled_mel.shape
torch.Size([3, 80, 25600])

forward(x)[source]

将频谱图上采样 256 倍以匹配音频长度提取 mel 频谱图时跳跃长度应为 256

参数:: x (torch.Tensor) – 输入 mel 频谱图 [bs, 80, mel_len]
返回类型:: 上采样频谱图 [bs, 80, mel_len*256]

class speechbrain.lobes.models.DiffWave.ResidualBlock(n_mels, residual_channels, dilation, uncond=False)[source]

基类： Module

带空洞卷积的残差块

参数:

n_mels (int) – 用于条件声码任务的 conv1x1 输入 mel 通道数
residual_channels (int) – 音频卷积的通道数
dilation (int) – 音频卷积的空洞周期
uncond (bool) – 条件/无条件生成

示例

>>> from speechbrain.lobes.models.DiffWave import ResidualBlock
>>> res_block = ResidualBlock(n_mels=80, residual_channels=64, dilation=3)
>>> noisy_audio = torch.randn(1, 1, 22050)
>>> timestep_embedding = torch.rand(1, 512)
>>> upsampled_mel = torch.rand(1, 80, 22050)
>>> output = res_block(noisy_audio, timestep_embedding, upsampled_mel)
>>> output[0].shape
torch.Size([1, 64, 22050])

forward(x, diffusion_step, conditioner=None)[source]

残差块的前向函数

参数:

x (torch.Tensor) – 输入样本 [bs, 1, time]
diffusion_step (torch.Tensor) – 要执行的扩散步骤的嵌入
conditioner (torch.Tensor) – 用于条件生成的条件

返回:

残差输出 [bs, residual_channels, time]
残差分支的跳跃连接 [bs, residual_channels, time]

class speechbrain.lobes.models.DiffWave.DiffWave(input_channels, residual_layers, residual_channels, dilation_cycle_length, total_steps, unconditional=False)[source]

基类： Module

带空洞残差块的 DiffWave 模型

参数:

input_channels (int) – 用于条件声码任务的 conv1x1 输入 mel 通道数
residual_layers (int) – 残差块数量
residual_channels (int) – 音频卷积的通道数
dilation_cycle_length (int) – 音频卷积的空洞周期
total_steps (int) – 总扩散步骤
unconditional (bool) – 条件/无条件生成

示例

>>> from speechbrain.lobes.models.DiffWave import DiffWave
>>> diffwave = DiffWave(
...     input_channels=80,
...     residual_layers=30,
...     residual_channels=64,
...     dilation_cycle_length=10,
...     total_steps=50,
... )
>>> noisy_audio = torch.randn(1, 1, 25600)
>>> timestep = torch.randint(50, (1,))
>>> input_mel = torch.rand(1, 80, 100)
>>> predicted_noise = diffwave(noisy_audio, timestep, input_mel)
>>> predicted_noise.shape
torch.Size([1, 1, 25600])

forward(audio, diffusion_step, spectrogram=None, length=None)[source]

DiffWave 前向函数

参数:

audio (torch.Tensor) – 输入高斯样本 [bs, 1, time]
diffusion_step (torch.Tensor) – 要执行的扩散时间步 [bs, 1]
spectrogram (torch.Tensor) – 频谱图数据 [bs, 80, mel_len]
length (torch.Tensor) – 样本长度 - 未使用 - 仅为兼容性提供

返回类型:

预测的噪声 [bs, 1, time]

diffusion_forward(x, timesteps, cond_emb=None, length=None, out_mask_value=None, latent_mask_value=None)[source]: 适合由扩散包装的前向函数。对于此模型，out_mask_value/latent_mask_value 未使用并被丢弃。详情请参阅 forward()。

class speechbrain.lobes.models.DiffWave.DiffWaveDiffusion(model, timesteps=None, noise=None, beta_start=None, beta_end=None, sample_min=None, sample_max=None, show_progress=False)[source]

基类： DenoisingDiffusion

增强的扩散实现，带有 DiffWave 特有的推理

参数:

model (nn.Module) – 底层模型
timesteps (int) – 总时间步数
noise (str|nn.Module) – 使用的噪声类型，“gaussian” 将产生标准高斯噪声
beta_start (float) – 过程开始时的“beta”参数值（参见 DiffWave 论文）
beta_end (float) – 过程结束时的“beta”参数值
sample_min (float)
sample_max (float) – 用于裁剪输出。
show_progress (bool) – 推理期间是否显示进度

示例

>>> from speechbrain.lobes.models.DiffWave import DiffWave
>>> diffwave = DiffWave(
...     input_channels=80,
...     residual_layers=30,
...     residual_channels=64,
...     dilation_cycle_length=10,
...     total_steps=50,
... )
>>> from speechbrain.lobes.models.DiffWave import DiffWaveDiffusion
>>> from speechbrain.nnet.diffusion import GaussianNoise
>>> diffusion = DiffWaveDiffusion(
...     model=diffwave,
...     beta_start=0.0001,
...     beta_end=0.05,
...     timesteps=50,
...     noise=GaussianNoise,
... )
>>> input_mel = torch.rand(1, 80, 100)
>>> output = diffusion.inference(
...     unconditional=False,
...     scale=256,
...     condition=input_mel,
...     fast_sampling=True,
...     fast_sampling_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5],
... )
>>> output.shape
torch.Size([1, 25600])

inference(unconditional, scale, condition=None, fast_sampling=False, fast_sampling_noise_schedule=None, device=None)[source]

处理 diffwave 的推理一个推理函数用于所有局部/全局条件生成和无条件生成任务

参数:

unconditional (bool) – 如果为 True 则进行无条件生成，否则进行条件生成
scale (int) – 对于条件生成，用于获取最终输出波形长度的缩放比例；输出波形长度为 scale * condition.shape[-1] 例如，如果条件是频谱图 (bs, n_mel, time)，则 scale 应为跳跃长度；对于无条件生成，scale 应为所需的音频长度
condition (torch.Tensor) – 用于声码的输入频谱图或其他条件生成的条件，无条件生成时应为 None
fast_sampling (bool) – 是否进行快速采样
fast_sampling_noise_schedule (list) – 用于快速采样的噪声调度
device (str|torch.device) – 推理设备

返回:

predicted_sample – 预测的音频 (bs, 1, t)

返回类型:

torch.Tensor