speechbrain.lobes.models.DiffWave 模块
用于 DIFFWAVE 的神经网络模块:一种多功能音频合成扩散模型
更多详情请参阅:https://arxiv.org/pdf/2009.09761.pdf
- 作者
Yingzhi WANG 2022
摘要
类
带空洞残差块的 DiffWave 模型 |
|
增强的扩散实现,带有 DiffWave 特有的推理 |
|
将扩散步骤嵌入到 DiffWave 的输入向量中 |
|
带空洞卷积的残差块 |
|
使用转置卷积对频谱图进行上采样 仅在此处完成上采样,特定层的卷积可在残差块中找到,用于将 mel 频带映射到 2× 残差通道 |
函数
计算原始音频信号的 MelSpectrogram 并对其进行预处理以用于 diffwave 训练 |
参考
- speechbrain.lobes.models.DiffWave.diffwave_mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, audio)[source]
计算原始音频信号的 MelSpectrogram 并对其进行预处理以用于 diffwave 训练
- 参数:
sample_rate (int) – 音频信号采样率。
hop_length (int) – STFT 窗口之间的跳跃长度。
win_length (int) – 窗口大小。
n_fft (int) – FFT 大小。
n_mels (int) – mel 滤波器组数量。
f_min (float) – 最小频率。
f_max (float) – 最大频率。
power (float) – 幅度谱的指数。
normalized (bool) – 是否在 stft 后按幅度归一化。
norm (str 或 None) – 如果是“slaney”,则将三角 mel 权重除以 mel 频带的宽度
mel_scale (str) – 使用的刻度:“htk” 或 “slaney”。
audio (torch.Tensor) – 输入音频信号
- 返回:
mel
- 返回类型:
torch.Tensor
- class speechbrain.lobes.models.DiffWave.DiffusionEmbedding(max_steps)[source]
基类:
Module
将扩散步骤嵌入到 DiffWave 的输入向量中
- 参数:
max_steps (int) – 总扩散步骤
示例
>>> from speechbrain.lobes.models.DiffWave import DiffusionEmbedding >>> diffusion_embedding = DiffusionEmbedding(max_steps=50) >>> time_step = torch.randint(50, (1,)) >>> step_embedding = diffusion_embedding(time_step) >>> step_embedding.shape torch.Size([1, 512])
- class speechbrain.lobes.models.DiffWave.SpectrogramUpsampler[source]
基类:
Module
使用转置卷积对频谱图进行上采样 仅在此处完成上采样,特定层的卷积可在残差块中找到,用于将 mel 频带映射到 2× 残差通道
示例
>>> from speechbrain.lobes.models.DiffWave import SpectrogramUpsampler >>> spec_upsampler = SpectrogramUpsampler() >>> mel_input = torch.rand(3, 80, 100) >>> upsampled_mel = spec_upsampler(mel_input) >>> upsampled_mel.shape torch.Size([3, 80, 25600])
- class speechbrain.lobes.models.DiffWave.ResidualBlock(n_mels, residual_channels, dilation, uncond=False)[source]
基类:
Module
带空洞卷积的残差块
- 参数:
示例
>>> from speechbrain.lobes.models.DiffWave import ResidualBlock >>> res_block = ResidualBlock(n_mels=80, residual_channels=64, dilation=3) >>> noisy_audio = torch.randn(1, 1, 22050) >>> timestep_embedding = torch.rand(1, 512) >>> upsampled_mel = torch.rand(1, 80, 22050) >>> output = res_block(noisy_audio, timestep_embedding, upsampled_mel) >>> output[0].shape torch.Size([1, 64, 22050])
- class speechbrain.lobes.models.DiffWave.DiffWave(input_channels, residual_layers, residual_channels, dilation_cycle_length, total_steps, unconditional=False)[source]
基类:
Module
带空洞残差块的 DiffWave 模型
- 参数:
示例
>>> from speechbrain.lobes.models.DiffWave import DiffWave >>> diffwave = DiffWave( ... input_channels=80, ... residual_layers=30, ... residual_channels=64, ... dilation_cycle_length=10, ... total_steps=50, ... ) >>> noisy_audio = torch.randn(1, 1, 25600) >>> timestep = torch.randint(50, (1,)) >>> input_mel = torch.rand(1, 80, 100) >>> predicted_noise = diffwave(noisy_audio, timestep, input_mel) >>> predicted_noise.shape torch.Size([1, 1, 25600])
- forward(audio, diffusion_step, spectrogram=None, length=None)[source]
DiffWave 前向函数
- 参数:
audio (torch.Tensor) – 输入高斯样本 [bs, 1, time]
diffusion_step (torch.Tensor) – 要执行的扩散时间步 [bs, 1]
spectrogram (torch.Tensor) – 频谱图数据 [bs, 80, mel_len]
length (torch.Tensor) – 样本长度 - 未使用 - 仅为兼容性提供
- 返回类型:
预测的噪声 [bs, 1, time]
- class speechbrain.lobes.models.DiffWave.DiffWaveDiffusion(model, timesteps=None, noise=None, beta_start=None, beta_end=None, sample_min=None, sample_max=None, show_progress=False)[source]
-
增强的扩散实现,带有 DiffWave 特有的推理
- 参数:
示例
>>> from speechbrain.lobes.models.DiffWave import DiffWave >>> diffwave = DiffWave( ... input_channels=80, ... residual_layers=30, ... residual_channels=64, ... dilation_cycle_length=10, ... total_steps=50, ... ) >>> from speechbrain.lobes.models.DiffWave import DiffWaveDiffusion >>> from speechbrain.nnet.diffusion import GaussianNoise >>> diffusion = DiffWaveDiffusion( ... model=diffwave, ... beta_start=0.0001, ... beta_end=0.05, ... timesteps=50, ... noise=GaussianNoise, ... ) >>> input_mel = torch.rand(1, 80, 100) >>> output = diffusion.inference( ... unconditional=False, ... scale=256, ... condition=input_mel, ... fast_sampling=True, ... fast_sampling_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5], ... ) >>> output.shape torch.Size([1, 25600])
- inference(unconditional, scale, condition=None, fast_sampling=False, fast_sampling_noise_schedule=None, device=None)[source]
处理 diffwave 的推理 一个推理函数用于所有局部/全局条件生成和无条件生成任务
- 参数:
unconditional (bool) – 如果为 True 则进行无条件生成,否则进行条件生成
scale (int) – 对于条件生成,用于获取最终输出波形长度的缩放比例;输出波形长度为 scale * condition.shape[-1] 例如,如果条件是频谱图 (bs, n_mel, time),则 scale 应为跳跃长度;对于无条件生成,scale 应为所需的音频长度
condition (torch.Tensor) – 用于声码的输入频谱图或其他条件生成的条件,无条件生成时应为 None
fast_sampling (bool) – 是否进行快速采样
fast_sampling_noise_schedule (list) – 用于快速采样的噪声调度
device (str|torch.device) – 推理设备
- 返回:
predicted_sample – 预测的音频 (bs, 1, t)
- 返回类型:
torch.Tensor