以在 GitHub 上执行、查看或下载此笔记本

语音特征

语音是一个非常**高维**的信号。例如，当采样频率为 16 kHz 时，每秒有 16000 个采样点。从机器学习的角度来看，处理如此高维的数据可能非常重要。特征提取的目标是找到更**紧凑**的方式来表示语音。

几年前，对合适的语音特征进行研究是一个非常活跃的领域。然而，随着深度学习的出现，趋势是向神经网络输入**简单特征**。然后让网络自己去发现更高层次的表示。

在本教程中，我们将介绍两种最流行的语音特征

滤波器组 (FBANKs)
梅尔频率倒谱系数 (MFCCs)

然后，我们将提到一些添加上下文信息的常用技术。

1. 滤波器组 (FBANKs)

FBANKs 是通过对语音信号的频谱图应用**一组滤波器**计算得出的时频表示。请参阅此教程，以详细了解傅里叶变换和频谱图。

首先，让我们下载一些语音信号并安装 SpeechBrain

%%capture
# Installing SpeechBrain via pip
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH

# Clone SpeechBrain repository
!git clone https://github.com/speechbrain/speechbrain/
%cd /content/speechbrain/

%%capture
!wget https://www.dropbox.com/s/u8qyvuyie2op286/spk1_snt1.wav

现在让我们计算语音信号的频谱图

import torch
import matplotlib.pyplot as plt
from speechbrain.dataio.dataio import read_audio
from speechbrain.processing.features import STFT

signal = read_audio('spk1_snt1.wav').unsqueeze(0) # [batch, time]

compute_STFT = STFT(sample_rate=16000, win_length=25, hop_length=10, n_fft=400)
signal_STFT = compute_STFT(signal)

spectrogram = signal_STFT.pow(2).sum(-1) # Power spectrogram
spectrogram = spectrogram.squeeze(0).transpose(0,1)
spectrogram = torch.log(spectrogram)

plt.imshow(spectrogram.squeeze(0), cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

压缩信号的一种方法是沿频率轴对频谱图进行平均。这可以通过一组滤波器来完成

从频谱图中可以看出，大部分能量集中在**频谱的较低部分**。因此，在频谱的较低部分分配更多滤波器，而对高频部分分配较少滤波器更为合理。梅尔滤波器组正是这样做的。

每个滤波器都是**三角形**的，在中心频率处响应为 1。响应线性衰减至 0，直到达到两个相邻滤波器的中心频率（参见图）。因此，相邻滤波器之间存在一定的**重叠**。

滤波器设计为在梅尔频率域等距分布。可以通过以下非线性变换从线性频率域转换为梅尔频率域（反之亦然）

\( m=2595log10(1+f/700)\)

\(f=700(10m/2595−1)\),

其中 \(m\) 是梅尔频率分量，\(f\) 是标准频率分量（单位：赫兹）。梅尔频率域通过对数进行压缩。结果是，在梅尔域中等距分布的滤波器在目标线性域中将不是等距分布的。我们确实如期望的那样，在频谱的较低部分有更多滤波器，而在较高部分有较少滤波器。

现在让我们使用 SpeechBrain 计算 FBANKs

from speechbrain.processing.features import spectral_magnitude
from speechbrain.processing.features import Filterbank

compute_fbanks = Filterbank(n_mels=40)

STFT = compute_STFT(signal)
mag = spectral_magnitude(STFT)
fbanks = compute_fbanks(mag)

print(STFT.shape)
print(mag.shape)
print(fbanks.shape)

plt.imshow(fbanks.squeeze(0).t(), cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

通常计算 40 或 80 个 FBANKs。正如你从形状中观察到的那样，时间轴的维度相同。而频率轴的维度则被减少了。你可以将 FBANKs 看作是一种**压缩**频谱图中所嵌入丰富信息的简单方法。

SpeechBrain 的滤波器组实现旨在支持不同形状的滤波器（三角形、矩形、高斯形）。此外，当 freeze=False 时，滤波器不会被冻结，可以在训练过程中进行调整。

为了简化 FBANKs 的计算，我们创建了一个 lobe，它在一个函数中执行所有必需的步骤

SpeechBrain 的滤波器组实现旨在支持不同形状的滤波器（三角形、矩形、高斯形）。此外，当 freeze=False 时，滤波器不会被冻结，可以在训练过程中进行调整。

为了简化 FBANKs 的计算，我们创建了一个 lobe，它在一个函数中执行所有必需的步骤

from speechbrain.lobes.features import Fbank
fbank_maker = Fbank()
fbanks = fbank_maker(signal)

plt.imshow(fbanks.squeeze(0).t(), cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

# Zoom of first 80 steps
plt.imshow(fbanks.squeeze(0).t()[:,0:80], cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

2. 梅尔频率倒谱系数 (MFCCs)

MFCCs 是通过在 FBANKs 的基础上应用离散余弦变换 (DCT) 计算得出的。DCT 是一种对特征进行去相关的变换，可用于进一步压缩特征。

为了简化 MFCCs 的计算，我们为此创建了一个 lobe

from speechbrain.lobes.features import MFCC
mfcc_maker = MFCC(n_mfcc=13, deltas=False, context=False)
mfccs = mfcc_maker(signal)

plt.imshow(mfccs.squeeze(0).t(), cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

#Zoom of the first 25 steps
plt.imshow(mfccs.squeeze(0).t()[:,0:25], cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

过去，处理去相关特征至关重要。过去的机器学习技术，如高斯混合模型 (GMMS)，不适合对相关数据进行建模。而深度神经网络即使使用**相关数据**也能很好地工作，因此 FBANKs 现在是首选。

3. 上下文信息

对局部上下文的适当管理对于大多数语音处理任务至关重要。过去主要解决方案是采用以下方法设置“人工设计”的上下文：

导数
上下文窗口

3.1 导数

导数背后的思想是通过简单地计算与相邻特征的**差值**来引入局部上下文。导数通常使用 MFCCS 系数计算

from speechbrain.lobes.features import MFCC
mfcc_maker = MFCC(n_mfcc=13, deltas=True, context=False)
mfccs_with_deltas = mfcc_maker(signal)

print(mfccs.shape)
print(mfccs_with_deltas.shape)

plt.imshow(mfccs_with_deltas.squeeze(0).t(), cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

一阶和二阶导数分别称为 delta 和 delta-delta 系数，并与静态系数连接。在示例中，维度因此为 39 (13 个静态系数，13 个 delta，13 个 delta-delta)。

3.2 上下文窗口

上下文窗口通过简单地**连接**多个连续特征来添加局部上下文。结果是一个更大的特征向量，它能更好地“感知”局部信息。

让我们看一个例子

from speechbrain.lobes.features import MFCC
mfcc_maker = MFCC(n_mfcc=13,
                  deltas=True,
                  context=True,
                  left_frames=5,
                  right_frames=5)
mfccs_with_context = mfcc_maker(signal)

print(mfccs.shape)
print(mfccs_with_deltas.shape)
print(mfccs_with_context.shape)

在本例中，我们将当前帧与过去 5 帧和未来 5 帧连接起来。因此，总维度为 \(39 * (5+5+1)= 429\)

当前趋势是使用静态特征，并通过**卷积神经网络** (CNN) 的**感受野**逐步添加**可学习的上下文**，而不是使用上述解决方案。CNN 通常用于神经语音处理系统的早期层，以导出鲁棒且上下文感知的表示。

4. 其他特征

最近的一个趋势是向神经网络输入**原始数据**。直接向神经网络输入**频谱图**甚至 **STFT** 已经非常普遍。也可以直接向神经网络输入**原始时域采样**。通过 SincNet 等经过适当设计的网络可以更容易地实现这一点。SincNet 使用参数化的卷积层，称为 SincConv，它可以从原始采样中学习。SincNet 在[本教程](添加链接)中有所描述。

参考文献

[1] P. Mermelstein (1976), “Distance measures for speech recognition, psychological and instrumental,” 载于《Pattern Recognition and Artificial Intelligence》。 pdf (网络存档)

[2] X. Huang, A. Acero (作者), H.-W. Hon, “语音语言处理：理论、算法与系统开发指南” 精装版 – 2001

[3] https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

[4] M. Ravanelli, M. Omologo, “Automatic context window composition for distant speech recognition”, Speech Communication, 2018 ArXiv

引用 SpeechBrain

如果你在研究或商业中使用 SpeechBrain，请使用以下 BibTeX 条目进行引用

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}