在 GitHub 上执行或查看/下载此 notebook

将量化应用于语音识别模型

量化简介

量化通常是 SpeechBrain 自动语音识别模型的低延迟应用（例如实时语音识别）所必需的。

量化通过将模型的权重和激活值从浮点值转换为较低分辨率的值（例如 8 位整数）来实现。这不仅减少了模型的内存占用，还降低了推理延迟，因为整数运算通常比浮点运算快。

这种转换通过将给定范围内的值映射到量化范围，并将值“裁剪”到所选分辨率的最接近值。通常，量化有两个主要概念：零点和比例因子。

零点：量化过程中 0 映射到的量化值。
比例因子：数据范围为了适应量化范围而缩放的因子。

零点和比例因子共同描述了映射的工作方式。

换句话说，

\(y = round\left(\frac{x}{S} + Z\right)\)

其中 \(x\) 是原始值，\(y\) 是量化值，\(S\) 是比例因子，\(Z\) 是零点。

量化方法

量化可以根据量化发生的**时间**进行分类：在量化感知训练 (QAT) 中，量化在训练期间进行；而在训练后量化 (PTQ) 中，量化仅在模型训练完成后才应用。本教程的重点是量化预训练模型，这意味着它将侧重于后者。

PTQ 可以根据模型激活值量化的时间进一步细分为两种方法。动态量化在模型推理期间执行此操作，而静态量化在推理发生之前执行此操作。

对于所有类型的量化，权重可以提前量化，因为权重依赖于模型本身而不是输入数据。这意味着在量化时已经可以获得关于权重值范围的信息，从而无需任何额外信息即可量化权重。

然而，模型的激活值，即应用激活函数后的值，取决于输入数据。这意味着激活值的范围在运行时可能会发生变化，这促使了不同的量化方法。

动态量化

在动态量化中，子模块在准备阶段被转换为量化版本，以便权重被适当量化。然后，在推理过程中，每个量化层观察输入给它的数据，并根据观察到的情况调整量化参数。这在推理执行过程中重复发生，因此得名“动态”量化。

静态量化

与动态量化相反，静态量化在运行时不进行任何调整。取而代之的是，在将要量化的层的选定位置插入观察器模块，并将模型应用于一组代表性数据样本。然后观察器模块将根据输入到模型的数据选择量化参数，这些参数在运行时将保持固定。

动态量化与静态量化比较

动态量化不固定零点和比例因子，而是在运行时根据观察到的数据进行调整。相比之下，静态量化需要一个初始校准阶段。在静态量化的校准过程中，观察器模块将记录激活值的数据范围，并用它来确定量化的零点和比例因子。

动态量化的优点在于它不需要校准，适用于输入数据范围可能差异很大的模块。另一方面，静态量化不需要在运行时执行即时量化调整，这可能会减少延迟，但这可能会以牺牲准确性为代价。

本教程目的

本教程将展示如何调整 PyTorch 量化函数以便将其应用于 SpeechBrain 模型，以及如何对量化模型进行基准测试。

本教程将重点介绍预训练的自动语音识别 (ASR) 模型，这些模型可以使用库中的 speechbrain.inference.ASR 模块轻松加载和使用。

先决条件

安装 SpeechBrain

%%capture
# Installing SpeechBrain via pip
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH

安装其他依赖项

kenlm 和 pygtrie 是我们选择的模型依赖于 n-gram 相关功能的外部库。如果你的模型不使用这些库，你可能不需要这些。请替换这些安装为你模型所需的其他外部库。

%%capture
!pip install https://github.com/kpu/kenlm/archive/master.zip
!pip install pygtrie

导入

import gc
import numpy as np
import os
import sentencepiece
import speechbrain
import time
import torch
import torch.nn as nn
import tqdm

from collections import Counter
from copy import deepcopy

模型选择

为了本教程的目的，我们将使用在 CommonVoice English 上训练的带有 CTC 的 Wav2Vec 2.0 模型。

Wav2Vec 2.0 模型是基于 transformer 的。此外，这是一个 Encoder ASR 模型，这意味着它没有解码器层，而是使用解码函数。虽然编码器不使用语言模型，但解码函数可选地使用语言模型进行 n-gram 重评分，这就是需要安装 kenlm 的原因。

from speechbrain.inference.ASR import EncoderASR

asr_model = EncoderASR.from_hparams(
    source="speechbrain/asr-wav2vec2-commonvoice-14-en",
    savedir="/content/pretrained_ASR/asr-wav2vec2-commonvoice-14-en",
)

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://hugging-face.cn/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-large-lv60 and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
WARNING:speechbrain.lobes.models.huggingface_transformers.huggingface:speechbrain.lobes.models.huggingface_transformers.huggingface - Wav2Vec2Model is frozen.

让我们仔细看看模型的子模块。

asr_model

EncoderASR(
  (mods): ModuleDict(
    (encoder): LengthsCapableSequential(
      (wav2vec2): Wav2Vec2(
        (model): Wav2Vec2Model(
          (feature_extractor): Wav2Vec2FeatureEncoder(
            (conv_layers): ModuleList(
              (0): Wav2Vec2LayerNormConvLayer(
                (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,))
                (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
                (activation): GELUActivation()
              )
              (1-4): 4 x Wav2Vec2LayerNormConvLayer(
                (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,))
                (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
                (activation): GELUActivation()
              )
              (5-6): 2 x Wav2Vec2LayerNormConvLayer(
                (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,))
                (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
                (activation): GELUActivation()
              )
            )
          )
          (feature_projection): Wav2Vec2FeatureProjection(
            (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (projection): Linear(in_features=512, out_features=1024, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (encoder): Wav2Vec2EncoderStableLayerNorm(
            (pos_conv_embed): Wav2Vec2PositionalConvEmbedding(
              (conv): ParametrizedConv1d(
                1024, 1024, kernel_size=(128,), stride=(1,), padding=(64,), groups=16
                (parametrizations): ModuleDict(
                  (weight): ParametrizationList(
                    (0): _WeightNorm()
                  )
                )
              )
              (padding): Wav2Vec2SamePadLayer()
              (activation): GELUActivation()
            )
            (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (layers): ModuleList(
              (0-23): 24 x Wav2Vec2EncoderLayerStableLayerNorm(
                (attention): Wav2Vec2Attention(
                  (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
                )
                (dropout): Dropout(p=0.1, inplace=False)
                (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
                (feed_forward): Wav2Vec2FeedForward(
                  (intermediate_dropout): Dropout(p=0.1, inplace=False)
                  (intermediate_dense): Linear(in_features=1024, out_features=4096, bias=True)
                  (intermediate_act_fn): GELUActivation()
                  (output_dense): Linear(in_features=4096, out_features=1024, bias=True)
                  (output_dropout): Dropout(p=0.1, inplace=False)
                )
                (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
              )
            )
          )
        )
      )
      (enc): Sequential(
        (linear1): Linear(
          (w): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bn1): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation): LeakyReLU(negative_slope=0.01)
        (drop): Dropout(p=0.15, inplace=False)
        (linear2): Linear(
          (w): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bn2): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation2): LeakyReLU(negative_slope=0.01)
        (drop2): Dropout(p=0.15, inplace=False)
        (linear3): Linear(
          (w): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bn3): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation3): LeakyReLU(negative_slope=0.01)
      )
      (ctc_lin): Linear(
        (w): Linear(in_features=1024, out_features=1000, bias=True)
      )
      (log_softmax): Softmax()
    )
  )
  (decoding_function): CTCBeamSearcher()
)

请注意，并非所有模块都可以量化，并且某些模块无法使用特定方法进行量化。特别要注意以下无需自定义修改即可绕过 PyTorch 限制进行量化的模块列表：

可动态量化的模块

nn.Linear
nn.LSTM
nn.GRU
nn.RNNCell
nn.GRUCell
nn.LSTMCell
nn.EmbeddingBag
nn.Embedding

可静态量化的模块

nn.Linear
nn.Conv1d/2d/3d
nn.EmbeddingBag
nn.Embedding

有了这些信息，我们就可以开始确定我们的量化方案了。从我们选择的模型中，我们可以确定以下模块：

encoder.wav2vec2.model.feature_extractor：包含 7 个 nn.Conv1d 层，必须进行静态量化。
encoder.wav2vec2.model.feature_projection：包含 1 个 nn.Linear 层，可以进行动态和静态量化。
encoder.wav2vec2.model.encoder.pos_conv_embed：包含一个 nn.ParameterizedConv1d 层，PyTorch 中尚未实现对其的量化。
encoder.wav2vec2.model.encoder.layers：静态量化尚未针对依赖于注意力的模块（例如此子模块，其中包含 transformer 层）进行正确实现，因此只能应用动态量化。
encoder.enc：一系列 nn.Linear 和 nn.BatchNorm1d 层。不幸的是，PyTorch 不允许 BatchNorm 层在不跟随卷积层的情况下进行静态量化，因此此子模块必须进行动态量化。
encoder.ctc_lin：包含 1 个 nn.Linear 层，可以进行动态或静态量化。

请注意，我们刚刚分离出模型的“主要”子模块 - 通过对我们挑选出的子模块中的特定层应用不同的量化策略，可以以更细粒度的方式进行量化。（例如，即使整个子模块无法以这种方式量化，我们也可以对 encoder.wav2vec2.model.encoder.layers 内的特定 nn.Linear 层应用静态量化。）

然而，量化存在开销，因为输入必须被量化，输出必须被反量化，因此不建议以过于细粒度的方式进行量化。例如，同时静态量化多个层意味着只需要一次量化和一次反量化，而分别量化它们则意味着当数据从一层流向另一层时需要重复执行反量化和量化。

鉴于量化的限制以及经验收集的数据，对于此模型，我们将对 encoder.wav2vec2.model.encoder.layers 和 encoder.enc 进行动态量化，并对 encoder.wav2vec2.model.feature_extractor 和 encoder.wav2vec2.model.feature_projection 进行静态量化。

encoder.ctc_lin 将不进行量化，因为实验表明如果将其量化，它会对 WER（词错误率，衡量准确性的指标）产生很大影响。

由于子模块对不同的量化方法响应不同，你可能需要尝试动态和静态量化的组合，以找到最适合你的模型的组合。

数据下载和预处理

下载 LibriSpeech dev-clean 数据集，其中包含音频样本和相应的转录文本。这将是我们用于评估模型量化前后的性能的数据集。选择此数据集是因为它相对较小——我们不需要大型数据集来评估模型的性能——并且因为它是“干净的”，即没有可能不必要地干扰模型准确性的背景噪声或音频伪影。

还需要额外的预处理，将数据集转换为适合应用我们模型的格式，并用于比较模型输出与参考转录文本。我们希望获得一个音频-参考对列表，以便比较模型在每个音频样本上的输出与正确的参考转录文本。

%%capture
!mkdir librispeech_dev_clean
!wget https://www.openslr.org/resources/12/dev-clean.tar.gz -P /content
!tar -xvf dev-clean.tar.gz -C librispeech_dev_clean

from speechbrain.dataio.dataio import read_audio

# Retrieve the downloaded speech data as a list of audio-reference pairs
def get_samples(root):
    audios = []
    references = []
    for book in os.listdir(root):
        for chapter in os.listdir(f"{root}/{book}"):
            for file in os.listdir(f"{root}/{book}/{chapter}"):
                if file.endswith("txt"):
                    with open(f"{root}/{book}/{chapter}/{file}", "r") as f:
                        for line in f.readlines():
                            audio_path, reference = line.split(" ", 1)
                            full_audio_path = f"{root}/{book}/{chapter}/{audio_path}.flac"
                            audios.append(read_audio(full_audio_path))
                            references.append(reference)
    return audios, references

audios, references = get_samples("/content/librispeech_dev_clean/LibriSpeech/dev-clean")
assert len(audios) == len(references)

量化设置

辅助函数

这里我们定义了 get_module 和 set_module，这些辅助函数用于通过提供字符串来检索和设置模块内的子模块。这是进行局部量化所必需的，即替换一个子模块为量化子模块而不量化其他任何东西。

这些辅助函数基于 getattr 和 setattr 函数构建，但允许使用嵌套属性，例如

module_string = "encoder.wav2vec2.model.feature_projection"

这允许检索和设置嵌套子模块。

def get_module(model, module_string):
    curr = model.mods
    for attr in module_string.split("."):
        if attr.isnumeric():
            curr = curr[int(attr)]
        else:
            curr = getattr(curr, attr)
    return curr

def set_module(model, module_string, new_module):
    curr = model.mods
    attrs = module_string.split(".")
    for attr in attrs[:-1]:
        if attr.isnumeric():
            curr = curr[int(attr)]
        else:
            curr = getattr(curr, attr)
    if attrs[-1].isnumeric():
        curr[int(attrs[-1])] = new_module
    else:
        setattr(curr, attrs[-1], new_module)

静态量化封装器

静态量化需要 QuantStub 和 DeQuantStub 模块来指示量化模块和非量化模块之间的边界，以及指示应在何处放置量化观察器进行校准。

在校准期间，量化观察器将记录数据范围以确定量化的比例因子和零点，从而实现更优化的量化结果。

此外，在静态量化后，QuantStub 和 DeQuantStub 将被转换为分别量化和反量化输入张量的层，从而使量化模块能够与非量化模块平滑交互。

请注意，下面 __getattr__ 被重写以允许检索引用封装器内模型的属性。

此外，DeQuantStub 必须能够处理模型返回的元组，即多个返回值，因为 DeQuantStub 的前向函数本身不考虑元组。

from torch.ao.quantization import QuantStub, DeQuantStub

class StaticQuant(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.quant = QuantStub()
        self.model = model
        self.dequant = DeQuantStub()

    def __getattr__(self, name):
        if name in self.__dict__:
            return self.__dict__[name]
        elif name in self.__dict__['_modules']:
            return self.__dict__['_modules'][name]
        else:
            return getattr(self.__dict__['_modules']['model'], name)

    def forward(self, x, *args, **kwargs):
        x = self.quant(x)
        x = self.model(x, *args, **kwargs)
        if isinstance(x, tuple):
            return tuple(self.dequant(output) for output in x)
        else:
            return self.dequant(x)

量化函数

这是一个自定义量化函数，它使得子模块可以进行动态和静态量化。它还提供了额外的灵活性，允许应用量化分辨率和其他量化配置等超参数。这使得将量化策略组合更简单地应用于我们的模型。

详见 docstring。

def custom_quantize(
        model,
        dynamic_modules=None,
        static_modules=None,
        calibration_samples=None,
        dynamic_targets=None,
        dynamic_dtype=torch.qint8,
        static_qconfig=torch.ao.quantization.default_qconfig,
):
    """Performs in-place quantization of an ASR model

    The quantization is customizable. A combination of dynamic and static
    quantization can be performed on specific submodules that are passed into
    this function.

    Names of submodules passed into this class are implicitly assumed to be
    nested fields of ``model.mods``. For example, the ``model.mods.encoder.enc``
    submodule should be passed in as ``encoder.enc``.

    Reference https://pytorch.ac.cn/docs/stable/quantization.html for
    what torch modules can and cannot be dynamically/statically quantized.

    Arguments
    ---------
    model : torch.nn.Module
        Model to be quantized.
    dynamic_modules : list[str]
        Names of the submodules to be dynamically quantized. They should be
        formatted as stated above.
    static_modules : list[str]
        Names of the submodules to be statically quantized. They should be
        formatted as stated above.'
    calibration_samples : list[torch.Tensor]
        Sample inputs used for calibration during static quantization.
    dynamic_targets : set[torch.nn.Module]
        Torch modules to be quantized during dynamic quantization.
    dynamic_dtype : torch.dtype
        The torch datatype that values will be converted to during dynamic
        quantization. This should be a quantized datatype, such as
        ``torch.quint8``, ``torch.qint8``, ``torch.qint32``
    static_qconfig : torch.ao.quantization.qconfig.QConfig
        The quantization config for static quantization, which, among other
        things, specifies the observer modules that will be inserted
        and the resolution of quantization.

    Returns
    -------
    None
    """

    ##################################################
    # Dynamic Quantization                           #
    ##################################################
    if dynamic_modules is not None and len(dynamic_modules) > 0:
        if dynamic_targets is None:
            dynamic_targets = {
                torch.nn.LSTM,
                torch.nn.GRU,
                torch.nn.RNNCell,
                torch.nn.GRUCell,
                torch.nn.LSTMCell,
                torch.nn.Linear
            }

        for module in dynamic_modules:
            torch.quantization.quantize_dynamic(
                get_module(model, module),
                dynamic_targets,
                dtype=dynamic_dtype,
                inplace=True,
            )

    ##################################################
    # Static Quantization                            #
    ##################################################
    if static_modules is not None and len(static_modules) > 0:
        if calibration_samples is None or len(calibration_samples) == 0:
            raise Exception("No calibration samples provided for static quantization.")

        for module in static_modules:
            set_module(
                model,
                module,
                StaticQuant(get_module(model, module)),
            )
            get_module(model, module).qconfig = static_qconfig

        torch.ao.quantization.prepare(model, inplace=True)

        for sample in calibration_samples:
            model.transcribe_batch(sample.unsqueeze(0), torch.tensor([1.0]))

        torch.ao.quantization.convert(model, inplace=True)

基准测试设置

我们将重点关注 ASR 的两个主要性能指标：实时率 (RTF) 和词错误率 (WER)。

RTF 是总推理时间与输入音频总长度之比。这很重要，因为 RTF 小于 1 意味着推理时间少于播放音频所需的时间，这可能允许实时语音识别（不包括其他延迟源）。

WER 是模型产生的词级错误数（替换、删除、插入）与参考文本中词数之比。

这两个指标结合起来，可以让我们评估模型量化前后的延迟和准确性。

WER

Levenshtein 距离（或编辑距离）是 WER 指标的核心。它衡量将一个字符串转换为另一个字符串所需的替换、删除和/或插入次数，可以使用动态规划方法计算。

Levenshtein 距离与 WER 的主要区别在于，前者考虑字符级别的字符串，而后者考虑整个单词的替换/删除/插入。

Speechbrain 提供了用于衡量 WER 和其他相关指标的辅助函数。

from speechbrain.utils.edit_distance import accumulatable_wer_stats

def compute_wer(references, hypotheses):
    if isinstance(references, str):
        references = [references.split()]
    else:
        references = [ref.split() for ref in references]
    if isinstance(hypotheses, str):
        hypotheses = [hypotheses.split()]
    else:
        hypotheses = [hyp.split() for hyp in hypotheses]
    if len(references) != len(hypotheses):
        raise Exception("Number of references is not equal to the number of hypotheses")
    stats = accumulatable_wer_stats(references, hypotheses, Counter())
    return stats['WER']

修改 EncoderASR transcribe_batch

修改现有的 transcribe_batch 方法，以便对编码器的前向函数进行计时。

不同的 ASR 类型有不同的 transcribe_batch 实现，因此可能需要根据你自己的模型进行微小调整。

import functools

# Functions necessary for preprocessing the input and generating transcriptions

def preprocess_input(model: EncoderASR, input):
    with torch.no_grad():
        wavs = input.unsqueeze(0)
        wav_lens = torch.tensor([1.0])
        wavs = wavs.float()
        wavs, wav_lens = wavs.to(model.device), wav_lens.to(model.device)
        return wavs, wav_lens

def generate(model, predictions):
    is_ctc_text_encoder_tokenizer = isinstance(
        model.tokenizer, speechbrain.dataio.encoder.CTCTextEncoder
    )
    if isinstance(model.hparams.decoding_function, functools.partial):
        if is_ctc_text_encoder_tokenizer:
            predicted_words = [
                "".join(model.tokenizer.decode_ndim(token_seq))
                for token_seq in predictions
            ]
        else:
            predicted_words = [
                model.tokenizer.decode_ids(token_seq)
                for token_seq in predictions
            ]
    else:
        predicted_words = [hyp[0].text for hyp in predictions]
    return predicted_words

请注意，我们只关心与量化相关的推理时间变化，而不关心输入预处理或词生成的开销。这就是我们只记录编码器前向函数持续时间的原因。

def timed_transcribe(model: EncoderASR, input):
    with torch.no_grad():
        wavs, wav_lens = preprocess_input(model, input)
        start = time.time()
        encoder_out = model.mods.encoder(wavs, wav_lens)
        end = time.time()
        duration = end - start
        predictions = model.decoding_function(encoder_out, wav_lens)
        predicted_words = generate(model, predictions)
    return predicted_words[0], duration

对模型性能进行基准测试

延迟测量起初通常不稳定，因此引入了热身阶段，以确保更准确的性能评估。

def benchmark(model, samples, references):
    total_audio_length = sum([sample.shape[0] / 16000 for sample in samples])
    total_cpu_time = 0
    outputs = []

    for sample in tqdm.tqdm(samples[:10], desc="warming up"):
        timed_transcribe(model, sample)

    for sample in tqdm.tqdm(samples, desc="evaluating"):
        output, duration = timed_transcribe(model, sample)
        outputs.append(output)
        total_cpu_time += duration

    wer = compute_wer(references, outputs)
    rtf = total_cpu_time / total_audio_length
    return wer, rtf

量化和基准测试

准备好量化和基准测试所需的设置代码后，我们就可以开始对模型进行量化前后的基准测试了。

选择数据

为了节省时间，选择音频数据的一个子集用于模型的基准测试。

n = 100
audio_subset = audios[:n]
ref_subset = references[:n]

原始模型

# Deepcopy the original model to avoid propagating unwanted changes
original_model = deepcopy(asr_model)

original_model.eval()
wer, rtf = benchmark(original_model, audio_subset, ref_subset)

warming up: 100%|██████████| 10/10 [01:40<00:00, 10.01s/it]
evaluating: 100%|██████████| 100/100 [09:32<00:00,  5.73s/it]

print(f"Original Model\nWER(%): {wer}\nRTF: {rtf}")

Original Model
WER(%): 6.067291781577496
RTF: 0.7967449480673793

为避免超出会话的 RAM 限制，请在基准测试后删除模型。

del original_model
gc.collect()

量化模型

首先，让我们回顾一下模型架构

asr_model

EncoderASR(
  (mods): ModuleDict(
    (encoder): LengthsCapableSequential(
      (wav2vec2): Wav2Vec2(
        (model): Wav2Vec2Model(
          (feature_extractor): Wav2Vec2FeatureEncoder(
            (conv_layers): ModuleList(
              (0): Wav2Vec2LayerNormConvLayer(
                (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,))
                (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
                (activation): GELUActivation()
              )
              (1-4): 4 x Wav2Vec2LayerNormConvLayer(
                (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,))
                (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
                (activation): GELUActivation()
              )
              (5-6): 2 x Wav2Vec2LayerNormConvLayer(
                (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,))
                (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
                (activation): GELUActivation()
              )
            )
          )
          (feature_projection): Wav2Vec2FeatureProjection(
            (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (projection): Linear(in_features=512, out_features=1024, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (encoder): Wav2Vec2EncoderStableLayerNorm(
            (pos_conv_embed): Wav2Vec2PositionalConvEmbedding(
              (conv): ParametrizedConv1d(
                1024, 1024, kernel_size=(128,), stride=(1,), padding=(64,), groups=16
                (parametrizations): ModuleDict(
                  (weight): ParametrizationList(
                    (0): _WeightNorm()
                  )
                )
              )
              (padding): Wav2Vec2SamePadLayer()
              (activation): GELUActivation()
            )
            (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (layers): ModuleList(
              (0-23): 24 x Wav2Vec2EncoderLayerStableLayerNorm(
                (attention): Wav2Vec2Attention(
                  (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
                )
                (dropout): Dropout(p=0.1, inplace=False)
                (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
                (feed_forward): Wav2Vec2FeedForward(
                  (intermediate_dropout): Dropout(p=0.1, inplace=False)
                  (intermediate_dense): Linear(in_features=1024, out_features=4096, bias=True)
                  (intermediate_act_fn): GELUActivation()
                  (output_dense): Linear(in_features=4096, out_features=1024, bias=True)
                  (output_dropout): Dropout(p=0.1, inplace=False)
                )
                (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
              )
            )
          )
        )
      )
      (enc): Sequential(
        (linear1): Linear(
          (w): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bn1): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation): LeakyReLU(negative_slope=0.01)
        (drop): Dropout(p=0.15, inplace=False)
        (linear2): Linear(
          (w): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bn2): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation2): LeakyReLU(negative_slope=0.01)
        (drop2): Dropout(p=0.15, inplace=False)
        (linear3): Linear(
          (w): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bn3): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation3): LeakyReLU(negative_slope=0.01)
      )
      (ctc_lin): Linear(
        (w): Linear(in_features=1024, out_features=1000, bias=True)
      )
      (log_softmax): Softmax()
    )
  )
  (decoding_function): CTCBeamSearcher()
)

如前所述，本教程中我们将对注意力层和顺序线性层应用动态量化，对其他可量化层（不包括 ctc_lin，实验观察到其对量化响应不佳）应用静态量化。

回想一下，并非所有 PyTorch 层都可以量化，并且有些层只能进行动态或静态量化，因此你在选择要量化的模块和量化方法时存在限制。

对于你的模型，请随意试验以找到最佳结果。

dynamic_modules = [
    "encoder.wav2vec2.model.encoder.layers",
    "encoder.enc"
]
static_modules = [
    "encoder.wav2vec2.model.feature_projection",
    "encoder.wav2vec2.model.feature_extractor",
]

随机选择校准样本用于静态量化。

from operator import itemgetter

np.random.seed(1337)
indices = np.random.choice(len(audios), 10)
calibration_samples = list(itemgetter(*indices)(audios))

我们已经具备量化模型所需的一切。

# Deepcopy the original model to avoid propagating unwanted changes
quantized_model = deepcopy(asr_model)

custom_quantize(
    model=quantized_model,
    dynamic_modules=dynamic_modules,
    static_modules=static_modules,
    calibration_samples=calibration_samples,
)

这是量化后的模型。注意指定的子模块已经被量化版本取代了。

quantized_model

EncoderASR(
  (mods): ModuleDict(
    (encoder): LengthsCapableSequential(
      (wav2vec2): Wav2Vec2(
        (model): Wav2Vec2Model(
          (feature_extractor): Static(
            (quant): Quantize(scale=tensor([0.1671]), zero_point=tensor([60]), dtype=torch.quint8)
            (model): Wav2Vec2FeatureEncoder(
              (conv_layers): ModuleList(
                (0): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(1, 512, kernel_size=(10,), stride=(5,), scale=0.23443543910980225, zero_point=67)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
                (1): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(512, 512, kernel_size=(3,), stride=(2,), scale=0.8026854991912842, zero_point=62)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
                (2): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(512, 512, kernel_size=(3,), stride=(2,), scale=1.169354796409607, zero_point=89)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
                (3): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(512, 512, kernel_size=(3,), stride=(2,), scale=0.8424969911575317, zero_point=66)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
                (4): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(512, 512, kernel_size=(3,), stride=(2,), scale=0.592667818069458, zero_point=54)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
                (5): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(512, 512, kernel_size=(2,), stride=(2,), scale=0.4864558279514313, zero_point=68)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
                (6): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(512, 512, kernel_size=(2,), stride=(2,), scale=0.4137037694454193, zero_point=41)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
              )
            )
            (dequant): DeQuantize()
          )
          (feature_projection): Static(
            (quant): Quantize(scale=tensor([0.0369]), zero_point=tensor([5]), dtype=torch.quint8)
            (model): Wav2Vec2FeatureProjection(
              (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
              (projection): QuantizedLinear(in_features=512, out_features=1024, scale=0.7401247620582581, zero_point=64, qscheme=torch.per_tensor_affine)
              (dropout): QuantizedDropout(p=0.1, inplace=False)
            )
            (dequant): DeQuantize()
          )
          (encoder): Wav2Vec2EncoderStableLayerNorm(
            (pos_conv_embed): Wav2Vec2PositionalConvEmbedding(
              (conv): ParametrizedConv1d(
                1024, 1024, kernel_size=(128,), stride=(1,), padding=(64,), groups=16
                (parametrizations): ModuleDict(
                  (weight): ParametrizationList(
                    (0): _WeightNorm()
                  )
                )
              )
              (padding): Wav2Vec2SamePadLayer()
              (activation): GELUActivation()
            )
            (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (layers): ModuleList(
              (0-23): 24 x Wav2Vec2EncoderLayerStableLayerNorm(
                (attention): Wav2Vec2Attention(
                  (k_proj): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
                  (v_proj): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
                  (q_proj): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
                  (out_proj): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
                )
                (dropout): Dropout(p=0.1, inplace=False)
                (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
                (feed_forward): Wav2Vec2FeedForward(
                  (intermediate_dropout): Dropout(p=0.1, inplace=False)
                  (intermediate_dense): DynamicQuantizedLinear(in_features=1024, out_features=4096, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
                  (intermediate_act_fn): GELUActivation()
                  (output_dense): DynamicQuantizedLinear(in_features=4096, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
                  (output_dropout): Dropout(p=0.1, inplace=False)
                )
                (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
              )
            )
          )
        )
      )
      (enc): Sequential(
        (linear1): Linear(
          (w): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
        )
        (bn1): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation): LeakyReLU(negative_slope=0.01)
        (drop): Dropout(p=0.15, inplace=False)
        (linear2): Linear(
          (w): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
        )
        (bn2): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation2): LeakyReLU(negative_slope=0.01)
        (drop2): Dropout(p=0.15, inplace=False)
        (linear3): Linear(
          (w): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
        )
        (bn3): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation3): LeakyReLU(negative_slope=0.01)
      )
      (ctc_lin): Linear(
        (w): Linear(in_features=1024, out_features=1000, bias=True)
      )
      (log_softmax): Softmax()
    )
  )
  (decoding_function): CTCBeamSearcher()
)

接下来，我们对量化模型进行基准测试。

quantized_model.eval()
wer, rtf = benchmark(quantized_model, audio_subset, ref_subset)

warming up: 100%|██████████| 10/10 [01:16<00:00,  7.61s/it]
evaluating: 100%|██████████| 100/100 [07:12<00:00,  4.32s/it]

print(f"Quantized Model\nWER(%): {wer}\nRTF: {rtf}")

Quantized Model
WER(%): 7.335907335907336
RTF: 0.6004914075674289

我们可以观察到 RTF 显著下降，而 WER 合理增加。这表明量化成功了。

最后，如果你需要使用其他模型进行更多量化基准测试，可以删除此模型以释放 RAM。

del quantized_model
gc.collect()

引用 SpeechBrain

如果您在研究或业务中使用了 SpeechBrain，请使用以下 BibTeX 条目引用它：

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}