speechbrain.lm.arpa 模块

处理 ARPA 格式 N-gram 模型的工具

ARPA 格式应包含：- 数据头 - 按后续列出的顺序排列的 ngrams 计数 - 数据和 n-grams 段之间的换行：节 - 末尾例如

``` 数据报 1=2 ngram 2=1

1-grams: -1.0000 你好 -0.23 -0.6990 世界 -0.2553

2-grams: -0.2553 你好世界

结束```

示例

>>> # This example loads an ARPA model and queries it with BackoffNgramLM
>>> import io
>>> from speechbrain.lm.ngram import BackoffNgramLM
>>> # First we'll put an ARPA format model in TextIO and load it:
>>> with io.StringIO() as f:
...     print("Anything can be here", file=f)
...     print("", file=f)
...     print("\\data\\", file=f)
...     print("ngram 1=2", file=f)
...     print("ngram 2=3", file=f)
...     print("", file=f)  # Ends data section
...     print("\\1-grams:", file=f)
...     print("-0.6931 a", file=f)
...     print("-0.6931 b 0.", file=f)
...     print("", file=f)  # Ends unigram section
...     print("\\2-grams:", file=f)
...     print("-0.6931 a a", file=f)
...     print("-0.6931 a b", file=f)
...     print("-0.6931 b a", file=f)
...     print("", file=f)  # Ends bigram section
...     print("\\end\\", file=f)  # Ends whole file
...     _ = f.seek(0)
...     num_grams, ngrams, backoffs = read_arpa(f)
>>> # The output of read arpa is already formatted right for the query class:
>>> lm = BackoffNgramLM(ngrams, backoffs)
>>> lm.logprob("a", context = tuple())
-0.6931
>>> # Query that requires a backoff:
>>> lm.logprob("b", context = ("b",))
-0.6931

作者

Aku Rouhe 2020
Pierre Champion 2023

摘要

函数

`arpa_to_fst`	使用 kaldilm 将 ARPA 语言模型转换为 FST。
`read_arpa`	从流中读取 ARPA 格式的 N-gram 语言模型

参考

speechbrain.lm.arpa.read_arpa(fstream)[source]

从流中读取 ARPA 格式的 N-gram 语言模型

参数:

fstream (TextIO) – 用于读取模型的文本文件流（通常由 open() 返回）。

返回:

dict – 将 N-gram 阶数映射到该阶数的 ngrams 数量。实质上是 ARPA 格式文件的数据部分。
dict – ARPA 文件中的对数概率（第一列）。这是一个三重嵌套字典。第一层按 N-gram 阶数（整数）索引。第二层按上下文（token 元组）索引。第三层按 token 索引，并映射到对数概率。此格式与 speechbrain.lm.ngram.BackoffNGramLM 兼容。示例：在 ARPA 格式中，log(P(fox|a quick red)) = -5.3 表示为

-5.3 a quick red fox

要访问该概率，请使用
ngrams_by_order[4][('a', 'quick', 'red')]['fox']
dict – ARPA 文件中的对数后退权重（最后一列）。这是一个双重嵌套字典。第一层按 N-gram 阶数（整数）索引。第二层按后退历史（token 元组）索引，即概率分布所基于的上下文。这映射到对数权重。此格式与 speechbrain.lm.ngram.BackoffNGramLM 兼容。示例：如果 log(P(fox|a quick red)) 未列出，我们找到 log(backoff(a quick red)) = -23.4，在 ARPA 格式中为

<logp> a quick red -23.4

要在此处访问该值，请使用
backoffs_by_order[3][('a', 'quick', 'red')]

引发:

ValueError – 如果未找到语言模型或文件格式不正确。

speechbrain.lm.arpa.arpa_to_fst(words_txt: str | Path, in_arpa: str | Path, out_fst: str | Path, ngram_order: int, disambig_symbol: str = '#0', cache: bool = True)[source]

使用 kaldilm 将 ARPA 语言模型转换为 FST。例如，你可以使用 speechbrain.lm.train_ngram 创建一个 ARPA 语言模型，然后使用此函数将其转换为 FST。

值得注意的是，如果 FST 已经在 output_dir 中存在，则不会再次转换（因此，如果你在任何时候更改了 ARPA 模型，可能需要手动删除它们）。

参数:

words_txt (str | Path) – 由 prepare_lang 创建的 words.txt 文件的路径。
in_arpa (str | Path) – 要转换为 FST 的 ARPA 语言模型的路径。
out_fst (str | Path) – FST 将保存到的路径。
ngram_order (int) – ARPA（和 FST）的 ngram 阶数。
disambig_symbol (str) – 要使用的消歧符号。
cache (bool) – 如果 fst.txt 文件已存在，是否重新创建。

引发:

ImportError – 如果 kaldilm 未安装。

返回类型:

None

示例

>>> from speechbrain.lm.arpa import arpa_to_fst

>>> # Create a small arpa model
>>> arpa_file = getfixture('tmpdir').join("bigram.arpa")
>>> arpa_file.write(
...     "Anything can be here\n"
...     + "\n"
...     + "\\data\\\n"
...     + "ngram 1=3\n"
...     + "ngram 2=4\n"
...     + "\n"
...     + "\\1-grams:\n"
...     + "0 <s>\n"
...     + "-0.6931 a\n"
...     + "-0.6931 b 0.\n"
...     + "" # Ends unigram section
...     + "\\2-grams:\n"
...     + "-0.6931 <s> a\n"
...     + "-0.6931 a a\n"
...     + "-0.6931 a b\n"
...     + "-0.6931 b a\n"
...     + "\n"  # Ends bigram section
...     + "\\end\\\n")  # Ends whole file
>>> # Create words vocab
>>> vocav = getfixture('tmpdir').join("words.txt")
>>> vocav.write(
...     "a 1\n"
...     + "b 2\n"
...     + "<s> 3\n"
...     + "#0 4")  # Ends whole file
>>> out = getfixture('tmpdir').join("bigram.txt.fst")
>>> arpa_to_fst(vocav, arpa_file, out, 2)