PHONETISAURUS(1)

User Commands

PHONETISAURUS(1)

NAME¶

phonetisaurus-align - Dictionary aligner

SYNOPSIS¶

phonetisaurus-align --input=dictionary.bsf --ofile=training.corpus [OPTIONS]

DESCRIPTION¶

phonetisaurus-align

This tool read an input dictionary and produce an aligned corpus that can be used to train a model for Grapheme-to-Phoneme conversion.

INPUT FORMAT¶

The input format is a two columns plain-text file. The first column is supposed to contain a graphemes sequence (e.g., the orthographic form of a word). The second column is supposed to contain the corresponding phonemes sequence.

By default the two columns are separated by a TAB character (it is possible to change the separator using the --delim option), each character of the first column is supposed to be a grapheme (it is possible to specify a grapheme separator using --seq1_sep), phonemes in the second column are separated by spaces (it is possible to change the phoneme separator using --seq2_sep).

Input example:

ABBREVIATE AH B R IY V IY EY T

OPTIONS¶

--help=<bool> (default: false)

: show usage information

--helpshort=<bool> (default: false)

: show brief usage information

--tmpdir=<string> (default: "/tmp/")

: temporary directory

--v=<int32> (default: 0)

: verbose level

--fst_align=<bool> (default: false)

: Write FST data aligned where appropriate

--fst_default_cache_gc=<bool> (default: true)

: Enable garbage collection of cache

--fst_default_cache_gc_limit=<int64> (default: 1048576)

: Cache byte size that triggers garbage collection

--fst_verify_properties=<bool> (default: false)

: Verify fst properties queried by TestProperties

--fst_weight_parentheses=<string> (default: "")

: Characters enclosing the first weight of a printed composite weight (e.g. pair weight, tuple weight and derived classes) to ensure proper I/O of nested composite weights; must have size 0 (none) or 2 (open and close parenthesis)

--fst_weight_separator=<string> (default: "")

: Character separator between printed composite weights; must be a single character

--save_relabel_ipairs=<string> (default: "")

: Save input relabel pairs to file

--save_relabel_opairs=<string> (default: "")

: Save output relabel pairs to file --delim=<string> (default: " ")
: Delimiter used to separate input and output tokens.

--eps=<string> (default: "<eps>")

: Epsilon symbol.

--fb=<bool> (default: false)

: Use forward-backward pruning for the alignment lattices.

--input=<string> (default: "")

: Two-column input file to align.

--iter=<int32> (default: 11)

: Maximum number of EM iterations to perform.

--lattice=<bool> (default: false)

: Write out the alignment lattices as an fst archive (.far).

--model=<bool> (default: true)

: Load a pre-trained model for use.

--mbr=<bool> (default: false)

: Use the LMBR decoder (not yet implemented).

--model_file=<string> (default: "")

: FST-format alignment model to load.

--nbest=<int32> (default: 1)

: Output the N-best alignments given the model.

--ofile=<string> (default: "")

: Output file to write the aligned dictionary to.

--penalize=<bool> (default: true)

: Penalize scores.

--penalize_em=<bool> (default: false)

: Penalize links during EM training.

--pthresh=<double> (default: -99)

Pruning threshold. Use to prune unlikely N-best candidates when using multiple alignments.

--restrict=<bool> (default: true)

: Restrict links to M-1, 1-N during initialization.

--s1_char_delim=<string> (default: "")

: Sequence one input delimiter.

--s1s2_sep=<string> (default: "}")

: Token used to separate input-output subsequences in the g2p model.

--s2_char_delim=<string> (default: " ")

: Sequence two input delimiter.

--seq1_del=<bool> (default: true)

: Allow deletions in sequence one.

--seq1_max=<int32> (default: 2)

: Maximum subsequence length for sequence one.

--seq1_sep=<string> (default: "|")

: Multi-token separator for input tokens.

--seq2_del=<bool> (default: true)

: Allow deletions in sequence two.

--seq2_max=<int32> (default: 2)

: Maximum subsequence length for sequence two.

--seq2_sep=<string> (default: "|")

: Multi-token separator for output tokens.

--skip=<string> (default: "_")

: Skip token used to represent null transitions. Distinct from epsilon.

--thresh=<double> (default: 1e-10)

: Delta threshold for EM training termination.

--write_model=<string> (default: "")

: Write out the alignment model in OpenFst format to filename.

--fst_compat_symbols=<bool> (default: true)

: Require symbol tables to match when appropriate

--fst_field_separator=<string> (default: " ")

: Set of characters used as a separator between printed fields

--fst_error_fatal=<bool> (default: true)

: FST errors are fatal; o.w. return objects flagged as bad: e.g., FSTs - kError prop. true, FST weights - not a Member()

February 2013

phonetisaurus 0.7.8

Source file:	phonetisaurus-align.1.en.gz (from phonetisaurus 0.7.8-6+b1)
Source last updated:	2016-06-04T17:05:17Z
Converted to HTML:	2022-09-07T21:03:03Z