seq : str
The input sequence that shall be tokenized.
diacritics : {str, None} (default=None)
A string containing all diacritics which shall be considered in the
respective analysis. When set to None, the default diacritic string
will be used.
vowels : {str, None} (default=None)
A string containing all vowel symbols which shall be considered in the
respective analysis. When set to None, the default vowel string will
be used.
tones : {str, None} (default=None)
A string indicating all tone letter symbals which shall be considered
in the respective analysis. When set to None, the default tone string
will be used.
combiners : str (default=”͜͡”)
A string with characters that are used to combine two separate
characters (compare affricates such as t͡s).
breaks : str (default=”-.”)
A string containing the characters that indicate that a new token
starts right after them. These can be used to indicate that two
consecutive vowels should not be treated as diphtongs or for diacritics
that are put before the following letter.
merge_vowels : bool
Indicate, whether vowels should be merged into diphtongs
(default=True), or whether each vowel symbol should be considered
separately.
|