LingPy

Multiple Sequence Comparison (Multiple)

class lingpy.compare.Multiple(infile, merge_vowels=True, comment='#')

Basic class for carrying out multiple sequence alignment analyses.

Parameters :

infile : file

A file in msq-format or msa-format.

merge_vowels : bool (default=True)

Indicate, whether neighboring vowels should be merged into diphtongs, or whether they should be kept separated during the analysis.

comment : char (default=’#’)

The comment character which, inserted in the beginning of a line, prevents that line from being read.

Notes

In order to read in data from text files, two different file formats can be used along with this class:

msq-format

The msq-format is a specific format for text files containing unaligned sequences. Files in this format should have the extension msq. The first line of an msq-file contains information regarding the dataset. The second line contains information regarding the sequence (meaning, identifier), and the following lines contain the name of the taxa in the first column and the sequences in IPA format in the second column, separated by a tabstop. As an example, consider the file test.msq:

Harry Potter Testset
Woldemort (in different languages)
German  waldemar
English woldemort
Russian vladimir
msa-format

The msa-format is a specific format for text files containing already aligned sequence pairs. Files in this format should have the extension msa.

The first line of a msa-file contains information regarding the dataset. The second line contains information regarding the sequence (its meaning, the protoform corresponding to the cognate set, etc.). The aligned sequences are given in the following lines, whereas the taxa are given in the first column and the aligned segments in the following columns. Additionally, there may be a specific line indicating the presence of swaps and a specific line indicating highly consistent sites (local peaks) in the MSA. The line for swaps starts with the headword SWAPS whereas a plus character (+) marks the beginning of a swapped region, the dash character (-) its center and another plus character the end. All sites which are not affected by swaps contain a dot. The line for local peaks starts with the headword LOCAL. All sites which are highly consistent are marked with an asterisk (*), all other sites are marked with a dot (.). As an example, consider the file test.msa:

Harry Potter Testset
Woldemort (in different languages)
English     w    o    l    -    d    e    m    o    r    t
German.     w    a    l    -    d    e    m    a    r    -
Russian     v    -    l    a    d    i    m    i    r    -
SWAPS..     .    +    -    +    .    .    .    .    .    .
LOCAL..     *    *    *    .    *    *    *    *    *    .

Examples

Get the path to a file from the testset.

>>> from lingpy import *
>>> seq_file = get_file('test.seq')

Load the file into the Multiple class.

>>> mult = Multiple(seq_file)

Carry out a progressive alignment analysis of the sequences.

>>> mult.prog_align()

Print the result to the screen:

>>> print(mult)
w    o    l    -    d    e    m    o    r    t
w    a    l    -    d    e    m    a    r    -
v    -    l    a    d    i    m    i    r    -

Methods

get_pairwise_alignments([new_calc, model, ...]) Function creates a dictionary of all pairwise alignments scores.
get_peaks([gap_weight]) Calculate the profile score for each column of the alignment.
get_pid([mode]) Return the Percentage Identity (PID) score of the calculated MSA.
ipa2cls([model]) Retrieve sound-class strings from aligned IPA sequences.
iterate_all_sequences([check, mode, gop, ...]) Iterative refinement based on a complete realignment of all sequences.
iterate_clusters(threshold[, check, mode, ...]) Iterative refinement based on a flat cluster analysis of the data.
iterate_orphans([check, mode, gop, ...]) Iterate over the most divergent sequences in the sample.
iterate_similar_gap_sites([check, mode, ...]) Iterative refinement based on the Similar Gap Sites heuristic.
lib_align([model, mode, modes, scale, ...]) Carry out a library-based progressive alignment analysis of the sequences.
output([fileformat, filename, sorted_seqs, ...]) Write data to file.
prog_align([model, mode, gop, gep_scale, ...]) Carry out a progressive alignment analysis of the input sequences.
sum_of_pairs([alm_matrix, mat, gap_weight]) Calculate the sum-of-pairs score for a given alignment analysis.
swap_check([swap_penalty, score_mode]) Check for possibly swapped sites in the alignment.