LingPy

Pairwise Sequence Comparison (Pairwise)

class lingpy.compare.Pairwise(infile, merge_vowels=True, comment='#')

Basic class for dealing with the pairwise alignment of sequences.

Parameters :

infile : file

A file in psq-format.

merge_vowels : bool (default=True)

Indicate, whether neighboring vowels should be merged into diphtongs, or whether they should be kept separated during the analysis.

comment : char (default=’#’)

The comment character which, inserted in the beginning of a line, prevents that line from being read.

Notes

In order to read in data from text files, two different file formats can be used along with this class:

psq-format

The psq-format is a specific format for text files containing unaligned sequence pairs. Files in this format should have the extension psq.

The first line of a psq-file contains information regarding the dataset. The sequence pairs are given in triplets, with a sequence identifier in the first line of a triplet (containing the meaning, or orthographical information) and the two sequences in the second and third line, whereas the first column of each sequence line contains the name of the taxon and the second column the sequence in IPA format. All triplets are divided by one empty line. As an example, consider the file test.psq:

Harry Potter Testset
Woldemort in German and Russian
German  waldemar
Russian vladimir

Woldemort in English and Russian
English woldemort
Russian vladimir

Woldemort in English and German
English woldemort
German  waldemar
psa-format

The psa-format is a specific format for text files containing already aligned sequence pairs. Files in this format should have the extension psq.

The first line of a psa-file contains information regarding the dataset. The sequence pairs are given in triplets, with a sequence identifier in the first line of a triplet (containing the meaning, or orthographical information) and the aligned sequences in the second and third line, whith the name of the taxon in the first column and all aligned segments in the following columns, separated by tabstops. All triplets are divided by one empty line. As an example, consider the file test.psa:

Harry Potter Testset
Woldemort in German and Russian
German.    w    a    l    -    d    e    m    a    r
Russian    v    -    l    a    d    i    m    i    r

Woldemort in English and Russian
English    w    o    l    -    d    e    m    o    r    t
Russian    v    -    l    a    d    i    m    i    r    -

Woldemort in English and German
English    w    o    l    d    e    m    o    r    t
German.    w    a    l    d    e    m    a    r    -

Attributes

taxa list A list of tuples containing the taxa of all sequence pairs.
seqs list A list of tuples containing all sequence pairs.
tokens list A list of tuples containing all sequence pairs in a tokenized form.

Methods

align([model, mode, gop, gep_scale, scale, ...]) Align two sequences or a list of sequence pairs pairwise.
output([fileformat, filename]) Write the results of the analyses to a text file.