Basic class for dealing with the pairwise alignment of sequences.
Parameters : | infile : file
merge_vowels : bool (default=True)
comment : char (default=’#’)
|
---|
Notes
In order to read in data from text files, two different file formats can be used along with this class:
The psq-format is a specific format for text files containing unaligned sequence pairs. Files in this format should have the extension psq.
The first line of a psq-file contains information regarding the dataset. The sequence pairs are given in triplets, with a sequence identifier in the first line of a triplet (containing the meaning, or orthographical information) and the two sequences in the second and third line, whereas the first column of each sequence line contains the name of the taxon and the second column the sequence in IPA format. All triplets are divided by one empty line. As an example, consider the file test.psq:
Harry Potter Testset
Woldemort in German and Russian
German waldemar
Russian vladimir
Woldemort in English and Russian
English woldemort
Russian vladimir
Woldemort in English and German
English woldemort
German waldemar
The psa-format is a specific format for text files containing already aligned sequence pairs. Files in this format should have the extension psq.
The first line of a psa-file contains information regarding the dataset. The sequence pairs are given in triplets, with a sequence identifier in the first line of a triplet (containing the meaning, or orthographical information) and the aligned sequences in the second and third line, whith the name of the taxon in the first column and all aligned segments in the following columns, separated by tabstops. All triplets are divided by one empty line. As an example, consider the file test.psa:
Harry Potter Testset
Woldemort in German and Russian
German. w a l - d e m a r
Russian v - l a d i m i r
Woldemort in English and Russian
English w o l - d e m o r t
Russian v - l a d i m i r -
Woldemort in English and German
English w o l d e m o r t
German. w a l d e m a r -
Attributes
taxa | list | A list of tuples containing the taxa of all sequence pairs. |
seqs | list | A list of tuples containing all sequence pairs. |
tokens | list | A list of tuples containing all sequence pairs in a tokenized form. |
Methods
align([model, mode, gop, gep_scale, scale, ...]) | Align two sequences or a list of sequence pairs pairwise. |
output([fileformat, filename]) | Write the results of the analyses to a text file. |