Coverage for lingpy/basic/wordlist.py : 99%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# *-* coding: utf-8 *-* This module provides a basic class for the handling of word lists. """
wl2dst, wl2dict, renumber, calculate_data, wl2qlc, tsv2triple, wl2multistate, coverage, iter_rows )
""" Basic class for the handling of multilingual word lists.
Parameters ---------- filename : { string, dict } The input file that contains the data. Otherwise a dictionary with consecutive integers as keys and lists as values with the key 0 specifying the header.
row : str (default = "concept") A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list.
col : str (default = "doculect") A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list.
conf : string (default='') A string defining the path to the configuration file (more information in the notes).
Notes ----- A word list is created from a dictionary containing the data. The idea is a three-dimensional representation of (linguistic) data. The first dimension is called **col** (*column*, usually "language"), the second one is called **row** (*row*, usually "concept"), the third is called **entry**, and in contrast to the first two dimensions, which have to consist of unique items, it contains flexible values, such as "ipa" (phonetic sequence), "cogid" (identifier for cognate sets), "tokens" (tokenized representation of phonetic sequences). The LingPy website offers some tutorials for word lists which we recommend to read in case you are looking for more information.
A couple of methods is provided along with the word list class in order to access the multi-dimensional input data. The main idea is to provide an easy way to access two-dimensional slices of the data by specifying which entry type should be returned. Thus, if a word list consists not only of simple orthographical entries but also of IPA encoded phonetic transcriptions, both the orthographical source and the IPA transcriptions can be easily accessed as two separate two-dimensional lists.
""" self, filename, row, col, conf or util.data_path('conf', 'wordlist.rc'))
# setup other local temporary storage
# check for taxa in meta
self, entry, source, function, override=False, **keywords): """ Add new entry-types to the word list by modifying given ones.
Parameters ---------- entry : string A string specifying the name of the new entry-type to be added to the word list.
source : string A string specifying the basic entry-type that shall be modified. If multiple entry-types shall be used to create a new entry, they should be passed in a simple string separated by a comma.
function : function A function which is used to convert the source into the target value.
keywords : {dict} A dictionary of keywords that are passed as parameters to the function.
Notes ----- This method can be used to add new entry-types to the data by converting given ones. There are a lot of possibilities for adding new entries, but the most basic procedure is to use an existing entry-type and to modify it with help of a function.
"""
self, col='', row='', entry='', **keywords): """ Function returns dictionaries of the cells matched by the indices.
Parameters ---------- col : string (default="") The column index evaluated by the method. It should contain one of the values in the row of the :py:class:`~lingpy.basic.wordlist.Wordlist` instance, usually a taxon (language) name.
row : string (default="") The row index evaluated by the method. It should contain one of the values in the row of the :py:class:`~lingpy.basic.wordlist.Wordlist` instance, usually a taxon (language) name.
entry : string (default="") The index for the entry evaluated by the method. It can be used to specify the datatype of the rows or columns selected. As a default, the indices of the entries are returned.
Returns ------- entries : dict A dictionary of keys and values specifying the selected part of the data. Typically, this can be a dictionary of a given language with keys for the concept and values as specified in the "entry" keyword.
Notes ----- The "col" and "row" keywords in the function are all aliased according to the description in the ``wordlist.rc`` file. Thus, instead of using these attributes, the aliases can also be taken. For selecting a language, one may type something like::
>>> Wordlist.get_dict(language='LANGUAGE')
and for the selection of a concept, one may type something like::
>>> Wordlist.get_dict(concept='CONCEPT')
See the examples below for details.
Examples -------- Load the ``harry_potter.csv`` file::
>>> wl = Wordlist('harry_potter.csv')
Select all IPA-entries for the language "German"::
>>> wl.get_dict(language='German',entry='ipa') {'Harry': ['haralt'], 'hand': ['hant'], 'leg': ['bain']}
Select all words (orthographical representation) for the concept "Harry"::
>>> wl.get_dict(concept="Harry",entry="words") {'English': ['hæri'], 'German': ['haralt'], 'Russian': ['gari'], \ 'Ukrainian': ['gari']}
Note that the values of the dictionary that is returned are always lists, since it is possible that the original file contains synonyms (multiple words corresponding to the same concept).
See also -------- Wordlist.get_list Wordlist.add_entries
"""
for key, value in entries.items()}
for i in self._array[:, self.cols.index(col)] if i != 0]: for key, value in entries.items()}
self, row='', col='', entry='', flat=False, **keywords): """ Function returns lists of rows and columns specified by their name.
Parameters ---------- row: string (default = '') The row name whose entries are selected from the data.
col : string (default = '') The column name whose entries are selected from the data.
entry: string (default = '') The entry-type which is selected from the data.
flat : bool (default = False) Specify whether the returned list should be one- or two-dimensional, or whether it should contain gaps or not.
Returns ------- data : list A list representing the selected part of the data.
Notes ----- The 'col' and 'row' keywords in the function are all aliased according to the description in the ``wordlist.rc`` file. Thus, instead of using these attributes, the aliases can also be taken. For selecting a language, one may type something like::
>>> Wordlist.get_list(language='LANGUAGE')
and for the selection of a concept, one may type something like::
>>> Wordlist.get_list(concept='CONCEPT')
See the examples below for details.
Examples -------- Load the ``harry_potter.csv`` file::
>>> wl = Wordlist('harry_potter.csv')
Select all IPA-entries for the language "German"::
>>> wl.get_list(language='German',entry='ipa' ['bain', 'hant', 'haralt']
Note that this function returns 0 for missing values (concepts that don't have a word in the given language). If one wants to avoid this, the 'flat' keyword should be set to *True*.
Select all words (orthographical representation) for the concept "Harry"::
>>> wl.get_list(concept="Harry",entry="words") [['Harry', 'Harald', 'Гари', 'Гарi']]
Note that the values of the list that is returned are always two-dimensional lists, since it is possible that the original file contains synonyms (multiple words corresponding to the same concept). If one wants to have a flat representation of the entries, the 'flat' keyword should be set to *True*::
>>> wl.get_list(concept="Harry",entry="words",flat=True) ['hæri', 'haralt', 'gari', 'hari']
See also -------- Wordlist.get_list Wordlist.add_entries
""" # if row is chosen # otherwise, start searching else: # first, get the row ids
# if only row is chosen, return the ids # check for flat representation else:
# if row and entry-type is chosen, return the entry-type else: # get the index for the entry in the data dictionary # get the entries else: # get the entries
# if column is chosen "The column {0} you selected is not available!".format(col)) else:
else: else:
else: else: "You should specify only one value for either row or for col!") else: col=keywords[key], entry=entry, flat=flat) row=keywords[key], entry=entry, flat=flat)
self, ref="cogid", entry='', modify_ref=False ): """ Return an etymological dictionary representation of the word list.
Parameters ---------- ref : string (default = "cogid") The reference entry which is used to store the cognate ids.
entry : string (default = '') The entry-type which shall be selected.
modify_ref : function (default=False) Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to "abs", and all cognate IDs will be converted to their absolute value.
Returns ------- etym_dict : dict An etymological dictionary representation of the data.
Notes ----- In contrast to the word-list representation of the data, an etymological dictionary representation sorts the counterparts according to the cognate sets of which they are reflexes. If more than one cognate ID are assigned to an entry, for example in cases of fuzzy cognate IDs or partial cognate IDs, the etymological dictionary will return one cognate set for each of the IDs.
"""
# create an etymdict object
# get the index for the cognate id
# iterate over all data # check if data is not a list or tuple, if this is the case, # make it a fake-list, so we can treat it just as all the other # instances of fuzzy cognates (output is the same, though) # we initialize with zero here, since this corresponds to a # missing entry in our data # assign new values for the current session # create the output # get the index of the header # retrieve the values else:
self, ref='cogid', entry='concept', missing=0, modify_ref=False ): """ Function returns a list of present-absent-patterns of a given word list.
Parameters ---------- ref : string (default = "cogid") The reference entry which is used to store the cognate ids. entry : string (default = "concept") The field which is used to check for missing data. missing : string,int (default = 0) The marker for missing items. """ modify_ref=modify_ref)
# retrieve the values
# check for missing data
# get the sum of the list in the wordlist of self
# append all languages which are zero to missing else:
else: else: else:
"""Iterate over the columns in a wordlist.
Parameters ---------- entries : list The name of the columns which shall be iterated.
Returns ------- iterator : iterator An iterator yielding lists in which the first entry is the ID of the wordlist row and the following entries are the content of the columns as specified.
Examples -------- Load a wordlist from LingPy's test data::
>>> from lingpy.tests.util import test_data >>> from lingpy import Wordlist >>> wl = Wordlist(test_data("KSL.qlc")) >>> list(wl.iter_rows('ipa'))[:10] [[1, 'ɟiθ'], [2, 'ɔl'], [3, 'tut'], [4, 'al'], [5, 'apa.u'], [6, 'ʔayɬʦo'], [7, 'bytyn'], [8, 'e'], [9, 'and'], [10, 'e']]
So as you can see, the function returns the key of the wordlist as well as the specified entry.
"""
self, data, taxa='taxa', concepts='concepts', ref='cogid', **keywords): """ Function calculates specific data.
Parameters ---------- data : str The type of data that shall be calculated. Currently supports
* "tree": calculate a reference tree based on shared cognates * "dst": get distances between taxa based on shared cognates * "cluster": cluster the taxa into groups using different methods
"""
self, source, target='', override=False): """ Renumber a given set of string identifiers by replacing the ids by integers.
Parameters ---------- source : str The source column to be manipulated.
target : str (default='') The name of the target colummn. If no name is chosen, the target column will be manipulated by adding "id" to the name of the source column.
override : bool (default=False) Force to overwrite the data if the target column already exists.
Notes ----- In addition to a new column, an further entry is added to the "_meta" attribute of the wordlist by which newly coined ids can be retrieved from the former string attributes. This attribute is called "source2target" and can be accessed either via the "_meta" dictionary or directly as an attribute of the wordlist.
"""
""" Internal function that eases its modification by daughter classes. """ # check for stamp attribute
# add the default parameters, they will be checked against the keywords keywords, cols=False, distances=False, entries=("concept", "counterpart"), entry='concept', fileformat=fileformat, filename=rcParams['filename'], formatter='concept', modify_ref=False, meta=self._meta, missing=0, prettify='false', ignore='all', ref='cogid', rows=False, subset=False, # setup a subset of the data, taxa='taxa', threshold=0.6, # threshold for flat clustering tree_calc='neighbor')
ref=keywords['ref'], entry=keywords['entry'], missing=keywords['missing'])
# simple printing of taxa
# csv-output
# get the header line [s for s in set(self._alias.values()) if s in self._header], key=lambda x: self._header[x])
# get the data, in case a subset is chosen # write stuff to file
# check for chosen header # get indices for header else:
else:
# get the data
else:
# output dst-format (phylip) # check for distances as keyword
stamp=keywords['stamp'], taxlen=keywords.get('taxlen', 0))
# output tre-format (newick) # check for distances # we look up a function to calculate a tree in the cluster module: self._meta['distances'], self.cols, distances=keywords['distances']) else:
keywords['threshold'], self._meta['distances'], self.taxa)
# make lambda inline for data-check
str(i + 1) + '\t' + concept + '\t' + '\t'.join( [l(t) for t in line])) else: 'ID\tConcept\t' + '\t'.join( ['{0}\t COG'.format(t) for t in self.taxa])) self.get_list(row=concept, entry=keywords['entry'])): '{0}\t{1}'.format(l(a), b) for a, b in zip(line, cogs[j]))
keywords['filename'], lines, 'starling_' + keywords['entry'] + '.csv')
""" Write wordlist to file.
Parameters ---------- fileformat : {"tsv","tre","nwk","dst", "taxa", "starling", "paps.nex", "paps.csv"} The format that is written to file. This corresponds to the file extension, thus 'tsv' creates a file in extended tsv-format, 'dst' creates a file in Phylip-distance format, etc. filename : str Specify the name of the output file (defaults to a filename that indicates the creation date). subset : bool (default=False) If set to c{True}, return only a subset of the data. Which subset is specified in the keywords 'cols' and 'rows'. cols : list If *subset* is set to c{True}, specify the columns that shall be written to the csv-file. rows : dict If *subset* is set to c{True}, use a dictionary consisting of keys that specify a column and values that give a Python-statement in raw text, such as, e.g., "== 'hand'". The content of the specified column will then be checked against statement passed in the dictionary, and if it is evaluated to c{True}, the respective row will be written to file. ref : str Name of the column that contains the cognate IDs if 'starling' is chosen as an output format.
missing : { str, int } (default=0) If 'paps.nex' or 'paps.csv' is chosen as fileformat, this character will be inserted as an indicator of missing data.
tree_calc : {'neighbor', 'upgma'} If no tree has been calculated and 'tre' or 'nwk' is chosen as output format, the method that is used to calculate the tree.
threshold : float (default=0.6) The threshold that is used to carry out a flat cluster analysis if 'groups' or 'cluster' is chosen as output format.
ignore : { list, "all" (default='all')} Modifies the output format in "tsv" output and allows to ignore certain blocks in extended "tsv", like "msa", "taxa", "json", etc., which should be passed as a list. If you choose "all" as a plain string and not a list, this will ignore all additional blocks and output only plain "tsv". prettify : bool (default=False) Inserts comment characters between concepts in the "tsv" file output format, which makes it easier to see blocks of words denoting the same concept. Switching this off will output the file in plain "tsv".
See also -------- ~lingpy.compare.lexstat.LexStat.output ~lingpy.align.sca.Alignments.output
"""
self, fileformat, sections=None, entries=None, entry_sep='', item_sep='', template='', exclude=None, entry_start='', entry_close='', **keywords): """ Export a wordlist to various file formats. """ h1=('concept', '\n# Concept: {0}\n'), h2=('cogid', '## Cognate-ID: {0}\n')) h1=('concept', r'\section{{Concept: ``{0}"}}' + '\n'), h2=('cogid', r'\subsection{{Cognate Set: ``{0}"}}' + '\n')) h1=('concept', '<h1>Concept: {0}</h1>'), h2=('cogid', '<h2>Cognate Set: {0}</h2>'))
# get the temporary dictionary
# assign the output string
# iterate over the dictionary and start to fill the string # write key to file
# reassign tmp
# set the pointer and the index
# check for type of current point else: else: else:
self, fileformat, sections=None, entries=None, entry_sep='', item_sep='', template='', **keywords): """ Export the wordlist to specific fileformats.
Notes ----- The difference between export and output is that the latter mostly serves for internal purposes and formats, while the former serves for publication of data, using specific, nested statements to create, for example, HTML or LaTeX files from the wordlist data. """
fileformat, sections, entries, entry_sep, item_sep, template, **keywords)
""" Function determines the coverage of a wordlist. """
""" Load a wordlist from a normal CSV file.
Parameters ---------- path : str The path to your CSV file. delimiter : str The delimiter in the CSV file. quotechar : str The quote character in your data. row : str (default = "concept") A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list. col : str (default = "doculect") A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list. conf : string (default='') A string defining the path to the configuration file.
Notes ----- This function returns a :py:class:`~lingpy.basic.wordlist.Wordlist` object. In contrast to the normal way to load a wordlist from a tab-separated file, however, this allows to directly load a wordlist from any "normal" csv-file, with your own specified delimiters and quote characters. If the first cell in the first row of your CSV file is not named "ID", the integer identifiers, which are required by LingPy will be automatically created.
""" D[0] = header[1:] for row in data: D[row[0]] = [normalize(normalization_form, n) for n in row[1:]] else:
""" Load data from CLDF into a LingPy Wordlist object or similar.
Parameters ---------- path : str The path to the metadata-file of your CLDF dataset. to : ~lingpy.basic.wordlist.Wordlist A ~lingpy.basic.wordlist.Wordlist object or one of the descendants (LexStat, Alignmnent).
Note ---- This function does not offer absolute flexibility regarding the data you can input so far. However, it can regularly read CLDF-formatted data into LingPy and thus allow you to use CLDF data in LingPy analyses.
Todo ---- Add support for partial cognates. """
# obtain the dictionaries to convert ids to values tbg.tabledict['languages.csv']} tbg.tabledict['parameters.csv']}
# create dictionary # check for numeric ID else: idx = i+1
'') or '' for f in ['form_in_source', 'Form', 'Segments', 'Comment', 'Source']] # add the header 'form', 'tokens', 'note', 'source']
# convert to wordlist (simplifies handling)
# add cognates if they are needed row['Alignment']) for row in tbg.tabledict['cognates.csv']}
|