nidaba.algorithms package

Submodules

nidaba.algorithms.otsu module

nidaba.algorithms.otsu

Module implementing variants of Otsu’s method.

nidaba.algorithms.otsu.otsu(im)

A naive native python implementation of Otsu thresholding (or at least the algorithm Wikipedia describes as Otsu’s method).

Parameters:im (PIL.Image) – A PIL Image object in mode ‘L’ (8bpp grayscale)
Returns:PIL.Image in mode ‘1’ (1bpp b/w) containing the binarized image

nidaba.algorithms.string module

nidaba.algorithms.string

Implementation of various algorithms operating on strings and unicode objects, e.g. alignment, edit distances, and symmetric deletion searches.

nidaba.algorithms.string.compare_strings(u1, u2)
nidaba.algorithms.string.edit_distance(str1, str2, substitutionscore=1, insertscore=1, deletescore=1, charmatrix={}, alignment_type=u'global')
nidaba.algorithms.string.full_edit_distance(str1, str2, substitutionscore=1, insertscore=1, deletescore=1, ins_func=None, iargs=[], ikwargs={}, del_func=None, dargs=[], dkwargs={}, sub_func=None, sargs=[], skwargs={}, charmatrix={}, alignment_type=u'global')

A version of the modified Wagner-Fischer algorithm that accepts user defined scoreing functions. These functions should be of the form fname(token1, token2, args*, kwargs**) and return an integer >= 0. A return value of 0 indicates an optimality. The larger, the integer, the worse the score. Charmatrix is here only used to calulate default delete and insert scores for the initial matrix.

nidaba.algorithms.string.greek_chars()

Return a list containing all the characters from the Greek and Coptic, Extended Greek, and Combined Diacritical unicode blocks.

nidaba.algorithms.string.greek_filter(string)

Remove all non-Greek characters from a string.

nidaba.algorithms.string.identify(string, unicode_blocks)

Determine percent-wise how many characters in the given string belong in each given unicode block. Ranges may be user defined, and not official unicode ranges. It is assumed that ranges do not overlap. unicode_blocks is an iterable of 3-tuples of the form (<name of block>, <first unichar in block>, <last unichar in the block>).

nidaba.algorithms.string.inblock(c, bounds)

Check that the character c is equal to or between the two bounding characters in the unicode table.

nidaba.algorithms.string.initmatrix(rows, columns, defaultval=0)

Initializes a 2d list to the desired dimensions.

nidaba.algorithms.string.isgreek(ustr)
nidaba.algorithms.string.islang(unistr, unicode_blocks, threshold=1.0)

Determine if a given (unicode) string belongs to a certain langauge. This calls the identify function to determine what fraction of the string’s characters are in the specified blocks. It returns true if the number of chars in those blocks is >= threshold. Threshold is a float between 0 and 1.

nidaba.algorithms.string.key_for_del_dict_entry(entry)

Parse a line from a symmetric delete dictionary. Returns a tuple of the form (key, list of values).

nidaba.algorithms.string.key_for_single_word(entry)

Parse a line from a simple “one word per line” dictionary.

nidaba.algorithms.string.list_to_uni(l, encoding=u'utf-8')

Return a human readable string representation of a list of unicode strings using the specified encoding.

nidaba.algorithms.string.mapped_sym_suggest(ustr, del_dic_path, dic, depth, ret_count=0)

Generate a list of spelling suggestions using the memory mapped dictionary search/symmetric delete algorithm. Return only suggestions at the specified depth, not up to and including that depth.

Perform a binary search on a memory mapped dictionary file, and return the parsed entry, or None if the specified entry cannot be found. This function assumes that the dictionary is properly formatted and well-formed, otherwise the behavior is undefined. Line buffer must not be shorter than the longest line in the dictionary. Entries may be any strings which do not contain newlines (newlines delimint entries); the entryparser_fn should be of the form fn_name(unicodestr), decorated with @unibarrier and return a tuple of the form (keytosort by, val). By default, it uses the function for parsing symmetric deletion dictionary entries. The line_buffer_size argument must be >= the longest line in the dictionary, or behavior is undefined.

nidaba.algorithms.string.mr(matrix)

Returns a string rep of a 2d list as a matrix. Useful for debugging.

nidaba.algorithms.string.native_align(str1, str2, substitutionscore=1, insertscore=1, deletescore=1, charmatrix={})

Calculate the edit distance of two strings, then backtrace to find a valid edit sequence.

nidaba.algorithms.string.native_backtrace(matrix, start=None)

Trace edit steps backward to find an edit sequence for an alignment. Starts at the provided ‘start’ index, or in the i,j’th index if none is provided. The backtrace always ends at the index 0,0.

nidaba.algorithms.string.native_full_edit_distance(str1, str2, substitutionscore=1, insertscore=1, deletescore=1, charmatrix={}, alignment_type=u'global')
nidaba.algorithms.string.native_global_matrix(str1, str2, substitutionscore, insertscore, deletescore, charmatrix)

An initial matrix for a global sequence alignment.

nidaba.algorithms.string.native_semi_global_align(shortseq, longseq, substitutionscore=1, insertscore=1, deletescore=1, charmatrix={})
nidaba.algorithms.string.native_semi_global_matrix(str1, str2, substitutionscore, insertscore, deletescore, charmatrix)
nidaba.algorithms.string.np_align(str1, str2, substitutionscore=1, insertscore=1, deletescore=1, charmatrix={})

Calculate the edit distance of two strings, then backtrace to find a valid edit sequence.

nidaba.algorithms.string.np_backtrace(matrix, start=None)

Trace edit steps backward to find an edit sequence for an alignment. Starts at the provided ‘start’ index, or in the i,j’th index if none is provided. The backtrace always ends at the index 0,0.

nidaba.algorithms.string.np_full_edit_distance(str1, str2, substitutionscore=1, insertscore=1, deletescore=1, charmatrix={}, alignment_type=u'global')

A modified implenmentation of the Wagner-Fischer algorithm using numpy. Unlike the minimal and optimized version in the “edit_distance” function, this returns the entire scoring matrix, and an operation matrix for backtracing and reconstructing the edit operations. This should be used when an alignment is desired, not only the edit distance.

nidaba.algorithms.string.np_global_matrix(str1, str2, substitutionscore, insertscore, deletescore, charmatrix)

An initial matrix for a global sequence alignment.

nidaba.algorithms.string.np_semi_global_align(shortseq, longseq, substitutionscore=1, insertscore=1, deletescore=1, charmatrix={})

Find a semi-global alignment between two strings.

nidaba.algorithms.string.np_semi_global_matrix(str1, str2, substitutionscore, insertscore, deletescore, charmatrix)

An initial matrix for a semi-global sequence alignment.

nidaba.algorithms.string.parse_del_dict_entry(entry)
nidaba.algorithms.string.prev_newline(mm, line_buffer_size=100)

Return the pointer position immediately after the closest left hand newline, or to the beginning of the file if no such newlines exist.

nidaba.algorithms.string.sanitize(string, encoding=u'utf-8', normalization=u'NFD')

Strip leading and trailing whitespace, convert to NFD. If the passed string is a str rather than an unicode, decode it with the specified encoding.

nidaba.algorithms.string.strings_by_deletion(unistr, dels)

Compute the unique strings which can be formed from a string by deleting the specified number of characters from it. The results are sorted in ascending order.

nidaba.algorithms.string.strip_diacritics(ustr)

Remove all Greek diacritics from the specified string. Expects the string to be in NFD.

nidaba.algorithms.string.suggestions(ustr, sugs, freq=None)

Call mapped_sym_suggest, and return the suggestions as a sorted list. Python’s built in sort is stable, so we can simply sort repeatedly, from least important aspect to most important.

nidaba.algorithms.string.sym_suggest(ustr, dic, delete_dic, depth, ret_count=0)

Return a list of “spelling” corrections using a symmetric deletion search. Dic is a set of correct words. Delete_dic is of the form {edit_term:[(candidate1, edit_distance), (candidate2, edit_distance), ...]}.

nidaba.algorithms.string.todec(ustr)
nidaba.algorithms.string.truestring(unicode)
nidaba.algorithms.string.uniblock(start, stop)

Return a list containing all the characters in the unicode table starting with ‘start’ (inclusive) and ending with end (inclusive).

Module contents