nidaba package

Submodules

nidaba.api module

nidaba.api

Exposes the functionality of the SimpleBatch class using a restful interface.

For a documentation of the interface see the API docs.

class nidaba.api.Batch

Bases: flask_restful.Resource

endpoint = 'batch'
get(batch_id)

Retrieves the state of batch batch_id.

** Request **

GET /batch/:batch_id

** Response **

HTTP/1.1 200 OK
Parameters:batch_id (string) – batch identifier
Status 200:No error
Status 404:No such batch
mediatypes(resource_cls)
methods = ['GET', 'POST']
post(batch_id)

Executes batch with identifier batch_id

** Request **

POST /batch/:batch_id

** Response **

HTTP/1.1 202 ACCEPTED
Parameters:batch_id (string) – batch’s unique id
Status 202:Successfully executed
Status 400:Batch could not be executed
Status 404:No such batch
Status 409:Trying to reexecute an already executed batch
class nidaba.api.BatchCreator

Bases: flask_restful.Resource

endpoint = 'batchcreator'
mediatypes(resource_cls)
methods = ['POST']
post()

Creates a new batch and returns it identifier.

** Request **

POST /batch

** Response **

HTTP/1.1 201 CREATED

{
    "id": "78a1f1e4-cc76-40ce-8a98-77b54362a00e",
    "url": "/batch/78a1f1e4-cc76-40ce-8a98-77b54362a00e"
}
Status 201:Successfully created
class nidaba.api.BatchPages

Bases: flask_restful.Resource

endpoint = 'batchpages'
get(batch_id)

Returns the list of pages associated with the batch with batch_id.

** Request **

GET /batch/:batch/pages

** Response **

HTTP/1.1 200 OK

[
    {
        "name": "0033.tif",
        "url": "/pages/63ca3ec7-2592-4c7d-9009-913aac42535d/0033.tif"
    },
    {
        "name": "0072.tif",
        "url": "/pages/63ca3ec7-2592-4c7d-9009-913aac42535d/0072.tif"
    },
    {
        "name": "0014.tif",
        "url": "/pages/63ca3ec7-2592-4c7d-9009-913aac42535d/0014.tif"
    }
]
Status 200:success
Status 404:batch not found
mediatypes(resource_cls)
methods = ['GET', 'POST']
parser = <flask_restful.reqparse.RequestParser object>
post(batch_id)

Adds a page (really any type of file) to the batch identified by batch_id.

** Request **

POST /batch/:batch/pages

** Response **

HTTP/1.1 201 OK

[
{
“name”: “0033.tif”, “url”: “/pages/63ca3ec7-2592-4c7d-9009-913aac42535d/0033.tif”

}

]

Form scans:file(s) to add to the batch
Status 201:task created
Status 403:file couldn’t be created
Status 404:batch not found
class nidaba.api.BatchTasks

Bases: flask_restful.Resource

endpoint = 'batchtasks'
get(batch_id, group=None, task=None)

Retrieves the list of tasks and their argument values associated with a batch, optionally limited to a specific group.

** Request **

GET /batch/:batch_id/tasks

** Response **

HTTP/1.1 200 OK

{
    "segmentation": [
        ["tesseract", {}]
    ],
    "ocr": [
        ["kraken",
            {
                "model": "teubner",
            }
        ]
    ]
}

To limit output to a specific group of tasks, e.g. segmentation or binarization append the group to the URL:

** Request **

GET /batch/:batch_id/tasks/:group

** Response **

HTTP/1.1 200 OK

{
    'group': [
        ["tesseract", {}],
        ["kraken", {}]
    ]
}
Status 200:success
Status 404:batch, group, or task not found.
mediatypes(resource_cls)
methods = ['GET', 'POST']
post(batch_id, group=None, task=None)

Adds a particular configuration of a task to the batch identified by batch_id.

** Request **

POST /batch/:batch_id/:group/:task

{
kwarg_1: “value”, kwarg_2: 10, kwarg_3: ‘true’, kwarg_4: [“a”, “b”], kwarg_5: ‘/pages/:batch_id/path’

}

** Response **

HTTP/1.1 201 CREATED

To post files as arguments use their URL returned by the call that created them on the batch. Booleans are strings containing either the values ‘True’/’true’ or ‘False’/’false’.

Status 201:task created
Status 404:batch, group, or task not found.
class nidaba.api.Page

Bases: flask_restful.Resource

endpoint = 'page'
get(batch, file)

Retrieves the file at file in batch batch.

** Request **

GET /pages/:batch/:path

** Response **

HTTP/1.1 200 OK
Content-Type: application/octet-stream

...
Parameters:
  • batch (str) – batch’s unique id
  • file (path) – path to the batch’s file
Status 200:

No error

Status 404:

File not found

mediatypes(resource_cls)
methods = ['GET']
class nidaba.api.Task

Bases: flask_restful.Resource

endpoint = 'task'
get(group=None, task=None)

Retrieves the list of available tasks, their arguments and valid values for those arguments.

** Request **

GET /tasks

** Response **

HTTP/1.1 200 OK

{
    "img": {
        "deskew": {},
        "dewarp": {},
        "rgb_to_gray": {}
    },
    "binarize": {
        "nlbin": {
            "border": "float",
            "escale": "float",
            "high": [
                0,
                100
            ],
            "low": [
                0,
                100
            ],
        },
        "otsu": {},
        "sauvola": {
            "factor": [
                0.0,
                1.0
            ],
            "whsize": "int"
        }
    },
    "segmentation": {
        "kraken": {},
        "tesseract": {}
    },
    "ocr": {
        "kraken": {
            "model": [
                "fraktur.pyrnn.gz",
                "default",
                "teubner"
            ]
        },
        "tesseract": {
            "extended": [
                false,
                true
            ],
            "languages": [
                "chr",
                "chi_tra",
                "ita_old",
                "ceb",
            ]
        }
    },
    "postprocessing": {
        "spell_check": {
            "filter_punctuation": [
                true,
                false
            ],
            "language": [
                "latin",
                "polytonic_greek"
            ]
        }
    },
    "output": {
        "metadata": {
            "metadata": "file",
            "validate": [
                true,
                false
            ]
        },
        "tei2hocr": {},
        "tei2simplexml": {},
        "tei2txt": {}
    }
}

It is also possible to retrieve only a subset of task definitions by adding to the request a task group and/or the task name:

** Request **

GET /tasks/segmentation

** Response **

HTTP/1.1 200 OK

{
    "segmentation": {
        "kraken": {},
        "tesseract": {}
    }
}

Currently there are 4 different argument types:

  • “int”: An integer

  • “float”: A float (floats serialized to integers, i.e. 1.0 to 1

    are also accepted)

  • “str”: An UTF-8 encoded string

  • “file”: A file on the storage medium, referenced by its URL

Finally there are lists of valid argument values where one or more values out of the list may be picked and value ranges

mediatypes(resource_cls)
methods = ['GET']
nidaba.api.create_app()
nidaba.api.get_blueprint()

nidaba.celery module

nidaba.cli module

This module encapsulates all shell callable functions of nidaba.

nidaba.cli.conv_arg_string(s)

A small helper function intended to coerce an input string to types in the order int -> float -> bool -> unicode -> input. Also resolves lists of these values.

Parameters:s (unicode) –
Returns:Input variable coerced to the highest data type in the ordering.
Return type:int or float or unicode or original input type
nidaba.cli.help_tasks(ctx, param, value)
nidaba.cli.move_to_storage(batch, kwargs)

Takes as dictionary of kwargs and moves the suffix of all keys starting with the string ‘file:’ to the storage medium, prepending a unique identifier. The path components are rewritten in storage tuple form and the modified dictionary is returned.

It is assumed that the filestore is already created.

nidaba.cli.spin(msg)
nidaba.cli.validate_definition(ctx, param, value)

Validates all task definitions of a group and returns them as a list.

nidaba.config module

nidaba.config.reload_config()

Triggers a global reloading of the configuration files and reinitializes the redis connection pool.

As of now configuration files are only read from sys.prefix/etc/nidaba/.

nidaba.image module

nidaba.image

Common image processing functions encapsulating the PIL or pythonica image interface to absolute file paths.

nidaba.image.any_to_png(imagepath, resultpath)

Converts an image in any format recognized by pillow to PNG.

Parameters:
  • imagepath – Path of the input image
  • resultpath – Path of the output image
Returns:

Path of the actual output file

Return type:

unicode

nidaba.image.otsu(imagepath, resultpath)

Binarizes an grayscale image using Otsu’s algorithm.

Parameters:
  • imagepath – Path of the input image
  • resultpath – Path of the output image
Returns:

Path of the actual output file

Return type:

unicode

nidaba.image.rgb_to_gray(imagepath, resultpath)

Converts an RGB or CMYK image into a 8bpp grayscale image.

Parameters:
  • imagepath – Path of the input image
  • resultpath – Path of the output image
Returns:

Path of the actual output file

Return type:

unicode

nidaba.lex module

nidaba.lex

This module contains functions for dealing with words and dictionaries, such as extracting words from texts, normalizing encodings, building symmetric deletion dictionaries, etc.

nidaba.lex.cleanlines(path, encoding=u'utf-8', normalization=u'NFD')

Read in lines from a file and return them as a sanitized list. Non-unique linse will be repeated.

Parameters:
  • path (unicode) – Absolute path of the file to be read
  • encoding (unicode) – Encoding to use for decoding the file
  • normalization (unicode) – Normalization format to use
Returns:

List of lines containing the sanitized output, i.e. normalized unicode objects.

Return type:

list

nidaba.lex.cleanuniquewords(path, encoding=u'utf-8', normalization=u'NFD')

Read in lines from a file as separated by lines and spaces, convert them to the specified normalization, and return a set of all unique words.

Parameters:
  • path (unicode) – Absolute path of the file to be read
  • encoding (unicode) – Encoding to use for decoding the file
  • normalization (unicode) – Normalization format to use
Returns:

Set of unique tokens

Return type:

set

nidaba.lex.cleanwords(path, encoding=u'utf-8', normalization=u'NFD')

Read in every word from a files as separated by lines and spaces. Non-unique words will be repeated as they are read in. Detects only words divided by a standard space.

Parameters:
  • path (unicode) – Absolute path of the file to be read
  • encoding (unicode) – Encoding to use for decoding the file
  • normalization (unicode) – Normalization format to use
Returns:

List of words containing the sanitized output, i.e. normalized unicode objects.

Return type:

list

nidaba.lex.make_deldict(outpath, words, depth)

Creates a symmetric deletion dictionary from the specified word list.

WARNING! This is a naive approach, which requires all the variants to be stored in memory. For large dictionaries at higher depth, this can easily use all available memory on most machines.

Parameters:
  • outpath (unicode) – File path to write to
  • words (iterable) – An iterable returning a single word per iteration
  • depth (int) – Maximum edit distance to calculate
nidaba.lex.make_dict(outpath, iterable, encoding=u'utf-8')

Create a file at outpath and write evrey object in iterable to its own line. The file is opened in append mode.

Parameters:
  • outpath (unicode) – File path to write to
  • iterable (iterable) – An iterable used as a data source
  • normalization (unicode) – Normalization format to use
nidaba.lex.spellcheck(tokens, dictionary, deletion_dictionary)

Performs a spell check on a sequence of tokens.

The spelling of each sequence of characters is compared against a dictionary containing deletions of valid words and a dictionary of correct words.

Parameters:
  • tokens (iterable) – An iterable returning sequences of unicode characters.
  • dictionary (unicode) – Path to a base dictionary.
  • deletion_dictionary (unicode) – Path to a deletion dictionary.
Returns:

A dictionary containing a sorted (least to highest edit distance) list of suggestions for each character sequence that is not contained verbatim in the dictionary. Tokens that are not recognized as valid words but don’t have spelling suggestions either will be contained in the result dictionary.

nidaba.lex.tei_spellcheck(facsimile, dictionary, deletion_dictionary, filter_punctuation=False)

Performs a spell check on an TEI XML document.

Each seg element is treated as a single word and spelling corrections will be inserted using a choice tag. Correct words will be untouched and correction candidates will be sorted by edit distance.

Parameters:
  • facsimile (nidaba.tei.TEIFacsimile) – TEIFacsimile object.
  • dictionary (unicode) – Path to a base dictionary.
  • deletion_dictionary (unicode) – Path to a deletion dictionary.
  • filter_punctuation (bool) – Switch to filter punctuation inside segments.
Returns:

A TEIFacsimile object containing the spelling corrections.

nidaba.lex.unique_words_from_files(dirpath, encoding=u'utf-8', normalization=u'NFD')

Create a set of unique words from a directory of text files. All file in the given directory will be parsed.

Parameters:
  • dirpath (unicode) – Absolute path of the directory to enter
  • encoding (unicode) – Encoding to use for decoding the files
  • normalization (unicode) – Normalization format to use
Returns:

Set of words of all files in the directory

Return type:

set

nidaba.lex.uniquewords_with_freq(path, encoding=u'utf-8', normalization=u'NFD')

Read in every word from a file as separated by lines and spaces. Return a counter (behaves like a dictionary) of unique words along with the number of times they occurred.

Parameters:
  • path (unicode) – Absolute path of the file to be read
  • encoding (unicode) – Encoding to use for decoding the file
  • normalization (unicode) – Normalization format to use
Returns:

Contains the frequency of each token

Return type:

Counter

nidaba.lex.words_from_files(dirpath, encoding=u'utf-8', normalization=u'NFD')

Create a dictionary from a directory of text files. All files in the given directory will be parsed.

Parameters:
  • dirpath (unicode) – Absolute path of the directory to enter
  • encoding (unicode) – Encoding to use for decoding the files
  • normalization (unicode) – Normalization format to use
Returns:

List of words of all files in the directory

Return type:

list

nidaba.lock module

nidaba.lock

This module contains an NFS-safe locking method that is hopefully interoperable with anything anybody is going to encounter out there.

class nidaba.lock.lock(locked_file)

Bases: object

A global lock implementation for files on the common storage medium. It is intended to be mainly used on NFS although it should work on any network file system.

acquire()

Acquires a lock on the selected file. Waits until the lock can be acquired except when the directory components of the path does not exist.

release()

Releases the lock on the selected file.

Returns:True if the lock has been releases, False otherwise
Return type:bool

nidaba.merge_hocr module

nidaba.merge_hocr

A naive algorithm merging multiple hOCR output documents into one.

class nidaba.merge_hocr.Rect(ul=(0, 0), lr=(0, 0))

Bases: object

Native python replacement for gameras C++ Rect object.

nidaba.merge_hocr.close_enough(bbox1, bbox2, fudge=0.1)

Roughly matches two bboxes roughly using a fudge factor.

Parameters:
  • bbox1 (Rect) – Rect object of a bounding box.
  • bbox2 (Rect) – Rect object of a bounding box.
  • fudge (float) – Fudge factor to account for slight variations in word boundary detection between segmentation engines.
Returns:

True if the bounding boxes are sufficiently aligned, False otherwise.

Return type:

bool

nidaba.merge_hocr.get_hocr_lines_for_tree(treeIn)
class nidaba.merge_hocr.hocrLine

Bases: object

Dummy class associating lines, words with their text and bboxes.

class nidaba.merge_hocr.hocrWord

Bases: object

Dummy class associating word text with a bbox.

nidaba.merge_hocr.merge(docs, lang, output)

Merges multiple hOCR documents into a single one.

First bboxes from all documents are roughly matched, then all matching bboxes are scored using a spell checker. If no spell checker is available all matches will be merged without ranking.

The matching is naive, i.e. we just grab the first input document and assume that all other documents have similar segmentation results. Issues like high variance in segmentation, especially word boundaries are not accounted for.

Parameters:
  • docs (iterable) – A list of storage tuples of input documents
  • lang (unicode) – A language identifier for the spell checker
  • output (tuple) – Storage tuple for the result
Returns:

The output storage tuple. Should be the same as `output`.

Return type:

tuple

nidaba.merge_hocr.parse_bbox(prop_str)

Parses the microformat property string in the hOCR title field.

Parameters:prop_str (unicode) – The verbatim property string
Returns:A rectangular area described in a bbox field
Return type:Rect
Raises:ValueError – The property string either did not contain a bbox definition or this definition was malformed.
nidaba.merge_hocr.score_word(lang, word)

A simple token scoring function similar to the one used in Bruce Robertsons rigaudon. FIXME: Actually score input tokens.

Parameters:
  • lang (unicode) – Language to use for scoring.
  • word (unicode) – Input token to score
Returns:

Value representing the input tokens score. Higher values are closer to native language words.

Return type:

int

nidaba.merge_hocr.sort_words_bbox(words)

Sorts word bboxes of a document in European reading order (upper left to lower right). The list is sorted in place.

Parameters:words (list) – List of hocrWord object containing Rects in the field bbox and the recognized text in the text field.
Returns:The sorted word list.
Return type:list

nidaba.nidaba module

nidaba.nidaba

The public API of nidaba. External applications should exclusively use the objects and methods defined here.

class nidaba.nidaba.Batch(id)

Bases: object

Creates a series of celery tasks OCRing a set of documents (among other things).

A batch contains three level of definitions: tasks, ticks, and steps. A task is a singular operation on an input document, creating a single output document, e.g. binarization using a particular configuration of an algorithm or OCR using a particular engine. Multiple tasks executing in parallel are grouped into a tick and multiple ticks (running sequentially) are grouped into steps which again are executed sequentially.

The need for steps and ticks arises from two different execution orders required by a pipeline. Take the following example:

step 1:
tick a: task 1 tick b: task 2, task 3 tick c: task 4, task 5

The pipeline expands this example to the following sequences run in parallel (dot product of all ticks):

task 1 -> task 2 -> task 4 task 1 -> task 2 -> task 5 task 1 -> task 3 -> task 4 task 1 -> task 3 -> task 5

It is not garantueed that any particular task in another sequence has executed successfully before a task in a sequence is run, i.e. it is not ensured that all task 1’s have finished before task 2 of the first sequence is executed, except the task(s) further up the sequence.

Steps on the other hand ensure that all tasks of the previous step have finished successfully. The output(s) of the expanded ticks is aggregated into a single list and used as the input of the first tick of the step. Expanding on the example the following step is added:

step 2:
tick d: task 6

After the 4 sequence are finished their output is aggregated into a list [d1, d2, d3, d4] and used as the input of task 6. The final output of task 6 is the output of the pipeline.

The call order to create this example is:

Batch.add_step() Batch.add_tick() Batch.add_task(task_1) Batch.add_tick() Batch.add_task(task_2) Batch.add_task(task_3) Batch.add_tick() Batch.add_task(task_4) Batch.add_task(task_5) Batch.add_step() Batch.add_tick() Batch.add_task(task_6)
add_document(doc)

Add a document to the batch.

Adds a document tuple to the batch and checks if it exists.

Parameters:doc (tuple) – A standard document tuple.
Raises:NidabaInputException – The document tuple does not refer to a file.
add_step()

Add a new step to the batch definition.

Adds a step, a list of sequentially run ticks to the batch definition. The output(s) of the last tick of a step is aggregated into a single list and used as the input of the first tick of the following step.

add_task(method, **kwargs)

Add a task to the current tick.

Adds a task, a single executable task gathering one or more input documents and returning a single output document, to the current tick. Multiple jobs are run in parallel.

Parameters:
  • method (unicode) – A task identifier
  • **kwargs – Arguments to the task
Raises:
  • NidabaTickException – There is no tick to add a task to.
  • NidabaNoSuchAlgorithmException – Invalid method given.
add_tick()

Add a new tick to the current step.

Adds a tick, a set of tasks running in parallel and sharing common input documents to the current step.

Raises:NidabaStepException – There is no step to add a tick to.
get_errors()

Retrieves all errors of the batch.

Returns:A list of tuples containing keyword arguments to the task, a dictionary containing debug tracking information (i.e. variables which are given to the tasks as keyword arguments but aren’t arguments to the underlying function), and the exception message of the failure.
Return type:list
get_extended_state()

Returns extended batch state information.

Returns:A dictionary containing an entry for each subtask.
get_results()

Retrieves the storage tuples of a successful batch.

Returns:

get_state()

Retrieves the current state of a batch.

Returns:A string containing one of the following states:

NONE: The batch ID is not registered in the backend. FAILURE: Batch execution has failed. PENDING: The batch is currently running. SUCCESS: The batch has completed successfully.

Return type:(unicode)
run()

Executes the current batch definition.

Expands the current batch definition to a series of celery chains and executes them asynchronously. Additionally a batch record is written to the celery result backend.

Returns:Batch identifier.
Return type:(unicode)
class nidaba.nidaba.NetworkSimpleBatch(host, id=None)

Bases: object

A SimpleBatch object providing the same interface as a SimpleBatch.

It does some basic error checking to minimize network traffic but it won’t catch all errors before issuing API requests, especially if the batch is modified by another process. In these cases exceptions will be raised by the requests module.

add_document(path, callback, auxiliary=False)

Add a document to the batch.

Uploads a document to the API server and adds it to the batch.

..note::
Note that this function accepts a standard file system path and NOT a storage tuple as a client using the web API is not expected to keep a separate, local storage medium.
Parameters:
  • path (unicode) – Path to the document
  • callback (function) –
    A function that is called with a
    requests_toolbelt.multipart.encoder.MultipartEncoderMonitor

    instance.

  • auxiliary (bool) – Switch to disable setting the file as an input document. May be used to upload ground truths, metadata, and other ancillary files..
Raises:

NidabaInputException – The document does not refer to a file or the batch is locked because the run() method has been called.

add_task(group, method, *args, **kwargs)

Add a particular task configuration to a task group.

Parameters:
  • group (unicode) – Group the task belongs to
  • method (unicode) – Name of the task
  • kwargs – Arguments to the task
create_batch()

Creates a batch on the server. Also synchronizes the list of available tasks and their parameters.

get_available_tasks()

Synchronizes the local task/parameter list with the remote server.

get_documents()

Returns the list of input document for this task.

get_extended_state()

Returns the extended batch state.

Raises:NidabaInputException if the batch hasn’t been executed yet.
get_results()

Retrieves the storage tuples of a successful batch.

Returns:

get_state()

Retrieves the current state of a batch.

Returns:A string containing one of the following states:

NONE: The batch ID is not registered in the backend. FAILURE: Batch execution has failed. PENDING: The batch is currently running. SUCCESS: The batch has completed successfully.

Return type:(unicode)
get_tasks()

Returns the task tree either from the scratchpad or from the pipeline when already in execution.

is_running()

Returns True if the batch’s run() method has been successfully called, otherwise False.

run()

Executes the current batch definition.

Expands the current batch definition to a series of celery chains and executes them asynchronously. Additionally a batch record is written to the celery result backend.

Returns:Batch identifier.
Return type:(unicode)
class nidaba.nidaba.SimpleBatch(id=None)

Bases: nidaba.nidaba.Batch

A simpler interface to the batch functionality that is more amenable to RESTful task assembly and prevents some incidences of bullet-in-foot-syndrome.

A SimpleBatch contains only a list of input documents and a series of tasks. The order of task execution depends on a predefined order, similar to the nidaba command-line util.

If no batch identifier is given a new batch will be created.

SimpleBatches always contain a scratchpad (which will be restored automatically).

add_document(doc)

Add a document to the batch.

Adds a document tuple to the batch and checks if it exists.

Parameters:doc (tuple) – A standard document tuple.
Raises:NidabaInputException – The document tuple does not refer to a file or the batch is locked because the run() method has been called.
add_task(group, method, **kwargs)

Add a particular task configuration to a task group.

Parameters:
  • group (unicode) – Group the task belongs to
  • method (unicode) – Name of the task
  • kwargs – Arguments to the task
static get_available_tasks()

Returns all available tasks and their valid argument values.

The return value is an ordered dictionary containing an entry for each group with a sub-dictionary containing the task identifiers and valid argument values.

get_documents()

Returns the list of input document for this task.

get_tasks()

Returns the simplified task definition either from the scratchpad or from the pipeline when already in execution.

is_running()

Returns True if the batch’s run() method has been successfully called, otherwise False.

run()

Executes the current batch definition.

Expands the current batch definition to a series of celery chains and executes them asynchronously. Additionally a batch record is written to the celery result backend.

Returns:Batch identifier.
Return type:(unicode)
nidaba.nidaba.task_arg_validator(arg_values, **kwargs)

Validates keyword arguments against the list of valid argument values contained in the task definition.

Raises:NidabaInputException if validation failed.

nidaba.nidabaexceptions module

nidaba.nidabaexceptions

All custom exceptions raised by various nidaba modules and packages. Packages should always define their exceptions here.

exception nidaba.nidabaexceptions.NidabaAlgorithmException(message=None)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaConfigException(message=None)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaInputException(status_code)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaInvalidParameterException(status_code)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaLeptonicaException(message=None)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaNoSuchAlgorithmException(status_code)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaNoSuchStorageBin(status_code)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaOcropusException(status_code)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaPluginException(message=None)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaStepException(status_code)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaStorageViolationException(status_code)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaTEIException(status_code)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaTaskException(status_code)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaTesseractException(status_code)

Bases: exceptions.Exception

exception nidaba.nidabaexceptions.NidabaTickException(status_code)

Bases: exceptions.Exception

nidaba.storage module

nidaba.storage

This module contains all file handling/storage management/ID mapping methods.

class nidaba.storage.StorageFile(jobID, path, *args, **kwargs)

Bases: io.IOBase

A file-like interface to a file on the storage medium.

abs_path
close()
closed
flush()
isatty()
read(size=-1)
readable()
readall()
readinto(b)
readline(limit=-1)
readlines(hint=-1)
seek(offset)
seekable()
storage_path
tell()
writable()
write(msg)
writelines(lines)
nidaba.storage.get_abs_path(jobID, *path)

Returns the absolute path of a file.

Takes a job ID and a sequence of path components and checks if their absolute path is in the directory of that particular job ID.

Parameters:
  • jobID (unicode) – A unique job ID
  • *path (unicode) – A list of path components that are concatenated to
  • the absolute path. (calculate) –
Returns:

A string containing the absolute path of the storage tuple.

Return type:

(unicode)

Raises:

NidabaStorageViolationException – The resulting absolute path is either not in the storage_path of the nidaba configuration or not in its job directory.

nidaba.storage.get_storage_path(path)

Converts an absolute path to a storage tuple of the form (id, path).

Parameters:

path (unicode) – A unicode string of the absolute path.

Returns:

(id, path)

Return type:

tuple

Raises:
  • NidabaStorageViolationException – The given path can not be converted into a storage tuple.
  • NidabaNoSuchStorageBin – The given path is not beneath a valid job ID.
nidaba.storage.insert_suffix(orig_path, *suffix)

Inserts one or more suffixes just before the file extension.

nidaba.storage.is_file(jobID, path)

Checks if a storage tuple is a regular file.

Parameters:
  • jobID (unicode) – An unique ID associated with a particular job.
  • path (unicode) – A path of a file beneath jobID.
Returns:

Either True or False depending on the existence of the file.

Return type:

bool

Raises:

Exception – Who the fuck knows. The python standard library doesn’t document such paltry information as exceptions.

nidaba.storage.is_valid_job(jobID)

Checks if filestore has been prepared for a job.

Parameters:jobID (unicode) – An identifier of a job.
Returns:True if job is already in the system, False otherwise.
Return type:bool
Raises:Standard python library caveats apply.
nidaba.storage.list_content(jobID, pattern=u'*')

Lists all files to a job ID, optionally applying a glob-like filter.

Parameters:
  • jobID (unicode) – Identifier of the bin
  • pattern (unicode) – glob-like filter to match files
Returns:

A list of unicode strings of the matching files.

Return type:

list

Raises:

NidabaNoSuchStorageBin if the job identifer is not known.

nidaba.storage.prepare_filestore(jobID)

Prepares the default filestore to accept files for a job.

Parameters:jobID (unicode) – Identifier of the bin to be created.
Raises:NidabaStorageViolationException if the job ID already exists.
nidaba.storage.retrieve_content(jobID, documents=None)

Retrieves data from a single or a list of documents. Returns binary data, for retrieving unicode text use retrieve_text().

Parameters:
  • jobID (unicode) – Identifier of the bin
  • documents (tuple or list of tuples) – Documents to read in
Returns:

A dictionary mapping file identifiers to their contents.

Return type:

Dictionary

Raises:

NidabaNoSuchStorageBin if the job identifer is not known.

nidaba.storage.retrieve_text(jobID, documents=None)

Retrieves UTF-8 encoded text from a single or a list of documents.

Parameters:
  • jobID (unicode) – Identifier of the bin
  • documents (tuple or list of tuples) – Documents to read in
Returns:

A dictionary mapping file identifiers to their contents.

Return type:

Dictionary

Raises:

NidabaNoSuchStorageBin if the job identifer is not known.

nidaba.storage.write_content(jobID, dest, data)

Writes data to a document at a destination beneath a jobID. Writes bytes, does not accept unicode objects; use write_text() for that.

Parameters:
  • jobID (unicode) – Identifier of the bin.
  • dest (tuple) – Documents to write to.
  • data (str) – Data to write.
Returns:

Length of data written

Return type:

int

nidaba.storage.write_text(jobID, dest, text)

Writes text data encoded as UTF-8 to a file beneath a jobID.

Parameters:
  • jobID (unicode) – Identifier of the bin.
  • dest (tuple) – Documents to write to.
  • text (unicode) – Data to write.
Returns:

Length of data written

Return type:

int

nidaba.tei module

nidaba.tei

A module for interfacing TEI OCR output

class nidaba.tei.TEIFacsimile

Bases: object

A class encapsulating a TEI XML document following the TEI digital facsimile guidelines for embedded transcriptions.

add_choices(id, it)

Adds alternative interpretations to an element.

Parameters:
  • id (unicode) – Globally unique XML id of the element.
  • it (iterable) – An iterable returning a tuple containing an alternative reading and an optional confidence value in the range between 0 and 100.
add_graphemes(it)

Adds a number of graphemes to the current scope (either line or word). A line or segment has to be created beforehand.

Parameters:it (iterable) – An iterable returning a tuple containing a glyph (unicode), and optionally the bounding box of this glyph (x0, y0, x1, y1) and a recognition confidence value in the range 0 and 100.
add_line(dim)

Marks the beginning of a new topographical line and scopes it.

Parameters:dim (tuple) – A tuple containing the bounding box (x0, y0, x1, y1)
add_respstmt(name, resp)

Adds a responsibility statement and treats all subsequently added text as a responsibility of this statement.

Parameters:
  • name (unicode) – Identifier of the process that generated the output.
  • resp (unicode) – Phrase describing the nature of the process generating the output.
Returns:

A unicode string corresponding to the responsibility identifier.

add_segment(dim, lang=None, confidence=None)

Marks the beginning of a new topographical segment in the current scope. Most often this correspond to a word recognized by an engine.

Parameters:
  • dim (tuple) – A tuple containing the bounding box (x0, y0, x1, y1)
  • lang (unicode) – Optional identifier of the segment language.
  • confidence (float) – Optional confidence value between 0 and 100.
author

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

authority

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

clear_graphemes()

Deletes all grapheme zone nodes from the document. Mainly used when combining page segmentation algorithms extracting graphemes and OCR engines operating on lexemes. Also resets the current scope to the first line (and if applicable its first segment).

clear_lines()

Deletes all <line> nodes from the document.

clear_segment()

Marks the end of the current topographical segment.

clear_segments()

Deletes all word zone nodes from the document. Mainly used when combining page segmentation algorithms extracting lexemes (and graphemes) and OCR engines operating on lines. Also resets the current scope to the first line.

description

Returns a tuple containing a source document’s path and its dimensions.

distributor

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

document(dim, image_url)
edition

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

editor

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

fields = OrderedDict([('title', ('titleStmt', '/{http://www.tei-c.org/ns/1.0}title')), ('author', ('titleStmt', '/{http://www.tei-c.org/ns/1.0}author', 'ref')), ('editor', ('titleStmt', '/{http://www.tei-c.org/ns/1.0}editor', 'ref')), ('funder', ('titleStmt', '/{http://www.tei-c.org/ns/1.0}funder', 'ref')), ('principal', ('titleStmt', '/{http://www.tei-c.org/ns/1.0}principal', 'ref')), ('sponsor', ('titleStmt', '/{http://www.tei-c.org/ns/1.0}sponsor', 'ref')), ('meeting', ('titleStmt', '/{http://www.tei-c.org/ns/1.0}meeting', 'meeting')), ('edition', ('editionStmt', '/{http://www.tei-c.org/ns/1.0}edition')), ('publisher', ('publicationStmt', '/{http://www.tei-c.org/ns/1.0}publisher', 'target')), ('distributor', ('publicationStmt', '/{http://www.tei-c.org/ns/1.0}distributor', 'target')), ('authority', ('publicationStmt', '/{http://www.tei-c.org/ns/1.0}authority', 'target')), ('idno', ('publicationStmt', '/{http://www.tei-c.org/ns/1.0}idno', 'type')), ('pub_place', ('publicationStmt', '/{http://www.tei-c.org/ns/1.0}pubPlace')), ('licence', ('publicationStmt', '/{http://www.tei-c.org/ns/1.0}availability/{http://www.tei-c.org/ns/1.0}licence', 'target')), ('series_title', ('seriesStmt', '/{http://www.tei-c.org/ns/1.0}p')), ('note', ('notesStmt', '/{http://www.tei-c.org/ns/1.0}notes')), ('source_desc', ('sourceDesc', '/{http://www.tei-c.org/ns/1.0}p'))])
fileDesc = ['titleStmt', 'editionStmt', 'publicationStmt', 'seriesStmt', 'notesStmt', 'sourceDesc']
funder

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

graphemes

Returns a reading order sorted list of tuples in the format (x0, y0, x1, y1, id, text).

idno

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

lang

The language value of the teiHeader

licence

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

lines

Returns an reading order sorted list of tuples in the format (x0, y0, x1, y1, xml id, text).

load_hocr(fp)

Extracts as much information as possible from an hOCR file and converts it to TEI.

TODO: Write a robust XSL transformation.

Parameters:fp (file) – File descriptor to read data from.
meeting

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

note

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

principal

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

pub_place

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

publisher

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

read(fp)

Reads an XML document from a file object and populates all recognized attributes. Also sets the scope to the first line (and if applicable segment) of the document.

Parameters:fp (file) – file object to read from
respstmt

Returns an ordered dictionary of responsibility statements from the XML document.

scope_line(id)

Scopes a particular line for addition of segments/graphemes. Also disables the current segment scope.

Parameters:id (unicode) – XML id of the line tag
Raises:NidabaTEIException if the identifier is unknown
scope_respstmt(id)

Scopes a respStmt for subsequent addition of graphemes/segments.

Parameters:id (unicode) – XML id of the responsibility statement
Raises:NidabaTEIException if the identifier is unknown
segments

Returns an reading order sorted list of tuples in the format (x0, y0, x1, y1, confidence, id, text).

series_title

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

source_desc

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

sponsor

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

tei_ns = '{http://www.tei-c.org/ns/1.0}'
title

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

write(fp)

Writes the TEI XML document to a file object.

Parameters:fp (file) – file object to write to
write_abbyyxml(fp)

Writes the TEI document in a format reminiscent of Abbyy FineReader’s XML output. Its basic format is:

<text> <line l=”0” r=”111” t=”6” b=”89”> <charParams l=”0” r=”78” t=”6” b=”89” charConfidence=”76” wordStart=”true”>D</charParams> <charParams l=”86” r=”111” t=”24” b=”89” charConfidence=”76” wordStart=”false”>e</charParams> </line> .... </text>

Parameters:fp (file) – File descriptor to write to.
write_hocr(fp)

Writes the TEI document as an hOCR file.

Parameters:fp (file) – File descriptor to write to.
write_text(fp)

Writes the TEI document as plain text.

Parameters:fp (file) – File descriptor to write to.
xml_ns = '{http://www.w3.org/XML/1998/namespace}'

nidaba.uzn module

nidaba.uzn

A simple writer/reader interface for UNLV-style zone files.

class nidaba.uzn.UZNReader(f, **kwds)

Bases: object

A reader parsing a UNLV zone file from a file object ‘f’.

next()
class nidaba.uzn.UZNWriter(f, **kwds)

Bases: object

A class writing a UNLV zone file.

writerow(x0, y0, x1, y1, descriptor=u'Text')
writerows(rows)

nidaba.web module

nidaba.web

A web interface for the REST API.

For a documentation of the interface see the API docs.

nidaba.web.get_blueprint()
nidaba.web.indexRoute(path)

Module contents