Option Groups

Preprocessing

Several useful preprocessing tasks are grouped in here.

$ nidaba batch ... -i any_to_png ... -- *.tif

Options and Syntax

nidaba.tasks.img.any_to_png(doc, method)

Converts an image (color or otherwise) in any format recognized by pillow to PNG.

The pillow image library relies on external libraries for loading and saving Image data. To recognize the most common image formats used for digital archival you’ll need:

  • libtiff
  • zlib
  • libjpeg
  • openjpeg (version 2.0 +)
  • libwebp

To have access to all formats run (on Debian/Ubuntu):

# apt-get -y install libtiff5-dev libjpeg62-turbo-dev zlib1g-dev             libwebp-dev libopenjp2-dev
Parameters:
  • doc (unicode, unicode) – The input document tuple
  • method (unicode) – The suffix string appended to all output files.
Returns:

Storage tuple of the output file

Return type:

(unicode, unicode)

nidaba.tasks.img.rgb_to_gray(doc, method)

Converts an arbitrary bit depth image to grayscale and writes it back appending a suffix.

Parameters:
  • doc (unicode, unicode) – The input document tuple
  • method (unicode) – The suffix string appended to all output files.
Returns:

Storage tuple of the output file

Return type:

(unicode, unicode)

Binarization

Binarization is the process of converting a grayscale image to a bi-level image by selecting one or more thresholds separating foreground (usually the text to be recognized) from background (usually the white page) pixels. As all character recognition methods implemented in nidaba operate only on bi-level images, it is paramount to create properly binarized images as a preprocessing step.

Binarization is an own group of tasks and functions can be accessed using the --binarization/-b switch:

$ nidaba batch ... -b otsu -b sauvola:... -- *.tif

Options and Syntax

nidaba.tasks.binarize.otsu(doc, method)

Binarizes an input document utilizing a naive implementation of Otsu’s thresholding.

Parameters:
  • doc (unicode, unicode) – The input document tuple.
  • method (unicode) – The suffix string appended to all output files.
Returns:

Storage tuple of the output file

Return type:

(unicode, unicode)

There are also additional, more advanced binarization algorithms available in the leptonica and kraken plugins.

Page Segmentation

A prerequisite to the actual OCR is the extraction of textual elements, columns, paragraphs, and lines, from the page. Page segmentation is a separate group of tasks and functions can be accessed using the --segmentation/-l switch:

# nidaba batch ... -l tesseract ... -- *.tif

Options and Syntax

Segmentation is usually an integral part of an OCR engine, so different implementations are situated in their respective plugins. See tesseract and kraken for additional information.

Optical Character Recognition

OCR is arguably the main part of nidaba. Currently 3 OCR engines are implemented and can be accessed using the --ocr group of task:

$ nidaba batch ... -o tesseract:languages=\[eng\] -o kraken:model=en-default ... -- *.tif

Options and Syntax

As OCR engines are usually quite large and sometimes hard to install, all functionality is contained in plugins. See tesseract, kraken, and ocropus for additional information, configuration keys, etc.

Spell Checking

Nidaba includes support for an edit distance based spell checker out of the box. Particular configurations of the spell checking algorithm have to be predefined in the nidaba.yaml configuration file under the lang_dicts section:

lang_dicts:
  polytonic_greek: {dictionary: [dicts, greek.dic],
                    deletion_dictionary: [dicts, del_greek.dic]}
  latin: {dictionary: [dicts, latin.dic],
                    deletion_dictionary: [dicts, del_latin.dic]}

The spell-checker is part of the postprocessing group of tasks and can be accessed be the name spell_check, e.g.:

$ nidaba batch ... -p spell_check:language=polytonic_greek,filter_punctuation=False ... -- *.tif

Creating Dictionaries

The spell checker requires two dictionaries on the common storage medium: a dictionary of valid word forms and a corresponding file containing a mapping between variants and those valid word forms. Both are best created using the nidaba_mkdict tool installed by the default distribution. It takes a arbitrary text document, extracts all unique character sequences, and calculates the dictionaries in a normalized format. For example:

$ nidaba_mkdict --input greek.txt --del_dict del_greek.dic --dictionary greek.dic
Reading input file      [✓]
Writing dictionary      [✓]
Writing deletions       [✓]

Be aware that calculating the deletion dictionary is a process requiring a lot of memory, e.g. for a 31Mb word list mkdict utilizes around 8Gb memory and the resulting deletion dictionary will be 750Mb large.

Options and Syntax

nidaba.tasks.postprocessing.spell_check(doc, method, language, filter_punctuation, no_ocrx_words)

Adds spelling suggestions to an TEI XML document.

Alternative spellings for each segment will be included in a choice tagcontaining a series of corr tags with the original segment appearing beneath a sic element. Correct words, i.e. words appearing verbatim in the dictionary, are left untouched.

Parameters:
  • doc (unicode, unicode) – The input document tuple.
  • method (unicode) – The suffix string appended to the output file.
  • language (unicode) – Identifier defined in the nidaba configuration as a valid dictionary.
  • filter_punctuation (bool) – Switch to filter punctuation inside seg
Returns:

Storage tuple of the output document

Return type:

(unicode, unicode)

Output Merging

There is a rudimentary merging algorithm able to combine multiple recognition results into a single document if certain conditions are met. The combined output can then be used for further postprocessing, e.g manual correction or lexicality based weighting. It has been ported from Bruce Robertson’s rigaudon, an OCR engine for polytonic Greek.

Currently, its basic operation is as follows. First (word) bboxes from all documents are roughly matched, then all matching bboxes are scored using a spell checker. If no spell checker is available all matches will be merged without ranking.

Note

The matching is naive, i.e. we just grab the first input document and assume that all other documents have similar segmentation results. Issues like high variance in segmentation, especially word boundaries are not accounted for.

Options and Syntax

nidaba.tasks.postprocessing.blend_hocr(doc, method, language)

Blends multiple hOCR files using the algorithm I cooked up in between thesis procrastination sessions. It is language independent and parameterless.

Parameters:
  • [ (doc) – A list of storage module tuples that will be
  • into a single output document. (merged) –
  • method (unicode) – The suffix string appended to the output file.
Returns:

Storage tuple of the output document

Return type:

(unicode, unicode)

Output Layer

The output layer handles conversion and extension of nidaba’s native TEI format. It can be used to distill data from the TEI document into plain text, hOCR, and a simple XML format. It can also use an external metadata file to complete raw TEI output to a valid TEI document.

An example adding metadata from an external file using the file: syntax to copy it to the storage medium on task invocation:

$ nidaba batch ...  -f metadata:metadata=file:openphilology_meta.yaml,validate=False -- *.tif

Options and Syntax

nidaba.tasks.output.tei_metadata(doc, method, metadata, validate)

Enriches a TEI-XML document with various metadata from an user-supplied YAML file.

The following fields may be contained in the metadata file with the bolded subset mandatory for a valid TEI-XML file. They are grouped by their place in the header. Unknown fields are ignored and input is escaped as to disable injection.

Some element may also be extended by increasing their arity, the second value is then usually used as a global identifer/locator, i.e. an URL or authority control ID.

titleStmt:

  • title: Title of the resource

  • author: Name of the author of the resource (may be extended)

  • editor: Name of the editor, compiler, translator, etc. of the

    resource (may be extended)

  • funder: Institution responsible for the funding of the text (may be

    extended)

  • principal: PI responsible for the creation of the text (may be

    extended)

  • sponsor: Name of the sponsoring institution (may be extended)

  • meeting: Conference/meeting resulting in the text (may be extended)

editionStmt:

  • edition: Peculiarities to the underlying edition of the text

publicationStmt:

  • licence: Licence of the content (may be extended)

  • publisher: Person or agency responsible for the publication of

    the text (may be extended)

  • distributor: Person or agency responsible for the text’s

    distribution (may be extended)

  • authority: Authority responsible for making the work available

  • idno: Identifier of the publication (may be extended with the type of

    identifier)

  • pub_place: Place of publication

  • date: Date of publication

seriesStmt:

  • series_title: Title of the series to which the publication belongs

notesStmt:

  • note: Misc. notes about the text

sourceDesc:

  • source_desc: Description of the source document

other:

  • lang: Abbreviation of the language used in the header

There is a sample file from the OpenPhilology project in the example directory.

Parameters:
  • doc (unicode, unicode) – Storage tuple of the input document
  • method (unicode) –
  • metadata (unicode, unicode) – Storage tuple of the metadata YAML file
Returns:

Storage tuple of the output document

Return type:

(unicode, unicode)

Raises:
  • NidabaTEIException if the resulting document is not TEI compatible and
  • validation is enabled.
nidaba.tasks.output.tei2abbyyxml(doc, method)

Convert a TEI Facsimile to a format similar to Abbyy FineReader’s XML output.

Parameters:doc (unicode, unicode) – Storage tuple of the input document
Returns:Storage tuple of the output document
Return type:(unicode, unicode)
nidaba.tasks.output.tei2hocr(doc, method)

Convert a TEI Facsimile to hOCR preserving as much metadata as possible.

Parameters:doc (unicode, unicode) – Storage tuple of the input document
Returns:Storage tuple of the output document
Return type:(unicode, unicode)
nidaba.tasks.output.tei2txt(doc, method)

Convert a TEI Facsimile to a plain text file.

Parameters:doc (unicode, unicode) – Storage tuple of the input document
Returns:Storage tuple of the output document
Return type:(unicode, unicode)

Metrics

It is possible to calculate metrics on the textual output to assess its deviation from a given ground truth. The ground truth may be in one of several supported formats including plain text, hOCR, and TEI XML. Currently two schemes for calculating character edit distances are included; one using a variant of the well-known diff algorithm and a task calculating the global minimal edit distance:

$ nidaba batch ... -s text_edit_ratio:ground_truth=file:'*.gt.txt',xml_in=True,clean_gt=True,gt_format=text,clean_in=True,divert=True -- *.tif

Note that we didn’t associate a single ground truth with the batch but a pattern matching one more ground truth files. On runtime the task automatically selects the needed ground truth for its OCR result based on the filename prefix.

Access to the metric is provided by the usual status command:

$ nidaba status 481964c3-fe5d-487a-9b73-a12869678ab3
Status: success (final)

3/3 tasks completed. 0 running.

Output files:

0016.tif → 0016_ocr.kraken_teubner.xml (94.0% / 0016.gt.txt)

As you can see we got an accuracy of 94% on our scans which is good but not great.

Options and Syntax

nidaba.tasks.stats.text_diff_ratio(doc, method, ground_truth, xml_in, gt_format, clean_in, clean_gt, divert)

Calculates the similarity of the input documents and a given ground truth using the algorithm of python’s difflib SequenceMatcher. The result is a value between 0.0 (no commonality) and 1.0 (identical strings).

Parameters:
  • doc (unicode, unicode) – The input document tuple
  • method (unicode) – The suffix string appended to the output file.
  • ground_truth (unicode) – Ground truth location tuple or a list of ground truths to choose from. When more than one is given, the file sharing the longest prefix with the input document is chosen.
  • xml_in (bool) – Switch to treat input as an TEI-XML document.
  • gt_format (unicode) – Switch to select ground truth format. Valid values are ‘tei’, ‘hocr’, and ‘text’.
  • clean_in (bool) – Normalize to NFD and strip input data. (DO NOT DISABLE!)
  • clean_gt (bool) – Normalize to NFD and strip ground truth. (DO NOT DISABLE!)
  • divert (bool) – Switch selecting output diversion. If enabled the output will be added to the tracking arguments and the input document will be returned as the result of the task. Use this to insert a statistical measure into a chain without affecting the results.
Returns:

Storage tuple of the output document

Return type:

(unicode, unicode)

nidaba.tasks.stats.text_edit_ratio(doc, method, ground_truth, xml_in, gt_format, clean_in, clean_gt, divert)

Calculates the similarity of the input documents and a given ground truth using the Damerau-Levenshtein distance. The result is a value between 0.0 (no commonality) and 1.0 (identical strings).

Parameters:
  • doc (unicode, unicode) – The input document tuple
  • method (unicode) – The suffix string appended to the output file.
  • ground_truth (unicode) – Ground truth location tuple or a list of ground truths to choose from. When more than one is given, the file sharing the longest prefix with the input document is chosen.
  • xml_in (bool) – Switch to treat input as an TEI-XML document.
  • gt_format (unicode) – Switch to select ground truth format. Valid values are ‘tei’, ‘hocr’, and ‘text’.
  • clean_in (bool) – Normalize to NFD and strip input data. (DO NOT DISABLE!)
  • clean_gt (bool) – Normalize to NFD and strip ground truth. (DO NOT DISABLE!)
  • divert (bool) – Switch selecting output diversion. If enabled the output will be added to the tracking arguments and the input document will be returned as the result of the task. Use this to insert a statistical measure into a chain without affecting the results.
Returns:

Storage tuple of the output document

Return type:

(unicode, unicode)