Option Groups¶
Preprocessing¶
Several useful preprocessing tasks are grouped in here.
$ nidaba batch ... -i any_to_png ... -- *.tif
Options and Syntax¶
-
nidaba.tasks.img.
any_to_png
(doc, method)¶ Converts an image (color or otherwise) in any format recognized by pillow to PNG.
The pillow image library relies on external libraries for loading and saving Image data. To recognize the most common image formats used for digital archival you’ll need:
- libtiff
- zlib
- libjpeg
- openjpeg (version 2.0 +)
- libwebp
To have access to all formats run (on Debian/Ubuntu):
# apt-get -y install libtiff5-dev libjpeg62-turbo-dev zlib1g-dev libwebp-dev libopenjp2-dev
Parameters: - doc (unicode, unicode) – The input document tuple
- method (unicode) – The suffix string appended to all output files.
Returns: Storage tuple of the output file
Return type: (unicode, unicode)
-
nidaba.tasks.img.
rgb_to_gray
(doc, method)¶ Converts an arbitrary bit depth image to grayscale and writes it back appending a suffix.
Parameters: - doc (unicode, unicode) – The input document tuple
- method (unicode) – The suffix string appended to all output files.
Returns: Storage tuple of the output file
Return type: (unicode, unicode)
Binarization¶
Binarization is the process of converting a grayscale image to a bi-level image by selecting one or more thresholds separating foreground (usually the text to be recognized) from background (usually the white page) pixels. As all character recognition methods implemented in nidaba operate only on bi-level images, it is paramount to create properly binarized images as a preprocessing step.
Binarization is an own group of tasks and functions can be accessed using the
--binarization/-b
switch:
$ nidaba batch ... -b otsu -b sauvola:... -- *.tif
Options and Syntax¶
-
nidaba.tasks.binarize.
otsu
(doc, method)¶ Binarizes an input document utilizing a naive implementation of Otsu’s thresholding.
Parameters: - doc (unicode, unicode) – The input document tuple.
- method (unicode) – The suffix string appended to all output files.
Returns: Storage tuple of the output file
Return type: (unicode, unicode)
There are also additional, more advanced binarization algorithms available in
the leptonica
and kraken
plugins.
Page Segmentation¶
A prerequisite to the actual OCR is the extraction of textual elements,
columns, paragraphs, and lines, from the page. Page segmentation is a separate
group of tasks and functions can be accessed using the --segmentation/-l
switch:
# nidaba batch ... -l tesseract ... -- *.tif
Optical Character Recognition¶
OCR is arguably
the main part of nidaba. Currently 3 OCR engines are implemented and can be
accessed using the --ocr
group of task:
$ nidaba batch ... -o tesseract:languages=\[eng\] -o kraken:model=en-default ... -- *.tif
Spell Checking¶
Nidaba includes support for an edit distance based spell checker out of the
box. Particular configurations of the spell checking algorithm have to be
predefined in the nidaba.yaml
configuration file under the lang_dicts
section:
lang_dicts:
polytonic_greek: {dictionary: [dicts, greek.dic],
deletion_dictionary: [dicts, del_greek.dic]}
latin: {dictionary: [dicts, latin.dic],
deletion_dictionary: [dicts, del_latin.dic]}
The spell-checker is part of the postprocessing group of tasks and can be
accessed be the name spell_check
, e.g.:
$ nidaba batch ... -p spell_check:language=polytonic_greek,filter_punctuation=False ... -- *.tif
Creating Dictionaries¶
The spell checker requires two dictionaries on the common storage medium: a
dictionary of valid word forms and a corresponding file containing a mapping
between variants and those valid word forms. Both are best created using the
nidaba_mkdict
tool installed by the default distribution. It takes a
arbitrary text document, extracts all unique character sequences, and
calculates the dictionaries in a normalized format. For example:
$ nidaba_mkdict --input greek.txt --del_dict del_greek.dic --dictionary greek.dic
Reading input file [✓]
Writing dictionary [✓]
Writing deletions [✓]
Be aware that calculating the deletion dictionary is a process requiring a lot of memory, e.g. for a 31Mb word list mkdict utilizes around 8Gb memory and the resulting deletion dictionary will be 750Mb large.
Options and Syntax¶
-
nidaba.tasks.postprocessing.
spell_check
(doc, method, language, filter_punctuation, no_ocrx_words)¶ Adds spelling suggestions to an TEI XML document.
Alternative spellings for each segment will be included in a choice tagcontaining a series of corr tags with the original segment appearing beneath a sic element. Correct words, i.e. words appearing verbatim in the dictionary, are left untouched.
Parameters: - doc (unicode, unicode) – The input document tuple.
- method (unicode) – The suffix string appended to the output file.
- language (unicode) – Identifier defined in the nidaba configuration as a valid dictionary.
- filter_punctuation (bool) – Switch to filter punctuation inside
seg
Returns: Storage tuple of the output document
Return type: (unicode, unicode)
Output Merging¶
There is a rudimentary merging algorithm able to combine multiple recognition results into a single document if certain conditions are met. The combined output can then be used for further postprocessing, e.g manual correction or lexicality based weighting. It has been ported from Bruce Robertson’s rigaudon, an OCR engine for polytonic Greek.
Currently, its basic operation is as follows. First (word) bboxes from all documents are roughly matched, then all matching bboxes are scored using a spell checker. If no spell checker is available all matches will be merged without ranking.
Note
The matching is naive, i.e. we just grab the first input document and assume that all other documents have similar segmentation results. Issues like high variance in segmentation, especially word boundaries are not accounted for.
Options and Syntax¶
-
nidaba.tasks.postprocessing.
blend_hocr
(doc, method, language)¶ Blends multiple hOCR files using the algorithm I cooked up in between thesis procrastination sessions. It is language independent and parameterless.
Parameters: - [ (doc) – A list of storage module tuples that will be
- into a single output document. (merged) –
- method (unicode) – The suffix string appended to the output file.
Returns: Storage tuple of the output document
Return type: (unicode, unicode)
Output Layer¶
The output layer handles conversion and extension of nidaba’s native TEI format. It can be used to distill data from the TEI document into plain text, hOCR, and a simple XML format. It can also use an external metadata file to complete raw TEI output to a valid TEI document.
An example adding metadata from an external file using the file:
syntax to
copy it to the storage medium on task invocation:
$ nidaba batch ... -f metadata:metadata=file:openphilology_meta.yaml,validate=False -- *.tif
Options and Syntax¶
-
nidaba.tasks.output.
tei_metadata
(doc, method, metadata, validate)¶ Enriches a TEI-XML document with various metadata from an user-supplied YAML file.
The following fields may be contained in the metadata file with the bolded subset mandatory for a valid TEI-XML file. They are grouped by their place in the header. Unknown fields are ignored and input is escaped as to disable injection.
Some element may also be extended by increasing their arity, the second value is then usually used as a global identifer/locator, i.e. an URL or authority control ID.
titleStmt:
title
: Title of the resourceauthor: Name of the author of the resource (may be extended)
- editor: Name of the editor, compiler, translator, etc. of the
resource (may be extended)
- funder: Institution responsible for the funding of the text (may be
extended)
- principal: PI responsible for the creation of the text (may be
extended)
sponsor: Name of the sponsoring institution (may be extended)
meeting: Conference/meeting resulting in the text (may be extended)
editionStmt:
- edition: Peculiarities to the underlying edition of the text
publicationStmt:
licence
: Licence of the content (may be extended)publisher
: Person or agency responsible for the publication ofthe text (may be extended)
- distributor: Person or agency responsible for the text’s
distribution (may be extended)
authority: Authority responsible for making the work available
- idno: Identifier of the publication (may be extended with the type of
identifier)
pub_place: Place of publication
date: Date of publication
seriesStmt:
- series_title: Title of the series to which the publication belongs
notesStmt:
- note: Misc. notes about the text
sourceDesc:
source_desc
: Description of the source document
other:
- lang: Abbreviation of the language used in the header
There is a sample file from the OpenPhilology project in the example directory.
Parameters: - doc (unicode, unicode) – Storage tuple of the input document
- method (unicode) –
- metadata (unicode, unicode) – Storage tuple of the metadata YAML file
Returns: Storage tuple of the output document
Return type: (unicode, unicode)
Raises: - NidabaTEIException if the resulting document is not TEI compatible and
- validation is enabled.
-
nidaba.tasks.output.
tei2abbyyxml
(doc, method)¶ Convert a TEI Facsimile to a format similar to Abbyy FineReader’s XML output.
Parameters: doc (unicode, unicode) – Storage tuple of the input document Returns: Storage tuple of the output document Return type: (unicode, unicode)
-
nidaba.tasks.output.
tei2hocr
(doc, method)¶ Convert a TEI Facsimile to hOCR preserving as much metadata as possible.
Parameters: doc (unicode, unicode) – Storage tuple of the input document Returns: Storage tuple of the output document Return type: (unicode, unicode)
-
nidaba.tasks.output.
tei2txt
(doc, method)¶ Convert a TEI Facsimile to a plain text file.
Parameters: doc (unicode, unicode) – Storage tuple of the input document Returns: Storage tuple of the output document Return type: (unicode, unicode)
Metrics¶
It is possible to calculate metrics on the textual output to assess its
deviation from a given ground truth. The ground truth may be in one of several
supported formats including plain text, hOCR, and TEI XML. Currently two
schemes for calculating character edit distances are included; one using a
variant of the well-known diff
algorithm and a task calculating the global
minimal edit distance:
$ nidaba batch ... -s text_edit_ratio:ground_truth=file:'*.gt.txt',xml_in=True,clean_gt=True,gt_format=text,clean_in=True,divert=True -- *.tif
Note that we didn’t associate a single ground truth with the batch but a pattern matching one more ground truth files. On runtime the task automatically selects the needed ground truth for its OCR result based on the filename prefix.
Access to the metric is provided by the usual status
command:
$ nidaba status 481964c3-fe5d-487a-9b73-a12869678ab3
Status: success (final)
3/3 tasks completed. 0 running.
Output files:
0016.tif → 0016_ocr.kraken_teubner.xml (94.0% / 0016.gt.txt)
As you can see we got an accuracy of 94% on our scans which is good but not great.
Options and Syntax¶
-
nidaba.tasks.stats.
text_diff_ratio
(doc, method, ground_truth, xml_in, gt_format, clean_in, clean_gt, divert)¶ Calculates the similarity of the input documents and a given ground truth using the algorithm of python’s difflib SequenceMatcher. The result is a value between 0.0 (no commonality) and 1.0 (identical strings).
Parameters: - doc (unicode, unicode) – The input document tuple
- method (unicode) – The suffix string appended to the output file.
- ground_truth (unicode) – Ground truth location tuple or a list of ground truths to choose from. When more than one is given, the file sharing the longest prefix with the input document is chosen.
- xml_in (bool) – Switch to treat input as an TEI-XML document.
- gt_format (unicode) – Switch to select ground truth format. Valid values are ‘tei’, ‘hocr’, and ‘text’.
- clean_in (bool) – Normalize to NFD and strip input data. (DO NOT DISABLE!)
- clean_gt (bool) – Normalize to NFD and strip ground truth. (DO NOT DISABLE!)
- divert (bool) – Switch selecting output diversion. If enabled the output will be added to the tracking arguments and the input document will be returned as the result of the task. Use this to insert a statistical measure into a chain without affecting the results.
Returns: Storage tuple of the output document
Return type: (unicode, unicode)
-
nidaba.tasks.stats.
text_edit_ratio
(doc, method, ground_truth, xml_in, gt_format, clean_in, clean_gt, divert)¶ Calculates the similarity of the input documents and a given ground truth using the Damerau-Levenshtein distance. The result is a value between 0.0 (no commonality) and 1.0 (identical strings).
Parameters: - doc (unicode, unicode) – The input document tuple
- method (unicode) – The suffix string appended to the output file.
- ground_truth (unicode) – Ground truth location tuple or a list of ground truths to choose from. When more than one is given, the file sharing the longest prefix with the input document is chosen.
- xml_in (bool) – Switch to treat input as an TEI-XML document.
- gt_format (unicode) – Switch to select ground truth format. Valid values are ‘tei’, ‘hocr’, and ‘text’.
- clean_in (bool) – Normalize to NFD and strip input data. (DO NOT DISABLE!)
- clean_gt (bool) – Normalize to NFD and strip ground truth. (DO NOT DISABLE!)
- divert (bool) – Switch selecting output diversion. If enabled the output will be added to the tracking arguments and the input document will be returned as the result of the task. Use this to insert a statistical measure into a chain without affecting the results.
Returns: Storage tuple of the output document
Return type: (unicode, unicode)