Plugins¶
Tasks requiring extensive external or hard-to-install dependencies are contained in plugins. Some plugins are shipped with the standard distribution while others may be available from third parties. Prerequisites needed for using each plugin are usually stated in its docstring.
Architecture¶
Nidaba uses the stevedore Python package for dynamic plugin management. It builds on top of setuptools entry points enabling it to use plugins from any source as long as it has been installed using setuptools.
Plugins are located in the nidaba.plugins
namespace and configured in the
nidaba.yaml
configuration file in the plugins_load
section:
plugins_load:
tesseract: {implementation: capi,
tessdata: /usr/share/tesseract-ocr}
ocropus: {}
kraken: {}
leptonica: {}
Configuration data required by plugins can be stored in the dictionary beneath the plugin name; after importing the module the setup function of the module will be called with the corresponding configuration data.
Registering tasks requires getting access to the global application object of
celery. After importing it from nidaba.celery
your tasks can be decorated
as usual. Remember that all tasks should derive from the
nidaba.tasks.helper.NidabaTask
object.
Builtin Plugins¶
nidaba.plugins.kraken¶
Plugin implementing access to kraken functions.
kraken is a fork of OCRopus implementing sane interfaces while preserving (mostly) functional equivalence. To use this plugin kraken has to be installed into the current python path, e.g. the current virtualenv. It is available from pypi:
$ pip install kraken
It should be able to utilize any model trained for ocropus and is configured using the same global configuration options.
-
nidaba.plugins.kraken.
nlbin
(doc, method, threshold, zoom, escale, border, perc, range, low, high)¶ Binarizes an input document utilizing ocropus’/kraken’s nlbin algorithm.
Parameters: - doc (unicode, unicode) – The input document tuple.
- method (unicode) – The suffix string appended to all output files.
- threshold (float) –
- zoom (float) –
- escale (float) –
- border (float) –
- perc (int) –
- range (int) –
- low (int) –
- high (int) –
Returns: Storage tuple of the output file
Return type: (unicode, unicode)
Raises: NidabaInvalidParameterException
– Input parameters are outside the valid range.
-
nidaba.plugins.kraken.
segmentation_kraken
(doc, method)¶ Performs page segmentation using kraken’s built-in algorithm and writes a skeleton TEI file.
Parameters: - doc (unicode, unicode) – The input document tuple
- method (unicode) – The suffix string append to all output files
- black_colseps (bool) – Assume black column separator instead of white
- ones. –
Returns: Two storage tuples with the first one containing the segmentation and the second one being the file the segmentation was calculated upon.
-
nidaba.plugins.kraken.
ocr_kraken
(doc, method, model)¶ Runs kraken on an input document and writes a TEI file.
Parameters: - doc (unicode, unicode) – The input document tuple
- method (unicode) – The suffix string append to all output files
- model (unicode) – Identifier for the font model to use
Returns: Storage tuple for the output file
Return type: (unicode, unicode)
nidaba.plugins.ocropus¶
Plugin implementing an interface to the ocropus OCR engine.
It requires working ocropus-* tools in your execution path. Please have a look at the website for installation instructions.
Important
If you are not requiring specific functionality of ocropus please consider
using the kraken
plugin. Kraken does not
require working around oddities in input argument acceptance and is
generally more well-behaved than ocropus.
-
nidaba.plugins.ocropus.
ocr_ocropus
(doc, method, model)¶ Runs ocropus on an input document.
Parameters: - doc (unicode, unicode) – The input document tuple
- method (unicode) – The suffix string appended to all output files
- model (unicode) – Identifier for the font model to use
Returns: Storage tuple for the output file
Return type: (unicode, unicode)
nidaba.plugins.tesseract¶
Plugin implementing an interface to tesseract
This plugin exposes tesseract’s functionality as a task. It implements two ways of calling tesseract, a direct method calling the tesseract executable and one utilizing the C-API available from tesseract 3.02 and upwards.
The C-API requires a libtesseract shared object in the current library path and training data in the configured tessdata directory:
# apt-get install libtesseract3 tesseract-ocr-$lang
Using the direct call method requires the tesseract binary installable by executing:
# apt-get install tesseract-ocr
Note
It is strongly encouraged to use the C-API whenever possible. It is supposedly stable while hOCR output file names change between tesseract versions.
Note
Parameters in configuration files supersede command line parameters. Modular page segmentation utilizing zone files requires that the page segmentation mode may be set freely. Uncomment the line:
tessedit_pageseg_mode 1
in the default hocr configuration (in TESSDATA/configs/).
Configuration¶
- implementation (default=’capi’)
- Selector for the call method. May either be capi, direct (tesseract hOCR output with .hocr extension), or legacy (tesseract hOCR output with .html extension).
- tessdata (default=’/usr/share/tesseract-ocr/’)
- Path to load tesseract training data and configuration from. Has to be one directory level upwards from the actual tessdata directory
-
nidaba.plugins.tesseract.
segmentation_tesseract
(doc, method)¶ Performs page segmentation using tesseract’s built-in algorithm and writes a TEI XML segmentation file.
Parameters: - doc (unicode, unicode) – The input document tuple
- method (unicode) – The suffix string appended to all output files.
Returns: Two storage tuples with the first one containing the segmentation and the second one being the file the segmentation was calculated upon.
-
nidaba.plugins.tesseract.
ocr_tesseract
(doc, method, languages)¶ Runs tesseract on an input document.
Parameters: - doc (unicode, unicode) – The input document tuple
- method (unicode) – The suffix string appended to all output files
- languages (list) – A list of tesseract classifier identifiers
- extended (bool) – Switch to enable extended hOCR generation containing character cuts and confidences. Has no effect when direct or legacy implementation is used.
Returns: Storage tuple for the output file
Return type: (unicode, unicode)
nidaba.plugins.leptonica¶
Plugin accessing leptonica functions.
This plugin requires a liblept shared object in the current library search path. On Debian-based systems it can be installed using apt-get
# apt-get install libleptonica-dev
Leptonica’s APIs are rather unstable and may differ significantly between versions. If this plugin fails with weird error messages or workers are just dying without discernable cause please submit a bug report including your leptonica version.
-
nidaba.plugins.leptonica.
sauvola
(doc, method, whsize, factor)¶ Binarizes an input document utilizing Sauvola thresholding as described in [0]. Expects 8bpp grayscale images as input.
[0] Sauvola, Jaakko, and Matti Pietikäinen. “Adaptive document image binarization.” Pattern recognition 33.2 (2000): 225-236.
Parameters: - doc (unicode) – The input document tuple.
- method (unicode) – The suffix string appended to all output files
- whsize (int) – The window width and height that local statistics are calculated on are twice the value of whsize. The minimal value is 2.
- factor (float) – The threshold reduction factor due to variance. 0 =< factor < 1.
Returns: Storage tuple of the output file
Return type: (unicode, unicode)
Raises: NidabaInvalidParameterException
– Input parameters are outside the valid range.
-
nidaba.plugins.leptonica.
dewarp
(doc, method)¶ Removes perspective distortion (as commonly exhibited by overhead scans) from an 1bpp input image.
Parameters: - doc (unicode, unicode) – The input document tuple.
- method (unicode) – The suffix string appended to all output files.
Returns: Storage tuple of the output file
Return type: (unicode, unicode)
-
nidaba.plugins.leptonica.
deskew
(doc, method)¶ Removes skew (rotational distortion) from an 1bpp input image.
Parameters: - doc (unicode, unicode) – The input document tuple.
- method (unicode) – The suffix string appended to all output files.
Returns: Storage tuple of the output file
Return type: (unicode, unicode)