nidaba.plugins package

Submodules

nidaba.plugins.kraken module

nidaba.plugins.kraken

Plugin implementing access to kraken functions.

kraken is a fork of OCRopus implementing sane interfaces while preserving (mostly) functional equivalence. To use this plugin kraken has to be installed into the current python path, e.g. the current virtualenv. It is available from pypi:

$ pip install kraken

It should be able to utilize any model trained for ocropus and is configured using the same global configuration options.

nidaba.plugins.kraken.kraken_nlbin(input_path, output_path, threshold=0.5, zoom=0.5, escale=1.0, border=0.1, perc=80, range=20, low=5, high=90)

Binarizes an input document utilizing ocropus’/kraken’s nlbin algorithm.

Parameters:
  • input_path (unicode) – Path to the input image
  • output_path (unicode) – Path to the output image
  • threshold (float) –
  • zoom (float) –
  • escale (float) –
  • border (float) –
  • perc (int) –
  • range (int) –
  • low (int) –
  • high (int) –
Raises:

NidabaInvalidParameterException – Input parameters are outside the valid range.

nidaba.plugins.kraken.max_bbox(boxes)

Calculates the minimal bounding box containing all boxes contained in an iterator.

Parameters:boxes (iterator) – An iterator returning tuples of the format (x0, y0, x1, y1)
Returns:A box covering all bounding boxes in the input argument
nidaba.plugins.kraken.setup(*args, **kwargs)

nidaba.plugins.leptonica module

nidaba.plugins.leptonica

Plugin accessing leptonica functions.

This plugin requires a liblept shared object in the current library search path. On Debian-based systems it can be installed using apt-get

# apt-get install libleptonica-dev

Leptonica’s APIs are rather unstable and may differ significantly between versions. If this plugin fails with weird error messages or workers are just dying without discernable cause please submit a bug report including your leptonica version.

nidaba.plugins.leptonica.lept_deskew(image_path, output_path)

Removes skew (rotational distortion from an 1bpp input image.

Parameters:
  • image_path (unicode) – Input image
  • output_path (unicode) – Path to the output document
Raises:

NidabaLeptonicaException if one of leptonica’s functions failed.

nidaba.plugins.leptonica.lept_dewarp(image_path, output_path)

Removes perspective distortion from an 1bpp input image.

Parameters:
  • image_path (unicode) – Path to the input image
  • output_path (unicode) – Path to the output image
Raises:

NidabaLeptonicaException if one of leptonica’s functions failed.

nidaba.plugins.leptonica.lept_sauvola(image_path, output_path, whsize=10, factor=0.35)

Binarizes an input document utilizing Sauvola thresholding as described in [0]. Expects 8bpp grayscale images as input.

[0] Sauvola, Jaakko, and Matti Pietikäinen. “Adaptive document image binarization.” Pattern recognition 33.2 (2000): 225-236.

Parameters:
  • image_path (unicode) – Input image path
  • output_path (unicode) – Output image path
  • whsize (int) – The window width and height that local statistics are calculated on are twice the value of whsize. The minimal value is 2.
  • factor (float) – The threshold reduction factor due to variance. 0 =< factor < 1.
Raises:

NidabaInvalidParameterException – Input parameters are outside the valid range.

nidaba.plugins.leptonica.setup(*args, **kwargs)

nidaba.plugins.ocropus module

nidaba.plugins.ocropus

Plugin implementing an interface to the ocropus OCR engine.

It requires working ocropus-* tools in your execution path. Please have a look at the website for installation instructions.

Important

If you are not requiring specific functionality of ocropus please consider using the kraken plugin. Kraken does not require working around oddities in input argument acceptance and is generally more well-behaved than ocropus.

class nidaba.plugins.ocropus.micro_hocr

Bases: object

A simple class encapsulating hOCR attributes

add(*args)
nidaba.plugins.ocropus.ocr(image_path, segmentation_path, output_path, model_path)

Scan a single image with ocropus.

Reads a single image file from `imagepath` and writes the recognized text as a TEI document into output_path.

Parameters:
  • image_path (unicode) – Path of the input file
  • segmentation_path (unicode) – Path of the segmentation XML file.
  • output_path (unicode) – Path of the output file
  • model_path (unicode) – Path of the recognition model. Must be a pyrnn.gz pickle dump interoperable with ocropus-rpred.
Returns:

A string of the output file that is actually written. As Ocropus rewrites output file paths without notice it may be different from the `outputfilepath` argument.

Return type:

(unicode)

Raises:

NidabaOcropusException – Ocropus somehow failed. The error output is contained in the message but as it is de facto unusable as a library it’s impossible to deduct the nature of the problem.

nidaba.plugins.ocropus.setup(*args, **kwargs)

nidaba.plugins.tesseract module

nidaba.plugins.tesseract

Plugin implementing an interface to tesseract

This plugin exposes tesseract’s functionality as a task. It implements two ways of calling tesseract, a direct method calling the tesseract executable and one utilizing the C-API available from tesseract 3.02 and upwards.

The C-API requires a libtesseract shared object in the current library path and training data in the configured tessdata directory:

# apt-get install libtesseract3 tesseract-ocr-$lang

Using the direct call method requires the tesseract binary installable by executing:

# apt-get install tesseract-ocr

Note

It is strongly encouraged to use the C-API whenever possible. It is supposedly stable while hOCR output file names change between tesseract versions.

Note

Parameters in configuration files supersede command line parameters. Modular page segmentation utilizing zone files requires that the page segmentation mode may be set freely. Uncomment the line:

tessedit_pageseg_mode 1

in the default hocr configuration (in TESSDATA/configs/).

Configuration

implementation (default=’capi’)
Selector for the call method. May either be capi, direct (tesseract hOCR output with .hocr extension), or legacy (tesseract hOCR output with .html extension).
tessdata (default=’/usr/share/tesseract-ocr/’)
Path to load tesseract training data and configuration from. Has to be one directory level upwards from the actual tessdata directory
class nidaba.plugins.tesseract.Pix

Bases: _ctypes.Structure

class nidaba.plugins.tesseract.TessBaseAPI

Bases: _ctypes.Structure

class nidaba.plugins.tesseract.TessPageIterator

Bases: _ctypes.Structure

class nidaba.plugins.tesseract.TessResultIterator

Bases: _ctypes.Structure

class nidaba.plugins.tesseract.TessResultRenderer

Bases: _ctypes.Structure

nidaba.plugins.tesseract.ocr_capi(image_path, output_path, facsimile, languages, extended=False)

OCRs an image using the C API provided by tesseract versions 3.02 and higher.

Parameters:
  • image_path (unicode) – Path to the input image
  • facsimile (nidaba.tei.TEIFacsimile) – Facsimile object of the segmentation
  • output_path (unicode) – Path to the hOCR output
  • languages (list) – List of valid tesseract language identifiers
  • extended (bool) – Switch to select extended hOCR output containing character cuts and confidences values
nidaba.plugins.tesseract.ocr_direct(image_path, output_path, languages)

OCRs an image by calling the tesseract executable directly. Images are read using the linked leptonica library and the given output_path WILL be modified by tesseract.

Parameters:
  • image_path (unicode) – Path to the input image
  • output_path (unicode) – Path to the hOCR output
  • languages (list) – List of valid tesseract language identifiers
nidaba.plugins.tesseract.setup(*args, **kwargs)

Module contents

nidaba.plugins

nidaba.plugins.setup(ext, data)