.. _tei_output:
==========
TEI Output
==========
`TEI `_ is a consortium maintaining standards for the
representation of texts in digital form which are widely used in the
humanities. Nidaba is capable of encoding the OCR results and their metadata
into XML documents following the most recent `P5 guidelines
`_. The output is designed to facilitate
further manual annotation.
Format
======
Fundamentally output is generated following the `embedded transcription
`_ scheme
with a single surface containing all recognized line elements with their
textual representation.
The (body) skeleton of a TEI-OCR file will look like this:
.. code-block:: xml
5
4
...
...
.. note::
The `g `_
tag does NOT encode single characters but entities fed into the character
recognition engine. These entities are called `grapheme clusters
`_ and may correspond to a single
character/codepoint, multiple codepoints or in the case of ligatures
decomposited by the engine even to multiple characters (œ to oe).
Header
======
Most of the TEI header is filled using the
:py:func:`nidaba.tasks.output.tei_metadata` task.
Attribution
===========
The source of a particular element is usually attributed using a series of
`respStmt
`_ block
in the header of the document. A common example encoding a page segmentation
and character recognition as two sources of data in the document will resemble
these two statements:
.. code-block:: xml
page segmentation
tesseract
character recognition
kraken
Elements themselves are linked to these statements using the `resp
`_
attribute:
.. code-block:: xml
x
When merging the output of multiple OCR engines diverging ``readings`` will
also be attributed to their origin using the ``respStmt`` tag. Alternative
spellings provided by a spell checker will also be properly attributed.
Certainty
=========
Some recognition results will have a confidence value using the certainty tag
associated with them:
.. code-block:: xml
Μ
These necessarily refer to the identifier of the targeted element using the
``target`` attribute. The probability is a float value between 0 and 1 with
higher values indicating higher confidence in the results.