.. _tei_output: ========== TEI Output ========== `TEI `_ is a consortium maintaining standards for the representation of texts in digital form which are widely used in the humanities. Nidaba is capable of encoding the OCR results and their metadata into XML documents following the most recent `P5 guidelines `_. The output is designed to facilitate further manual annotation. Format ====== Fundamentally output is generated following the `embedded transcription `_ scheme with a single surface containing all recognized line elements with their textual representation. The (body) skeleton of a TEI-OCR file will look like this: .. code-block:: xml 5 4 ... ... .. note:: The `g `_ tag does NOT encode single characters but entities fed into the character recognition engine. These entities are called `grapheme clusters `_ and may correspond to a single character/codepoint, multiple codepoints or in the case of ligatures decomposited by the engine even to multiple characters (œ to oe). Header ====== Most of the TEI header is filled using the :py:func:`nidaba.tasks.output.tei_metadata` task. Attribution =========== The source of a particular element is usually attributed using a series of `respStmt `_ block in the header of the document. A common example encoding a page segmentation and character recognition as two sources of data in the document will resemble these two statements: .. code-block:: xml page segmentation tesseract character recognition kraken Elements themselves are linked to these statements using the `resp `_ attribute: .. code-block:: xml x When merging the output of multiple OCR engines diverging ``readings`` will also be attributed to their origin using the ``respStmt`` tag. Alternative spellings provided by a spell checker will also be properly attributed. Certainty ========= Some recognition results will have a confidence value using the certainty tag associated with them: .. code-block:: xml Μ These necessarily refer to the identifier of the targeted element using the ``target`` attribute. The probability is a float value between 0 and 1 with higher values indicating higher confidence in the results.