For this publication, a Creative Commons Attribution 4.0 International license has been granted by the author(s), who retain full copyright.
Hugoye: Journal of Syriac Studies is an electronic journal dedicated to the study of the Syriac tradition, published semi-annually (in January and July) by Beth Mardutho: The Syriac Institute. Published since 1998, Hugoye seeks to offer the best scholarship available in the field of Syriac studies.
This paper summarizes the results of an extensive test of Tesseract 4.0, an open-source Optical Character Recognition (OCR) engine with Syriac capabilities, and ascertains the current state of Syriac OCR technology. Three popular print types (S14, W64, and E22) representing the Syriac type styles Estrangela, Serto, and East Syriac were OCRed using Tesseract’s two different OCR modes (Syriac Language and Syriac Script). Handwritten manuscripts were also preliminarily tested for OCR. The tests confirm that Tesseract 4.0 may be relied upon for printed Estrangela texts but should be used with caution and human revision for Serto and East Syriac printed texts. Consonantal accuracy lies around 99% for Estrangela, between 89% and 94% for Serto, and around 89% for East Syriac. Scholars may use Tesseract to OCR Estrangela texts with a high degree of confidence, but further training of the engine will be required before Serto and East Syriac texts can be smoothly OCRed. In all type styles, human revision of the OCRed text is recommended when scholars desire an exact, error-free corpus.
This research was conducted under digital humanities fellowships offered in 2018 at Beth Mardutho: The Syriac Institute. The authors wish to thank Dr. George A. Kiraz for his guidance, refining thought, and bestowal of the research opportunity. They also wish to thank the three reviewers for their incisive feedback on a previous draft.
Digital humanities scholars are constantly searching for automatable processes that will free up time for deeper analysis. One such process, optical character recognition (OCR), has dramatically expanded scholars’ ability to search, cross-reference, and analyze large corpora. But while OCR capabilities for Western languages have been in advanced stages for some time, Syriac OCR remains in development, and Syriac scholars have not yet analyzed the accuracy of available OCR engines. To rectify said lacuna, a team of analysts carried out extensive tests on Tesseract 4.0, an open-source OCR engine with Syriac capabilities. This paper summarizes the test results in order to provide data-based recommendations for current Syriac OCR possibilities.
This project evaluated three popular print types (S14, W64, and
E22) representing the Syriac type styles Estrangela, Serto, and East Syriac and
tested a representative sample of each type style against both modes available
for Syriac in Tesseract, Syriac Language and Syriac Script. A brief note on terminology
used throughout the paper is in order: the term “font” denotes a
computer font such as the Meltho fonts. The term “print type” is used to
denote the physical type that corresponds to its code in J. F. Coakley’s
When the Tesseract OCR engine is invoked, it takes
the “-l” (for language) command-line argument. The language value
typically matches the ISO 639-2 3-letter code (e.g., “syr” for Syriac)
but is in reality the name given to the trained data in Tesseract’s
“tessdata” subfolder.Typography of Syriac. For instance, the
Meltho font Serto Jerusalem is based on the print type W11B.
The project’s extensive tests of Tesseract 4.0 confirmed that this
OCR engine may be relied upon for printed Estrangela texts but should be used
with caution and human revision for Serto and East Syriac printed texts.
Consonantal accuracy lies around 99% for Estrangela, between 89% and 94% for
Serto, and around 89% for East Syriac. When diacritics, punctuation, and
non-Syriac characters are taken into consideration, accuracy rates drop
dramatically: around 95% for Estrangela, approximately 86% for Serto, and around
77% for East Syriac. As will later be explained, every type style was
OCRed with two modes, hence the results averaged here. Precise
calculations follow.
In order to comprehend the tests that were conducted, it is
essential to clarify at the outset a technical distinction within Tesseract’s
computation and the terms used to describe it. Tesseract can be invoked in two
Syriac modes: Language and Script. The former makes use of language-specific
training data for OCR. The user can use this mode to recognize texts in English,
German, or French, for example. The latter is script wide (e.g., the Latin
script) and is not sensitive to a specific language. Having said that, according
to the Tesseract user documentation, when one invokes it in Script mode, English
is always added to the mix. “TESSERACT (1) Manual Page,” GitHub, last updated
July 2, 2018, https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages.
This lengthy paper contains overall results, results split by type style, analysis of the significant error trends, manuscripts results, practical tips, and concluding remarks. Reading the work in full will give the best understanding of the project and its conclusions; however, readers can still gain valuable information from focusing on certain parts of the paper. For those readers who prefer data to prose, tables and graphs are included throughout the paper and especially in Appendix 2. For scholars who have a specific Syriac text or project in mind, this paper highlights the results by type style (Estrangela, Serto, and East Syriac) and analyzes the most significant accuracy difficulties in each; these results will help those scholars understand if and how Tesseract can best assist them in their task. For those readers who are interested in handwritten texts, there is a section on manuscript results. The methodology of the project as well as the tips for usage will provide some practical advice for OCRing Syriac, though there are practical insights throughout the paper as well.
Syriac optical character recognition (OCR) has been sought
since the early 1990s. OCR is the electronic conversion of typeset or
handwritten text images into machine-encoded texts; the process turns an
unsearchable image of a text into a searchable text file. As one can
imagine, effective OCR drastically speeds up the process of searching and
analyzing large sets of text. Syriac OCR will thus allow for the creation of
virtual text corpora on par with, for example, the Thesaurus Linguae Graecae.
The primary hindrance to creating an OCR engine for Syriac has
always been its cursive nature of writing, since the script prevents precise
letter differentiation. This same concern has applied to Arabic
OCR. Maxim Romanov, Matthew Thomas Miller, Sarah
Bowen Savant, and Benjamin Kiessling, “Important New Developments in
Arabographic Optical Character Recognition (OCR),” March 2017. https://arxiv.org/abs/1703.09550
William F. Clocksin and Prem Fernando,
“Towards Automatic Transcription of Estrangelo Script,” Elizabeth Tse and Josef
Bigun, “A base-line character recognition for Syriac-Aramaic,” Hugoye: Journal of Syriac Studies 6, no. 2
(2003): 249–68. Clocksin has returned to the Syriac OCR problem in
recent months, and his new OCR software which is in development
should be evaluated in the future.2007 IEEE International Conference on Systems, Man
and Cybernetics (Montreal, Quebec: IEEE, 2007), 1048-1055.
doi: 10.1109/ICSMC.2007.4414012. https://ieeexplore.ieee.org/document/4414012/authors
In 2017, Grigory Kessel noticed that Syriac PDFs uploaded to
Google Drive became searchable files. Kessel correctly assumed that there
must be an OCR engine running in the background and notified the Syriac
studies community through a Syriac studies listserv. Grigory Kessel, April 21,
2017, message to hugoye-list, https://groups.yahoo.com/neo/groups/hugoye-list/conversations/messages/8069.
For more
information, see “Tesseract OCR,” Google Open
Source, https://opensource.google.com/projects/tesseract,
accessed 30 July, 2018.
The team’s preliminary question, before analyzing Tesseract’s
output, was: how was Tesseract trained and on what sort of data? In the
absence of any documentation from Tesseract’s programmers, the team
considered two options: either a large set of texts was transcribed and fed
into Tesseract with its associated text images or an automated method was
used to produce text images and their corresponding texts. “TrainingTesseract 4.00,”
GitHub, last revised 15 July 2018, https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00,
accessed 30 July, 2018.
Tesseract language-specific training guide, GitHub, last revised 19
July 2018, https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh.
Figure 1: Example of Estrangelo TurAbdin
Font
The Patriarchal Journal of the Syrian Orthodox
Patriarchate of Antioch and All the East 40, nos.
211-212-213 (2002), 94.
Fortunately, the standard non-calligraphic Meltho fonts are based on actual print types that were used in the vast majority of Syriac publications from the late-nineteenth century onward. This suggests that Tesseract should be able to accurately recognize printed editions that are typeset in those print types that were the basis of Meltho fonts. The non-calligraphic Meltho fonts with their corresponding print types are:
J. F.
Coakley, The Typography of Syriac: A
historical catalogue of printing types, 1537-1958 (New
Castle, DE, and London: Oak Knoll Press and The British Library,
2006), 172–76.
Coakley, Typography of Syriac,
178–79.
Coakley, Typography of
Syriac, 50–56.
Coakley,
Typography of Syriac,
125–28.
Coakley, Typography of
Syriac, 132–135.
Coakley, Typography of Syriac, 104-106,
136-137.
Coakley, Typography of Syriac, 144–46;
Bar Hebraeus, Le Livre des Splendeurs de
Grégoire Barhebraeus, ed. Alex Moberg (London: Humphrey
Milford, 1922).
Coakley, Typography of
Syriac, 225–28. For the history of this East Syriac
type, see J. F. Coakley, “Edward Breath and the Typography of
Syriac,” Harvard Library Bulletin, New
Series, vol. 6, no. 4 (1995): 4–64.
In addition to these print type-based fonts, the Meltho set
includes two fonts based directly on manuscripts. The Estrangelo Antioch
font was designed based on Damascus 12/21, a
manuscript copied in 1041/2 CE and housed in the Syriac Orthodox Patriarchal
Library in Damascus, and Estrangelo Midyat was designed based on a 13th
century manuscript, originally of the Church of Mort Shmoni in Midyat and
now at Mor Gabriel Monastery in Tur Abdin.
This project aimed to give an overview of Tesseract’s ability to OCR Syriac, and thus, the three major scripts were represented. A popular print type was chosen for each of the three type styles Estrangela, Serto, and East Syriac:
Estrangela: print type S14 Išo‘dad de Merv, Commentaire d’Išo‘dad de Merv Sur l’Ancien Testament I.
Genèse, ed. J.-M. Vosté and C. Van den Eynde, CSCO 126,
Scriptores Syri 67 (Louvain: Imprimerie Orientaliste L. Durbecq,
1950).
Serto: print type W64 Severus of Antioch, Les
Homiliae Cathedrales de Sévère d’Antioche: traduction Syriaque
de Jacques d’Édesse (Homélies LII–LVII), ed. and trans. R.
Duval, Patrologia Orientalis 4 (Paris: Librarie de Paris,
1908).
East Syriac: print type E22
Acta Martyrum et Sanctorum Syriace, vol. 1,
ed. Paul Bedjan (Hildesheim: Georg Olms Verlagsbuchhandlung,
1968).
Syriac texts that were printed in these frequently-used print
types were chosen for OCRing. The Estrangela and Serto texts are unvocalized
since the former is usually not vocalized and the latter is not generally
vocalized in text editions. East Syriac, on the other hand, tends to be
vocalized in many text editions. Several pages from each text were scanned
at 300 dpi in black and white and cropped in order to generate images that
only included the main body of the text, thus excluding marginal notes and
footnotes. In prior OCR tests, footnotes, headings, and
marginal notes had caused difficulties with line segmentation and
made the OCR results less accurate. In choosing a page to test, the clarity
of the image and the amount of text were both considered. Pages that
only included full lines of text were prioritized, as were images
with the most even lines. The pages were also not pre-segmented in
this OCR process; they were OCRed as they appear in Figure
2.
Figure 2: Samples of the
Original Images Shown to Scale
Tesseract was downloaded onto two Windows laptops. The team
decided to run Tesseract only on Windows computers as a point of
consistency across the outputs, but it used more than one machine in
order to check the outputs against each other. Tesseract 4.0 was
downloaded from the GitHub page maintained by the Mannheim
University Library (UB Mannheim):
https://github.com/UB-Mannheim/tesseract/wiki
,
accessed 30 July, 2018.
Figure 3: Tesseract Run
through Command Prompt
There are multiple variables at play in this OCR process, and
they were accounted for as carefully as possible throughout the test. The
first is Tesseract’s language code command-line arguments (which run
Tesseract using Syriac Language and Syriac Script modes). A second variable
is the output file kind. Tesseract can be invoked to output text (.txt
files), PDF, or hOCR files, The hOCR file type is built upon HTML. FAQ,
GitHub,
One option for
command-line arguments that occasionally interfered with accuracy
was selecting multiple output file types at once. When executing the
Tesseract command-line arguments, one must choose a file kind for
output; selecting “txt” generates the OCR as a .txt file, and so
forth. However, one can choose to generate multiple file kinds at
once by entering all codes into the same command-line argument. For
the most part, the OCR text that results from multi-output functions
is identical to those generated by single-output functions. However,
sometimes a multi-output function generated gibberish. For example,
a page in Serto was run through Tesseract with the Script
command-line argument and with an output of “txt pdf hocr,” and one
of the resultant lines was: “bs avwnE vrtwAUM oa airog + https://github.com/tesseract-ocr/tesseract/wiki/FAQ
,
accessed 30 July, 2018.
After the pages were processed through Tesseract and converted
to Word files, data collection began. First the team calculated data for the
original images in the following categories: consonants, punctuation marks,
diacritics, and non-Syriac characters (including numerals, brackets, and
asterisks). To determine these numbers, several lines of each page were
counted and averaged, and that average was then used to estimate character
counts for the whole page. One analyst counted the exact number of
characters for her pages without using averages and found her counts
to be similar to that of her colleagues. This information is given
in In other words, the team estimated Appendix 2. total character counts for consonants,
diacritics, punctuation, and non-Syriac characters, but it counted
exact numbers of individual consonants.
Next, the analysts began to identify and record the errors
that appeared in the 18 OCRed documents, comparing each one to its original
image.
Despite the type style of the original image, the Word documents
used by the analysts for comparison generally presented the text in
each computer’s default Syriac font, generally an Estrangela font.
Analysts could leave the Estrangela font or change fonts to match
the original image. Two analysts choose the former method, one the
latter. Smudges or dust that appeared on the original
image as if it were a character was not considered an error, per se,
as the OCR correctly identified the character even though the
character was not intended in the original image.
Further on, there will be a discussion of nun was incorrectly recognized as a yud in
one word and as an olaph in another word, these
occurrences were recorded as distinct errors. This allowed more precision in
data analysis later. Over the course of the project, it is likely that each
analyst reviewed her six pages more than five times, rechecking data and
pulling information on specific issues. These reviews were often prompted by
an unusual number of a certain type of error or a need to record some errors
even more specifically than originally intended. Though tedious, the process
allowed the analysts to identify the most common errors within Tesseract
modes, within particular type styles, and within individual images, as well
as to develop basic understandings of why these particular errors might have
occurred. Having a record of all errors produced also provided a wider
picture of the breadth of issues that can appear when OCRing Syriac
texts.truly
odd occurrences found within these pages. Of course, this
does not present an exhaustive account of what could appear in
Syriac OCR pages, but it does give a basic understanding of what
sorts of errors can occur.
Figure 4: Sample
Selection of Data Collected from East Syriac Spreadsheet
The key data point to determine was the rate of recognition
accuracy. Accuracy, simply stated, is the inverse of error rate; accuracy
was calculated by subtracting the error rate from one. For most data points,
the error rate is understood as the total errors This includes any
truncated characters, which may be considered an OCR segmentation
issue rather than an OCR recognition issue. Because accuracy was defined as total errors
(including added characters) divided by characters in the original
image, the accuracy rates can be negative if there are more
resulting errors than characters in the original image. The
characters that were added cause this problem. Some accuracy rates
could not be calculated due to the lack of certain characters (e.g.,
non-Syriac characters) in the original. These
consonant accuracy rates do not include truncated characters, as
that is typically a segmentation issue and not a problem of
recognition. They also do not include added consonants since added
characters do not reflect Tesseract’s ability to recognize the
specific character.
The analysts sought to discover which were the most
predominate kinds of errors, which combinations of type styles and
Tesseract modes were most accurate for which kinds of letters, and what
were the overall accuracy rates scholars could expect with the current
version of Tesseract. The project further sought to identify important
areas for future training of the Tesseract engine. Analysis began with
the most detailed descriptions of errors possible and grouped error
types into increasingly broader categories. At the most narrow level,
the analysts considered categories such as ayn
becoming ḥéth in East
Syriac-Script, which only occurred once across all three pages. At the
broadest level, the test tabulated total percentage of consonant errors
compared across all six combinations of type style and Tesseract mode.
The team also attempted to evaluate the impact of other factors such as
pixel size and image quality on the OCR results. The section that
follows details the specific results for each type style, subdivided
into discussions of consonants, diacritics, punctuation, and non-Syriac
characters.
In the broadest terms, Tesseract generated the most accurate results with the Estrangela type style and using Tesseract’s Language mode. The East Syriac type style paired with Script mode produced the lowest accuracy rates overall. Unsurprisingly, diacritics caused the most errors across all type styles and modes, while consonants were more easily recognized and differentiated. Punctuation and other non-Syriac characters also caused some errors across the board.
Figure 5: Overall
Accuracy Rates
The analysts, each of whom examined different pages,
naturally recorded different OCR errors. Individual results
can be found within Within the data
tables of individual data, it was observed that there was no
consistent pattern that would indicate a strong distinction
between the analyst’s data. For example, no one analyst
consistently had the lowest or highest accuracy rates. The lack
of pattern speaks to the consistent method of data collection
across each of the three analysts.Appendix 2.
The individual data was analyzed to determine the range of
the data set as well as the variety of accuracy rates within the pages.
Most of the data given in this paper is based on the averages of the
three analysts’ work, as this gives the best overall picture of the
results and compensates for human error. However, the range and standard
deviation between individual data gave the team a deeper understanding
of the underlying data points, and these are provided in tables in Appendix 2. For the most
part, the data remained reasonably close between analysts; consonant
accuracy rates only have ranges of one to four percent, while the ranges
of total accuracy rates do not reach six percent. Diacritic accuracy
rates, on the other hand, vary greatly between individual pages. For
example, the ranges within the data for the Estrangela and Serto texts
in Script mode hovered around 55%. Interestingly enough, East
Syriac-Script had the most consistent diacritic rates among the
analysts with a range of 1.65%.
This section gives a general numerical summary of the test
results. Tables and graphs, rather than prose, dominate this section in
order to communicate clearly the most crucial pieces of data. The
results of each type style are given, along with brief discussions about
consonants, diacritics, punctuation, and non-Syriac characters. Further
data can be found in Appendix 2.
Estrangela consonants were recognized very accurately with 99.28% of consonants OCRed correctly in Language mode and 98.51% OCRed correctly in Script mode. A total of 22 different types of consonantal error were recorded, 16 of which occurred only once or twice. This low number of occurrences for the majority of errors makes it difficult to determine their cause.
Table 4 shows the six consonantal errors which occurred
three times or more. The most frequent error was word truncation, which
caused three consonants to be deleted in Language mode and eight to be
deleted in Script mode. The second most common error was a yudh being inserted into the text, which occurred
twice in Language and six times in Script. These two errors were also
common in Serto and East Syriac, and they are discussed in more detail
in the Analysis.
These most frequent errors include those which occurred three or more times in total across both Language and Script modes.
Table 5 gives the individual accuracy rate for each consonant in the two modes. In both modes, the majority of consonants were recognized with 100% accuracy, and only one consonant in each fell below 97% accuracy.
Colors have been added to the table as a visual aid to help with comparison. The colors green and red were set on a gradient, with every number from 1 to 100 spaced evenly across it. The shade of each box on the table visually indicates that consonant’s accuracy.
The Estrangela pages chosen for testing were all
unvocalized, meaning that in terms of diacritics they contained only
homograph dots and syomés. In Language mode, only
45.45% of these diacritics were recognized correctly, and in Script
mode, only 32.17% were recognized correctly. The most frequently
occurring error was the diacritic homograph dot being substituted for
another kind of diacritic. It was most often rendered as one of the very
similar-looking dotted vowels, but was also occasionally misidentified
as a Greek vowel, an Arabic diacritic, or a syomé. The second most common error was the homograph dot being
missed or added, which happened 14 times in Language and 32 times in
Script.
There were 78 Syriac punctuation marks on the Estrangela
pages tested, with 12 errors recorded in Language mode and 14 recorded
in Script. Only two types of errors were noted. The most common was the
addition or omission of punctuation, which happened 12 times in Language
and 11 times in Script. There were also three occurrences where a Syriac
punctuation mark was recognized as an Arabic diacritic—for example, the
See,
George Anton Kiraz, pāsūqā was recognized as a sukun—and this error was exclusive to Script.Tūrrāṣ Mamllā: A Grammar
of the Syriac Language, vol 1: Orthography (Piscataway,
NJ: Gorgias Press, 2012), 149–50.
There were 21 non-Syriac characters in the Estrangela pages
tested, consisting of asterisks and superscript numbers. None of these
characters was recognized accurately by Tesseract, and the errors
involving these non-Syriac characters were very inconsistent. For
example, on one page there were ten superscripted Arabic digits (i.e.,
1, 2, 3) identifying footnotes. Estrangela Page 1. Intriguingly, the number that was
recognized as an exclamation point was the digit 8, not a 7 or a
1 as one might assume based on similarity of character
shapes.hé recognized as Latin m).
Attentive readers may be perplexed by the negative percentages in this category: they are caused by Tesseract’s addition of non-Syriac characters which were not present in the original images. This occurred in Script mode on only two occasions, but these two additions had a notable effect on the accuracy rate because of the small number of non-Syriac characters on the original pages.
In Language mode, Tesseract recognized 93.67% of consonants accurately. In Script mode, 88.98% of consonants were recognized accurately. The consonantal errors identified in Serto varied widely, with over 60 different types of errors recorded. Half of these errors only occurred once or twice—again making it difficult to identify any patterns. However there were also certain errors which occurred with a particularly high frequency. Table 9 lists the 15 most commonly occurring errors and gives their frequency in both Language and Script mode.
The most common issue in Serto was the addition of extra
letters, particularly the tooth letters yudh, ayn, and ḥéth. This was common in both Language and Script
mode but occurred twice as many times in Script mode. This is discussed
in more detail in the Analysis section under Added Tooth Letters. Dolath-rish, waw, olaph, and béth were also incorrectly
added to the text fairly frequently. Waw and olaph were inserted more times in Script mode,
while dolath-rish was inserted more often in
Language mode, and béth was inserted an almost
equal number of times in both modes.
After the addition of consonants, misidentification of
consonants was the second most common cause of errors. Certain
consonants, when misidentified, were misidentified consistently. For
instance, when Tesseract misidentified a Ṣodhé,
it was always substituted with a dolath.
Similarly, on all the occasions that taw was
misidentified, it was recognized as an olaph.
Other consonantal substitutions were not completely consistent but still
showed a tendency towards a particular letter. For example, shin was most likely to be misidentified as phé, which accounted for seven out of the nine
substitution errors in Language and both of the substitution errors in
Script. Hé was most frequently misidentified as a
waw in both modes, and in Script it was also
frequently mistaken for a zayn and waw together, which in Serto type style creates a
very similar shape to hé. Yudh was mistaken for a ḥéth 11 out of the 16 times it was misidentified
and, likewise, all nine times ḥéth was misidentified it was mistaken for one or two yudhs. The remaining two consonants that were
frequently misidentified, mim and dolath-rish, showed more variation in their
substitutions, with the 13 mim substitutions
occurring across ten distinct error categories and the ten dolath-rish substitutions occurring in eight
distinct error categories.
Table 10 shows the accuracy rate of each consonant in
Serto in both Language and Script modes. The majority of consonants
achieved over 97% accuracy in both Serto modes. However, when compared
with Estrangela, Serto's numbers become less praiseworthy. Serto has an
average of eight consonants performing at lower-than-97% accuracy, while
Estrangela only has one that falls below this threshold. Moreover,
certain consonants in Serto had a significantly low accuracy rate.
Ṣodhé, for example, proved
particularly difficult for Tesseract to recognize. Also illustrated by
the table is the fact that the two modes can produce extremely varied
accuracy ratings for the same letter. For example, ḥéth, taw, and
mim were over 10% more accurate in Language
mode, whereas shin was 19% more accurate in
Script mode.
As with Estrangela, the Serto pages tested were
unvocalized, meaning that the images’ only diacritics were homograph
dots and syomés. Of all three type styles tested,
Tesseract produced the least accurate results for diacritics in Serto,
with only a 22.52% accuracy rate for diacritics in Language mode and a
16.56% accuracy rate for diacritics in Script mode. These figures
include a high number of diacritics that were incorrectly added by
Tesseract, many of which the analysts believe were caused by stray marks
in the original images.
Table 11 lists the seven different types of diacritic errors recorded for Serto and gives their frequency in Language mode and Script mode. The homograph dot was the locus of most Serto diacritic errors; it was frequently added, deleted, or substituted. When the homograph dot was misidentified it was most often substituted for a dotted vowel, as was also the case in Estrangela.
There were 74 punctuation marks on the Serto pages tested, and 15 errors were recorded in Language mode and 29 recorded in Script mode. The most common punctuation errors were the addition or omission of punctuation, accounting for about 73% of punctuation errors in both Language and Script. The remaining errors involved punctuation being misidentified—usually as another Syriac punctuation mark, though in one instance, Serto-Script rendered a punctuation mark as an Arabic diacritic.
The Serto pages tested contained only three non-Syriac characters, consisting of asterisks and one superscript numeral. None of these were recognized correctly by Tesseract. Again, a minor number of added non-Syriac characters caused the negative accuracy rates—one extra non-Syriac character was inserted in Language mode and two were inserted in Script mode.
Tesseract recognized 88.67% of consonants accurately with Language mode and 88.16% with Script mode. The consonantal errors were incredibly varied, with 99 different types of consonantal error recorded. Figure 6 gives a visual representation of the distribution of these errors.
Figure 6:
Distribution of Consonantal Errors in East Syriac
Since so many different categories of consonantal errors were produced in East Syriac and the majority of errors only occurred once or twice, it is difficult to discuss them all in depth and to discern patterns within the data. Table 14 lists the 12 most frequently-occurring errors and gives their frequency in both Language and Script modes. Here, specific errors have been subsumed under more general categories in order to provide a broader picture of the errors encountered.
The issue that most affected Tesseract’s accuracy with
East Syriac text in both modes was word truncation, a problem resulting
from segmentation issues. This caused Tesseract to omit 80 consonants in
Language and 82 in Script. Tesseract also omitted many individual
consonants from within words that had otherwise been recognized. The
most commonly omitted consonants were olaph, dolath-rish, nun, and ḥéth.
The majority of misidentification errors related to dolath-rish. Not only were these the most
commonly misidentified consonants, they were also the consonants that
others were most frequently substituted for. This issue is discussed in
more detail in the Analysis section. Another frequently substituted
consonant was ṭéth, which
was substituted for a gomal on every occasion
that it was misidentified.
Table 15 gives the accuracy rate for each individual
consonant in East Syriac. Tesseract was less than 97% accurate at
recognizing the majority of consonants in both modes. The letters dolath-rish, ṭéth, and zayn posed
particular difficulty for Tesseract in East Syriac type style. The table
also illustrates some of the large differences in accuracy of individual
consonants between the two modes. Zayn is 50%
more accurate in Language and béth is 11% more
accurate in Language, while Ṣ
odhé is 20% more accurate in Script.
The original East Syriac pages tested contained syomés, homograph dots, dotted vowels, and Syriac
oblique lines (Unicode character U+0747). Tesseract recognized the
diacritics in East Syriac most accurately out of the three type styles
tested, but the accuracy rate was still very low, with a 60.69% rate of
accuracy in Language and a 53.17% rate of accuracy in Script.
Both the relative accuracy and the (still) high number of
errors in East Syriac are likely due to the high frequency of vowel
pointings in East Syriac to begin with. The three pages tested had a
total of 1,239 original diacritics. The more diacritics there are and
the more In addition, as will
be discussed further in the section on Segmentation Issues, East
Syriac’s many diacritical marks invade the white space between
lines and thus create difficulties in distinguishing words.
Besides affecting Tesseract’s segmentation, this may also
contribute to misidentified diacritics.kinds of diacritics there are, the more
opportunities Tesseract has to misread the image.
Tesseract’s two most frequent diacritic errors involved the
dotted vowel. Dotted vowels (e.g., There were 1,239
original diacritics in the pages tested. Dotted vowels were
rendered as their equivalent Greek vowel 51 times in Language
and 81 times in Script, and dotted vowels were rendered as a
non-equivalent Greek vowel 23 times in Language and 43 times in
Script. These two vowel shifts account for 15.20% and 21.42% of
diacritic errors in Language and Script, respectively. The shift
from dotted vowel to its equivalent Greek vowel is considered an
error because Tesseract is not representing the original image,
and especially so given that Tesseract was able to correctly
recognize the dotted vowels in some cases confirming that the
engine is capable of exact recognition and
reproduction.
There were 80 punctuation marks in the East Syriac pages tested. The primary issue recorded was punctuation being added or deleted, accounting for over half of total punctuation errors. The remaining errors consisted of punctuation being misidentified—punctuation marks were most often substituted with other Syriac characters but on anomalous occasions were also recognized as Arabic or Latin characters.
There were no non-Syriac letters, numerals, or symbols in the original East Syriac pages, and so an accuracy rate was not calculated for this category.
With the detailed results for each type style fully recounted, this paper turns now to an analysis of prominent and unusual error trends that were observed during testing. To put it in terms of the popular metaphor, now that the individual trees have been catalogued, the following section will describe the forest.
While the OCR testing generated a long list of errors, some of
which were unique, trends are still discernible in the data. The following pages
detail major error trends that emerged from the Tesseract testing data. Tooth
letters had statistically significant error rates across all type styles,
especially in terms of addition and substitution errors. Other kinds of errors
were perceived primarily in one type style; examples of this were dolath and rish in East Syriac,
mim in Serto, and yudh in
Serto. Additional subsections consider unique errors, segmentation issues, and
spacing issues as examples of what scholars can expect to find in their OCR
ventures. Along with summarizing each of these common error trends, the paper
analyzes what factors from letter shape to OCR training caused these tendencies.
All three type styles differed as to their most common type of error, whether added, deleted, or substituted. While Estrangela-Script had relatively even numbers of all three, Estrangela-Language had significantly more added and deleted letters than substituted. Interestingly, for both Script and Language modes in Estrangela, Tesseract deleted the same number of letters it added. In Serto, added letters were most common in both modes, and deleted letters were least common. In East Syriac-Language, deletions were slightly more common error types than substitutions, but both far outnumbered insertions. In East Syriac-Script, additions remained the least common error type, but substitutions predominated over deletions.
Table 18: General Consonantal Errors in All Type Styles
Another important trend is immediately apparent from this information: Tesseract performed remarkably well with Estrangela under both modes. Very few consonantal errors were generated. In fact, Tesseract had a 99.28% consonantal accuracy in Estrangela-Language, and a 98.51% consonantal accuracy in Estrangela-Script.
Figure 7: Most Common Error
Types
Serto tended to have results at odds with the other two type
styles, and this was mirrored at the underlying level of page layout. The
Estrangela type style had an average of 32.2 consonants per line, and an average
of 28 lines per page on the three images tested. For all three type styles,
average consonants per line was calculated by choosing 5 full lines at
random from the three pages (2 from one, 2 from a second, and one from a
third), counting all the consonants on those lines, and taking the
mean.
Humorously referred to as “tooth letters” by Syriac
instructors,
ḥéth, yudh, ayn, and nun exhibit strong
similarities in all Syriac type styles.Lomadh is not considered a tooth letter
despite its similarities to ayn because its
“leg” extends higher up the page, making it more recognizable to
Tesseract. Similarly, zayn is not counted as
a tooth letter because it does not connect to the letters on either
side, making it uniquely distinguishable.yudhs in Serto), deleted tooth letters,
or recognized other letters as tooth letters and vice versa. As Table 19
illustrates, tooth letter errors account for more than 14% of consonantal
errors across all six type style-mode combinations. It ranges from 14.10% of
consonantal errors in East Syriac-Language to 54.04% of consonantal errors
in Serto-Script.
Figure 8: Tooth
Letter Errors as Percentage of Consonantal Error
The kinds of tooth letter errors varied between the different type styles. By far the most common type of tooth letter error in Serto was added tooth letters. East Syriac, on the other hand, had more substitution errors, in which one tooth letter was misread as a different tooth letter, and errors of tooth letter deletion. The types of tooth letter errors were relatively even in Estrangela, with four, eight, and six total substitution, addition, and deletion errors, respectively.
As Table 20 shows, Tesseract rendered tooth letters with a
range of accuracies across the six combinations of type style and mode.
Both Estrangela modes had higher than 97% accuracy rates at rendering
all four tooth letters, and the two East Syriac modes ranged from 90.61%
to 93.01% accurate. Particularly for Estrangela, Tesseract may be relied
upon with a high degree of certainty in regard to these four letters
(ḥéth, yud, nun, and ayn). Serto, on the other hand, performed
relatively poorly. Serto-Language generated an 82.96% accuracy rate, and
Serto-Script was only 68.17% accurate.
Figure 9: Tooth
Letter Errors as Percentage of All Tooth Letters
Zayns are less similar to the
“tooth letters,” differentiated particularly by lack of script
attachment to surrounding letters. However, when a zayn precedes a medial nun or yudh Tesseract occasionally recognized the two
letters as a ḥéth. This
happened once each way in East Syriac-Script.
As illustrated by Table 21, the three type styles varied in
their encounters with added tooth letter errors. Surprisingly, East
Syriac had the least problems with added tooth letters, and East
Syriac-Language did not add any tooth letters. The OCR process only
added a few tooth letters in Estrangela-Language and Estrangela-Script.
In fact, Estrangela’s only tooth letter errors were eight yudhs added across all six pages (two in Language
and six in Script).
Contributing in large measure to Serto’s poor numbers
regarding tooth letters, is the frequent addition of yudhs in both Serto-Language and Serto-Script modes. While the
reasoning behind this phenomenon is unclear, Tesseract added many yudhs during the OCR process. Sometimes—though
not every time—these appeared to be added when a dot rested above the
line of text in the original image, as if Tesseract believed a letter
was pushed out of line. The differences between Serto and the other type
styles are stark: while East Syriac added only one yudh across all six pages tested (the one occurrence was in
Script mode) and Estrangela added eight yudhs,
Serto inserted an astounding 41 yudhs in Language
and 84 in Script! This accounts for about 30–35% of consonantal errors
in both of Serto’s modes. Consequently, added yudhs also account for a large percentage of added tooth
letter errors in Estrangela and Serto.
In Serto-Language and Serto-Script, however, 48 and 111
tooth letters were added, respectively. Most of its added tooth letters
were yudhs: Serto’s many added yudhs made up 85.42% of added tooth letter errors in Language
mode and 75.68% in Script mode. If yudhs are
discounted, Serto only had 7 tooth letter additions in Language and 27
in Script.
Another unique error within Serto came when one line in
both Serto-Language and Serto-Script added multiple Serto Page 3.ḥ
éths that could not be accounted for
with stray marks.ḥéths were added when there was a longer
connecting line between two letters, such as a yudh and a taw. The rest of the page
did not see such a wide expanse between letters, and it is possible that
Tesseract added the ḥéths
to make up for the added space between letters.
Given the similar appearances of each tooth letter to
another, the analysts anticipated prior to the testing that a high rate
of substitution errors would result, wherein one tooth letter (an ayn, for instance) would be incorrectly
recognized as a difference tooth letter (say, a nun) by Tesseract. In actuality, Tesseract had a very high
accuracy rate in this regard. The least-accurate permutation tested,
Serto-Language, still only misidentified 2.01% of tooth letters as other
tooth letters. Put another way, while added and deleted tooth letters
made up a high percentage of consonantal errors, confusion between
letters did not significantly contribute to the error rate. Tesseract
thus appears to have a high degree of accuracy in recognizing tooth
letters, and evidently other factors cause the added letters.
Some errors are typical across all combinations of type style
and mode, but the tests also generated discrepancies between type styles in
which one type style had markedly different results than the other two. This
difference was particularly noticeable in the cases of rishes and dolaths, mims, and added yudhs.
While close to 100% of dolaths and
rishes in Estrangela and over 92% in Serto
were recognized accurately in both modes, East Syriac had relatively
imprecise accuracy rates for these letters: 72.94% in Language and
75.23% in Script. Close to 25% of dolaths and rishes were incorrectly identified or deleted in
each East Syriac mode. This may have to do with the shape of rishes and dolaths in the
East Syriac font. With their large curved shape, they look similar to
waws and kophs.
Indeed, Tesseract substituted a total of 35 dolaths and rishes for waws and a total of 14 dolaths and rishes for kophs. Interestingly, the misidentification did
not go both ways; on no occasion was a waw
substituted for a dolath or rish, and kophs were very rarely
identified as a rish or dolath (only three total occurrences of this type of error in
both East Syriac modes). Tesseract also had some difficulty
distinguishing dolaths and rishes from each other and from the dotless dolath-rish consonantal stem. These were incorrectly
substituted amongst themselves 18 times in East Syriac-Language and 11
times in East Syriac-Script.
Figure 10: Frequency
of Consonants Substituted for Dolath-Rish in East Syriac
The East Syriac dolath-rish also
proved problematic for Tesseract in another way. Not only did Tesseract
often misidentify dolath-rish as other
consonants, it also frequently mistook other consonants for a dolath or rish. After dolath-rish, the most commonly misidentified
consonants in East Syriac were nun, zayn, olaph, and shin, all of which were most frequently mistaken
as dolath-rish. The affected nuns were all medial except one. Perhaps the similar height of
the medial nun compared with the dolath-rish stem caused this issue. Zayn and shin are also
similar in height and shape to the dolath-rish
stem. Olaph stands out as an unusual consonant to
be mistaken as dolath-rish, as it is not similar
in height or shape. It is possible that East Syriac’s dotted vowels had
an effect in these cases. In Script mode, four out of the five instances
in which olaph was recognized as a dolath occurred when dotted vowels appeared below
the original olaphs. This indicates that
Tesseract’s ability to accurately recognize East Syriac text could be
improved by further training focused on distinguishing dolath-rish from consonants with similar shapes and further
training to distinguish diacritics from dolath-rish dots.
Mims were recognized
variously as: Tesseract exhibited no Mims exhibited a similar pattern,
being very accurately rendered in two type styles but less-accurately so
in the third. Estrangela had a nearly 100% accuracy rate in both modes,
and East Syriac-Language and -Script were both about 98% accurate.
However, Serto-Language was 95.8% accurate, and Serto-Script was 86.55%
accurate. The different shape of mims in Serto
likely contributes to Tesseract’s less-accurate performance here. A
glance at the kinds of mim substitution errors
across type styles supports this hypothesis: In East Syriac, mims were mistaken as a qoph
(twice in Script), a phé (once in Language),
and an olaph (once in Language). In Serto, by
contrast, mims were misidentified with six
(different) letters. In Serto-Language, mims were
recognized as a lomadh (1), a waw-béth (1), and a koph-lomadh (1);
and in Serto-Script, mims were recognized as a
koph-combination (3),koph, koph-lomadh, and koph-ayn.béth (2), a waw (1), a lomadh (1), a
waw-lomadh (1), a ḥéth (1), and an ayn
(1).mim substitution
errors in Estrangela. In fact, its only mim
error at all was one deleted mim in
Estrangela-Script.mim’s
shape in Serto is similar to greater variety of letters than in the
other type styles; particularly in the font W64, mims can look like a rounded letter followed by a tooth
letter. Given that even human readers might mistake the mim as a different set of letters, it is perhaps
unsurprising that Tesseract was occasionally confused.
Figure 11: W64
Serto
mim
A second issue arises when considering the great disparity
between Serto-Script and the five other Tesseract permutations. A nearly
87% accuracy rate is not unsatisfactory, but what makes this error rate
so intriguing is the fact that it differs so markedly from the five
other type style-mode combinations. What would make Serto mims particularly confusing to the Script mode
but not to the Language mode? While the shape of Serto mims seems to be an issue for Tesseract, if shape were the
only problematic variable one should expect to see it causing similar
problems for Tesseract in both modes, not only
one. Evidently, some aspect of the Script command-line argument (-l
script/Syriac) contributes to this misreading.
Rarely were other letters read as mims. A shin became a mim once in East Syriac-Script. A mim
was substituted for a koph once in
Serto-Language. A waw was recognized as a mim once in both Serto-Language and Serto-Script.
Estrangela did not substitute a mim for any
letter in either mode.
Added yudhs have already been
discussed in detail under Added Tooth Letters. However, it bears
repeating that this particular error was unique to Serto. Tesseract
added a high number of yudhs in Serto (41 and 84
in Language and Script modes, respectively), but the other two type
styles had a miniscule number (three and eight total for Estrangela and
East Syriac, respectively).
Attentive readers will have noticed that Serto tends to
have distinct error trends compared with the Estrangela and East Syriac
type styles. Tooth letters, particularly yudhs,
were added erroneously in all type styles but at a far higher rate in
Serto. Tesseract misread mims to a higher degree
in Serto. Serto had the lowest diacritic accuracy of all three type
styles. While stray marks in the Serto images may have contributed to
some of these OCR errors, they cannot account for all of Tesseract’s
mistakes. Since Serto images had essentially identical resolutions to
those in Estrangela and East Syriac, pixel size cannot account for this
discrepancy either.
The most likely explanation for Tesseract’s unusual error patterns in Serto lies at the level of font. The print types tested in this paper were chosen based on popularity—because they were frequently-used print types in each type style. As the analysts discovered partway through the testing project, Tesseract was most assuredly trained on Meltho fonts, and thus, print types closely related to Meltho fonts will perform better with Tesseract. Print types S14 and E22, those tested here for Estrangela and East Syriac, happened to be part of the set of types used as the basis for the Meltho computer fonts. Their closeness in shape to Tesseract’s training modules explains their OCR superiority in these areas. However, the Serto print type tested, W64, was not part of the set of print types upon which Meltho fonts were based, making its letter shapes further from Tesseract’s baseline. By correlation, Tesseract is less able to recognize text printed in W64. This explains much of the broadest level results between Estrangela, East Syriac, and Serto. If Tesseract were trained with a font based on W64, Tesseract’s OCR accuracy with Serto print type would likely improve significantly.
East Syriac still had the lowest average consonantal
accuracy (88.87%, averaging the two modes), slightly below Serto
(91.33%, averaged) and far behind Estrangela (98.89%, averaged). Yet E22
was part of the pool of print types underlying the Meltho fonts. How can
its low performance be explained? Very simply, as it turns out. The
shapes of letters in the East Syriac type style are more similar in
shape to one another than the same letters are to each other in Serto
and Estrangela type styles, which show a greater differentiation of
letter design. By comparison, Dolaths and rishes in particular, are uniquely shaped in the E22 print
type and look very similar to kophs, béths, waws, hés, phés, and qophs. Errors with dolath
and rish makeup 25.99% and 20.78% of consonantal
errors in East Syriac-Language and East Syriac-Script,
respectively.dolath
and rish errors make up 16.67% of
errors in Estrangela-Language, 10.81% in Estrangela-Script,
11.11% in Serto-Language, and 4.26% in Serto-Script.
In order to give a well-rounded picture of Tesseract’s OCR
results, it is necessary not only to highlight the larger trends observed
within the results but also to mention some of the specific oddities that
appeared in those same pages. These kinds of errors, examples of which are
given below, do not seem to follow any specific trend in formation. While
smudges or marks contributed to some errors throughout the texts and modes,
those errors are not included here. The following errors have been deemed to
lack a clear or obvious explanation. This list is not exhaustive. Other unusual
errors have been given throughout this report, and even more
happened without a specific mention here.
Estrangela-Script Page 3.
Serto-Script Page 2
Serto-Script Page 3.
East Syriac-Script Page 1.
East Syriac-Language Page 1, East Syriac-Script Page 3.
Before Tesseract can properly read the Syriac text it must be able to segment the page correctly into words and lines. One of the telling signs of incorrect segmentation comes from truncated words and missing lines. If Tesseract drops letters at the beginning or end of words, particularly words at the beginning or end of lines, or if entire lines are skipped during the OCR process, then Tesseract most likely did not recognize the correct beginning and ends of word and line gaps. Tesseract is trained to determine word and line divisions based on white spaces. So, if many punctuation marks fill spaces between words or many diacritics appear above and below the lines, Tesseract can erroneously conclude that there is no white space and, hence, that there are no letters to be read in those places. The test results align with this understanding of Tesseract’s segmentation function. While the majority of lines were accurately recognized as such by Tesseract in both Language and Script modes, Tesseract nevertheless exhibited issues with segmentation in East Syriac, the type style containing the most diacritical marks.
Tesseract segmented Serto with 100% accuracy in both Script and Language modes. There were no truncated words or missed lines. Serto’s remarkable segmentation accuracy likely arises from its wide white spaces between lines, as well as from its extra-long lines of text. The Serto images which were OCRed contained on average 47.6 characters per line, a 45–47% increase from the average line length in the East Syriac and Estrangela images. And the tests conducted on manuscripts (discussed below) gave evidence that the longer a line of text is, the better Tesseract is able to recognize it.
Tesseract performed similarly well with Estrangela, truncating
only 11 characters total across all six pages of Script and Language modes.
Six of those characters were a Estrangela Page 1.
With East Syriac, Tesseract truncated multiple words and missed a full line in both Script and Language for a total of 162 dropped characters. With a higher total number of characters on the page to begin with, this still generated a 97.59% accuracy rate. Given that the East Syriac text had far less white space between lines compared with the Serto and Estrangela images as a result of its many diacritical marks, the less-precise segmentation results for East Syriac are naturally explained.
Alongside character recognition and segmentation errors, the testing detected errors in spacing between words. The OCR process often added extra paragraph breaks between lines, mostly when there was truly a paragraph break in the text. There were also additional spaces between punctuation marks and words. As those issues do not inhibit the reading or searchability of the text itself, this paper does not discuss them.
Some spacing issues, however, do impact the reading of the
text. Found in several documents, there were cases of added spaces in the
middle of words and deleted spaces between words. Among these cases were
East Syriac-Language Page 3 (additions and deletions) and
Serto-Script Page 3 (deletion). Serto-Script Page 3.
It should be noted that the accuracy rates given in this paper do not include spacing errors since the project’s analysis is based on characters in the original image. The aim of the project was to determine character recognition, and the analysts prioritized texts with wide spacing in order to do so. When others perform OCR in the future, texts with narrower spacing between words may encounter more instances like the one given above, where multiple words are OCRed as one word. Texts with wide spacing, like the majority of those tested here, may also find deleted or added spaces scattered throughout with no obvious cause. While this did not occur in every page tested in this project, even the small numbers of errors on these few pages imply a high likelihood of spacing errors across large texts.
Given this, as Tesseract stands, it seems inevitable that
spaces will be added or eliminated between some words during the OCR process
and thus interfere with the accuracy of the searchable text. This assumption
is based on the team’s observation that the current version of Tesseract
does not encode whether the letters in an image are initial, medial, or
final. Instead, it appears that the program maps the various allographic
forms of a grapheme into one Unicode value and thus does not transfer the
differentiation between letter forms to the resulting document. With the
tight spacing of this East Syriac line (Figure 12), the reader must be able
to recognize the final forms of the letters in order to determine where
words end in the initial image. The OCRed document in Language mode, for
example, does not receive the information that the first nun in the line is a final nun, resulting
in a much larger word. Fonts determine a letter’s form based on what follows
it, and thus the second nun is turned into a final
nun because of the colon. Based on this
observation, future training of Tesseract should work to encode the
allographic forms of the grapheme into separate Unicode values, one for each
initial, medial, and final. To aid with the spacing issue itself, spaces
could be required after every final letter.
Figure 12: Comparison of
East Syriac Original Image (top) with OCR Results for Language (middle)
and Script (bottom)
With Estrangela, Serto, and East Syriac type styles tested in Tesseract, only one category of Syriac texts remains: handwritten manuscripts. No analysis of Tesseract’s performance would be complete without an investigation into manuscripts. The analysts ran six manuscript pages through Tesseract, and their findings are outlined below.
Following the study of Tesseract’s performance on printed Syriac
texts, the analysts conducted additional tests on images of Syriac manuscripts.
The aim was to assess Tesseract’s potential as a tool for digitally transcribing
handwritten Syriac. Three pages of Damascus
12/21 from the Syriac Orthodox Patriarchate were chosen
for testing: 72r, 159r, and 163v. Damascus 12/21 was
selected because its hand is the basis for the Meltho font Estrangela Antioch,
and thus it was hypothesized that Tesseract had a high chance of being able to
recognize this text accurately. Three particular pages were chosen for their
relatively straight and well-spaced lines.
As in the study on printed texts, Tesseract was tested on all
three pages using both language command-line arguments: Syriac Language (-l Syr)
and Syriac Script (-l script/Syriac). Tesseract’s output was then compared to
the original image and all the consonantal errors were logged. In order to
provide a point of comparison, three additional pages of Estrangela text from
three different manuscripts were selected from W. H. P. Hatch,
Plate 29 is London British Museum Add. Ms. 14599 fol.32; Plate 34 is
Florence Biblioteca Laurenziana Plut. 1 Cod. 56 fol. 99; Plate 35 is
London British Museum Add. Ms. 17152 fol. 30v.Hatch’s Album
of Dated Syriac Manuscripts to be run through Tesseract and analyzed
for errors using the same method.An Album of
Dated Syriac Manuscripts, (Boston, MA: The American Academy of
Arts and Sciences, 1946).
Manuscripts initially proved more difficult for Tesseract to
recognize than the printed texts. When the images of the manuscripts were
run through Tesseract without any prior editing done to them, the OCR engine
produced too little text to analyze—and in some cases no text at all.
Therefore, several stages of edits were conducted on the images to make them
“easier” for Tesseract to read and to derive more fruitful output files that
could be analyzed in depth. The multiple edits were initially carried out on
Damascus 12/21 72r until a result was achieved
that the analysts felt was suitably successful; then the other images were
edited following the same process.
The images of Damascus 12/21 supplied
for testing were high-quality color photographs that showed full pages of
the manuscript on a black background. The text was written in Estrangela in
two long and narrow columns per page, with approximately three words per
column line and twenty-five lines per page. At the first stage in the
editing process, the photograph of 72r was cropped of all empty space such
as the background and page margins and then color-edited to black and white.
Two images were created from the photograph, each containing one column of
text. These two images were run through Tesseract, and both modes were
tested. Despite the editing, Tesseract did not recognize any text in column
1 using either mode, and it recognized only 14 fragmentary lines in column
2. The specific 14 lines differed between Language and Script modes, but
both produced 14 OCRed lines.
Image courtesy of the
Syriac Orthodox Patriarchate.Figure 13: Sample Image of
Damascus 12/21
159r
At the next stage of editing, line spacing was exaggerated so that no letters overlapped onto other lines. These images were tested again in both modes. A modest improvement was made on the recognition of column 1, with 11 characters now recognized by Tesseract, but the results were still poor compared with the earlier tests on printed texts. Finally, the images were re-edited using a photo editor to combine both columns into a single image, elongating the lines so they fit approximately ten words per line rather than three. At this stage, more than half of the text was recognized by Tesseract and these improvements occurred in both modes.
Now that 72r had been successfully edited to produce similar accuracy rates to the printed texts, the remaining manuscript images were edited in the same way and run through Tesseract. As in previous tests, the output files were compared to the original images and the errors were recorded. The results that follow are from tests conducted on the final forms of the edited manuscript images.
When comparing Tesseract’s output files to the original images, analysts recorded only consonantal errors in this preliminary test to provide a general idea of how well Tesseract might perform on manuscripts. The consonantal error categories were also simplified so that the errors encountered were recorded in one of four general categories—deleted, substituted, added, and truncated—which encapsulated all the varieties of consonantal errors that occurred. Consonants were considered “deleted” when the consonant in the original image was missing in the output from a word that had otherwise been recognized by Tesseract; consonants were considered “substituted” when a consonant in the original image was mistaken for a different consonant; consonants were considered “added” when a consonant appeared in the output where no corresponding consonant appeared in the original; and consonants were considered “truncated” when they were missing from the output due to partial or full lines of text being missed by Tesseract. Using this data, the analysts calculated accuracy rates for each set of pages in both modes.
The three pages tested from Damascus
12/21 contained a total of 1,924 consonants. There were 553
consonantal errors in Language mode and 630 consonantal errors in Script
mode. With all these errors taken into account, the consonantal accuracy
rate for the Language mode is 72.30% and the consonantal accuracy rate
for Script is 67.26%. These results are significantly lower than the
consonantal accuracy rate for the printed Estrangela texts that were
tested, which were recognized with over 98% accuracy in both modes.
Tesseract generated significantly more errors of each type
for Damascus 12/21 than for the printed texts,
but the low accuracy was primarily the result of the high number of
consonants deleted due to line truncation. As Table 26 shows, truncated
characters were by far the most frequent error, accounting for over half
of the errors in both modes. As this error is possibly caused by issues
with the input image, an accuracy rate was also calculated which
excluded truncated characters from both the total consonant and the
total error counts. When truncated characters are removed from the
calculation, the accuracy rate raises significantly to 83.34% in
Language and 83.32% in Script.
Plates 29, 34, and 35 contained a total of 1,770
consonants. 240 errors were recorded in Language and 160 recorded in
Script. With all these errors taken into account, Tesseract was able to
correctly recognize 86.44% of consonants in Language mode and 90.96% of
consonants in Script mode. These accuracy rates are still lower than
those for the printed pages, but they are much higher than the accuracy
ratings for Damascus 12/21. This increase in
accuracy is attributable in part to the far-lower number of truncated
characters, but the number of errors recorded for the other three
categories decreased as well. When truncated characters are removed from
the calculation, the consonantal accuracy rate raises slightly to 89.37%
for Language and 91.74% for Script.
These results indicate that Tesseract is able to recognize
handwritten Estrangela with a reasonably high accuracy rate but only when
the images have been heavily edited so that the manuscripts more closely
resemble pages of printed texts. Even after a long editing process,
Tesseract often truncates words or whole or partial lines, significantly
reducing its OCR accuracy. These obstacles mean that, at present, Tesseract
is probably not useful to scholars seeking a quick method for digitally
transcribing manuscripts. If in the future Tesseract can be trained to
recognize text in narrow columns, and if a method of minimizing line
truncation is found, Tesseract could become a useful and accurate tool for
the optical character recognition of manuscripts. Since completion of this
study, the analysts have become aware of Transkribus—a software
specifically designed for the automatic recognition of handwritten
texts. Transkribus has shown good results with a variety of
handwriting styles in different languages. Once trained, it may have
the potential to work well on handwritten Syriac
manuscripts.
This paper has outlined and analyzed a great amount of data from this project but has yet to spell out how this information may best help those who desire to use Tesseract for their scholarship. Here is a brief list of such practical tips:
The one
exception being added simkaths in
Estrangela-Language, which did not occur in
Estrangela-Script.
The program Voyant Tools has been brought to the analysts’ attention as software that can assist with word studies by analyzing word frequency. Some researchers might find it useful after they have confirmed the accuracy of their OCRed text corpus.
Several things may be concluded from the foregoing tests. First,
this testing process confirmed Tesseract’s reliability for OCRing printed
Estrangela texts. When consonants are the reader’s only concern, Tesseract 4.0
will perform at close to 99% accuracy using either mode. Since most printed
Syriac texts use Estrangela, this is encouraging news for Syriac studies. Estrangela’s
consonantal accuracy cannot likely be improved. According to the
Tesseract programming community on GitHub, “unless you're using a very
unusual font or a new language retraining Tesseract is unlikely to
help.” [ImproveQuality, GitHub, last updated April 21, 2018,
Digital Syriac Corpus, at syriaccorpus.org, accessed
30 July, 2018.https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
], accessed 30 July, 2018. Fortunately
for scholars, Tesseract already generates useable OCR results for
Estrangela.
Second, Tesseract 4.0 is not currently recommended for OCRing Serto and East Syriac. These type styles are less accurate and hover around 90% consonantal accuracy (88.11–93.4% range) in both modes. Tesseract may be useful for an initial computer-generated pass-through in these type styles, but for full readability and searchability of texts humans will need to review and edit the OCRed text themselves. While accuracy rates can be improved for these type styles, until a training method for Tesseract is developed that does not require inputting hundreds of thousands of lines of text it is unlikely that time spent retraining Tesseract’s model will be time-effective. That being said, since fewer texts have been printed in East Syriac and Serto, manually transcribing these texts many prove a feasible alternative for gaining the digital texts.
Third, Tesseract’s OCR accuracy with Serto can likely be improved by training Tesseract on a Serto font that is based on W64 print type. As the analysts discovered during the testing project, Tesseract was most assuredly trained on Meltho fonts, and thus, print types closely related to Meltho fonts will perform better with Tesseract. Print types S14 and E22, those tested here for Estrangela and East Syriac, happened to be part of the set of types used as the basis for Meltho fonts, while W64, that tested for Serto, was not. Much of Tesseract’s unique error results for Serto are likely attributable to this discrepancy between print type and training font. If a programmer were to train Tesseract with a font based on W64, Tesseract’s OCR accuracy with W64 print type would likely improve significantly.
Fourth, these tests have verified that Tesseract has the potential to recognize handwritten Estrangela texts with a good accuracy, but at present demands a laborious and time-consuming editing process to make the images readable. Even after this extensive editing process the accuracy rate varies between 67%–90%, and so it is not recommended that scholars attempt to OCR manuscripts using Tesseract at this time. Tesseract would need many programming developments before it could become a practical, usable tool for OCRing Syriac manuscripts.
In practical advice, while variances remain when discussing individual letters, running Tesseract with the Language mode command-line argument (i.e., “-l Syr”) generates slightly more accurate results in all three type styles and is the recommended mode to use. In addition, the time spent cropping and deskewing scanned images before OCRing is well worth the investment. Without these steps, and without color-adjusting to black and white, Tesseract’s accuracy rates would drop significantly.
With a little practice, Tesseract can offer Syriac scholars a straightforward way of digitally transcribing printed Estrangela texts. If embraced by Syriac scholars, it has the potential to advance the field by improving the availability of printed Estrangela texts and opportunities to access them. As developments are made in OCR it is hoped that soon Serto and East Syriac will be recognized accurately enough to join Estrangela in this regard. The goal to OCR a high number and wide variety of printed Estrangela texts is sure to keep Syriac scholars busy in the meantime.
Tesseract occasionally inserted or swapped in
three Arabic characters (the fatḥah, shadda, and sukun) for
Syriac characters. Why Tesseract pulled Arabic characters into the
data set, let alone these particular characters and not others, is
not fully clear at this point and is difficult to determine due to
the inconsistency with which these errors occurred. At first
analysis it would seem that Tesseract preferences Syriac characters
over Arabic characters when OCRing in Syriac modes—typically
identifying a printed mark as a Syriac character if there is a
similar-looking one in the Unicode set and only secondarily
inserting Arabic characters if a Syriac one cannot be found.
However, this trend does not play out consistently. For instance,
whilst the Arabic fatḥah (U+064E) sometimes
appeared in place of the Syriac oblique line (U+0747), the Syriac
zqapha (U+0733) was never misidentified
for the similarly shaped Arabic ḍ
ammah (U+064F). One possible explanation is
that that when Tesseract was trained, certain Arabic diacritics were
incorrectly incorporated into the Syriac data set—muddying the
waters, so to speak. Without further information about the Tesseract
training process, the analysts cannot ascertain all of the engine’s
resultant OCR peculiarities.
The page numbers listed here do not reflect the page numbers in the printed books from which the images were scanned; rather they were the analysts’ testing designation.
Figure 14: Overall
Accuracy Rates of Type Style and Mode
Figure 15: Overall
Accuracy Rates of Type Style
The most common letters in Syriac, in
decreasing order of frequency, are: olaph
(13.9%), waw (10.1%), nun (9.6%), yudh (9.0%), lomadh (7.4%), dolath and mim (6.4%), hé (5.5%), taw
(5.3%), rish (4.4%), béth (4.3%), koph (3.0%), shin (2.8%), ayn
(2.6%), ḥéth (2.3%), phé (1.5%), qoph (1.4%), simkath (1.3%), gomal (0.9%), ṭéth (0.8%), zayn (0.6%), and ṣodhé (0.3%). [George Anton Kiraz, Tūrrāṣ Mamllā: A Grammar of The Syriac Language, vol.
1: Orthography (Piscataway, NJ: Gorgias Press, 2012),
53–54.]
This table aims to show the overall distribution of consonantal accuracy in each type style and mode. The percentage values here do not necessarily represent actual accuracy rates but rather give a clearer understanding of the results through statistical analysis (specifically the mean and standard deviation). The counts are the number of consonants that have accuracy rates that are equal to or higher than the statistical value.