man clara-adv (Commandes) - a cooperative OCR

NAME

clara - a cooperative OCR

SYNOPSIS

clara [options]

DESCRIPTION

Welcome. Clara OCR is a free OCR, written for systems supporting the C library and the X Windows System. Clara OCR is intended for the cooperative OCR of books. There are some screenshots available at http://www.claraocr.org/.

This documentation is extracted automatically from the comments of the Clara OCR source code. It is known as "The Clara OCR Advanced User's Manual". It's currently unfinished. First-time users are invited to read "The Clara OCR Tutorial". Developers must read "The Clara OCR Developer's Guide".

CONTENTS

1. Welcome to Clara OCR

1.1 Early historical notes 1.2 Design notes 1.3 Supported Alphabets 1.4 Clara vs the others 1.5 The requirements 1.6 How to download and compile Clara 1.7 Compilation and startup pitfalls

2. A first OCR project

2.1 Scanning and thresholding 2.2 Manual and histogram-based (global) 2.3 Classification-based (local) 2.4 Classification-based (global) 2.5 Avoiding or correcting skew 2.6 The work directory 2.7 Building the book font 2.8 Skeleton tuning 2.9 Classification tentatives 2.10 Alignment tuning

3. Complex procedures

3.1 Using two directories 3.2 Adding a page (to be written) 3.3 Multiple books 3.4 Adding a book (to be written) 3.5 Removing a page 3.6 Dealing with classification errors 3.7 Rebuilding session files (to be written) 3.8 Importing revision data 3.9 How to use the web interface 3.10 Revision acts maintenance 3.11 Analysing the statistics 3.12 Upgrading Clara OCR (to be written)

4. Reference of the Clara GUI

4.1 The application window 4.2 Tabs and windows 4.3 The Application Buttons 4.4 The Alphabet Map

5. Reference of the menus

5.1 File menu 5.2 Edit menu 5.3 View menu 5.4 Alphabets menu 5.5 Options menu 5.6 PAGE options menu 5.7 PAGE_FATBITS options menu 5.8 OCR steps menu

6. Reference of command-line switches

7. AVAILABILITY

8. CREDITS

1. Welcome to Clara OCR

Clara is an optical character recognition (OCR) software, a program that tries to identify the graphic images of the characters from a scanned document, converting their digital images to ASC, ISO or other codes.

The name Clara stands for "Cooperative Lightweight chAracter Recognizer".

Clara offers two revision interfaces: a standalone GUI and and a web interface, able to be used by various different reviewers simultaneously. Because of this feature Clara is a "cooperative" OCR (it's also "cooperative" in the sense of its free/open status and development model).

1.1 Early historical notes

For some years now we have tested and used OCR softwares, mainly for old books. Popular OCR softwares (those bundled with scanners) are useful tools. However, OCR is not a simple task. The results obtained using those programs vary largely depending on the the printed document, and, for most texts we're interested on, the results are really poor or even unusable. In fact, it's not a surprise that many digitalization projects prefer not to use OCR, but typists only.

For a programmer, it is somewhat intuitive that OCR could achieve good results even from low quality texts, when an add-hoc approach is used, focusing one specific book (for instance). Within this approach, OCR becomes a matter of finding one software adequate for the texts you're trying to OCR, or perhaps develop a new one. So a free and easy to customize OCR (on the source code level) would be a valuable resource for text digitalization projects.

Dealing with graphics is not among our main occupations, but after analysing many scanned materials, we began to write some simple and specialized recognition tools. More recently (in the third quarter of 1999) a simple X interface linked to a naive bitmap comparison heuristic was written. From that prototype, Clara OCR evolved. Since then, many new ideas from various persons helped to make it better.

1.2 Design notes

It's not a bad idea to enumerate some principles that have driven Clara OCR development. They'll make easier to understand the features and limitations of the software (these principles may change along time).

1. Clara is an OCR for printed texts, not for handwritten texts.

2. Clara was not designed to be used to OCR one or two single pages, but to OCR a large number of documents with the same graphic characteristics (font, size, etc). So it can take advantage of a fine (and perhaps expensive) training. This will be tipically the case when OCRing an entire book.

3. We chose not support directly multiple graphic formats, but only Jeff Poskanzer's raw PBM and PGM. Non-PBM/PGM files will be read through filters.

4. Clara OCR wants to be a tool that makes viable the sum and reuse of human revision effort. Because of this, on the OCR model implemented by Clara, training and revision are one same thing. The revision is a sum of punctual and independent acts and alternates with reprocessing steps along a refinement process.

5. The Clara GUI was implemented and behaves like a minimalistic HTML viewer. This is just an easy and standard way to implement a forms interface.

6. We have tried to make the source code portable across platforms that support the C library and the Xlib. Clara has no special provision to be ported to environments that do not support the Xlib. We avoided to use a higher level graphic environment like Motif, GTK or Qt, but we do not discourage initiatives to add code to Clara OCR adapt or adapt better to these or other graphic environments.

7. We generally try to make the code efficient in terms of RAM usage. CPU and disk usage (for session files) are less prioritary.

1.3 Supported Alphabets

Clara OCR focuses the Latin Alphabet ("a", "b", "c", ...), used by most European languages, and the decimal digits ("0", "1", "2", ...), but we're trying to support as many alphabets as possible.

To say that Clara OCR supports a given alphabet means that Clara OCR

(a) is able to be trained from the keyboard for the symbols of that alphabet, eventually applying some transliteration from that alphabet to latin. For instance, when OCRing a greek text, if the user presses the latin "a" key (assuming that the keyboard has latin labels), Clara is expected to train the current symbol as "alpha".

(b) knows the vertical alignment of each letter of that alphabet, for instance, knows that the bottom of an "e" is aligned at the baseline;

(c) knows which letters accept or require which signs (accents and others, like the dot found on "i" and "j");

(d) contains code to help avoiding common mistakes, like recognizing "e" as "c", "l" as "1", etc.

To say that Clara OCR supports a given alphabet does not necessarily mean that Clara OCR

(a) knows some particular encoding (ISO-8859-X, Unicode, etc) for that alphabet;

(b) contains or is able to use fonts for that alphabet to display the OCR output on the PAGE (OUTPUT) window.

Even ignoring the standard encondings for one given alphabet (e.g. ISO-LATIN-7 for Greek), Clara eventually will be able to produce output using TeX macros, like {Alpha}.

1.4 Clara vs the others

Clara differs from other OCR softwares in various aspects:

1. Most known OCRs are non-free and Clara is free. Clara focus the X Windows System. Clara offers batch processing, a web interface and supports cooperative revision effort.

2. Most OCR softwares focus omnifont technology disregarding training. Clara does not implement omnifont techniques and concentrate on building specialized fonts (some day in the future, however, maybe we'll try classification techniques that do not require training).

3. Most OCR softwares make the revision of the recognized text a process totally separated from the recognition. Clara pragmatically joins the two processes, and makes training and revision one same thing. In fact, the OCR model implemented by Clara is an interactive effort where the usage of the heuristics alternates with revision and visual fine-tuning of the OCR, guided by the user experience and feeling.

4. Clara allows to enter the transliteration of each pattern using an interface that displays a graphic cursor directly over the image of the scanned page, and builds and maintains a mapping between graphic symbols and their transliterations on the OCR output. This is a potentially useful mechanism for documentation systems, and a valuable tool for typists and reviewers. In fact, Clara OCR may be seen as a productivity tool for typists, instead of a typical OCR.

5. Most OCR softwares are integrated to scanning tools offerring to the user an unified interface to execute all steps from scanning to recognition. Clara does not offer one such integrated interface, so you need a separate software (e.g. SANE) to perform scanning.

6. Most OCR softwares expect the input to be a graphic file encoded in tiff or other formats. Clara supports only raw PBM/PGM.

1.5 The requirements

Clara OCR will run on a PC (386, 486 or Pentium) with GNU/Linux and Xwindows. Clara OCR will hopefully compile and run on a PC with any unix-like operating system and Xwindows. Currently Clara OCR won't run on big-endian CPUs (e.g. Sparc) nor on systems lacking X windows support (e.g. MS-Windows). Higher-level libraries like Motif, GTK or Qt are not required.

A relatively fast CPU is recommended (300MHz or more). Memory usage depends on the documents, and may range from some few megabytes to various tenths os megabytes The normal operation will create session files on your hard disk, so some megabytes of free disk space are required (a large project may require plents of gigabytes). Clara OCR can read and write gzipped files (see the -z command-line switch).

If you need to build the executable and/or the documentation, then an ANSI C compiler (with some GNU extensions) and a (version 5) perl interpreter are required.

1.6 How to download and compile Clara

For those who need to download and compile the source code (hopefully this will be unnecessary for most users as soon as Clara binary distributions become available), it may be downloaded from http://www.claraocr.org/. It's a compressed tar archive with a name like clara-x.y.tar.gz (x.y is the version number).

The compilation will generally require no more than issue the following commands on the shell prompt:

$ gunzip clara-x.y.tar.gz $ tar xvf clara-x.y.tar $ cd clara-x.y $ make $ make doc

Now you can copy the executable (the file "clara") to some directory of binaries (like /usr/local/bin), and the man page (file "clara.1") to some directory of man pages (like /usr/local/man/man1). By now there is no "make install" to perform these copies automatically.

If some of these steps fail, please try to obtain assistance from your local experts. They will solve most simple problems concerning wrong paths or compiler options. You can also read the subsection "Compilation and startup pitfalls".

1.7 Compilation and startup pitfalls

This subsection is intended to help people that are experiencing fatal errors when building the executable or when starting it. After each error message we'll point out some hints.

Bear in mind that most hints given below are very elementary concerning Unix-like systems. If you have problems, try to read all hints because details explained once are not repeated. If you cannot understand them, please try to ask your local experts, or try to read an introductory book on Unix things. Please don't email questions like these to the Clara developers, except when the hint suggests it.

1. Path-related pitfalls

$ make bash: make: command not found

The shell could not find the "make" utility. Maybe there is no such utility installed on your system, or maybe the path to it is unknown to the shell. You can try to find the "make" utility with a command like

$ find /usr -name make -print

The following command will display the current path:

$ echo $PATH

Remember that on Unix-like systems the environment is per-process. So if you change the PATH variable on the shell prompt within an xterm, this won't affect the other running shells (on the other xterms). Remember that the Unix shells expect to be explicitly informed about which variables must be exported to subprocesses (use "export" in Bourne-like shells and "setenv" on C-like shells).

$ make gcc -I/usr/X11R6/include -g -c gui.c -o gui.o make: gcc: Command not found make: *** [gui.o] Error 127

The make utility could not find the gcc compiler. Check if gcc is installed. If not, check if some other C compiler is installed (for instance, "cc"), and edit the makefile to chage the value of the CC variable.

If you don't know what I'm speaking about, take a look on the directory where the Clara source codes are, and you'll see there a file named "makefile". This file contains the names of the tools to be used and rules to build the Clara executable. It contains also important paths, like those where the system headers (files .h) and libraries can be found. If the names or the paths don't reflect those on your system, you need to edit the makefile accordingly.

$ make gcc -I/usr/X11R6/include -g -c gui.c -o gui.o In file included from gui.c:16: gui.h:12: X11/Xlib.h: No such file or directory make: *** [gui.o] Error 1

The compiler could not find the header Xlib.h. Maybe your system does not include such header, or maybe it is on another directory not explicited on the makefile through the INCLUDE variable.

$ make gcc -o clara clara.o skel.o gui.o mc.o ... /usr/bin/ld: cannot open -lX11: No such file or directory make: *** [clara] Error 1

The linker could not find the X11 library. Maybe your system does not include such library, or maybe it is on another directory not explicited on the makefile through the LIBPATH variable.

2. Compilation pitfalls

$ make gcc -I/usr/X11R6/include -g -c clara.c -o clara.o clara.c:70: parse error before `int' make: *** [clara.o] Error 1

A syntax error on the line 70 of the file clara.c. Double check if the sources were not changed. Try to obtain the sources again. If you're a programmer, try to fix the problem. In any case, report it to claraocr@claraocr.org.

$ make clara.c: In function `process_cl': clara.c:2293: `ZPS' undeclared (first use in this function) clara.c:2293: (Each undeclared identifier is reported only once clara.c:2293: for each function it appears in.) make: *** [clara.o] Error 1

A reference to an undeclared variable. Double check if the sources were not changed. Try to obtain the sources again. If you're a programmer, try to fix the problem. In any case, report it to claraocr@claraocr.org.

3. Runtime pitfalls

$ clara & [1] 1924 bash: clara: command not found

The Clara executable does not exist or is not on the path. Most Unix systems don't include the current directory ("./") on the path, so if you're trying to start Clara from the directory where it was compiled, specify the current directory ("./clara").

$ ./clara & [1] 1922 _X11TransSocketUNIXConnect: Can't connect: errno = 111 cannot connect to X server

Clara could not connect the X server. The X Windows System is a client-server system. The applications (xterm, xclock, etc) connect to a display server (the X server). If the server is not running, clients cannot connect to it. In some cases, it's required to inform explicitly the client about the server it must connect, using the environment variable DISPLAY.

$ ./clara Segmentation fault (core dumped)

If you can reproduce the problem, report it to claraocr@claraocr.org. If you're a programmer and Clara was compiled with the -g option, try a debugger to locate the point of the source code where the segmentation fault happened. Using gdb, it's quite easy:

$ gdb clara (gdb) run

Now try to reproduce the steps that led to the segmentation fault.

2. A first OCR project

Clara OCR is intended to OCR a relatively large collection of pages at once, typically a book. So we will refer the material that we are OCRing as "the book".

Let's describe a small but real project as an example on how to use Clara to OCR one "book". This section is in fact an in-depth tutorial on using Clara OCR. In order to try all techniques explained along this section, please download and uncompress the file referred as "page 143" of Manuel Bernardes Branco Dictionary (Lisbon, 1879), available at http://www.claraocr.org. It's a tarball containing the two text columns (one per file) of that page.

Just to make the things easier, we will assume that the files 143-l.pgm and 143-r.pgm were downloaded to the directory /home/clara/books/MBB/pgm/. We will assume also that the programs "clara", and "selthresh" are on the PATH. Some programs required to handle PBM files (pgmtopbm, pnmrotate and others, by Jef Poskanzer) are also required. These programs can be easily found around there, and are included on most free operating systems.

2.1 Scanning and thresholding

Clara OCR cannot scan paper documents by itself. Scanning must be performed by another program. The Clara OCR development effort is using SANE (http://www.mostang.com/sane) to produce 600 or 300 dpi images. The Clara OCR heuristics are tuned to 600 dpi.

Scanners offer three scanning modes: black-and-white (also known as "bitmap" or "lineart", however the meaning of these words may vary depending on the context), "grayscale" and "color". Clara OCR requires black-and-white or grayscale input. Both black-and-white and grayscale images may be saved in a variety of formats by scanning programs. However, only PBM (for black-and-white) and PGM (for grayscale) formats are recognized. Generally grayscale 600 or 300 dpi will be the best choice, but black-and-white 600 dpi may be good for new, high quality printed materials. If your scanning program do not support the PBM or PGM formats, try to save the images in TIFF format and convert to PBM or PGM using the command tifftopnm. If for some reason the TIFF format cannot be used, choose any other format that preserves all data (don't use "compressing" formats like JPEG), and for which a conversion tool is available, to convert it to PBM or PGM.

Remark: Programs that scan or handle (e.g. rotate) images may sometimes perform unexpected tasks, as applying dithering or reducing algorithms by themselves. An image transformed to become nice or small may be useless for OCR purposes.

Remark: The PBM and PGM formats do not carry the original resolution (dots-per-inch) at which the image was scanned. As some heuristics require that information, Clara OCR expects to be informed about it through the command-line switch -y (so take note of the resolution used).

Grayscale means that each pixel assumes one gray "level", typically from 0 (black) to 255 (white). This is a good choice for scanning old or low-quality printed materials, because it's possible to use specialized programs to analyse the image and choose a "threshold", in such a way that all pixels above that threshold will be considered "white", and all others will be considered black (when scanning in black-and-white mode, the threshold is chosen by the scanning program or by the user). The threshold may be global (fixed for the entire page) or local (vary along the page).

In most cases grayscale will achieve better results. However, as grayscale images are much larger than black-and-white images, 300 dpi (instead of 600 dpi) may be mandatory when using grayscale due to disk consumption requirements.

Remark: Try to limit yourself to the optical resolution oferred by the scanner. Most old scanners are 300 dpi, but the scanning software obtains higher resolutions through interpolation. Newer scanners may be optical 600 dpi or 1200 dpi or more.

Remark: the page 143 of Manuel Bernardes Branco Dictionary that we're using along these tests was scanned using the SANE scanimage command:

scanimage -d microtek2:/dev/sga --mode gray -x 150 -y 210 --resolution 300 > 143.pgm

Thresholding is not the only method for converting grayscale images to black-and-white (such conversion is also called "binarization"), but it's the current method used by Clara OCR. In practice, a too low threshold will brake many symbols on their thin parts, and a too high threshold will link symbols together (in the figure, an "a-i" link and a broken "u").

XX XX XXXXX XXX XXX XXX X XX XX XX XX XX XX XX XX XXXXXXX XX XX XX X XX XX XX XX X XX XX XX XX XXXXX XXXXXXX XX XXXX

It's a hard task to detect broken and linked symbols. The Clara OCR heuristics that handle these cases are incipient, so thresholding must must be carefully performed, in order to not compromise the OCR results. If the printing intensity, the noise level or the paper quality vary from page to page, thresholding must be performed on a per-page basis.

Remark: Now you can try avoid links in segmentation step. Just set "Try avoid links" parameter in Tune tab. (Normal values <=1)

The four thresholding methods currently avaliable are: manual (global), histogram-based (global), classification-based (local), classification-based (global).

2.2 Manual and histogram-based (global)

Histogram-based thresholding is the default method. It computes automatically a thresholding value based on the distribution of grayshades. To use it, just enter the TUNE tab and select (it's selected by default) the "use histogram-based global thresholder". To make a try, load a PGM image and press OCR or ask the Segmentation OCR step.

Remark: You can correct the automatic-detected threshold with "Threshold factor" in Tune tab.

A global thresholding value can be manually specified. This corresponds to the "use manual global thresholder" entry. The choice of the thresholding value is performed through a visul interface called "instant thresholding". To use it, load one PGM image and select the "Instant thresholding" entry (Edit menu). Then use '<', '>', '+' and '-' to change the thresholding value. When ok, press ESC. Note that the selected value will be applied only when the segmentation step runs.

2.3 Classification-based (local)

Global thresholding does not address those cases where the printing intensity (or paper properties) vary along one same page. Local thresholding methods are required on such cases. Clara OCR implements a classification-based local (per-symbol) thresholder. Saying that it's classification-based means that the OCR engine is used to choose the threshold. In other words, the threshold chosen is that for which the classifier successfully recognized the symbol (in fact, this is a brute-force approach).

The local binarizer can be manually applied at any symbol. To do so, load one PGM page and click any symbol directly on the PAGE tab. Two thresholding values will be chosen. The pixels found to be "black" for each one are painted "black" (smaller value) and "gray" (larger value). At this moment, it's possible to add the thresholded symbol as a pattern (just press the key corresponding to its transliteration). Remember that this thresholder relies on the classifier, so if the OCR is not trained, you'll get no benefit.

Two versions of the local binarizer were developed, a "weak" one and a "strong" one. The "weak" one just tries to change the threshold on those symbols not successfully classified using the default threshold. The "strong" one (unfinished) also tries to criticize locally the segmentation results. By default, the weak version is used. To try the strong one, check the corresponding checkbox at the TUNE tab.

Remark: As an alternative, use the "Balance" feature + global thresholding.

2.4 Classification-based (global)

Clara OCR includes a simple threshold selection script to compute global best thresholds based on classification results. Let's try it on our 2-page book. Just create a directory, cd to it and run the selthresh script informing the resolution and the names of the images:

$ cd /home/clara/books/BC $ mkdir pbm $ cd pbm $ selthresh -y 300 -l 0.45 0.55 ../pgm/*pgm selthresh: scaling 2 times Best thresholds: 143-l.pgm 0.49 143-r.pgm 0.51

In this case, selthresh will require around 4 minutes to complete on a 500MHz CPU. For larger collections of pages, selthresh may take much longer to complete (hours or days). If needed, the execution can be safely interrupted using Control-C (it's ok to shutdown the machine while selthresh is running). The execution can be safely restarted from the point where it was interrupted typing again the same command:

$ cd /home/clara/books/MBB/pbm $ selthresh -y 300 -l 0.40 0.55 ../pgm/*pgm

The option -l is used to inform an interval of thresholds to try. By now, selthresh is unable to choose by itself a "good" interval. The user must manually check the results for some thresholds in order to make a choice. For instance, to examine the results for threshold 0.4 on page 143-l.pgm, try:

$ pgmtopbm -threshold -value 0.4 ../pgm/143-l.pgm >143-l.pbm $ display 143-l.pbm

Change the threshold, repeat and, once found a threshold value that produces a "nice" visual result, specify to -l the interval centered at that threshold, and total width 0.1 or 0.2. The same interval may be used for all pages because selthresh will warn about a bad interval choice. Example:

$ selthresh -y 300 -l 0.30 0.35 ../pgm/143-l.pgm selthresh: scaling 2 times Best thresholds: 143-l.pgm 0.32 (bad interval, try -l 0.30 0.4)

If a "bad interval" warning appears on the final output for some pages, it's ok to restart selthresh informing a new, wider interval, as suggested by selthresh. Only the suspicious pages will be re-examined. In fact, selecting a narrow initial interval (and making it larger as required) may be a good strategy to reduce the total running time.

Once the best thresholds are known, use pgmtopbm to produce the black-and-white images. It's also a good idea to approach the resolution to 600 dpi using pnmenlarge. Yet pnmenlarge does not add information to the image, the classification heuristics will behave better. In our case, the command should be

$ cd /home/clara/books/BC/pbm $ pnmenlarge 2 ../pgm/143-l.pgm | pgmtopbm -threshold -value 0.49 >143-l.pbm $ pnmenlarge 2 ../pgm/143-r.pgm | pgmtopbm -threshold -value 0.51 >143-r.pbm

Remark: it's not a bad idea to visualize the PBM files, or at least some of them. Yet selthresh produced good results for us, your mileage may vary.

In order to capture the output of selthresh (to extract the per-page best thresholds), it's ok to re-generate it as many times as needed (just repeat the same selthresh command, because once all computations become performed, the script will just read the results from selthresh.out and output the results).

A final warning: selthresh may be fooled by too dark images. So if the right limit is much larger than it should be, selthresh may produce bad results. So be careful concerning the right limit of the interval. As a practical advice, keep in mind that the best threshold for most images is less then 0.6. In the near future we'll use statistical measurements to choose the interval to analyse, in order to prevent such problems and to make unnecessary a manual choice.

remark: the tarball also includes an alternative selthresh, named slethresh_fidian.pl. It contains instructions on how to use it.

2.5 Avoiding or correcting skew

Sometimes the printing is skewed relatively to the paper margins. Skew is a problem to the OCR heuristics. As the Clara OCR engine just detects components by pixel contiguity and builds classes of symbols, in practice the effect of skew will be a larger number of patterns, and therefore a larger revision cost.

In some cases, a careful manual scanning can solve the problem. When acceptable, a set-square solves the problem: just align one text line at one set-square rule and the edge of the scanner glass at the other rule (we're supposing that the bookbinding was disassembled).

The bundled preprocessor now includes a method to compute and correct skew, but it's not on by default. To activate it, enter the TUNE tab and select the "Use deskewer" checkbox. Now deskewing will be applied when the OCR button is pressed (or when the "Preprocessing" OCR step is requested). Note that preprocessing is called only once per page, so if the page was already preprocessed, it won't be deskewed.

2.6 The work directory

Clara OCR expects to find on one same directory one or more images of scanned pages. In our case, this directory is assumed to be /home/clara/books/BC/pbm. By default, on this same directory, various files will be created to store the OCR data structures. So, if 143-l.pbm and 143-r.pbm are the pages to OCR, then after processing all pages at least once (not done yet) the work directory will contain the following files:

143-l.pbm 143-l.html 143-l.session 143-r.pbm 143-r.html 143-r.session acts patterns

The files "*.pbm" are the PBM images, the files "*.html" are the current OCR output, the file "patterns" is the current "bookfont", the file "143-l.session" contains the OCR data structures for the page 143-l.pbm, and the file "acts" stores the human revision effort already spent.

When Clara OCR is processing the page x.pbm, the files "x.session", "acts" and "patterns" are in memory. These three files together are generally referred as "the section". So the menu option "save session" means saving all three files.

2.7 Building the book font

Patterns are selected symbols from the book. They're obtained from manual training, or from automatic selection. The patterns are used to deduce the transliteration of the unknown symbols by the bitmap comparison heuristics. In other words, the OCR discovers that one symbol is the letter "a" or the digit "1" comparing it with the patterns.

The book font is the collection of all patterns. The term "book font" was chosen to make sure that we're not talking about the X font used by the GUI. The book font is stored on a separate file ("patterns", on the work directory). Clara OCR classifies the patterns into "types", one type for each printing font. By now, most of this work must be done manually. Someday in the future, the auto-tuning features and the pre-build customizations will hopefully make this process less painful.

So, before OCRing one book, it's convenient to observe the different fonts used. In our case, we have three fonts (the quotations refer the page 5.pbm):

Unknown Latin 9pt ("Todos sao iguais...") Unknown Latin 9pt bold ("Art. 5") Unknown Latin 8pt italic (footings)

It's not mandatory to exactly identify each font by its "correct" name or style or size (Roman, Arial, Courier, etc). In our case, we've chosen the labels above ("Unknown Latin 9pt" and the others). These labels can be manually entered using the PATTERN (TYPES) tab, one "type" for each "font". So we'll have 3 "types", and, for each one, various parameters can be manually informed. At least the alphabet must be informed. In fact, the PATTERN (TYPES) tab allows structuring very carefully all fonts used along the book. Even some intrincated details, like the classification techniques that can be used for each symbol, can be set.

Now we can select some patterns from the pages 143-l.pbm and 143-r.pbm. Try:

$ cd /home/clara/books/MBB/pbm $ clara &

Load the page 143-l.pbm. Observe the symbols, select a nice one using the mouse button 1 or the arrows (say, a letter "a", small) and train it pressing the corresponding key (the "a" key). Repeat this process for various symbols, all from one same type (so do not mix bold with non-bold, etc). The entered patterns belong by default to "type 0". The "Set pattern type" entry of the Edit menu can be used to move all "type 0" patterns to some other type (1, 2 or 3 in our case). To display the letters and digits for which few or no samples are trained, click the mouse right button over the PAGE tab and select "Show pattern type". This way, one can complete all fonts used along the book.

At this point, the "Auto-classify" feature (Edit menu) may be quite useful. When on, Clara OCR will apply the just trained pattern to solve all unknown symbols, so after training an "a", only those "a" letters dissimilar to that trained will remain unknown (grayed).

Now save the session (menu "File"), exit Clara OCR (menu "File"), and enter Clara OCR again using the same commands above. Try to load one file and/or to observe the patterns on the tabs PATTERN, PATTERN (list), TUNE (SKEL), etc. This is a good way to experience that Clara OCR is started and exited many times along the duration of one OCR project.

The last remark in this subsection: instead of the just described manual pattern selection, Clara OCR is able to select by itself the patterns to use from the pages. In order to use this feature, after selecting the checkbox "Build the bookfont automatically" (TUNE tab), classify the symbols (just press the OCR button using the mouse button 1, or press the mouse button 3 over it and select the "classify" item). However, the current recommendation is to prefer the manual selection of patterns, at least as a first step.

2.8 Skeleton tuning

Currently, symbol classification can be performed by three different classifiers: skeleton fitting, border mapping or pixel distance. The choice is done on the TUNE tab. Border mapping is currently experimental. Pixel distance has been used as an auxiliar classifier. Skeleton fitting is a more mature code and is highly customizable. It's the default classification method by now.

When using skeleton fitting, two symbols are considered similar when each one contains the skeleton of the other. So the classification result depends strongly on how skeletons are computed. As an example, the figure presents one symbol ("e"). The symbol black pixels are the dots ('.'). The skeleton black pixels are stars ('*').

....... ..******.. .*. ..*.. ..*. ...*. .*.. ...*.. ..*.........*.. ..***********.. ..*. .... ..*. ..*.. ..*... ... ..*.......... ..********.. .........

Clara OCR offers seven different methods for computing skeletons. Each method has tunable parameters. The choice of the method and the parameters can be done through a visual inteface on the TUNE (SKEL) tab. To try it, first save the session (menu "File"), then enter that tab. At least one pattern must exist. Vary the parameters and observe the results. Press the left and right arrows to navigate through the patterns, and use the "zoom" button to choose a comfortable image size. The last selection will be used for all skeleton computations. To discard it, exit Clara OCR without saving the session.

Instead of trying the TUNE (SKEL) tab, it's possible to specify skeleton computation parameters through the -k command-line switch. Note however that if a selection was performed through the TUNE (SKEL) tab, that selection will override the parameters informed to -k, so be careful.

Clara OCR has an auto-tune feature to choose the "best" skeleton computation parameters. To use it, check the "Auto-tune skeleton parameters" entry on the TUNE tab. This feature is currently left off by default because manual tuning can achieve better results. Examples:

1. Quality printing without thin details

use -k 2,1.4,1.57,10,3.8,10,4,4 or -k 0,1.4,1.57,10,3.8,10,4,4

2. Quality printing with thin details

use -k 2,1.4,1.57,10,3.8,10,1,1 or -k 4,,,,,,3,

3. Poor printing without thin details

use -k 2,1.4,1.57,10,3.8,10,1,1

4. Poor printing with thin details

use -k 2,1.4,1.57,10,3.8,10,1,1

Yet the pattern computation parameters may change along the way, it's wise to choose adequate skeleton computation parameters before OCRing, and keep them fixed along the project. Every time Clara OCR is started, inform the same parameters chosen. In our case, we can use the default parameters. To do so, just enter Clara OCR as before:

$ cd /home/clara/books/BC/pbm $ clara &

2.9 Classification tentatives

To classify the book symbols (i.e. to discover the transliteration of unknown symbols using the patterns), enter Clara OCR, select "Work on all pages" ("Options" menu) and press the OCR button using the mouse button 1, or press the mouse button 3 and select "Classification". The classification may be performed many times. Each time, different parameters may be tried to refine the results already achieved.

When the classification finishes, observe the pages 5.pbm and 6.pbm. Much probably, some symbols will be greyed. In other words, the classifier was unable to classify all symbols. The statistics presented on the PAGE (LIST) tab may be useful now. To reduce the number of unknown symbols there are three choices: add more patterns, change the skeleton computation parameters, or try another classifier.

To add more patterns, just train some greyed symbols and reclassify all pages again. The reclassification will be faster than the first classification because most symbols, already classified, won't be touched.

To change the skeleton computation parameters, exit Clara OCR, restart it informing the new parameters through -k, select "Re-scan all patterns" ("Edit" menu), select "Work on all pages" ("Options" menu) and reclassify. May be easier to choose and set the new parameters using the TUNE (SKEL) tab, as explained earlier. However, remember that the parameters chosen through the TUNE (SKEL) tab override the parameters informed through -k.

To try another classifier, first select the "Re-scan all patterns" entry on the "Edit" menu. Then enter the TUNE tab and select the classifier to use from the available choices (skeleton-base, border mapping and pixel distance). The pixel distance may be a good choice. Then reclassify all pages.

The "Re-scan all patterns" is required because for each symbol Clara OCR remembers the patterns already tried to classify it, and do not try those patterns again. However, when the skeleton computation parameters change, or when the classifier changes, those same patterns must be tried again. Maybe in the future Clara OCR will decide by itself about re-scanning all patterns.

2.10 Alignment tuning

At this point, we can generate the output for all pages. The output is already available if the classification was performed clicking the OCR button with mouse button 1. If not, just select the "Work on all pages" item on the "Options" menu, and click the OCR button using the mouse button 1. The per-page output will be saved to the files 5.html and 6.html.

Maybe the output will contain unknow symbols. Maybe the output presents broken lines or broken words. If so, the numbers used to perform symbol alignment must be changed. These numbers are configured on the TUNE tab ("Magic numbers" section). They're part of the session data, so they'll be saved to disk.

There are 7 such numbers:

max word distance as percentage of x_height max symbol distance as percentage of x_height dot diameter measured in millimeters max alignment error as percentage of DD descent (relative to baseline) as percentage of DD ascent (relative to baseline) as percentage of DD x_height (relative to baseline) as percentage of DD steps required to complete the unity

In order to understand why these numbers are relevant, suppose, for instance, that Clara OCR already knows that the "b" symbol below is a letter "b", but does not know that the "p" symbol is a letter "p". To decide if the "p" symbol seems to be a letter instead of a blot, Clara OCR checks if it fits the the typical dimensions of a letter. To do so, alignemnt hints are needed. On the figure we can see the baseline-relative ascent (AS), descent (DS) and x_height (XH), and the dot diameter (DD).

XXX ---- XX | XX | XX | XX XXXXX XX XXXXX | ---- XXX X XXX X | AS | XX XX XX XX | | XX XX XX XX | | XH XX XX XX XX | | XX XX XX XX X | | ---- XXX X XXX X XXX | | | DD XX XXXXX XX XXXXX X ---- ---- ---- XX | XX | DS XX | XXXX ----

The most relevant numbers to configure are the dot diameter, the maximum alignment error, the descent, the ascent and the x_height. They inform the baseline-relative ascent, descent and x_height, as percentages of the dot diameter. The usage of these numbers is expected to stop some day in the future, when the pattern types implementation become more mature.

3. Complex procedures

To OCR an entire book is a long process. Perhaps along it a problem is detected. Bad choice of skeleton computation parameters, or a bad page contaminating the bookfont, some files loss due to a crash, etc. How to solve them?

Clara OCR does not offer currently a complete set of tools to solve all these problems. In some cases, a simple solution is available. In others, a solution is expected to become available in future versions. This session will depict some practical cases, and explain what can be done and what cannot be done for each one.

3.1 Using two directories

In order to make easier the usage of read-only media, Clara OCR allows splitting the files in two directories, one for images and other for work files. The path of the first is stored on pagesdir, and the second, on workdir. For instance:

(pagesdir)

| +- 1.pbm | +- 2.pbm

(workdir)

| +- 1.session | +- 1.html | +- 2.session | +- 2.html | +- acts | +- font

In this example, there are 2 pages (files "1.pbm" and "2.pbm"). The current font is the file "pattern". The files 1.session and 2.session are the dumps of the data structures built when processing the pages 1 and 2. The files 1.html and 2.html contain the current OCR output generated for pages 1 and 2.

3.2 Adding a page (to be written)

3.3 Multiple books

A somewhat rigid directory structure is recommended for high-volume digitalization projects based on Clara and using the web interface. In this case, there will be multiple "pagesdir" directories ("book1" and "book2" from the docsroot in the figure) and, for each one, a corresponding "workdir" ("book1" and "book2" from the workroot in the figure).

(booksroot)

| +- book1/ | +- 1.pbm | | | +- 2.pbm | | +- book2/ +- 1.pbm | +- 2.pbm

(workroot)

| +- book1/ | +- 1.session | | | +- 1.html | | | +- 2.session | | | +- 2.html | | | +- acts | | | +- doubts/ | | +- s.1.319.pbm | | | | | +- u.2.7015.pbm | | | | | +- 1.958225189.17423.hal | | | +- pattern | +- book2/ | +- 1.session | |

For each book subdirectory on the workroot subtree, there will be a "doubts" directory, used to communicate with the web server. Each OCR run on some page of this book will generate files of the form "u.page.symbol.pbm", that contains a pbm image of one symbol. Once the CGI is claimed to produce a revision page, it will choose one of these files and rename it to s.page.symbol.pbm. This procedure is performed without using locks, so two simultaneous revision acesses may access the same symbol. The revision submission generates a qmail-style file doc.time.pid.host.

3.4 Adding a book (to be written)

3.5 Removing a page

From the stats presented by the PAGE (LIST) tab it's possible to detect problems on specific pages. A low factorization may be a simptom of a bad choice of brightness for that page. In such a case, it's probably a good idea to remove completely that page.

To remove a page is a delicate operation. Clara OCR currently does not offer a "remove page" feature. Basically, it should remove all patterns from that page, remove the revision data acquired from that page, and remove the page image and its session file.

3.6 Dealing with classification errors

What to do when the OCR classifies incorrectly a large quantity of symbols? (to be written)

3.7 Rebuilding session files (to be written)

3.8 Importing revision data

When OCRing a large book, a good approach is to divide its pages into a number of smaller sections and OCR each one. So for a book with, say, 1000 pages, we could OCR pages 1-200, then 201-400, etc.

After finishing the first section, of course we desire reuse on the second section the training and revision effort already spent. This is not the same as adding the pages 201-400 to the first section, because we do not want handle the pages 1-200 anymore.

Basically we need to import the patterns of the first section when starting to process the second. Well, Clara OCR is currently unable to make this operation.

3.9 How to use the web interface

The Clara OCR web interface allows remote training of symbols. To use it, a web server able to run perl CGIs (e.g. Apache) is required. Let's present the steps to activate the web interface for a simple case, with only one book (named "book1"). Basically, one needs to create a subtree anywhere on the server disk (say, "/home/clara/www/"), owned by the user that will manage the project (say, "clara"), with subdirectories, "bin", "book1" and "book1/doubts":

$ id uid=511(clara) gid=511(clara) groups=511(clara) $ cd /home/clara/ $ mkdir www $ cd www $ mkdir bin book1 $ mkdir book1/doubts

Then copy to the directory "bin" the files clara.pl and sclara.c from the Clara OCR distribution (say, /usr/local/src/clara), edit clara.c to change the hardcoded definition of the root directory to "/home/clara/www", compile it and make it setuid:

$ cd bin $ cp /usr/local/src/clara/clara.pl . $ cp /usr/local/src/clara/sclara.c . $ emacs sclara.c $ grep '^char *root' sclara.c char *root = "/home/clara/www"; $ cc -o sclara -static sclara.c $ rm sclara.c $ chmod a+s sclara

Edit the script clara.pl. Example for the clara.pl configuration section (the script clara.pl contains default definitions for some of these variables, please comment out those definitions):

$CROOT = "/home/clara/www"; $U = "/cgi-bin/clara"; $book[0] = 'Author, <I>Test 1</I>, City, year'; $subdir[0] = "book1"; $LANG = 'en'; $opt = '-W -R 10 -b -k 2,1.4,1.57,10,3.8,10,4,1';

Now copy the PBM files to the directory "book1", create low-quality jpeg previews, gzip the PBM files, and select some patterns:

$ cd /home/clara/www/book1 $ cp /usr/local/src/clara/imre.pbm . $ pbmreduce 8 imre.pbm | convert -quality 25 - imre.jpg $ gzip -9 imre.pbm $ clara -k 2,1.4,1.57,10,3.8,10,4,1

(load one PBM file, train some symbols, save the session and quit the program).

Now we need to process the PBM files in order to create some "doubts". The script clara.pl also requires a symlink to the clara binary (change the path /usr/local/bin/clara as required):

$ cd /home/clara/www/bin $ ln -s /usr/local/bin/clara clara $ ./clara.pl -s book1 $ rm ../book1/*html $ ./clara.pl -p

Now your server must be instructed to exec /home/clara/www/bin/clara.pl when a visitor requests "/cgi-bin/clara" (if you prefer another URL, change the clara.pl customization too). An easy way to accomplish that is creating a symlink on the default directory for CGIs. The default directory of CGIs is platform-dependent (e.g. /home/httpd/cgi-bin, /usr/local/httpd/cgi-bin, /var/lib/apache/cgi-bin, etc). Example:

# cd /home/httpd/cgi-bin # ln -sf /home/clara/www/bin/clara.pl clara

Try to access the URL "/cgi-bin/clara" on your web server. The correct behaviour is successfully loading a page entitled "Prototype of the Cooperative Revision". If you have problems, be aware about some common problems:

1. Apache expects to be explicitly allowed to follow symlinks. The file access.conf should contain, in our case, a section similar to the following:

<Directory /home/httpd/cgi-bin> AllowOverride None Options ExecCGI FollowSymLinks </Directory>

2. The directory /home/clara must be world readable:

# ls -ld /home/clara drwxr-xr-x 4 clara clara 1024 Sep 17 09:56 /home/clara

If you succeeded, congratulations! Note that from time to time it'll be necessary to reprocess the pages, adding to the session files the data collected from the web, just like done before:

$ cd /home/clara/www/bin $ ./clara.pl -p $ ./clara.pl -s book1

3.10 Revision acts maintenance

Types of revision acts (to be written).

Discarding deduced data (to be written).

3.11 Analysing the statistics

The "page (list)" tab offers recognition statistics on a per-page basis. The contents of each column on this tab is described below:

POS: The sequential position on the list. The current page is informed by an asterisk on this column.

FILE: The name of the file that contains the PBM image of the document.

RUNS: The number of OCR runs on this page. Partial OCR runs, like classification (started by the "classify" button also count as one run.

TIME: Total CPU time wasted with OCR operations on this page. I/O time (reading and saving session files) is not included.

WORDS: Current number of words on this page. This variable is updated by the "build" step.

SYMBOLS: Current number of symbols on this page. This variable is updated by the "build" step.

DOUBTS: Current number of untransliterated CHAR symbols on this page. This variable is updated by the "build" step.

CLASSES: Current number of classes on this page.

FACT: Quotient between the number of symbols and the number of classes.

RECOG: Quotient between (symbols-doubts) and symbols, where "symbols" is the number of symbols and "doubts" is the number of doubts as defined above.

PROGRESS: difference between the current recog rate and the recog rate for the previous run.

3.12 Upgrading Clara OCR (to be written)

4. Reference of the Clara GUI

In this section, the Clara application window will be described in detail, both to document all its features and to define the terminology.

4.1 The application window

The application window is divided into three major areas: the buttons ("zoom", "OCR", "stop", etc) the "plate" (right), including the tabs ("page", "symbol" and "font"), and one or more "document windows" inside the plate.

We say "document window" because each window is exhibiting one "document". This "document" may be the scanned page (PAGE window), the current OCR output for this page (PAGE OUTPUT window), the symbol form (PAGE SYMBOL window), the GPL (GPL window) and so on. However, we'll refer the document windows merely as "windows".

Around each window there are two scrollbars. On the botton of the application window there is a status line. On the top there is a menu bar (fully documented on the section "Reference of the menus").

+-----------------------------------------------+ | File Edit OCR ... | +-----------------------------------------------+ | +--------+ +----+ +--------+ +-------+ | | | zoom | |page| |patterns| | tune | | | +--------+ +-+ +-+ +-+ +-+ | | +--------+ | +-------------------------+ | | | | zone | | | | | | | +--------+ | | | | | | +--------+ | | | | | | | OCR | | | WELCOME TO | | | | +--------+ | | | | | | +--------+ | | C L A R A O C R | | | | | stop | | | | | | | +--------+ | | | | | | . | | | | | | . | | | | | | | | | | | | | | | | | | | +-------------------------+ | | | +-----------------------------+ | | | | (status line) | +-----------------------------------------------+

4.2 Tabs and windows

Three tabs are oferred, and each one may operate in one or more "modes". For instance, pressing the PATTERN tab many times will circulate two modes: one presenting the windows "pattern" and "pattern (props)" and another with the window "pattern (list)".

On each tab, Clara OCR displays on the plate one or more windows. Each such window is called a "document window" to distinguish them from the application window. Each such window is supposed to be displaying a portion of a larger document, for instance

The scanned page (graphic) The OCR output (text) The list of pages (text) The list of patterns (text) The symbol description (text)

Unless the user hides them, two scrollbars are displayed for each document window, one horizontal and one vertical. On each one, a cursor is drawn to show the relative portion of the full document currently visible ont the display.

All available tabs and the modes for each one are listed below. The numbers (1, 2, etc) are only to make easier to distinguish one mode from the others. There is no effective association between the modes and the numbers.

tab mode windows -------------------------------

1 WELCOME

2 GPL

3 PATTERN_ACTION

page 4 PAGE_LIST

5 PAGE PAGE_OUTPUT PAGE_SYMBOL

6 PAGE_FATBITS PAGE_MATCHES

pattern 7 PATTERN

8 PATTERN_LIST

9 PATTERN_TYPES

tune 10 TUNE

11 TUNE_PATTERN TUNE_SKEL

11 TUNE_ACTS

Note that the windows WELCOME and GPL have no corresponding tab. When these windows are displayed, there is no active tab. Except in these cases, the name of the current window is always presented as the label of the active tab.

4.3 The Application Buttons

The application buttons are those displayed on the left portion of the Clara X window. They're labelled "zoom", "OCR", etc. Three types of buttons are available. There are on/off buttons (like "italic"), multi-state buttons (like the alphabet button), where the state is informed by the current label, and there are buttons that merely capture mouse clicks, like the "zoom" button. Some are sensible both to mouse button 1 and to mouse button 3, others are sensible only to mouse button 1.

zoom - enlarge or reduce bitmaps. The mouse buttom 1 enlarge bitmaps, the mouse button 3 reduce bitmaps. The bitmaps to enlarge or reduce are determined by the current window. If the PAGE window is active, then the scanned document is enlarged or reduced. If the PAGE (fatbits) or the PATTERN window is active, then the grid is enlarged or reduced. If the PAGE (symbol) or the PATTERN (props) or the PATTERN (list) window is active, then the web clip is enlarged or reduced.

OCR - start a full OCR run on the current page or on all pages, depending on the state of the "Work on current page only" item of the Options menu.

stop - stop the current OCR run (if any). OCR does not stop immediately, but will stop as soon as possible.

zone - start definition of the OCR zone. Currently zoning in Clara OCR is useful only for saving the zone can as a PBM file, using the "save zone" item on the "File" menu. By now, only one zone can be defined and the OCR operations consider the entire document, ignoring the zone.

type - read-only button, set accordingly to the pattern type of the current symbol or pattern. The various letter sizes or styles (normal, footnote, etc) used by the book are numbered from 0 by Clara OCR ("type 0", "type 1", etc).

bad - toggles the button state. The bad flag is used to identify damaged bitmaps.

latin/greek/etc - read-only button, set accordingly to the alphabet of the current symbol or pattern.

4.4 The Alphabet Map

When the "Show alphabet map" option of the "View" menu is selected, the GUI will include an alphabet map between the buttons and the plate. This map presents all symbols from the current alphabet. The current alphabet is selected using the alphabet button. The alphabet button circulates all alphabets selected on the "Alphabets" menu.

Clara OCR offers an initial support for multiple alphabets. To become useful, it needs more work. The alphabet map currently does not offer any functionality. For some alphabets (Cyrillic and Arabic) the alphabet map is disabled on the source code due to the large alphabet size. Currently Clara OCR does not contain bitmaps for displaying Katakana.

5. Reference of the menus

Most menus are acessible from their labels menu bar (on the top of the application window). The labels are "File", "Edit", etc. Other menus are presented when the user clicks the mouse button 3 on some special places (for instance the button "OCR"). Let's describe all menus and their entries.

5.1 File menu

This menu is activated from the menu bar on the top of the application X window.

Load page

Enter the page list to select a page to be loaded.

Save session

Save on disk the page session (file page.session), the patterns (file "pattern") and the revision acts (file "acts").

Save first zone

Save on disk the first zone as the file zone.pbm.

Save replacing symbols

Save on disk the entire page replacing symbols by patterns, to achieve better compression rates (mostly to produce small web images).

Write report

Save the contents of the PAGE LIST window to the file report.txt on the working directory.

Quit the program

Just quit the program (asking before if the session is to be saved.

5.2 Edit menu

This menu is activated from the menu bar on the top of the application X window.

Only doubts

When selected, the right or the left arrows used on the PATTERN or the PATTERN PROPS windows will move to the next or the previous untransliterated patterns.

Re-scan all patterns

When selected, the classification heuristic will retry all patterns for each symbol. This is required when trying to resolve the unclassified symbols using a second classification method.

Auto-classify.

When selected, the engine will re-run the classifier after each new pattern trained by the user. So if various letters "a" remain unclassified, training one of them will perhaps recognize some othersm helping to complete the recognition.

Fill region

When selected, the mouse button 1 will fill the region around one pixel on the pattern bitmap under edition on the font tab.

Paint pixel

When selected, the mouse button 1 will paint individual pixels on the pattern bitmap under edition on the font tab.

Clear region

When selected, the mouse button 1 will clear the region around one pixel on the pattern bitmap under edition on the font tab.

Clear pixel

When selected, the mouse button 1 will clear individual pixels on the pattern bitmap under edition on the font tab.

Sort patterns by page

When selected, the pattern list window will divide the patterns in blocks accordingly to their (page) sources.

Sort patterns by matches

When selected, the pattern list window will use as the first criterion when sorting the patterns, the number of matches of each pattern.

Sort patterns by transliteration

When selected, the pattern list window will use as the second criterion when sorting the patterns, their transliterations.

Sort patterns by number of pixels

When selected, the pattern list window will use as the third criterion when sorting the patterns, their number of pixels.

Sort patterns by width

When selected, the pattern list window will use as the fourth criterion when sorting the patterns, their widths.

Sort patterns by height

When selected, the pattern list window will use as the fifth criterion when sorting the patterns, their heights.

Del Untransliterated patterns

Remove from the font all untransliterated fonts.

Set pattern type.

Set the pattern type for all patterns marked as "other".

Search barcode.

Try to find a barcode on the loaded page.

Instant thresholding.

Perform on-the-fly global thresholding.

Reset skeleton parameters

Reset the parameters for skeleton computation for all patterns.

5.3 View menu

This menu is activated from the menu bar on the top of the application X window.

Small font

Use a small X font (6x13).

Medium font

Use the medium font (9x15).

Small font

Use a large X font (10x20).

Default font

Use the default font (7x13 or "fixed" or the one informed on the command line).

Hide scrollbars

Toggle the hide scrollbars flag. When active, this flag hides the display of scrolllbar on all windows.

Omit fragments

Toggle the hide fragments flag. When active, fragments won't be included on the list of patterns.

Show HTML source

Show the HTML source of the document, instead of the graphic rendering.

Show web clip

Toggle the web clip feature. When enabled, the PAGE_SYMBOL window will include the clip of the document around the current symbol that will be used through web revision.

Show alphabet map

Toggle the alphabet map display. When enabled, a mapping from Latin letters to the current alphabet will be displayed.

Show current class

Identify the symbols on the current class using a gray ellipse.

Show matches

Display bitmap matches when performing OCR.

Show comparisons

Display all bitmap comparisons when performing OCR.

Show matches

Display bitmap matches when performing OCR, waiting a key after each display.

Show comparisons and wait

Display all bitmap comparisons when performing OCR, waiting a key after each display.

Show skeleton tuning

Display each candidate when tuning the skeletons of the patterns.

Presentation

Perform a presentation. This item is visible on the menu only when the program is started with the -A option.

5.4 Alphabets menu

This item selects the alphabets that will be available on the alphabets button.

Arabic

This is a provision for future support of Arabic alphabet.

Cyrillic

This is a provision for future support of Cyrillic alphabet.

Greek

This is a provision for future support of Greek alphabet.

Hebrew

This is a provision for future support of Hebrew alphabet.

Kana

This is a provision for future support of Kana alphabet.

Latin

Words that use the Latin alphabet include those from the languages of most Western European countries (English, German, French, Spanish, Portuguese and others).

Number

Numbers like 1234, +55-11-12345678 or 2000.

Ideogram

Ideograms.

5.5 Options menu

Work on current page only

OCR operations (classification, merge, etc) will be performed only on the current page.

Work on all pages

OCR operations (classification, merge, etc) will be performed on all pages.

Emulate deadkeys

Toggle the emulate deadkeys flag. Deadkeys are useful for generating accented characters. Deadkeys emulation are disabled by default The emulation of deadkeys may be set on startup through the -i command-line switch.

Menu auto popup

Toggle the automenu feature. When enabled, the menus on the menu bar will pop automatically when the pointer reaches the menu bar.

PAGE only

When selected, the PAGE tab will display only the PAGE window. The windows PAGE_OUTPUT and PAGE_SYMBOL will be hidden.

5.6 PAGE options menu

This menu is activated when the mouse button 3 is pressed on the PAGE window.

See in fatbits

Change to PAGE_FATBITS focusing this symbol.

Bottom left here

Scroll the window contents in order to the current pointer position become the bottom left.

Use as pattern

The pattern of the class of this symbol will be the unique pattern used on all subsequent symbol classifications. This feature is intended to be used with the "OCR this symbol" feature, so it becomes possible to choose two symbols to be compared, in order to test the classification routines.

OCR this symbol

Starts classifying only the symbol under the pointer. The classification will re-scan all patterns even if the "re-scan all patterns" option is unselected.

Merge with current symbol

Merge this fragment with the current symbol.

Link as next symbol

Create a symbol link from the current symbol (the one identified by the graphic cursor) to this symbol.

Disassemble symbol

Make the current symbol nonpreferred and each of its components preferred.

Link as accent

Create an accent link from the current symbol (the one identified by the graphic cursor) to this symbol.

Diagnose symbol pairing

Run locally the symbol pairing heuristic to try to join this symbol to the word containing the current symbol. This is useful to know why the OCR is not joining two symbols on one same word.

Diagnose word pairing

Run locally the word pairing heuristic to try to join this word with the word containing the current symbol on one same line. This is useful to know why the OCR is not joining two words on one same line.

Diagnose lines

Run locally the line comparison heuristic to decide which is the preceding line.

Diagnose merging

Run locally the geometrical merging heuristic to try to merge this piece to the current symbol.

Show pixel coords

Present the coordinates and color of the current pixel.

Show closures

Identify the individual closures when displaying the current document.

Show symbols

Identify the individual symbols when displaying the current document.

Show words

Identify the individual words when displaying the current document.

Show pattern type

Display absent symbols on pattern type 0, to help building the bookfont.

Report scale

Report the scale on the tab when the PAGE window is active.

Display box instead of symbol

On the PAGE window displays the bounding boxes instead of the symbols themselves. This is useful when designing new heuristics.

5.7 PAGE_FATBITS options menu

This menu is activated when the mouse button 3 is pressed on the PAGE.

Bottom left here

Scroll the window contents in order to the current pointer position become the bottom left.

Centralize

Scroll the window contents in order to the centralize the closure under the pointer.

Build border path

Build the closure border path and activate the flea.

Search straight lines (linear)

Build the closure border path and search straight lines there using linear distances.

Search straight lines (quadratic)

Build the closure border path and search straight lines there using correlation.

Is bar?

Apply the isbar test on the closure.

Detect extremities?

Detect closure extremities.

Show skeletons

Show the skeleton on the windows PAGE_FATBITS. The skeletons are computed on the fly.

Show border

Show the border on the window PAGE_FATBITS. The border is computed on the fly.

Show pattern skeleton

For each symbol, will show the skeleton of its best match on the PAGE (fatbits) window.

Show pattern border

For each symbol, will show the border of its best match on the PAGE (fatbits) window.

5.8 OCR steps menu

This menu is activated when the mouse button 3 is pressed on the OCR button. It allows running specific OCR steps (all steps run in sequence when the OCR button is pressed).

Preproc.

Start preproc.

Detect blocks

Start detecting text blocks.

Segmentation.

Start binarization and segmentation.

Consist structures

All OCR data structures are submitted to consistency tests. This is under implementation.

Prepare patterns

Compute the skeletons and analyse the patterns for the achievement of best results by the classifier. Not fully implemented yet.

Read revision data

Revision data from the web interface is read, and added to the current OCR training knowledge.

Classification

start classifying the symbols of the current page or of all pages, depending on the state of the "Work on current page only" item of the Options menu. It will also build the font automatically if the corresponding item is selected on the Options menu.

Geometric merging

Merge closures on symbols depending on their geometry.

Build words and lines

Start building the words and lines. These heuristics will be applied on the current page or on all pages, depending on the state of the "Work on current page only" item of the Options menu.

Generate spelling hints

Remark: this is not implemented yet.

Start filtering through ispell to generate transliterations for unknow symbols or alternative transliterations for known symbols. Clara will use the dictionaries available for the languages selected on the Languages menu. Filtering will be performed on the current page or on all pages, depending on the state of the "Work on current page only" item of the Options menu.

Generate output

The OCR output is generated to be displayed on the "PAGE (output)" window. The output is also saved to the file page.html.

Generate web doubts

Files containing symbols to be revised through the web interface are created on the "doubts" subdirectory of the work directory. This step is performed only when Clara OCR is started with the -W command-line switch.

6. Reference of command-line switches

A number of internal variables now can be defined on the command-line. Variable names can be optionally preceded by '-'. If a value is absent, the default is 1. Examples:

clara -pp_deskew clara pp_deskew clara pp_deskew=1 clara -pp_deskew=1

(apply deskewer)

clara bin_method=3

(use the classification-based, local thresholder)

To known about the variables that can be defined on the command line, see the source code, file clara.c, function checkvar().

-a bf_auto,st_auto,st_auto_global,classifier
Bookfont handling options
-b
Run in batch mode.

The application window will not be created, and the OCR will automatically execute a full OCR run on all pages (or on the page specified through -f).

Implies -u.

-c N|c|black,gray,white,darkgray,vdgray
Choose the number of gray levels or the colors to be used by the GUI. To choose the number the colors AND the colors, this option must be used twice.

The Clara OCR GUI uses by default only five colors, internally called "white", "black", "gray", "darkgray" and "vdgray" ("very dark gray"). There are two predefined schemes to map these internal colors into RGB values: "c" (color) and the default (grayscale). Alternatively, the mapping may be explicited, informing the RGB values separated by commas. The notation #RRGGBB is not supported; RGB values must be specified through color names known by the xserver (e.g. "brown", "pink", "navyblue", etc, see the file /etc/X11R6/lib/X11/rgb.txt). The following example specify the default mapping:

-c black,gray80,white,gray60,gray40

To simulate reverse video try:

-c white,gray40,black,gray60,gray80

However, when displaying graymaps, the GUI may use more colors. On truecolor displays, the GUI uses by default 32 or 256 graylevels when displaying graymaps. On pseudocolor displays, 4 graylevels are used (in fact, the colors "black", "vdgray", "gray" and "white" are used, so the "graylevels" are not necessarily "gray"). To force only 4 graylevels on truecolor displays, use

-c 4

To force black-and-white, use

-c 2

(by now, '-c N' is useful mainly as a workaround for bad behaviour of the GUI on some display).

-D or-display
X Display to connect (by default read the environment variable DISPLAY).
-d
Run in debug mode. Debug messages will be sent to stderr. Debug messages are generated when an acceptable but not reasonable event is detected.
-e reviewer,type
Reviewer and reviewer type.

All revision data is assigned by Clara to its originator. By default the reviewer name is "nobody" and its type is "A".

The reviewer generally will be an email address or a nickname, The type may be T (trusted), A (arbiter) or N (anonymous). Example:

-e ueda@ime.usp.br,T

-F fontname
The X font to use (must be a font with fixed column size, e.g. "fixed" or "9x15").
-f path
Scanned page or page directory. Defaults to the current directory.

The argument must be a pbm file (with absolute or relative path) or the path (absolute or relative) of the directory where the pbm file(s) was (were) placed.

To specify a range of pages, use the restrictors start_at and stop_at. Example:

$ clara start_at=12 stop_at=122

In this case pages like 12.pbm or 0033.pbm will be processed, but pages like 9.pbm, 0009.pbm or 590.pbm won't.

-g wxh(+|-)x(+|-)y
X geometry.
-h
Display short help and exit.
-i
Emulate dead keys functionality.
-k list
Parameters SA,RR,MA,MP,ML,MB,RX,BT used to compute skeletons.

BUG: these parameters are ignored when a "patterns" already exists. In this case, Clara will read the parameters from the "patterns" file.

-N list
Switch off optimizations. Generally useful only for debug purposes. Non-supported displays depths (if any) may require optimizations to switch off (s, a, j, q, c, x or d). Examples:

-N s -N aq -N jq

-o t|h
Select output format (t=text, h=html, d=djvu). The default is HTML.
-P PNT1,PNT2,MD
Parameters for filtering symbol comparison.

PNT1 and PNT2 are the pixel number thresholds. These thresholds are used to filter out bad candidates when classifying symbols. The first threshold is for strong similarity and the second for weak similarity. The comparison algorithm performs two passes. The first pass uses PNT1 to filter. The second pass uses PNT2. So on the first pass only patterns "quite similar" to the symbol to classify are tried. On the second pass, we relax and permit more patterns to be tried. This method helps to achieve a good performance. As PNT1 becomes larger, less patterns will be tried on the first pass. As PNT2 becomes smaller, more patterns will be tried on the second pass.

MD is the maximum clearance to try a skeleton. The clearance must be an integer in the range 4..30 (default 6). The shape recognition algorithm will refuse to try to fit an skeleton into a symbol if the difference of the widths or heights of them is larger than twice the clearance.

Examples:

-P 50,5,8 -P 40,3,6

-R doubts
Maximum number of doubts per run. The argument must be an integer (default 30).
-T
Avoid loading and creation of session files. Also reports bookfont size on stdout before exiting. This option is intended to be used by the selthresh script.
-t
Switch on trace messages. Trace messages depict the execution flow, and are useful for developers. Trace messages are written to stderr.
-v
Verbose mode. Without this option, Clara runs quietly (default). Otherwise, informative warnings about potentially relevant events are sent to stderr.
-V
Print version and compilation options and exit.
-W
Web mode. Will read from the doubts subdir the input collected from web, and will dump on that same directory the doubts to be reviewed.
-w path
Work directory. Defaults to the page directory (see -f).

The path of the directory where the OCR will write the output, the acts, the book font and the session files. The doubts directory (web operation) is assumed to be a subdirectory of the work directory.

-X 0|1
Switch off (0) or on (1) index checking. Index checking is performed in some critical points in order to detect memory leaks. Index checking is unavailable when Clara is compiled with the symbol MEMCHECK undefined.
-y resolution
Inform the resolution of the scanned image in dots per inch (default 600). This resolution applies for all pages to be processed until the program exits.
-z
Write (and read) compressed session files (*.session, acts and patters will be compressed using GNU zip).

Be careful: if -z is used, any existing uncompressed file (*.session, acts or patterns) will be ignored. So if you start using uncompressed files and suddenly decides to begin using compressed files, then compress manually all existing files before starting Clara with the -z switch.

Clara OCR support for reading and writing compressed files depends on the platform, and requires gzip and gunzip to be installed in some directory of binaries included in the PATH.

-Z ZPS
ZPS, that is, the size of the bitmap pixels measured in display pixels, when in fat bit mode. Must be a small odd integer (1, 3, 5, 7 or 9).

7. AVAILABILITY

Clara OCR is free software. Its source code is distributed under the terms of the GNU GPL (General Public License), and is available at http://www.claraocr.org/. If you don't know what is the GPL, please read it and check the GPL FAQ at http://www.gnu.org/copyleft/gpl-faq.html. You should have received a copy of the GNU General Public License along with this software; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. The Free Software Foundation can be found at http://www.fsf.org.

8. CREDITS

Clara OCR was written by Ricardo Ueda Karpischek. Giulio Lunati wrote the internal preprocessor. Clara OCR includes bugfixes produced by other developers. The Changelog (http://www.claraocr.org/CHANGELOG) acknowledges all them (see below). Imre Simon contributed high-volume tests, discussions with experts, selection of bibliographic resources, propaganda and many ideas on how to make the software more useful.

Ricardo authored various free materials, some included (at least) in Conectiva, Debian, FreeBSD and SuSE (the verb conjugator "conjugue", the ispell dictionary br.ispell and the proxy axw3). He recently ported the EiC interpreter to the Psion 5 handheld and patched the Xt-based vncviewer to scale framebuffers and compute image diffs. Ricardo works as an independent developer and instructor. He received no financial aid to develop Clara OCR. He's not an employee of any company or organization.

Imre Simon promotes the usage and development of free technologies and information from his research, teaching and administrative labour at the University.

Roberto Hirata Junior and Marcelo Marcilio Silva contributed ideas on character isolation and recognition. Richard Stallman suggested improvements on how to generate HTML output. Marius Vollmer is helping to add Guile support. Jacques Le Marois helped on the announce process. We acknowledge Mike O'Donnell and Junior Barrera for their good criticism. We acknowledge Peter Lyman for his remarks about the Berkeley Digital Library, and Wanderley Antonio Cavassin, Janos Simon and Roberto Marcondes Cesar Junior for some web and bibliographic pointers. Bruno Barbieri Gnecco provided hints and explanations about GOCR (main author: Jorg Schulenburg). Luis Jose Cearra Zabala (author of OCRE) is gently supporting our tentatives of using portions of his code. Adriano Nagelschmidt Rodrigues and Carlos Juiti Watanabe carefully tried the tutorial before the first announce. Eduardo Marcel Macan packaged Clara OCR for Debian and suggested some improvements. Mandrakesoft is hosting claraocr.org. We acknowledge Conectiva and SuSE for providing copies of their outstanding distributions. Finally, we acknowledge the late Jose Hugo de Oliveira Bussab for his interest in our work.

Adriano Nagelschmidt Rodrigues donated a 15" monitor.

The fonts used by the "view alphabet map" feature came from Roman Czyborra's "The ISO 8859 Alphabet Soup" page at http://czyborra.com/charsets/iso8859.html.

The names cited by the CHANGELOG (and not cited before) follow (small patches, bug reports, specfiles, suggestions, explanations, etc).

Brian G. (win32), Bruce Momjian, Charles Davant (server admin), Daniel Merigoux, De Clarke, Emile Snider (preprocessor, to be released), Erich Mueller, Franz Bakan (OS/2), groggy, Harold van Oostrom, Ho Chak Hung, Jeroen Ruigrok, Laurent-jan, Nathalie Vielmas, Romeu Mantovani Jr (packager), Ron Young, R P Herrold, Sergei Andrievskii, Stuart Yeates, Terran Melconian, Thomas Klausner (NetBSD), Tim McNerney, Tyler Akins.