Açıklama Yok

x1ongzhu b38169ba1a . 6 yıl önce
data 6cdf1924ef . 6 yıl önce
.gitignore aac582c011 . 6 yıl önce
.pylintrc 98b7b45a3a first commit 6 yıl önce
LICENSE 98b7b45a3a first commit 6 yıl önce
Makefile 98b7b45a3a first commit 6 yıl önce
README.md 98b7b45a3a first commit 6 yıl önce
font_properties a0bf68212e . 6 yıl önce
generate_line_box.py 98b7b45a3a first commit 6 yıl önce
index.js 0d6e846743 . 6 yıl önce
package.json 98b7b45a3a first commit 6 yıl önce
pubg.font.exp0.box b38169ba1a . 6 yıl önce
pubg.font.exp0.tif b38169ba1a . 6 yıl önce
pubg.font.exp0.tr b38169ba1a . 6 yıl önce
pubg.inttemp b38169ba1a . 6 yıl önce
pubg.normproto b38169ba1a . 6 yıl önce
pubg.pffmtable b38169ba1a . 6 yıl önce
pubg.shapetable b38169ba1a . 6 yıl önce
pubg.traineddata b38169ba1a . 6 yıl önce
pubg.ttf 0d6e846743 . 6 yıl önce
pubg.unicharset b38169ba1a . 6 yıl önce
unicharset b38169ba1a . 6 yıl önce

README.md

ocrd-train

Training workflow for Tesseract 4 as a Makefile for dependency tracking and building the required software from source.

Install

leptonica, tesseract

You will need a recent version (>= 4.0.0beta1) of tesseract built with the training tools and matching leptonica bindings. Build instructions and more can be found in the Tesseract project wiki.

Alternatively, you can build leptonica and tesseract within this project and install it to a subdirectory ./usr in the repo:

  make leptonica tesseract

Tesseract will be built from the git repository, which requires CMake, autotools (including autotools-archive) and some additional libraries for the training tools. See the installation notes in the tesseract repository.

Provide ground truth

Place ground truth consisting of line images and transcriptions in the folder data/ground-truth. This list of files will be split into training and evaluation data, the ratio is defined by the RATIO_TRAIN variable.

Images must be TIFF and have the extension .tif.

Transcriptions must be single-line plain text and have the same name as the line image but with .tif replaced by .gt.txt.

The repository contains a ZIP archive with sample ground truth, see ocrd-testset.zip. Extract it to ./data/ground-truth and run make training.

NOTE: If you want to generate line images for transcription from a full page, see tips in issue 7 and in particular @Shreeshrii's shell script.

Train

 make training MODEL_NAME=name-of-the-resulting-model

which is basically a shortcut for

   make unicharset lists proto-model training

Run make help to see all the possible targets and variables:


  Targets

    unicharset       Create unicharset
    lists            Create lists of lstmf filenames for training and eval
    training         Start training
    proto-model      Build the proto model
    leptonica        Build leptonica
    tesseract        Build tesseract
    tesseract-langs  Download tesseract-langs
    clean            Clean all generated files

  Variables

    MODEL_NAME         Name of the model to be built. Default: foo
    START_MODEL        Name of the model to continue from. Default: ''
    PROTO_MODEL        Name of the proto model. Default: 'data/foo/foo.traineddata'
    CORES              No of cores to use for compiling leptonica/tesseract. Default: 4
    LEPTONICA_VERSION  Leptonica version. Default: 1.75.3
    TESSERACT_VERSION  Tesseract commit. Default: fd492062d08a2f55001a639f2015b8524c7e9ad4
    TESSDATA_REPO      Tesseract model repo to use. Default: _fast
    GROUND_TRUTH_DIR   Ground truth directory. Default: data/ground-truth
    NORM_MODE          Normalization Mode - see src/training/language_specific.sh for details. Default: 2
    PSM                Page segmentation mode. Default: 6
    RATIO_TRAIN        Ratio of train / eval training data. Default: 0.90

License

Software is provided under the terms of the Apache 2.0 license.

Sample training data provided by Deutsches Textarchiv is in the public domain.