What is dhSegment?

It is a generic approach for Historical Document Processing. It relies on a Convolutional Neural Network to do the heavy lifting of predicting pixelwise characteristics. Then simple image processing operations are provided to extract the components of interest (boxes, polygons, lines, masks, …)

It was originally created by Benoit Seguin and Sofia Ares Oliveira at the Digital Humanities Laboratory (DHLAB) at EPFL for the needs of the Venice Time Machine.

A few key facts:

What sort of training data do I need?

Each training sample consists in an image of a document and its corresponding parts to be predicted.

Additionally, a text file encoding the RGB values of the classes needs to be provided. In this case if we want the classes ‘background’, ‘document’ and ‘photograph’ to be respectively classes 0, 1, and 2 we need to encode their color line-by-line:

0 255 0
255 0 0
0 0 255

Use cases

Page Segmentation

Dataset : READ-BAD [2] annotated by [1].

Layout Analysis

Dataset : DIVA-HisDB [3].

Ornament Extraction

Dataset : BCU collection.

Line Detection

Dataset : READ-BAD [2].

Document Segmentation

Dataset : Photo-collection from the Cini Foundation.

Tensorboard Integration

References

If you want to cite this work, please cite it as :

S. Ares Oliveira, B.Seguin, and F. Kaplan, “dhSegment: A generic deep-learning approach for document segmentation,” in Frontiers in Handwriting Recognition (ICFHR), 2018 16th International Conference on, pp. 7-12, IEEE, 2018.

Dataset references :

[1] C. Tensmeyer, B. Davis, C. Wigington, I. Lee, and B. Barrett, “Pagenet: Page boundary extraction in historical handwritten documents,” in Proceedings of the 4th International Workshop on Historical Document Imaging and Processing, pp. 59–64, ACM, 2017.
[2] T. Grüning, R. Labahn, M. Diem, F. Kleber, and S. Fiel, “Read-bad: A new dataset and evaluation scheme for baseline detection in archival documents,” in 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 351–356, IEEE, 2018.
[3] F. Simistira, M. Seuret, N. Eichenberger, A. Garz, M. Liwicki, and R. Ingold, “DIVA-HisDB: A precisely annotated large dataset of challenging medieval manuscripts,” in Frontiers in Handwriting Recognition (ICFHR), 2016 15th International Conference on, pp. 471–476, IEEE, 2016.