An Enhanced U-Net-based Approach for Sinhala Document Layout Analysis

Hulathdoowage S.K.D; Kumara B.T.G.S

doi:10.56427/jcbd.v4i3.767

Authors

Hulathdoowage S.K.D Faculty of Computing, Sabaragamuwa University of Sri Lanka
Kumara B.T.G.S Faculty of Computing, Sabaragamuwa University of Sri Lanka

DOI:

https://doi.org/10.56427/jcbd.v4i3.767

Keywords:

Document Layout Analysis, Sinhala Documents, U-Net, Vision Transformer, Semantic Segmentation

Abstract

Document layout analysis plays a critical role in the digitization pipeline by identifying, segmenting, and classifying structural elements within documents to support accurate information extraction. This task becomes increasingly challenging when dealing with heterogeneous layouts that contain paragraphs, tables, figures, mathematical expressions, and other visual components. For Sinhala, a low-resource language with limited annotated datasets and specialized models, research in this area remains sparse. To address this gap, this study proposes an enhanced U-Net architecture that integrates convolutional neural networks with vision transformer blocks to improve semantic segmentation performance. The model leverages convolutional layers to capture fine-grained local features while employing transformer components to model long-range dependencies and global contextual relationships across document regions. A manually annotated dataset of 750 Sinhala document images covering 14 distinct element categories was developed to train and evaluate the model. Experimental results demonstrate that the proposed architecture significantly outperforms standard U-Net and attention U-Net variants, achieving 93.06% pixel accuracy, 64.37% mean IoU, and 77.32% mean F1-score. This research represents the first comprehensive document layout analysis framework tailored specifically for Sinhala documents and provides a strong foundation for future digitization, archival, and text processing initiatives within Sri Lankan academic, governmental, and cultural institutions.

Downloads

Download data is not yet available.

References

F. Grijalva, E. Santos, B. Acuña, J. C. Rodríguez, and J. C. Larco, "Deep learning in time-frequency domain for document layout analysis," IEEE Access, vol. 9, pp. 151254-151265, 2021.

A. Gemelli, E. Vivoli, and S. Marinai, "Graph neural networks and representation embedding for table extraction in PDF documents," in 2022 26th International Conference on Pattern Recognition (ICPR), 2022: IEEE, pp. 1719-1726.

X.-H. Li, F. Yin, and C.-L. Liu, "Page segmentation using convolutional neural network and graphical model," in Document Analysis Systems: 14th IAPR International Workshop, DAS 2020, Wuhan, China, July 26–29, 2020, Proceedings 14, 2020: Springer, pp. 231-245.

W. Ohyama, M. Suzuki, and S. Uchida, "Detecting mathematical expressions in scientific document images using a u-net trained on a diverse dataset," IEEE Access, vol. 7, pp. 144030-144042, 2019.

L. Aljiffry, H. Al-Barhamtoshy, A. Jamal, and F. Abukhodair, "Arabic Documents Layout Analysis (ADLA) using Fine-tuned Faster RCN," in 2022 20th International Conference on Language Engineering (ESOLEC), 2022, vol. 20: IEEE, pp. 66-71.

O. Mechi, M. Mehri, R. Ingold, and N. Essoukri Ben Amara, "A two-step framework for text line segmentation in historical Arabic and Latin document images," International Journal on Document Analysis and Recognition (IJDAR), vol. 24, no. 3, pp. 197-218, 2021.

T. Shehzadi, D. Stricker, and M. Z. Afzal, "A hybrid approach for document layout analysis in document images," arXiv preprint arXiv:2404.17888, 2024.

Z. Yang and N. Li, "Identification of Layout elements in Chinese academic papers based on Mask R-CNN," in 2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE), 2022: IEEE, pp. 250-255.

K. Thamarasee and R. Surendra, "Sinhala Character Identification Using Orientation and Support Vector Machine," in 2024 4th International Conference on Advanced Research in Computing (ICARC), 2024: IEEE, pp. 127-131.

A. Almutairi and M. Almashan, "Instance segmentation of newspaper elements using mask R-CNN," in 2019 18th IEEE International conference on machine learning and applications (ICMLA), 2019: IEEE, pp. 1371-1375.

A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.

J. Wang, K. Hu, and Q. Huo, "DLAFormer: An End-to-End Transformer For Document Layout Analysis," in International Conference on Document Analysis and Recognition, 2024: Springer, pp. 40-57.

A. Banerjee, S. Biswas, J. Lladós, and U. Pal, "SemiDocSeg: harnessing semi-supervised learning for document layout analysis," International Journal on Document Analysis and Recognition (IJDAR), pp. 1-18, 2024.

Y. Cuo, N. Tashi, Z. Liu, Q. Wei, L. Gadeng, and G. Trashi, "Layout Analysis of Tibetan Historical Documents Based on Deep Learning," presented at the Proceedings of the 2019 the International Conference on Pattern Recognition and Artificial Intelligence, Wenzhou, China, 2019. [Online]. Available: https://doi.org/10.1145/3357777.3357790.

Z. Gao and S. Li, "YOLOLayout: Multi-Scale Cross Fusion Former for Document Layout Analysis," International Journal of Emerging Technologies and Advanced Applications, vol. 1, no. 2, pp. 8-15, 2024.

T. Grüning, G. Leifert, T. Strauß, J. Michael, and R. Labahn, "A two-stage method for text line detection in historical documents," International Journal on Document Analysis and Recognition (IJDAR), vol. 22, no. 3, pp. 285-302, 2019.

Y. Huang et al., "A YOLO-based table detection method," in 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019: IEEE, pp. 813-818.

S. C. Kosaraju et al., "DoT-Net: Document layout classification using texture-based CNN," in 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019: IEEE, pp. 1029-1034.

D. Liu and S. Zhou, "Pixel-Level Segmentation of Handwritten and Printed Texts in Document Images with Deep Learning," in 2023 16th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 2023: IEEE, pp. 1-5.

O. Mechi, M. Mehri, R. Ingold, and N. E. B. Amara, "Text line segmentation in historical document images using an adaptive u-net architecture," in 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019: IEEE, pp. 369-374.

O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, 2015: Springer, pp. 234-241.

H. M. Zangana, Z. B. Sallow, and B. A. Salih, "The Impact of Artificial Intelligence on Healthcare: A Systematic Review of Innovations, Challenges, and Ethical Considerations," Journal of Computers and Digital Business, vol. 4, no. 1, pp. 1-9, 2025.

K. Han et al., "A survey on vision transformer," IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 87-110, 2022.