Printed ocr dataset Curated for precision and diversity, these image data sets feature a wide array of handwritten text samples, ranging from letters and notes to forms and invoices. This dataset consists of 10 categories and a total of 5005 printed images, covering most commonly encountered scenarios in daily life. Explore and accelerate your AI projects today! Apr 14, 2023 · The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. github. Explore and accelerate your AI projects today! Vietnamese OCR dataset, including word level and line level image data. The data was collected in Vietnam, and all the images in the dataset include labeling results. 繁體中文OCR文字識別數據集. Discover our specialized english Handwritten OCR Image Datasets, designed to advance the recognition of handwritten english text. Explore and accelerate your AI projects today! Download Finnish printed OCR datasets to train robust OCR and text recognition models. Unlock the potential of danish text recognition with our carefully curated danish Printed OCR Datasets. With the ongoing advances in deep learning and optical character recognition (OCR) technologies, neural networks designed to perform large-scale classification play an essential role in facilitating OCR systems. These datasets feature a diverse range of printed text from sources like books, invoices, receipts, product labels, digital menus, and more. Explore and accelerate your AI projects today! Aug 19, 2022 · Datasets related to using computer vision with images of documents, invoices, papers, contracts, screenshots, text, signatures, pdfs, jpegs, pngs, and more. This dataset consists of 11 categories and a total of 5006 printed images, covering most commonly encountered scenarios in daily life. com/static/assets/app. Explore and accelerate your AI projects today! Download Tamil printed OCR datasets to train robust OCR and text recognition models. Odia Printed OCR Image Datasets Unlock the potential of odia text recognition with our carefully curated odia Printed OCR Datasets. Furthermore, it supports multilingual text, as well as both handwritten and This repo contains a dataset of 10,603 samples of printed Arabic numeral digits used in Egyptian ID cards - collected to be used as part of an Arabic OCR system. Download Japanese printed OCR datasets to train robust OCR and text recognition models. This overlap causes difi-culties during the optical character recognition (OCR) and digitization process of documents, and subsequently, hurts downstream NLP tasks. We also provide UrduDoc, a benchmark dataset for Urdu text line detection in scanned documents. . However, training modern OCR models, especially powerful vision-language models, is hampered by the lack of large, diverse, and well-structured datasets that mimic real-world book layouts. This repo collects OCR-related datasets. Our model outperform these OCR engines on 8 out of 13 languages. 1 Datasets We utilize a variety of printed and handwritten Ara- bic OCR datasets in this study, which we have collectively named MIDAD. The IAM Handwritten dataset contains images of handwritten text. OCR system for Arabic language that converts images of typed text to machine-encoded text. OCR is a technology that converts printed or handwritten text into machine-readable text, and having standardized datasets is crucial for benchmarking the accuracy and robustness of OCR solutions. Designed for precision, these datasets include a wide variety of odia printed text from sources like books, newspapers, invoices, and product labels. Download Hindi printed OCR datasets to train robust OCR and text recognition models. Thus, it is hoped that the dataset provides the basis for developing practical Telugu OCR systems. It is designed for OCR training and Vision-Language Model (VLM) fine-tuning, offering a variety of real-world document layouts. Designed for precision, these datasets include a wide variety of chinese printed text from sources like books, newspapers, invoices, and product labels. and first released in this repository. To the best of our knowledge, this is the first publicly available large-scale dataset, comprising over 2 million images that encompass a wide range of fonts, sizes, and styles for characters, words, and sentences. Explore and accelerate your AI projects today! This dataset contains scanned images from 10 types of documents, such as advertisements, emails, forms, letters, and news articles. Such software needs to pass a training phase to learn how to recognize the letters in the text. OCR software uses machine learning to recognize characters in the document. Explore and accelerate your AI projects today! Download Thai printed OCR datasets to train robust OCR and text recognition models. Designed for precision, these datasets include a wide variety of polish printed text from sources like books, newspapers, invoices, and product labels. Contribute to xylcbd/ocr-open-dataset development by creating an account on GitHub. UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23) abdur75648. The dataset can be TrOCR model fine-tuned on the SROIE dataset. Download Gujarati printed OCR datasets to train robust OCR and text recognition models. Nov 3, 2024 · One of these is the traditional Mongolian MnOCR 1 dataset, and the other was released concurrently with the publication [10], consisting merely of a list of traditional Mongolian words printed in different fonts. The english dataset consists of 21 categories, total of 2,637 printed images and 406 handwritten images, covering most commonly used scenarios in daily life, with all data labeled. Designed for optical character recognition (OCR), machine learning (ML), and data science applications, this dataset provides diverse and challenging samples to enhance model accuracy and robustness. It was introduced in the paper TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Li et al. io/UTRNet/ machine-learning ocr computer-vision deep-learning pytorch text-recognition high-resolution text-detection unet document-analysis urdu scene-text-recognition urdu-nlp icdar hrnet icdar2023 urdu-ocr utrnet urdu-synth Readme View license 5 days ago · To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Similarly, the SROIE dataset consists of several thousand samples of receipt DIDA: The largest historical handwritten digit dataset with 250k digits DIDA is a new image-based historical handwritten digit dataset and collected from the Swedish historical handwritten document images between the year 1800 and 1940. Download English printed OCR datasets to train robust OCR and text recognition models. SARD: Synthetic Arabic Recognition Dataset Overview SARD (Synthetic Arabic Recognition Dataset) is a large-scale, synthetically generated dataset designed for training and evaluating Optical Character Recognition (OCR) models for Arabic text. Please refer to our GitHub ocr handwriting-ocr python3 optical-character-recognition htr handwriting-recognition handwritten-text-recognition ocr-python iam-dataset easter2 Updated on Apr 24, 2023 Jupyter Notebook ICDAR 2017 Post OCR Data Processing dataSomething went wrong and this page crashed! If the issue persists, it's likely a problem on our side. Something went wrong and this page crashed! If the issue persists, it's likely a problem on our side. Featuring both handwritten and printed posters in multiple languages, our high-quality datasets are perfect for training and refining OCR models. g. Dec 27, 2023 · The main purpose of the present article is to introduce a Printed Farsi Dataset for OCR researches (call as IDPL-PFOD). We provide word, line and paragraph level annotations. Evaluation of LLM-based Approaches: Sep 1, 2023 · Automated Optical Inspection (AOI) is a process that uses cameras to autonomously scan printed circuit boards for quality control. 1L+ images in total It is the first large-scale dataset for text in natural images and also the first dataset to annotate scene text with attributes such as legibility and type of text. Additionally, we have developed an online tool for end-to-end Urdu OCR from printed documents by integrating UTRNet with a text detection model. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes. Because of free data availability, the cost of developing the application is reduced significantly. Handwritten OCR Image Datasets Unlock the power of handwritten text recognition with our specialized Handwritten OCR Image Datasets. The editable version of the image is usually used for further processing operations like indexing, searching, analyzing, and modification. 3. Sep 28, 2025 · Explore OCR accuracy among ABBYY FineReader, Google Cloud Vision API, AWS Textract, Azure Computer Vision, Tesseract on handwritten & printed images Jul 17, 2019 · Abstract This paper describes how UHTelPCC, a dataset for Telugu printed character recognition, is created and its characteristics. TextOCR provides ~1M high quality word annotations on TextVQA images Nov 12, 2024 · Train Machine Learning Models Faster with 15 Best Open-source Handwriting & OCR Datasets. Explore and accelerate your AI projects today! Explore our Printed Digits Dataset featuring 3000 grayscale images, designed for Sudoku digit classification and OCR tasks. Beyond meeting the demands of the Farsi language, the purposed dataset, with its A Dataset is a curated collection of text data and corresponding images that is specifically designed for training and evaluating OCR systems and algorithms. Each sample consists of a synthetically rendered formula image and its corresponding LaTeX formula. Dataset and evaluation code for the Paper "CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy". In general, the datasets are classified by 6 types, i. , Natural Scene Text, Document Text, Handwritten Text, Historical Document Text, Video Text, and Synthetic Text. Featuring a diverse collection of printed materials in multiple languages, these high-quality datasets include detailed annotations for precise text extraction. We compare our model’s performance with popular publicly available OCR tools for end-to-end document image recognition. Including PPT type, document type, natural light photography, screenshots, and handwriting type. Ideal for enhancing handwriting recognition in AI applications. The image encoder was initialized from the weights of BEiT, while the text decoder was initialized from Download Telugu printed OCR datasets to train robust OCR and text recognition models. Nov 23, 2022 · The search stage includes the following terms: Arabic OCR, Arabic optical character recognition, Arabic OCR dataset, Arabic OCR Database, handwritten Arabic characters, printed Arabic recognition, CNN Arabic OCR, Handcrafted Arabic OCR, and deep learning Arabic OCR. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects. Designed for precision, these datasets include a wide variety of arabic printed text from sources like books, newspapers, invoices, and product labels. Dec 4, 2024 · (iii) We also provide APIs for our OCR models and web-based applications that integrate these APIs to digitize Indic printed documents. Nov 6, 2020 · OCR based Dot Matrix Character Recognition using Tesseract Hello folks, I recently took up a task by someone related to Characters Recognition whose dataset included characters of dot matrix form The main purpose of the present article is to introduce a Printed Farsi Dataset for OCR researches (call as IDPL-PFOD). Perfect for machine learning A Comprehensive Dataset for Advancing OCR Accuracy and VLM finetuning list all open dataset about ocr. Download Malay printed OCR datasets to train robust OCR and text recognition models. The data covers a wide range of real-world conditions—natural scenes, printed documents Unlock the potential of polish text recognition with our carefully curated polish Printed OCR Datasets. From handwritten menu images and caligraphy to printed images and street signs, our datasets offer limitless possibilities. 4%. Unlock the potential of czech text recognition with our carefully curated czech Printed OCR Datasets. A large-scale synthetic Arabic OCR dataset comprising 843,622 book-style document images across 10 fonts, designed to advance VLM for Arabic Texts Optical Character Recognition (OCR) detects and recognizes the printed and handwritten text from an image and converts it to editable text. Abstract : Any text recognition or Optical Character Recognition (OCR) system requires a dataset to learn how to recognize the text. OCR enables machines to interpret and convert printed or handwritten text into machine-readable data, revolutionizing how we interact with information. Italian Printed OCR Image Datasets Unlock the potential of italian text recognition with our carefully curated italian Printed OCR Datasets. 🚀 GitHub | 🤗 Hugging Face | 🤖 ModelScope | 📑 Paper | 📗 Blog Here is hosting the tsv version of CC-OCR data, which is used for evaluation in VLMEvalKit. The data covers a wide range of real-world conditions—natural scenes, printed documents Nov 6, 2020 · OCR based Dot Matrix Character Recognition using Tesseract Hello folks, I recently took up a task by someone related to Characters Recognition whose dataset included characters of dot matrix form The main purpose of the present article is to introduce a Printed Farsi Dataset for OCR researches (call as IDPL-PFOD). The ViOCRVQA dataset contains 28,282 images and 123,781 questions relevant to images with answers. May 30, 2025 · Abstract Arabic Optical Character Recognition (OCR) is essential for converting vast amounts of Arabic print media into digital formats. This OCR dataset consists of diverse types of images with text in the English language from different types of products. Unlock the potential of chinese text recognition with our carefully curated chinese Printed OCR Datasets. Explore and accelerate your AI projects today! Unlock the potential of arabic text recognition with our carefully curated arabic Printed OCR Datasets. Along with images, this dataset consists of detailed metadata as well. js?v=47055d8cde17c831f318:2:1565938. These datasets feature a diverse collection of handwritten samples, including letters, sticky notes, forms, invoices, etc. The dataset is created from characters extracted from images of printed Telugu texts from the period 1950–1990. Interestingly, it is primarily handwritten Mongolian OCR datasets that are in use, as shown in references [27, 28]. - google-research-datasets/hiertext This dataset contains scanned images from 10 types of documents, such as advertisements, emails, forms, letters, and news articles. Dataset Analysis: We analyze existing Vietnamese OCR datasets, including printed, handwritten, and scene text corpora, and discuss their limitations regarding size, diversity, and annotation quality. Explore and accelerate your AI projects today! Optical Character Recognition Dataset containing Various Fonts and Style CC-OCR This is the Repository for CC-OCR Benchmark. We introduce a synthetic dataset, Mozhi-LR(S) and a real dataset, Mozhi-LR(R), comprising word level images with textual transcriptions for these nine languages. This dataset spans 20 languages, including Traditional Chinese, Simplified Chinese, Japanese, Korean, Thai, Vietnamese, Indonesian, Malay, Polish, and more. Contains alphanumeric english (62 classes) character - 2. In order to implement the training phase the OCR needs to use a standard dataset. The dataset contains a variety of document types, including different layouts, font sizes, and styles. Designed for precision, these datasets include a wide variety of italian printed text from sources like books, newspapers, invoices, and product labels. Explore and accelerate your AI projects today! Supercharge your OCR technology with our comprehensive newspaper, book, and magazine image datasets. Boost your OCR capabilities with high-quality spanish printed text datasets, crafted to optimize recognition algorithms for accurate and reliable Arabic text processing. at https://www. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nexdata provides trusted speech recognition, computer vision, and natural language understanding data for AI training include OCR Training Dataset, OCR Datasets, Tailored Data,Handwriting OCR Datasets, Invoice OCR Datasets, English OCR Datasets, Japanese OCR Datasets, Korean OCR Datasets,and is your reliable partner. Jun 5, 2025 · Dataset Analysis: We analyze existing Vietnamese OCR datasets, including printed, handwritten, and scene text corpora, and discuss their limitations regarding size, diversity, and annotation quality. The system aims to solve a simpler problem of OCR with images that contain only Arabic characters (check the dataset link below to see a sample of the images). In this paper, IDPL-PFOD2, an innovative large-scale dataset for Farsi OCR, is presented. Along with product images, this dataset consists of detailed metadata as well. We collected parts of each of the datasets and then cut randomly the sentences to collect the final training set. LaTeX-OCR Dataset Summary This dataset was created to train LaTeX-OCR, a model for recognizing LaTeX code from images of mathematical formulas. Designed for precision, these datasets include a wide variety of filipino printed text from sources like books, newspapers, invoices, and product labels. Fine-tuning this dataset makes the model recognize handwritten text better than the other models. Our carefully curated datasets feature a diverse range of images, including printed and handwritten text from various sources such as invoices, flyers, business cards, and product labels. A foundational understanding of OCR, text recognition, and training datasets to fuel these AI models. In this paper, we introduce ICText, the largest dataset for text detection and recognition on integrated circuits Jun 14, 2025 · The trocr-base-printed model is a Transformer-based Optical Character Recognition (OCR) model fine-tuned on the SROIE dataset. The Digits and Alphabets Images Dataset offers a meticulously curated collection of characters and digits, blending samples from the renowned Chars74K and DIDA datasets. This article explores the powerful capabilities of OCR and presents a TensorFlow-based model, a testament to the Unlock the potential of polish text recognition with our carefully curated polish Printed OCR Datasets. Designed for precision, these datasets include a wide variety of czech printed text from sources like books, newspapers, invoices, and product labels. The HICMA Dataset Benchmarking Tool is a powerful utility designed to assess the performance of Optical Character Recognition (OCR) models on the HICMA Dataset. Explore and accelerate your AI projects today! Download Portuguese printed OCR datasets to train robust OCR and text recognition models. Explore and accelerate your AI projects today! Download Malayalam printed OCR datasets to train robust OCR and text recognition models. Explore and accelerate your AI projects today! We also provide UrduDoc, a benchmark dataset for Urdu text line detection in scanned documents. This OCR dataset consists of diverse types of images with text in the English language from newspapers, magazines, and books. Prior research either focuses solely on the binary classification of handwritten text or performs a three-class segmentation of the Jan 17, 2022 · In the field of computer vision, large-scale image classification tasks are both important and highly challenging. Designed for precision, these datasets include a wide variety of danish printed text from sources like books, newspapers, invoices, and product labels. Provide human eyes to your OCR engine Make your OCR engines see the world as we do with the expansive dataset collected by Flitto DataLab’s global network of collaborators. Optical Character Recognition (OCR) is the process of recognizing characters automatically from scanned or image documents. js?v=47055d8cde17c831f318:2:1563652. These include multi-column text, tables, and mixed media. Discover our premium collection of Printed OCR Datasets, specifically designed to enhance the accuracy of printed text recognition. However, training mod-ern OCR models, especially powerful vision-language models, is hampered by the lack of large, diverse, and well-structured datasets that mimic real-world book layouts. Sep 5, 2023 · In this article, we are fine tuning TrOCR Small Printed model on the SCUT CTW1500 dataset to improve its performance on curved text. Nougat, a recent tool This includes traditional OCR methods, deep learning-based approaches, and emerging LLM-driven solutions. Download Vietnamese handwritten OCR datasets to train accurate text recognition models. Download Kannada printed OCR datasets to train robust OCR and text recognition models. Elevate your OCR and text recognition with our extensive poster image datasets. Unlock the potential of turkish text recognition with our carefully curated turkish Printed OCR Datasets. Model Overview This model is designed for reading printed text labels and performs exceptionally well on synthetic OCR datasets Hindi Natural Scene OCR Image Corpus This dataset consists of 8 categories and a total of 21218 printed images, covering most commonly encountered scenarios in daily life. Abstract While analyzing scanned documents, handwritten text can overlap with printed text. Mar 23, 2024 · Abstract Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. The model is trained using syn-thetic datasets and fine-tuned with real ones, achieving high accuracy on synthetic and real datasets. GitHub is where people build software. 7%, whereas applying the three proposed techniques with YOLO4 achieved 82. RETAS OCR Evaluation Dataset The RETAS dataset (used in the paper by Yalniz and Manmatha, ICDAR'11) is created to evaluate the optical character recognition (OCR) accuracy of real scanned books. Designed for precision, these datasets include a wide variety of turkish printed text from sources like books, newspapers, invoices, and product labels. Explore our extensive collection of OCR image datasets, specifically designed for training and fine-tuning robust Optical Character Recognition (OCR) and Text Recognition systems. May 9, 2020 · We have created an OCR and page layout analysis dataset from one selected newspaper, namely “Ascher Zeitung,” printed in the second half of the nineteenth century. The data was collected in India, and all the images in the dataset include labeling results. Text is often printed on chip components, and it is crucial that this text is correctly recognized during AOI, as it contains valuable information. Aug 19, 2023 · We also provide UrduDoc, a benchmark dataset for Urdu text line detection in scanned documents. 500,000 high-resolution images featuring multilingual Optical Character Recognition (OCR) data across both natural scenes and various document types. Contribute to esun-ai/traditional-chinese-text-recogn-dataset development by creating an account on GitHub. Unlock the potential of filipino text recognition with our carefully curated filipino Printed OCR Datasets. e. May 30, 2025 · Arabic Optical Character Recognition (OCR) is essential for converting vast amounts of Arabic print media into digital formats. This article explores the powerful capabilities of OCR and presents a TensorFlow-based model, a testament to the Boost your OCR capabilities with high-quality bengali printed text datasets, crafted to optimize recognition algorithms for accurate and reliable Arabic text processing. Boost your OCR capabilities with high-quality assamese printed text datasets, crafted to optimize recognition algorithms for accurate and reliable Arabic text processing. Explore and accelerate your AI projects today! Download Vietnamese printed OCR datasets to train robust OCR and text recognition models. Boost your OCR capabilities with high-quality russian printed text datasets, crafted to optimize recognition algorithms for accurate and reliable Arabic text processing. kaggle. , formulae) or generic printed English text. Commonly used with optical character recognition (OCR) to translate text into usable data. Aug 21, 2022 · 总结OCR领域的主流公开数据集，包含检测&识别、各种场景、各种语言的数据集，并提供数据集的相关信息及下载链接。 - willpat1213/OCR-Datasets Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals. The dataset contains real OCR outputs for 160 scanned books (100 English, 20 French, 20 German, 20 Spanish) downloaded from the Internet Archive website. The data was collected in Saudi Arabia, and all the images in the dataset include labeling results. Below is a summary of these datasets. In this study, we developed an automatic OCR system designed to This OCR dataset consists of diverse types of images with text in the Arabic language from different types of products. Jul 23, 2025 · Optical Character Recognition (OCR) stands as a transformative force, bridging the gap between the physical and digital worlds. - google-research-datasets/hiertext This repo contains a dataset of 10,603 samples of printed Arabic numeral digits used in Egyptian ID cards - collected to be used as part of an Arabic OCR system. It is the largest historical handwritten digit dataset which is introduced to the Optical Character Recognition (OCR) community to help the researchers to test Digital Maktaba project 1, we present here a possible pipeline for the creation of a title pages dataset to efectively train an Open Source OCR model such as Kraken [1], through the escriptorium VRE [2], to extract cataloging metadata from Arabic printed frontispieces. It is an encoder-decoder model, with an image Transformer as the encoder and a text Transformer as the decoder. While many off-the-shelf OCR models exist, they are often trained for either scientific (e. A new Quranic OCR dataset is developed based on the most famous printed version of the Holy Quran (Mushaf Al-Madinah), and a page and line-text image with the corresponding labels is prepared. This dataset addresses the critical need for comprehensive Arabic text recognition resources by providing controlled, diverse, and scalable training data We constructed a novel dataset in the Vietnamese language and called ViOCRVQA (Vi etnamese O ptical C haracter R ecognition - V isual Q uestion A nswering dataset), which aims at enhancing the ability of solving the OCR-VQA task for Vietnamese. The OCR has many applications such as data entry [1], vehicle license plate recognition [2], postal address reading Sep 3, 2025 · This repository contains two datasets, SinFUND and SinOCR, aimed at enhancing handwritten and printed Optical Character Recognition (OCR) and form understanding for the Sinhala language. Unlock the potential of arabic text recognition with our carefully curated arabic Printed OCR Datasets. Additionally, we have developed an online tool for end-to-end Urdu OCR from printed docu-ments by integrating UTRNet with a text detection model. Download Marathi printed OCR datasets to train robust OCR and text recognition models. Explore and accelerate your AI projects today! Optical Character Recognition Dataset containing Various Fonts and Style ocr handwriting-ocr python3 optical-character-recognition htr handwriting-recognition handwritten-text-recognition ocr-python iam-dataset easter2 Updated on Apr 24, 2023 Jupyter Notebook Download Gujarati printed OCR datasets to train robust OCR and text recognition models. Due to the lack of a standard Nov 14, 2024 · The dataset was collected from Google Image searches across more than 1,700 keywords, including a variety of categories such as legal documents, educational materials, signage, advertisements, and more. The experimental results in the Arabic Printed Text Image (APTI) dataset showed that the Word Recognition Rate (WRR) of customized YOLO4 Arabic OCR is 66. Abstract Arabic Optical Character Recognition (OCR) is essential for converting vast amounts of Arabic print media into digital formats. Existing Arabic OCR datasets often focus on isolated words or lines or are Aug 29, 2023 · TrOCR Fine-Tuned Models After the pretraining stage, the models were fine-tuned on the IAM Handwritten text images and SROIE printed receipts dataset. This dataset is made from Miras text dataset and the images are generated with different fonts, font styles, font sizes and backgrounds.