Prerequisites

Optical Character Recognition (a.k.a. OCR) is the electronic or mechanical conversion of images of typed, handwritten, or printed text into a machine-encoded text from a scanned document.

This topic describes the requirements needed by the PdfProcessing library to start using the OcrFormatProvider.

The default Tesseract implementation is at this point Windows-only. You can still use the OCR feature with a custom implementation.

Used images should be 300 DPI for best results.

Required Assemblies/ NuGet Packages

In order to use the OcrFormatProvider you need to add the following assemblies:

.NET Framework	.NET Standard-compatible
Telerik.Windows.Documents.Core	Telerik.Documents.Core
Telerik.Windows.Documents.Fixed	Telerik.Documents.Fixed
Telerik.Windows.Documents.Fixed.FormatProviders.Ocr	Telerik.Documents.Fixed.FormatProviders.Ocr
Telerik.Windows.Zip	Telerik.Zip

This reference is recommended to always be in the form of a NuGet package, as it will add the required Tesseract references and files automatically. Otherwise, a manual intervention might be required.
Telerik.Windows.Documents.Tesseract.Ocr	Telerik.Documents.Tesseract.Ocr

To export images different than Jpeg and Jpeg2000 or ImageQuality different than High you will need to add a reference to the following assembly:
-	Telerik.Documents.ImageUtils _{This assembly is not available in UI for Xamarin.}
-	SkiaSharp _{Telerik.Documents.ImageUtils depends on SkiaSharp.}

Ensure that all Tesseract dependencies are properly set up.

Language Data Setup

Create a "tessdata" folder and populate it with the desired languages. You can download the language data files from the official Tesseract GitHub repository. Results may vary depending on the language version:

Tesseract Languages Version

The "tessdata" folder's placement is determined by the user. The DataPath property of the TesseractOcrProvider points to the parent folder containing "tessdata", allowing the provider to locate and use it.

"tessdata" Structure:

tessdata
├── due.traineddata
├── eng.traineddata     
└── spa.traineddata

tessdata Structure

Manually set up the Tesseract native assemblies

Ensure that the following already exist in the root directory of your project:

The "Tesseract.dll" assembly.
The Tesseract native assemblies (x86, x64):

Tesseract Native Assemblies Structure

If these requirements are not met, go through the following steps:

Extract the "Tesseract.dll" assembly from the Telerik.Windows.Documents.TesseractOcr NuGet package and add it to your project.
Download the "tesseract50.dll" and "leptonica-1.82.0.dll" native assemblies from the listed links:
- https://github.com/charlesw/tesseract/tree/master/src/Tesseract/x64.
- https://github.com/charlesw/tesseract/tree/master/src/Tesseract/x86.

Create the following structure and add the two folders to the root of the applicaiton.

Folder Structure:

RootFolder
├── x64
│   ├── tesseract50.dll
│   └── leptonica-1.82.0.dll
└── x86
    ├── tesseract50.dll
    └── leptonica-1.82.0.dll

Prerequisites

Required Assemblies/ NuGet Packages

Language Data Setup

"tessdata" Structure:

Manually set up the Tesseract native assemblies

Getting Started

Support Resources

Community