Available for: UI for ASP.NET MVC | UI for ASP.NET AJAX | UI for Blazor | UI for WPF | UI for WinForms | UI for Silverlight | UI for Xamarin | UI for WinUI | UI for ASP.NET Core | UI for .NET MAUI

New to Telerik Document Processing? Download free 30-day trial

Prerequisites

Optical Character Recognition (a.k.a. OCR) is the electronic or mechanical conversion of images of typed, handwritten, or printed text into a machine-encoded text from a scanned document.

This topic describes the requirements needed by the PdfProcessing library to start using the OcrFormatProvider.

The default Tesseract implementation is at this point Windows-only. You can still use the OCR feature with a custom implementation.

Used images should be 300 DPI for best results.

Required Assemblies/ NuGet Packages

In order to use the OcrFormatProvider you need to add the following assemblies:

.NET Framework .NET Standard-compatible
Telerik.Windows.Documents.Core Telerik.Documents.Core
Telerik.Windows.Documents.Fixed Telerik.Documents.Fixed
Telerik.Windows.Documents.Fixed.FormatProviders.Ocr Telerik.Documents.Fixed.FormatProviders.Ocr
Telerik.Windows.Zip Telerik.Zip
 
This reference is recommended to always be in the form of a NuGet package, as it will add the required Tesseract references and files automatically. Otherwise, a manual intervention might be required.
Telerik.Windows.Documents.Tesseract.Ocr Telerik.Documents.Tesseract.Ocr
 
To export images different than Jpeg and Jpeg2000 or ImageQuality different than High you will need to add a reference to the following assembly:
- Telerik.Documents.ImageUtils
This assembly is not available in UI for Xamarin.
- SkiaSharp
Telerik.Documents.ImageUtils depends on SkiaSharp.

Ensure that all Tesseract dependencies are properly set up.

Language Data Setup

Create a "tessdata" folder and populate it with the desired languages. You can download the language data files from the official Tesseract GitHub repository. Results may vary depending on the language version:

Tesseract Languages Version

The "tessdata" folder's placement is determined by the user. The DataPath property of the TesseractOcrProvider points to the parent folder containing "tessdata", allowing the provider to locate and use it.

"tessdata" Structure:

tessdata
├── due.traineddata
├── eng.traineddata     
└── spa.traineddata

tessdata Structure

Manually set up the Tesseract native assemblies

Ensure that the following already exist in the root directory of your project:

  • The "Tesseract.dll" assembly.
  • The Tesseract native assemblies (x86, x64):

Tesseract Native Assemblies Structure

If these requirements are not met, go through the following steps:

  1. Extract the "Tesseract.dll" assembly from the Telerik.Windows.Documents.TesseractOcr NuGet package and add it to your project.
  2. Download the "tesseract50.dll" and "leptonica-1.82.0.dll" native assemblies from the listed links:
  3. Create the following structure and add the two folders to the root of the applicaiton.
    • Folder Structure:
    RootFolder
    ├── x64
       ├── tesseract50.dll
       └── leptonica-1.82.0.dll
    └── x86
        ├── tesseract50.dll
        └── leptonica-1.82.0.dll