Prerequisites
Optical Character Recognition (a.k.a. OCR) is the electronic or mechanical conversion of images of typed, handwritten, or printed text into a machine-encoded text from a scanned document.
This topic describes the requirements needed by the PdfProcessing library to start using the OcrFormatProvider.
The default Tesseract implementation is at this point Windows-only. You can still use the OCR feature with a custom implementation.
Used images should be 300 DPI for best results.
Required Assemblies/ NuGet Packages
In order to use the OcrFormatProvider you need to add the following assemblies:
.NET Framework | .NET Standard-compatible |
---|---|
Telerik.Windows.Documents.Core | Telerik.Documents.Core |
Telerik.Windows.Documents.Fixed | Telerik.Documents.Fixed |
Telerik.Windows.Documents.Fixed.FormatProviders.Ocr | Telerik.Documents.Fixed.FormatProviders.Ocr |
Telerik.Windows.Zip | Telerik.Zip |
This reference is recommended to always be in the form of a NuGet package, as it will add the required Tesseract references and files automatically. Otherwise, a manual intervention might be required. | |
Telerik.Windows.Documents.Tesseract.Ocr | Telerik.Documents.Tesseract.Ocr |
To export images different than Jpeg and Jpeg2000 or ImageQuality different than High you will need to add a reference to the following assembly: | |
- |
Telerik.Documents.ImageUtils
This assembly is not available in UI for Xamarin. |
- |
SkiaSharp
Telerik.Documents.ImageUtils depends on SkiaSharp. |
Ensure that all Tesseract dependencies are properly set up.
Language Data Setup
Create a "tessdata" folder and populate it with the desired languages. You can download the language data files from the official Tesseract GitHub repository. Results may vary depending on the language version:
The "tessdata" folder's placement is determined by the user. The DataPath property of the TesseractOcrProvider points to the parent folder containing "tessdata", allowing the provider to locate and use it.
"tessdata" Structure:
tessdata
├── due.traineddata
├── eng.traineddata
└── spa.traineddata
Manually set up the Tesseract native assemblies
Ensure that the following already exist in the root directory of your project:
- The "Tesseract.dll" assembly.
- The Tesseract native assemblies (x86, x64):
If these requirements are not met, go through the following steps:
- Extract the "Tesseract.dll" assembly from the Telerik.Windows.Documents.TesseractOcr NuGet package and add it to your project.
- Download the "tesseract50.dll" and "leptonica-1.82.0.dll" native assemblies from the listed links:
- Create the following structure and add the two folders to the root of the applicaiton.
- Folder Structure:
RootFolder ├── x64 │ ├── tesseract50.dll │ └── leptonica-1.82.0.dll └── x86 ├── tesseract50.dll └── leptonica-1.82.0.dll