Available for: UI for ASP.NET MVC | UI for ASP.NET AJAX | UI for Blazor | UI for WPF | UI for WinForms | UI for Silverlight | UI for Xamarin | UI for WinUI | UI for ASP.NET Core | UI for .NET MAUI

New to Telerik Document Processing? Download free 30-day trial

Using OcrFormatProvider

Since Q1 2025 the RadPdfProcessing library supports Optical Character Recognition (OCR). OCR is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text from a scanned document. The library uses the OcrFormatProvider class that allows you to import an image which is returned as a RadFixedPage. By default, the OcrFormatProvider takes as a parameter a TesseractOcrProvider implementation which is achieved by using the third-party library Tesseract, however you can provide any custom implementation instead.

You can find all the dependencies and required steps for the implementation in the Prerequisites article.

TesseractOcrProvider Public API

Method/Property Description
TesseractOcrProvider(string dataPath) Constructor that takes as a parameter the path to the parent directory that contains the "tessdata" folder.
LanguageCodes The language codes to use for the Tesseract OCR engine. You can find the corresponding trained data for each language and their codes here. The value is "eng" by default.
CorrectVerticalPosition Indicates whether the OCR processor will try to correct the vertical position of the text. (Not available in .NET Framework)
DataPath The path to the parent directory that contains the "tessdata" folder.
ParseLevel Indicates the level of parsing that the OCR processor will perform - OcrParseLevel.Line or OcrParseLevel.Word.
GetAllTextFromImage Extracts all text from an image and returns it as a single string.
GetTextFromImage Extracts the text from an image and returns the words and their bounding rectangles.
// Requirement for Images in .NET Standard - https://docs.telerik.com/devtools/document-processing/libraries/radpdfprocessing/cross-platform/images
//FixedExtensibilityManager.ImagePropertiesResolver = new ImagePropertiesResolver();

TesseractOcrProvider tesseractOcrProvider = new TesseractOcrProvider(".");
tesseractOcrProvider.LanguageCodes = new List<string>() { "eng" };
//tesseractOcrProvider.CorrectVerticalPosition = false; // Available in .NET Standard
tesseractOcrProvider.DataPath = @"..\..\..\";
tesseractOcrProvider.ParseLevel = OcrParseLevel.Line;

string imagePath = @"..\..\..\images\image.png";

string imageText = tesseractOcrProvider.GetAllTextFromImage(File.ReadAllBytes(imagePath));
Dictionary<Rectangle, string> imageTextAndTextDimentions = tesseractOcrProvider.GetTextFromImage(File.ReadAllBytes(imagePath));

OcrFormatProvider OcrProvider = new OcrFormatProvider(tesseractOcrProvider);

RadFixedDocument document = new RadFixedDocument();

RadFixedPage page = new RadFixedPage();
page = OcrProvider.Import(new FileStream(imagePath, FileMode.Open), null);
document.Pages.Add(page);

string outputPath = "output.pdf";
PdfFormatProvider pdfFormatProvider = new PdfFormatProvider();
using (Stream output = File.OpenWrite(outputPath))
{
    pdfFormatProvider.Export(document, output, TimeSpan.FromSeconds(10));
}

You can find a complete example of implementing an OcrFormatProvider in our SDK repository.

See Also

In this article