Extracting Text from PDF Documents
Environment
Version | Product | Author |
---|---|---|
Q1 2025 | RadPdfProcessing | Desislava Yordanova |
Description
Learn how to extract the text content in a PDF document.
Solution
Follow the steps:
1. Import the PDF document using the PdfFormatProvider.
2. Export the RadFixedDocument's content to text using the TextFormatProvider. Thus, if the PDF document contains text fragments, it will be exported to the plain text result.
string filePath = "input.pdf";
PdfFormatProvider pdf_provider = new PdfFormatProvider();
RadFixedDocument fixed_document;
using (Stream stream = File.OpenRead(filePath))
{
fixed_document = pdf_provider.Import(stream);
}
Telerik.Windows.Documents.Fixed.FormatProviders.Text.TextFormatProvider provider = new Telerik.Windows.Documents.Fixed.FormatProviders.Text.TextFormatProvider();
string documentContent = provider.Export(fixed_document);
Debug.WriteLine(documentContent);
However, depending on the internal document's content, the TextFormatProvider may not be applicable for covering all the cases. A common scenario is a document with scanned images which contain text information. In this case, the above approach wouldn't parse the content to plain text because all the text inside is actually not text but Path elements. Here comes the benefit of using the OcrFormatProvider allowing you to convert images of typed, handwritten, or printed text into machine-encoded text from a scanned document.