New to Telerik Document Processing? Download free 30-day trial

How to identify the actual document type when the filename extension is not correct

Product Version Product Author
2022.1.217 WordsProcessing Martin Velikov

Description

This article describes how to identify the actual document type when the filename extension is incorrect which helps us to determine the appropriate format provider.

Solution

The following example demonstrates how to read two documents with ".doc" filename extensions but actually different document types. Using the (StringBuilder)[https://docs.microsoft.com/en-us/dotnet/api/system.text.stringbuilder?view=net-6.0] class we are creating the document signature (header) string, which later to compare with predefined values. Once having the right document type we can determine which format provider to use to import the document.

Example

List<byte[]> documents = new List<byte[]>(); 
documents.Add(File.ReadAllBytes("rtf.doc")); 
documents.Add(File.ReadAllBytes("doc.doc")); 
 
foreach (byte[] document in documents) 
{ 
    string headerCode = GetHeaderInfo(document).ToUpper(); 
 
    //! The signatures are taken from: https://www.filesignatures.net/index.php?page=search 
    if (headerCode.StartsWith("7B5C72746631")) 
    { 
        //! The document is RTF 
        RtfFormatProvider rtfFormatProvider = new RtfFormatProvider(); 
        RadFlowDocument rtfDocument = rtfFormatProvider.Import(new MemoryStream(document)); 
    } 
    else if (headerCode.StartsWith("D0CF11E0A1B11AE1")) 
    { 
        //! The document is DOC 
        DocFormatProvider docFormatProvider = new DocFormatProvider(); 
        RadFlowDocument docDocument = docFormatProvider.Import(document); 
    } 
} 

Getting document header

private static string GetHeaderInfo(byte[] documentData) 
{ 
    byte[] buffer = documentData.Take(8).ToArray(); 
 
    StringBuilder sb = new StringBuilder(); 
    foreach (byte b in buffer) 
    { 
        sb.Append(b.ToString("X2")); 
    } 
 
    return sb.ToString(); 
} 
In this article