How to identify the actual document type when the filename extension is not correct
Product Version | Product | Author |
---|---|---|
2022.1.217 | WordsProcessing | Martin Velikov |
Description
This article describes how to identify the actual document type when the filename extension is incorrect which helps us to determine the appropriate format provider.
Solution
The following example demonstrates how to read two documents with ".doc" filename extensions but actually different document types. Using the StringBuilder
class we are creating the document signature (header) string, which later to compare with predefined values.
Once having the right document type we can determine which format provider to use to import the document.
Example
List<byte[]> documents = new List<byte[]>();
documents.Add(File.ReadAllBytes("rtf.doc"));
documents.Add(File.ReadAllBytes("doc.doc"));
foreach (byte[] document in documents)
{
string headerCode = GetHeaderInfo(document).ToUpper();
//! The signatures are taken from: https://www.filesignatures.net/index.php?page=search
if (headerCode.StartsWith("7B5C72746631"))
{
//! The document is RTF
RtfFormatProvider rtfFormatProvider = new RtfFormatProvider();
RadFlowDocument rtfDocument = rtfFormatProvider.Import(new MemoryStream(document));
}
else if (headerCode.StartsWith("D0CF11E0A1B11AE1"))
{
//! The document is DOC
DocFormatProvider docFormatProvider = new DocFormatProvider();
RadFlowDocument docDocument = docFormatProvider.Import(document);
}
}
Getting document header
private static string GetHeaderInfo(byte[] documentData)
{
byte[] buffer = documentData.Take(8).ToArray();
StringBuilder sb = new StringBuilder();
foreach (byte b in buffer)
{
sb.Append(b.ToString("X2"));
}
return sb.ToString();
}