Moer and Éric Moreau's web site

(Print this page)

Detecting the Byte Order Mark of a file from a .Net application

Published date: Saturday, April 9, 2011

On: Moer and Éric Moreau's web site

Original URL: http://www.emoreau.com/Entries/Articles/2011/04/Detecting-the-Byte-Order-Mark-of-a-file-from-a-Net-application.aspx

Have you ever received a text file from a provider and had to test many encoding settings because it wasn’t really clear what format they were using?

It happened to me many times.

Do you know that there are a couple of bytes at the beginning of the file to identify the encoding of the file? This article will provide you code to quickly test the encoding of a text file.

This month’s demo code

This month’s demo code is provided in both VB and C#. It was created using Visual Studio 2008 but should work in all versions (yew even 2002).

What is encoding?

In short, encoding provides a way of supporting multiple characters set. Amongst the most popular encoding in .Net, we find UTF-7, UTF-8, UTF-32.

These days, Unicode encoding is the most widely encoding. It is a standard. You can find all the specifications from http://www.unicode.org/standard/standard.html.

Wikipedia has a great explanation of what are the different encodings of text files available from http://en.wikipedia.org/wiki/Byte-order_mark.

Let’s just say here that sometimes, what you expect to be a plain old text file is not. If you have to process it, the only difference for you will be to specify the encoding format when you open it. Once opened, there is nothing special to do.

In the Windows world, the default is UTF-8 (even the good old Notepad saves in this format).

What is the trick?

There is no magic here. We simply have to open the file and read the first few characters. Each encoding has its unique signature. If we detect this signature, we can safely (most of the time) assume that we have a file using this encoding.

Creating test files

Because you surely don’t have a file of each encoding available, we will create a small set of files only for testing purposes.

The simplest way I found to quickly create the set of test files is this snippet of code creating the 7 files representing each one of the 7 format supported by .Net:

'Create dummy test files with each encoding
ListBox1.Items.Clear()
ListBox1.Items.Add("Creation of test files...")
Dim objEncoder As Encoding = Nothing
For i As Integer = 1 To 7
    Select Case i
        Case 1 : objEncoder = New System.Text.ASCIIEncoding

        Case 2 : objEncoder = New System.Text.UnicodeEncoding(True, True) 'big endian
        Case 3 : objEncoder = New System.Text.UnicodeEncoding(False, True) 'little endian

        Case 4 : objEncoder = New System.Text.UTF32Encoding(True, True) 'big endian
        Case 5 : objEncoder = New System.Text.UTF32Encoding(False, True) 'little endian

        Case 6 : objEncoder = New System.Text.UTF7Encoding()
        Case 7 : objEncoder = New System.Text.UTF8Encoding(True)
    End Select

    ListBox1.Items.Add(String.Format("   Creating File{0}.txt with {1}", i, objEncoder.EncodingName))

    'open file with encoding 
    Dim tw As TextWriter = New StreamWriter(String.Format("File{0}.txt", i), False, objEncoder)
    'write data here 
    tw.Write("this is a sample file encoded using " + objEncoder.EncodingName)
    'save and close it 
    tw.Close()
Next

When you run this code, you will see the listbox control being filled as shown in figure 1. 7 files (File1.txt to File7.txt) will be created in your bin/debug folder.

Figure 1: Test files just created

I don’t know why, but the UTF7Encoding class of .Net does not seem to add the bytes at the beginning of the file. That means that this file will be detected as a plain old ASCII file.

Detecting file’s encoding

Figure 2 shows the application after the files’ encoding has been detected.

Figure 2: Encoding being reported

Now that we have a set of files, the task you really want to do is to loop through your test files to detect their encoding.

The loop is as simple as this:

ListBox1.Items.Clear()
For i As Integer = 1 To 7
    Dim strX As String = String.Format("File{0}.txt", i)
    Dim enc As Encoding = GetFileEncoding(strX)
    If enc Is Nothing Then
        ListBox1.Items.Add(String.Format("File >>{0}<< has an unkown encoding", strX))
    Else
        ListBox1.Items.Add(String.Format("File >>{0}<< is encoded using {1}", strX, enc.EncodingName))
    End If
Next

As you can see, there is nothing in that code that is detecting the encoding. This process has been encapsulated into the GetFileEncoding method which reads like this:

Try
    Dim enc As Encoding = Nothing
    Dim file As FileStream = New FileStream(pFileName, FileMode.Open, FileAccess.Read, FileShare.Read)

    If file.CanSeek Then
        Dim bom As Byte() = New Byte(3) {} ' Get the byte-order mark, if there is one
        file.Read(bom, 0, 4)
        If (bom(0) = &HFF AndAlso bom(1) = &HFE AndAlso bom(2) = 0 AndAlso bom(3) = 0) Then
            enc = Encoding.UTF32
        ElseIf (bom(0) = &HFF AndAlso bom(1) = &HFE) Then
            enc = Encoding.Unicode
        ElseIf (bom(0) = &HFE AndAlso bom(1) = &HFF) Then
            enc = Encoding.BigEndianUnicode
        ElseIf (bom(0) = &HEF AndAlso bom(1) = &HBB AndAlso bom(2) = &HBF) Then
            enc = Encoding.UTF8
        ElseIf (bom(0) = &H2B AndAlso bom(1) = &H2F AndAlso bom(2) = &H76) Then
            enc = Encoding.UTF7
        ElseIf (bom(0) = 0 AndAlso bom(1) = 0 AndAlso bom(2) = &HFE AndAlso bom(3) = &HFF) Then
            enc = Encoding.UTF32
        Else
            enc = Encoding.ASCII 'or it can also be UTF8
        End If
    Else
        enc = System.Text.Encoding.ASCII
    End If

    ' Close the file: never forget this step!
    file.Close()

    Return enc
Catch ex As Exception
    Return Nothing
End Try

This code simply opens the file into a FileStream and reads the first 4 bytes into a byte array. These arrays are compared against known signature to set a variable to the type of detected encoding. This method can easily be reused in your projects.

Conclusion

Using this simple method, you don’t have to guess the file encoding anymore.

(Print this page)