Have you ever received a text file from a provider and had to test many encoding settings because it wasn’t really clear what format they were using?
It happened to me many times.
Do you know that there are a couple of bytes at the beginning of the file to identify the encoding of the file? This article will provide you code to quickly test the encoding of a text file.
This month’s demo code
This month’s demo code is provided in both VB and C#. It was created using Visual Studio 2008 but should work in all versions (yew even 2002).
What is encoding?
In short, encoding provides a way of supporting multiple characters set. Amongst the most popular encoding in .Net, we find UTF-7, UTF-8, UTF-32.
These days, Unicode encoding is the most widely encoding. It is a standard.
You can find all the specifications from http://www.unicode.org/standard/standard.html.
Wikipedia has a great explanation of what are the different encodings of text files available from http://en.wikipedia.org/wiki/Byte-order_mark.
Let’s just say here that sometimes, what you expect to be a plain old text file is not. If you have to process it, the only difference for you will be to specify the encoding format when you open it.
Once opened, there is nothing special to do.
In the Windows world, the default is UTF-8 (even the good old Notepad saves in this format).
What is the trick?
There is no magic here. We simply have to open the file and read the first few characters. Each encoding has its unique signature.
If we detect this signature, we can safely (most of the time) assume that we have a file using this encoding.
Creating test files
Because you surely don’t have a file of each encoding available, we will create a small set of files only for testing purposes.
The simplest way I found to quickly create the set of test files is this snippet of code creating the 7 files representing each one of the 7 format supported by .Net:
'Create dummy test files with each encoding
ListBox1.Items.Clear()
ListBox1.Items.Add("Creation of test files...")
Dim objEncoder As Encoding = Nothing
For i As Integer = 1 To 7
Select Case i
Case 1 : objEncoder = New System.Text.ASCIIEncoding
Case 2 : objEncoder = New System.Text.UnicodeEncoding(True, True) 'big endian
Case 3 : objEncoder = New System.Text.UnicodeEncoding(False, True) 'little endian
Case 4 : objEncoder = New System.Text.UTF32Encoding(True, True) 'big endian
Case 5 : objEncoder = New System.Text.UTF32Encoding(False, True) 'little endian
Case 6 : objEncoder = New System.Text.UTF7Encoding()
Case 7 : objEncoder = New System.Text.UTF8Encoding(True)
End Select
ListBox1.Items.Add(String.Format(" Creating File{0}.txt with {1}", i, objEncoder.EncodingName))
'open file with encoding
Dim tw As TextWriter = New StreamWriter(String.Format("File{0}.txt", i), False, objEncoder)
'write data here
tw.Write("this is a sample file encoded using " + objEncoder.EncodingName)
'save and close it
tw.Close()
Next
When you run this code, you will see the listbox control being filled as shown in figure 1. 7 files (File1.txt to File7.txt) will be created in your bin/debug folder.
Figure 1: Test files just created
I don’t know why, but the UTF7Encoding class of .Net does not seem to add the bytes at the beginning of the file. That means that this file will be detected as a plain old ASCII file.
Detecting file’s encoding
Figure 2 shows the application after the files’ encoding has been detected.
Figure 2: Encoding being reported
Now that we have a set of files, the task you really want to do is to loop through your test files to detect their encoding.
The loop is as simple as this:
ListBox1.Items.Clear()
For i As Integer = 1 To 7
Dim strX As String = String.Format("File{0}.txt", i)
Dim enc As Encoding = GetFileEncoding(strX)
If enc Is Nothing Then
ListBox1.Items.Add(String.Format("File >>{0}<< has an unkown encoding", strX))
Else
ListBox1.Items.Add(String.Format("File >>{0}<< is encoded using {1}", strX, enc.EncodingName))
End If
Next
As you can see, there is nothing in that code that is detecting the encoding. This process has been encapsulated into the GetFileEncoding method which reads like this:
Try
Dim enc As Encoding = Nothing
Dim file As FileStream = New FileStream(pFileName, FileMode.Open, FileAccess.Read, FileShare.Read)
If file.CanSeek Then
Dim bom As Byte() = New Byte(3) {} ' Get the byte-order mark, if there is one
file.Read(bom, 0, 4)
If (bom(0) = &HFF AndAlso bom(1) = &HFE AndAlso bom(2) = 0 AndAlso bom(3) = 0) Then
enc = Encoding.UTF32
ElseIf (bom(0) = &HFF AndAlso bom(1) = &HFE) Then
enc = Encoding.Unicode
ElseIf (bom(0) = &HFE AndAlso bom(1) = &HFF) Then
enc = Encoding.BigEndianUnicode
ElseIf (bom(0) = &HEF AndAlso bom(1) = &HBB AndAlso bom(2) = &HBF) Then
enc = Encoding.UTF8
ElseIf (bom(0) = &H2B AndAlso bom(1) = &H2F AndAlso bom(2) = &H76) Then
enc = Encoding.UTF7
ElseIf (bom(0) = 0 AndAlso bom(1) = 0 AndAlso bom(2) = &HFE AndAlso bom(3) = &HFF) Then
enc = Encoding.UTF32
Else
enc = Encoding.ASCII 'or it can also be UTF8
End If
Else
enc = System.Text.Encoding.ASCII
End If
' Close the file: never forget this step!
file.Close()
Return enc
Catch ex As Exception
Return Nothing
End Try
This code simply opens the file into a FileStream and reads the first 4 bytes into a byte array. These arrays are compared against known signature to set a variable to the type of detected encoding.
This method can easily be reused in your projects.
Conclusion
Using this simple method, you don’t have to guess the file encoding anymore.