Have you ever received a text file from a provider and had to test many encoding settings because it wasn’t really clear what format they were using?
It happened to me many times.
Do you know that there are a couple of bytes at the beginning of the file to identify the encoding of the file? This article will provide you code to quickly test the encoding of a text file.
This month’s demo code
This month’s demo code is provided in both VB and C#. It was created using Visual Studio 2008 but should work in all versions (yew even 2002).
What is encoding?
In short, encoding provides a way of supporting multiple characters set. Amongst the most popular encoding in .Net, we find UTF-7, UTF-8, UTF-32.
These days, Unicode encoding is the most widely encoding. It is a standard. You can find all the specifications from http://www.unicode.org/standard/standard.html.
Wikipedia has a great explanation of what are the different encodings of text files available from http://en.wikipedia.org/wiki/Byte-order_mark.
Let’s just say here that sometimes, what you expect to be a plain old text file is not. If you have to process it, the only difference for you will be to specify the encoding format when you open it. Once opened, there is nothing special to do.
In the Windows world, the default is UTF-8 (even the good old Notepad saves in this format).
What is the trick?
There is no magic here. We simply have to open the file and read the first few characters. Each encoding has its unique signature. If we detect this signature, we can safely (most of the time) assume that we have a file using this encoding.
Creating test files
Because you surely don’t have a file of each encoding available, we will create a small set of files only for testing purposes.
The simplest way I found to quickly create the set of test files is this snippet of code creating the 7 files representing each one of the 7 format supported by .Net:
'Create dummy test files with each encoding ListBox1.Items.Clear() ListBox1.Items.Add("Creation of test files...") Dim objEncoder As Encoding = Nothing For i As Integer = 1 To 7 Select Case i Case 1 : objEncoder = New System.Text.ASCIIEncoding Case 2 : objEncoder = New System.Text.UnicodeEncoding(True, True) 'big endian Case 3 : objEncoder = New System.Text.UnicodeEncoding(False, True) 'little endian Case 4 : objEncoder = New System.Text.UTF32Encoding(True, True) 'big endian Case 5 : objEncoder = New System.Text.UTF32Encoding(False, True) 'little endian Case 6 : objEncoder = New System.Text.UTF7Encoding() Case 7 : objEncoder = New System.Text.UTF8Encoding(True) End Select ListBox1.Items.Add(String.Format(" Creating File{0}.txt with {1}", i, objEncoder.EncodingName)) 'open file with encoding Dim tw As TextWriter = New StreamWriter(String.Format("File{0}.txt", i), False, objEncoder) 'write data here tw.Write("this is a sample file encoded using " + objEncoder.EncodingName) 'save and close it tw.Close() Next
When you run this code, you will see the listbox control being filled as shown in figure 1. 7 files (File1.txt to File7.txt) will be created in your bin/debug folder.
Figure 1: Test files just created
I don’t know why, but the UTF7Encoding class of .Net does not seem to add the bytes at the beginning of the file. That means that this file will be detected as a plain old ASCII file.
Detecting file’s encoding
Figure 2 shows the application after the files’ encoding has been detected.
Figure 2: Encoding being reported
Now that we have a set of files, the task you really want to do is to loop through your test files to detect their encoding.
The loop is as simple as this:
ListBox1.Items.Clear() For i As Integer = 1 To 7 Dim strX As String = String.Format("File{0}.txt", i) Dim enc As Encoding = GetFileEncoding(strX) If enc Is Nothing Then ListBox1.Items.Add(String.Format("File >>{0}<< has an unkown encoding", strX)) Else ListBox1.Items.Add(String.Format("File >>{0}<< is encoded using {1}", strX, enc.EncodingName)) End If Next
As you can see, there is nothing in that code that is detecting the encoding. This process has been encapsulated into the GetFileEncoding method which reads like this:
Try Dim enc As Encoding = Nothing Dim file As FileStream = New FileStream(pFileName, FileMode.Open, FileAccess.Read, FileShare.Read) If file.CanSeek Then Dim bom As Byte() = New Byte(3) {} ' Get the byte-order mark, if there is one file.Read(bom, 0, 4) If (bom(0) = &HFF AndAlso bom(1) = &HFE AndAlso bom(2) = 0 AndAlso bom(3) = 0) Then enc = Encoding.UTF32 ElseIf (bom(0) = &HFF AndAlso bom(1) = &HFE) Then enc = Encoding.Unicode ElseIf (bom(0) = &HFE AndAlso bom(1) = &HFF) Then enc = Encoding.BigEndianUnicode ElseIf (bom(0) = &HEF AndAlso bom(1) = &HBB AndAlso bom(2) = &HBF) Then enc = Encoding.UTF8 ElseIf (bom(0) = &H2B AndAlso bom(1) = &H2F AndAlso bom(2) = &H76) Then enc = Encoding.UTF7 ElseIf (bom(0) = 0 AndAlso bom(1) = 0 AndAlso bom(2) = &HFE AndAlso bom(3) = &HFF) Then enc = Encoding.UTF32 Else enc = Encoding.ASCII 'or it can also be UTF8 End If Else enc = System.Text.Encoding.ASCII End If ' Close the file: never forget this step! file.Close() Return enc Catch ex As Exception Return Nothing End Try
This code simply opens the file into a FileStream and reads the first 4 bytes into a byte array. These arrays are compared against known signature to set a variable to the type of detected encoding. This method can easily be reused in your projects.
Conclusion
Using this simple method, you don’t have to guess the file encoding anymore.