Text Encoding

Text Encoding

Many programmers get by without having to worry about text encodings at all. Text is just a sequence of chars right? Chars are just numbers that map to letters, numbers and symbols. To save text to a file you just write out a sequence of chars, or let whatever API you are using handle the encode/decode. It’s not until you start supporting multiple languages, or come across a file in a weird format that you have to understand what encodings really are, and then it can be like opening a can of worms.

Here I want to explain the relationship between text encoding, unicode, wide chars and text file formats.

There are many standards for encodings, some things are misnamed (ANSI format), some things have very few standards (text file formats) and some things are just wrong, but have been wrong for so long that they have been inforporated into standards.

Let’s dig a little deeper.

Code Point

A code point is the terminology used for a symbol that is represented as a number. For programmers we could normally think of them as characters, but a code point is the more general term and doesn’t imply any information about how it is stored on a computer.

Encoding

An encoding converts a sequence of code points to a sequence of bytes. An encoding is typically used when writing text to a file. To read it back in we have to know how it was encoded and decode it back into memory. A text encoding is basically a file format for text files.

The simplest encoding is ASCII where each code point maps to a single byte. With ASCII we can simply write out byte chars directly to a file without doing any encoding work. Similarly, we don’t have to decode it when reading back in to memory. Although ASCII is encoded in bytes it only uses 7 bits for each character, and so only allows for 128 characters.

The term encoding is also used to refer to how numbers map to symbols. The letter A is encoded as 0x41. These mappings are called code points. So we encode symbols to numbers, but then we also encode those numbers to byte streams. This can get confusing, so from now on when I say encoding I’m talking about encoding code points to a byte stream.

You might be wondering why we don’t just use 4 bytes for all code points and just store text files as a sequence of 4 byte values. It’s a valid question. These days the size of text files is inconsequential compared to other file types. The reasons are mostly historical, from times when disk space was limited. It’s also very wastful to use 4 bytes when most text in the world only uses 7 bits. Encoding allows us to store most text as an array of 1 byte chars, which is easy for a human to decipher, but also allows us to have more complicated encodings when we want to use more uncommon characters.

Encodings typically encode each code point into 1, 2 or 4 bytes. A variable length encoding means that each code point in a sequence can be encoded in a different number of bytes. You can think of this as a simple form of compression, where the first byte tells you how many bytes follow.

Text encodings are only used for text stored in files. When text is read into memory it is usually converted to an array of chars, where each char is 1 or 2 bytes. A char represents a code point. This makes text manipulation much simpler.

ASCII

As mentioned already, ASCII is a character mapping that has 128 code points numbered 0 – 127 and each code point is encoded as a single byte. If we are storing the text as 1 byte chars in memory we can simple dump out the memory to a file and we don’t have to do anything to encode it.

Unicode

Unicode is a super set of ASCII and allows for a lot more code points. Unicode has a maximum of 21 bits for each code point allowing for 209,715,1 code points. Quite a jump up from 128. For backwards compatibility the first 128 code points in unicode are the same as as ASCII.

It is important to understand that unicode is not an encoding itself, but character mapping set. A unicode encoding encodes code points from the unicode set to a byte stream, and there are a number of unicode encodings such as UTF-8, UTF-16. These map unicode code points into byte streams in different ways. Unicode doesn’t say anything about how text is stored in memory or on disk, it is a standard character mapping set that maps numbers to symbols.

Let’s take a quick look at UTF-8 which maps unicode to a byte stream that can be parsed 1 byte at a time. It is a variable length encoding which means you look at the first byte to tell you how many bytes the code point is encoded as. It is constructed in such a way that if you are just using code points 0-127 (as in ASCII) then the byte stream is identical to a ASCII encoded byte stream. UTF-8 and ASCII are exactly equivalent if you are only using characters 0-127, but UTF-8 also allows you to encode any code point from the uniqcode set. This is all down to the fact that ASCII didn’t use the top bit. We will see why the top bit is so important when we code to the ANSI text format.

The important points to remember are that variable length encodings are only used when storing text in a file (think of it as compression). When we read text into memory we decode it to an array of chars.

Wide Chars

It’s important to make the distinction between how text is encoded in a file and how it is represented in memory. We don’t use variable length encodings in memory because this would make text manipulation very difficult, so we have to decide how many bytes we are going to use for each code point. If we are decoding ASCII we can use 1 byte chars, and if we are decoding unicode we use 2 byte wide chars.

The definition of a wide char is a bit vague, it just means a char is bigger than 1 byte, and it is up to the compiler to decide how big they are. In practice though, wide chars are typically 2 bytes. It’s also worth noting that a char in C# is a 2 byte wide char. 2 byte wide chars give us 65,535 code points, which doesn’t cover the 209,715,1 unicode standard, but it’s usually enough for most cases. Windows uses wide chars for most things internally, whereas Unix based systems tend to just use ASCII, or properly handle all unicode chars. Wide chars are kind of a half way between ASCII and Unicode, which in practice works well enough, but is not a proper solution.

It’s worth explaining what Visual Studio means by unicode in the project properties. Internally, Windows uses 2 byte wchar_t strings and that’s what it expects in it’s API functions. If you set this setting to “Not Set” it defines TCHAR to char instead of wchar_t. It also means that you can pass in char strings to Windows API functins that usually take wchar_t. This is only really for compatibility with old code, but I often turn off unicode because I prefer using char strings everywhere.

Which Encoding?

If we have a text file how do we know which encoding it is using? The simple answer is that we often have to guess. Yes, you read that right. There is a standard header called a BOM, but it can often cause more problems than it solves on systems that are not expecting it. If a text file doesn’t have a BOM then you just have to guess the format.

BOM’s

A BOM is a Byte Order Mark which is a special unicode character (U+FEFF) which is the first character of a file. This unicode character is encoded in various different ways in different encodings, so by reading the first character you can tell which encoding the file is in. For example, if a file starts with 0xEF,0xBB,0xBF you know that it is in UTF-8, because that is how UTF-8 encodes the BOM unicode character. If a text editor doesn’t read the BOM correctly you might see these characters at the start of the file  (the ASCII representation of those bytes)

The most common BOM encodings are UTF-8 0xEF,0xBB,0xBF and UTF-16 0xFE,0xFF,0x00,0x00. An interesting side note, the endianness of a text file can be detected from a BOM because it would appear as U+FFFE which is an invalid unicode character, so you assume it is U-FEFF and the file needs endian swapping on your system.

Many files don’t have BOMs, and most text editors will interpret these files as UTF-8 or make a best guess to the encoding. Editors will often show that the file has a BOM by calling the encoding something like “UTF-8 BOM”. The problem comes when a file is in the ANSI format with no BOM. This format can’t be parsed at UTF-8 and leads to invalid characters for codes > 127.

ANSI Format

The ANSI format is not actually a format standardised by ANSI (American National Standards Institute) as the name suggests. It is misnamed, and actually refers to a specific code page in the unicode standard called “Windows 1252”. People have been calling it the ANSI format for so long that the name has stuck. This format is a fixed length 1 byte encoding, just like ASCII except that it uses 8 bits instead of 7, allowing it to represent 256 code points.

The .NET System.IO.File.ReadText() function seems to assume UTF-8 for files with no BOM and will produce invalid characters for ANSI files with characters > 127. This is because characters > 127 means something totally different for the variable length UTF-8 encoding.

The way to fix this is to pass in the correct encoding enum: File.ReadAllLines(filename, Encoding.Default) ‘Default’ here means the default for Windows which is Windows-1252 (which is what the ANSI format really is). Don’t be misled by the ‘Default’ encoding. You can’t just use Default for everything, it won’t work if the file really is using UTF-8 or another encoding.

So how do we know if the file is using the ANSI encoding rather than UTF-8? Put simply, we don’t. If we have a file with no BOM then it could be UTF-8 or ANSI and there is no way of knowing which it is. If it only contains characters < 128 then we can parse it as ASCII or UTF-8, but if it contains characters > 127 then it could be UTF-8 or ANSI and we must make a best guess.

There are certain byte combinations that result in invalid UTF-8, so if we detect a top bit set and the next byte is not what we expect it to be then we can asusme it is ANSI. For example, if we encounter the character 0xC4 11000100, it could be the ANSI German-Umlaut Ä or a 2 byte UTF-8 encoding. If it is a 2 byte encoding then the next byte should be of the form 10xxxxxx. If it isn’t then it is invalid UTF-8 and we can assume it is ANSI. Tricks like this don’t cover every case, but they give us a best guess.

public static bool LooksLikeANSI(string filename)
{
	FileStream file_stream = File.OpenRead(filename);
	if (file_stream == null)
		return false;

	BinaryReader stream = new BinaryReader(file_stream);

	long file_size = file_stream.Length;
	byte[] buffer = new byte[file_size];
	Debug.Assert(file_size < int.MaxValue);
	stream.Read(buffer, 0, (int)file_size);

	// if it has a BOM it is not ANSI
	if (
		(file_size >= 3 && buffer[0] == 0xEF && buffer[1] == 0xBB && buffer[2] == 0xBF) ||                      // UTF-8
		(file_size >= 2 && buffer[0] == 0xFE && buffer[1] == 0xFF) ||                                           // UTF-16 (BE)
		(file_size >= 2 && buffer[0] == 0xFF && buffer[1] == 0xFE) ||                                           // UTF-16 (LE)
		(file_size >= 4 && buffer[0] == 0x00 && buffer[1] == 0x00 && buffer[2] == 0xFE && buffer[3] == 0xFF) || // UTF-32 (BE)
		(file_size >= 4 && buffer[0] == 0xFF && buffer[1] == 0xFE && buffer[2] == 0x00 && buffer[3] == 0x00))   // UTF-32 (BE)
	{
		return false;
	}

	stream.Close();

	for(int i=0; i<file_size; ++i)
	{
		byte b = buffer[i];

		// if the top bit is set it could be UTF-8 or ANSI
		if ((b & 0x80) != 0)
		{
			// if it is the last byte in the file it definitely isn't UTF-8, so it must be ANSI
			if (i == file_size - 1)
				return true;

			// if the next byte it not of the form 11xxxxxx it can't be UTF-8 so we assume it is ANSI
			if ((buffer[i + 1] & 0xC0) != 0x80)
				return true;
		}
	}

	// no top bits set or we couldn't determine that it was ANSI, so assume UTF-8
	return false;
}

To make matters even worse, sometimes files using the Windows-1252 (or ANSI) encoding are labelled as ISO-8859-1, which is wrong again! This causes certain characters such as quotes to not display properly. However, people have been getting this wrong for so long that HTML5 has actually standardised ISO-8859-1 to mean Windows-1252. What a mess!

Encodings

Here is a description of the common encodings:

ASCII

ASCII has 127 code points each encoded in a single byte with the top bit clear.

UTF-8

UTF-8 is a variable length encoding which can represent all unicode code points. It uses 1 to 4 bytes for each code point. If you are encoding characters < 128 then it is the same as ASCII which in most cases makes it a very simple format to work with. If it needs to encode any other unicode character it can done by setting the top bit and adding more bytes. UTF-8 is backwards compatible and as compact as ASCII for ASCII characters. UTF-8 is the most common format and used by most of the internet.

It’s important to note that UTF-8 is not compatible with the Windows-1252 1 byte format. Windows-1252 is a fixed length encoding using all 8 bits allowing it to map 256 characters. UTF-8 is a variable length encoding and if the top bit is set it means something totally different.

The meaning of the first byte changes depending on how many of the top bits are set. If the top bit is clear (0xxxxxxxx), then just treat it as a normal ASCII character and we are done. If the top bit two bits of the first byte are set 110xxxxx then the character is encoded in two bytes. Similarly, 1110xxxx means it is encoded in 3 bytes and 11110xxx means 4 bytes. Bytes 2, 3 and 4 are always of the form (10xxxxxx) where only the bottom 6 bits are used.

byte 1:	0xxxxxxx
byte 2:	10xxxxxx 10xxxxxx
byte 3:	110xxxxx 10xxxxxx 10xxxxxx
byte 4:	1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx

Typically we don’t need to understand exactly how to decode these files beause almost every programming landuage will have a library to do this for us, or a way of specifying the encoding when reading a text file. However, it is still useful to have an idea of code the encodings work, especially when trying to work out the encoding of a file.

UTF-16

UTF-16 is a bit different to UTF-8 but works on similar principles. Instead of parsing 1 byte at a time as done in UTF-8 you need to parse 2 bytes. This 2 byte code tells you whether the code point is encoded in 1 or 2 16 bit codes.

Summary

  • ASCII uses 7 bits to encode 128 code points. Each code point is encoded into a single byte.
  • Unicode defines a character set of 209,715,1 possible code points. The first 128 are the same as ASCII
  • In memory text is typically represented by a sequence of 1 or 2 byte chars.
  • When we write text to a file it needs to be encoded into a byte stream
  • A variable length encoding (such as UTF-8 or UTF-16) can use 1 – 4 bytes to encode each code point
  • UTF-8 is identicaly to ASCII if only the first 128 code points are used
  • ANSI is the same as ASCII except that it also uses the top bit and can encode 256 code points.
  • Because ANSI uses the top bit it can be impossible to know if the file is UTF-8 or ANSI and we have to use tricks to make a best guess
  • BOMs are 2-4 byte headers at the start of a text file that specify the encoding, but most text files don’t have them.

Leave a Reply

Your email address will not be published. Required fields are marked *