Text Encoding

Text Encoding

Many programmers get by without having to worry about text encodings at all. Most of the time text is just a sequence of chars, chars are just numbers that map to letters, numbers and symbols, and to save text to a file you just write out a sequence of chars, or let whatever API you are using handle the encode/decode. Unfortunately it’s not always that simple. It’s not until you start supporting multiple languages, or come across a file in a weird format that you have to understand how text encodings really work.

Here I want to explain the relationship between text encoding, unicode, wide chars and text file formats.

There are many standards for text encodings, some are misnamed (ANSI format), some have very few standards (text file formats) and some things are just plain wrong, but have been wrong for so long that they have been incorporated into standards.

Let’s dig a little deeper.

Code Point

A code point is the terminology used for a symbol that is represented as a number. For programmers we could normally think of them as characters, but a code point is the more general term and doesn’t imply any information about how it is stored on a computer. A code point can be stored in 1 to 4 bytes in memory. Obviously if less than 4 bytes are used not all code points can be represented.

Encoding

An encoding converts a sequence of code points to a sequence of bytes. An encoding is typically used when writing text to a file. To read it back in we have to know how it was encoded and decode it back into memory. A text encoding is basically a file format for text files. It’s important to distinguish the difference between a text file encoding and how each code point is stored in memory. Just because 2 bytes may be used to store each code point doesn’t mean that it is an encoding. They are different things. Whether an array of bytes in memory is an encoding is determined by how it is treated. For example, the C string functions assume strings are arrays of code points (no encoding). There is a push for C++ stl to treat all strings as encoded.

The simplest encoding is ASCII where each code point maps to a single byte. With ASCII we can simply write out byte chars directly to a file without doing any encoding work. Similarly, we don’t have to decode it when reading back in to memory. Although ASCII is encoded in bytes it only uses 7 bits for each character, and so only allows for 128 characters.

Confusingly, the term encoding is also used in the more general form to refer to how numbers map to symbols. The letter A is encoded as 0x41. These mappings are called code points. So we encode symbols to numbers, but then we also encode those numbers to byte streams. This can get confusing, so from now on when I say encoding I’m talking about encoding code points to a byte stream.

You might be wondering why we don’t just use 4 bytes for all code points and just store text files as a sequence of 4 byte values. It’s a valid question. These days the size of text files is inconsequential compared to other file types. The reasons are mostly historical, from times when disk space was limited. It’s also very wastful to use 4 bytes when most text in the world only uses 7 bits. Encoding allows us to store most text as an array of 1 byte chars, which is easy for a human to decipher, but also allows us to have more complicated encodings when we want to use more uncommon characters.

Encodings typically encode each code point into 1 – 4 bytes. A variable length encoding means that each code point in a sequence can be encoded in a different number of bytes. You can think of this as a simple form of compression, where the first byte tells you how many bytes follow.

Text encodings are used for text stored in files. When text is read into memory it is usually converted to an array of chars, where each char is 1 or 2 bytes. A char represents a code point. This makes text manipulation much simpler. However, sometimes strings in memory are treated as encoded, so you need to know which you are dealing with.

ASCII

As mentioned already, ASCII is a character mapping that has 128 code points numbered 0 – 127 and each code point is encoded as a single byte. If we are storing the text as 1 byte chars in memory we can simple dump out the memory to a file and we don’t have to do anything to encode it.

Unicode

Unicode is a super set of ASCII and allows for a lot more code points. Unicode has a maximum of 21 bits for each code point allowing for 209,715,1 code points. Quite a jump up from 128. For backwards compatibility the first 128 code points in unicode are the same as as ASCII.

It is important to understand that unicode is not an encoding itself, but character mapping set. A unicode encoding encodes code points from the unicode set to a byte stream, and there are a number of unicode encodings such as UTF-8, UTF-16. These map unicode code points into byte streams in different ways. Unicode doesn’t say anything about how text is stored in memory or on disk, it is a standard character mapping set that maps numbers to symbols.

Let’s take a quick look at UTF-8 which maps unicode to a byte stream that can be parsed 1 byte at a time. It is a variable length encoding which means you look at the first byte to tell you how many bytes the code point is encoded as. It is constructed in such a way that if you are just using code points 0-127 (as in ASCII) then the byte stream is identical to a ASCII encoded byte stream. UTF-8 and ASCII are exactly equivalent if you are only using characters 0-127, but UTF-8 also allows you to encode any code point from the uniqcode set. This is all down to the fact that ASCII didn’t use the top bit. We will see why the top bit is so important when we look at the ANSI text format.

Wide Chars

It’s important to make the distinction between how text is encoded in a file and how it is represented in memory. We often don’t use variable length encodings in memory because this would make text manipulation more complex, so we have to decide how many bytes we are going to use for each code point. If we are decoding ASCII we can use 1 byte chars, and if we are decoding unicode we use 2 byte wide chars.

The definition of a wide char is a bit vague, it just means a char is bigger than 1 byte, and it is up to the compiler to decide how big they are. In practice though, wide chars are typically 2 bytes. It’s also worth noting that a char in C# is a 2 byte wide char. 2 byte wide chars give us 65,535 code points, which doesn’t cover the 209,715,1 unicode standard, but it’s usually enough for most cases. Windows uses wide chars for most things internally, whereas Unix based systems tend to just use ASCII, or properly handle all unicode chars. Wide chars are kind of a half way between ASCII and Unicode, which in practice works well enough, but is not a proper solution.

It’s worth explaining what Visual Studio means by unicode in the project properties. Internally, Windows uses 2 byte wchar_t strings and that’s what it expects in it’s API functions. If you set this setting to “Not Set” it defines TCHAR to char instead of wchar_t. It means that you can pass in char strings to Windows API functins that usually take wchar_t. This is only for compatibility with old code, and you are encouraged to use the unicode setting.

Which Encoding?

If we have a text file how do we know which encoding it is using? The simple answer is that we often have to guess. Yes, you read that right. There is a standard header called a BOM, but it can often cause more problems than it solves on systems that are not expecting it. If a text file doesn’t have a BOM then it is usually UTF-8, but sometimes it is ANSI and these can be tricky to tell apart. The trick to telling it is ANSI is to check to see if it has any invalid UTF-8 byte sequences (see code below).

BOM’s

A BOM is a Byte Order Mark which is a special unicode character (U+FEFF) which is the first character of a file. This unicode character is encoded in various different ways in different encodings, so by reading the first character you can tell which encoding the file is in. For example, if a file starts with 0xEF,0xBB,0xBF you know that it is in UTF-8, because that is how UTF-8 encodes the BOM unicode character. If a text editor doesn’t read the BOM correctly you might see these characters at the start of the file  (the ASCII representation of those bytes)

The most common BOM encodings are UTF-8 0xEF,0xBB,0xBF and UTF-16 0xFE,0xFF,0x00,0x00. An interesting side note, the endianness of a text file can be detected from a BOM because it would appear as U+FFFE which is an invalid unicode character, so you assume it is U-FEFF and the file needs endian swapping on your system.

Many files don’t have BOMs, and most text editors will interpret these files as UTF-8 or make a best guess to the encoding. Editors will often show that the file has a BOM by calling the encoding something like “UTF-8 BOM”. The problem comes when a file is in the ANSI format with no BOM. This format can’t be parsed as UTF-8 and leads to invalid characters for codes > 127.

ANSI Format

The ANSI format is not actually a format standardised by ANSI (American National Standards Institute) as the name suggests. It is misnamed, and actually refers to a specific code page in the unicode standard called “Windows 1252”. People have been calling it the ANSI format for so long that the name has stuck. This format is a fixed length 1 byte encoding, just like ASCII except that it uses 8 bits instead of 7, allowing it to represent 256 code points.

The .NET System.IO.File.ReadText() function seems to assume UTF-8 for files with no BOM and will produce invalid characters for ANSI files with characters > 127. This is because characters > 127 means something totally different for the variable length UTF-8 encoding.

The way to fix this is to pass in the correct encoding enum: File.ReadAllLines(filename, Encoding.Default) ‘Default’ here means the default for Windows which is Windows-1252 (which is what the ANSI format really is). Don’t be misled by the ‘Default’ encoding. You can’t just use Default for everything, it won’t work if the file really is using UTF-8 or another encoding.

So how do we know if the file is using the ANSI encoding rather than UTF-8? Put simply, we don’t. If we have a file with no BOM then it could be UTF-8 or ANSI and there is no way of knowing which it is. If it only contains characters < 128 then we can parse it as ASCII or UTF-8, but if it contains characters > 127 then it could be UTF-8 or ANSI and we must make a best guess.

There are certain byte combinations that result in invalid UTF-8, so if we detect a top bit set and the next byte is not what we expect it to be then we can asusme it is ANSI. For example, if we encounter the character 0xC4 11000100, it could be the ANSI German-Umlaut Ä or a 2 byte UTF-8 encoding. If it is a 2 byte encoding then the next byte should be of the form 10xxxxxx. If it isn’t then it is invalid UTF-8 and we can assume it is ANSI. Tricks like this don’t cover every case, but they give us a best guess.

public static bool LooksLikeANSI(string filename)
{
	FileStream file_stream = File.OpenRead(filename);
	if (file_stream == null)
		return false;

	BinaryReader stream = new BinaryReader(file_stream);

	long file_size = file_stream.Length;
	byte[] buffer = new byte[file_size];
	Debug.Assert(file_size < int.MaxValue);
	stream.Read(buffer, 0, (int)file_size);
	stream.Close();

	// if it has a BOM it is not ANSI
	if (
		(file_size >= 3 && buffer[0] == 0xEF && buffer[1] == 0xBB && buffer[2] == 0xBF) ||                      // UTF-8
		(file_size >= 2 && buffer[0] == 0xFE && buffer[1] == 0xFF) ||                                           // UTF-16 (BE)
		(file_size >= 2 && buffer[0] == 0xFF && buffer[1] == 0xFE) ||                                           // UTF-16 (LE)
		(file_size >= 4 && buffer[0] == 0x00 && buffer[1] == 0x00 && buffer[2] == 0xFE && buffer[3] == 0xFF) || // UTF-32 (BE)
		(file_size >= 4 && buffer[0] == 0xFF && buffer[1] == 0xFE && buffer[2] == 0x00 && buffer[3] == 0x00))   // UTF-32 (BE)
	{
		return false;
	}

	for (int i = 0; i < file_size; ++i)
	{
		byte b = buffer[i];

		// if the top bit is set check if this is a valid multi-byte UTF-8 sequence
		if ((b & 0x80) != 0)
		{
			// check how many top bits are set to get the length of the UTF-8 sequence
			int length;
			if ((b & 0xE0) == 0xC0)			// 110xxxxx
				length = 2;
			else if ((b & 0xF0) == 0xE0)	// 1110xxxx
				length = 3;
			else if ((b & 0xF8) == 0xF0)	// 11110xxx
				length = 4;
			else
				return true;		// invalid number of top bits set, not UTf-8

			// if the sequence goes over the end of the file it isn't valid UTF-8, so assume ANSI
			if(i + length > file_size)
				return true;

			for(int a = 1; a < length; ++a)
			{
				++i;

				// if the next byte it not of the form 10xxxxxx it can't be UTF-8 so we assume it is ANSI
				if ((buffer[i] & 0xC0) != 0x80)
					return true;
			}
		}
	}

	// this is valid UTF-8, so assume it isn't ANSI
	return false;
}

To make matters even worse, sometimes files using the Windows-1252 (or ANSI) encoding are labelled as ISO-8859-1, which is wrong again! This causes certain characters such as quotes to not display properly. However, people have been getting this wrong for so long that HTML5 has actually standardised ISO-8859-1 to mean Windows-1252. What a mess!

Encodings

Here is a description of the common encodings:

ASCII

ASCII has 127 code points each encoded in a single byte with the top bit clear.

UTF-8

UTF-8 is a variable length encoding which can represent all unicode code points. It uses 1 to 4 bytes for each code point. If you are encoding characters < 128 then it is the same as ASCII which in most cases makes it a very simple format to work with. If it needs to encode any other unicode character it can done by setting the top bit and adding more bytes. UTF-8 is backwards compatible and as compact as ASCII for ASCII characters. UTF-8 is the most common format and used by most of the internet.

It's important to note that UTF-8 is not compatible with the Windows-1252 1 byte format. Windows-1252 is a fixed length encoding using all 8 bits allowing it to map 256 characters. UTF-8 is a variable length encoding and if the top bit is set it means something totally different.

The meaning of the first byte changes depending on how many of the top bits are set. If the top bit is clear (0xxxxxxxx), then just treat it as a normal ASCII character and we are done. If the top bit two bits of the first byte are set 110xxxxx then the character is encoded in two bytes. Similarly, 1110xxxx means it is encoded in 3 bytes and 11110xxx means 4 bytes. Bytes 2, 3 and 4 are always of the form (10xxxxxx) where only the bottom 6 bits are used.

1 byte sequence:	0xxxxxxx
2 byte sequence:	110xxxxx 10xxxxxx
3 byte sequence:	1110xxxx 10xxxxxx 10xxxxxx
4 byte sequence:	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Typically we don't need to understand exactly how to decode these files beause almost every programming landuage will have a library to do this for us, or a way of specifying the encoding when reading a text file. However, it is still useful to have an idea of code the encodings work, especially when trying to work out the encoding of a file.

UTF-16

UTF-16 is a bit different to UTF-8 but works on similar principles. Instead of parsing 1 byte at a time as done in UTF-8 you need to parse 2 bytes. This 2 byte code tells you whether the code point is encoded in 1 or 2 16 bit codes.

Summary

  • ASCII uses 7 bits to encode 128 code points. Each code point is encoded into a single byte.
  • UTF-8 is identical to ASCII if only the first 128 code points are used
  • Unicode defines a character set of 209,715,1 possible code points. The first 128 are the same as ASCII
  • In memory, text is often treated as an array of code points (char or wchars), although sometimes it can be an UTF-8 or UTF-16 encoding. It all depends on how it is treated.
  • When we write text to a file it needs to be encoded into a byte stream
  • A variable length encoding (such as UTF-8 or UTF-16) can use 1-4 1 or 2 byte chars to encode each code point
  • ANSI is the same as ASCII (1 byte encoding) except that it also uses the top bit and can encode 256 code points.
  • Because ANSI uses the top bit it can be impossible to know if the file is UTF-8 or ANSI and we have to use tricks to make a best guess
  • BOMs are 2-4 byte headers at the start of a text file that specify the encoding, but most text files don't have them.
  • wchar's are usually 2 bytes and can represent a subset of the unicode code points. This subset is enough for unicode support in most cases.
Comments are closed.