Manually Wrangling Unicode in Perl

The best advice I can give about Unicode and UTF-8 in Perl is: don’t worry about it. Just do what you need to do, read up on some basics, and it’ll mostly all work out. When you’re getting odd results, make sure your I/O is UTF-8, and perhaps sprinkle a “use utf8;” at the top of your source files.

And most of the time, this is just fine. Perl generally does the right thing, and many of us (myself included) have the luxury of mostly using the Basic Latin set of characters (that is, ASCII) for the most part. But when you do need to worry about character encoding in Perl, when you’re getting garbage characters in your perfectly valid and sensible strings, you will need to become and expert, and quickly.

A large part of the reason you might encounter problems is that Perl is very forgiving. If you ask it to load a file, it’ll load a file, regardless of what’s in it. Compare this to Python, which is not nearly as forgiving. Python will check the contents of the file and raise a UnicodeDecodeError exception of it is not in the encoding you claimed. This is actually quite appealing, once you’ve had to debug some thorny issue with Unicode in Perl; it would be quite pleasant if your data raised an error right at the start, rather than being puzzled about why you’re getting double-encoded UTF-8 (which is to say, garbage) right at the end.

For the most part, you don’t have to worry. Most Perl tooling does the right thing when reading data in or writing data out. But there are a few common misunderstandings that will trap even the most careful programmer (frequently, if you’re me). The first is, when we say Unicode, we don’t mean UTF-8. UTF-8 is very much not Unicode. For forgetful coders like myself, the clue is in the name: Unicode Transformation Format. Unicode is a standard that aspires to index the characters of the world’s main writing systems. UTF-8 is the most successful attempt at packing the increasingly large codepoints in this index into the smallest number of bytes. In an ideal world, you don’t actually have to worry about the difference between the two. But if you’re reading or writing out data in Perl yourself, or using a limited Perl module to do so, you’d best not forget this.

A second issue is that you may, like me, tend to think of the UTF-8 encoding as a successor to ISO-8859 encodings, and in particular, to ISO-8859-1. This misunderstanding can also cause you a few conceptual confusions, because, due to happy design, ISO-8859-1 is actually Unicode, unlike UTF-8. More specifically, Unicode’s first 256 characters were drawn directly from the ISO-8859-1 encoding, and make up the Latin-1 Supplement set of Unicode. This means that, if you read an ISO-8859-1 file into Perl, you’ll be getting a string that is already in Perl’s internal Unicode (but still not UTF-8).

That’s right: Perl has its own internal Unicode representation, which may be native 8-bit, or may be actual Unicode. That is, your string may carry the the notorious UTF8 flag (which has nothing to do with UTF-8 except the exceptionally poor choice of name).

Let’s look at a string that contains some valid Unicode using Devel::Peek. I’m passing the string directly in as a variable, so we’re not getting any fancy Unicode-aware behaviour that you might find when reading from a file or a database, just a bunch of bytes:

$ perl -MDevel::Peek -e 'my $str = "d\xe9barqu\xe9"; Devel::Peek::Dump $str; warn "$str\n";'
SV = PV(0x556dc2560ea0) at 0x556dc258c8e8
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x556dc275dc20 "d\351barqu\351"\0
  CUR = 8
  LEN = 10
  COW_REFCNT = 1
dbarqu

This is a plain string that happens to contain a couple of high codepoints. It’s not printing correctly because my terminal doesn’t know about Unicode, only about UTF-8 (I could have passed something like -COE as a command line argument to Perl, but that just obscures the problem in this case). The high codepoints are both LATIN SMALL LETTER E WITH ACUTE, which is U+00e9, commonly written in Perl with its hex code as \xe9 (or, more clearly, as \x{e9}), or with decimals as \351. Or simply “é”.

Note that it’s just a bunch of characters. If you transform the string in any way, like uppercasing it, the result won’t be what you wanted:

$ perl -MDevel::Peek -e 'my $str = "d\xe9barqu\xe9"; my $cap = uc( $str ); warn "$cap\n"; Devel::Peek::Dump( $cap );'
SV = PV(0x55e5e04d7ed0) at 0x55e5e0503848
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x55e5e07040a0 "D\351BARQU\351"\0
  CUR = 8
  LEN = 10
  COW_REFCNT = 1
DBARQU

… you can see that it uppercases the ASCII, but leaves the high codepoints alone, since it has no idea what do with them.

Enter the Unicode flag, UTF8. This does absolutely nothing to the bytes of the string, but is simply a flag that tells Perl how it should manipulate the the string. It can be set by using utf8::upgrade(), but in almost all cases you don’t want to do this directly. Instead, you should be consuming text in a way that already sets this flag, like decoding UTF-8 into Unicode, or relying on a Unicode-aware method for imbibing data into your Perl script.

But for comparison to the example above, let’s turn on the flag manually:

$ perl -MDevel::Peek -e 'my $str = "d\xe9barqu\xe9"; utf8::upgrade( $str ); Devel::Peek::Dump $str; warn "$str\n";'
SV = PV(0x55712c50dea0) at 0x55712c539788
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x55712c526dd0 "d\303\251barqu\303\251"\0 [UTF8 "d\x{e9}barqu\x{e9}"]
  CUR = 10
  LEN = 24
dbarqu

You can see that my terminal still doesn’t understand what is being printed, because it’s still not UTF-8. But now there’s an additional flag set, UTF8. Again, this was a poor naming choice; it has nothing to do with UTF-8, and certainly doesn’t mean the string is in UTF-8 (althought Devel::Peek merrily contributes to the confusion by printing the string in UTF-8 characters). Rather, it lets Perl know that it should be manipulating the string using Unicode rules.

But now that Devel::Peek knows that it’s a Unicode string, it changes the PV output: the first string that was just a bunch of bytes is now printed as UTF-8 (\303\251 is the UTF-8 encoding of \351; in hex, \x{c3}\x{a9}) even though the string isn’t actually encoded in UTF-8, and there’s a new string printed after the first string, preceded by UTF8… which isn’t UTF-8, but is Unicode. \351 is now shown as \x{e9} — in Perl, the two are equivalent, but Devel::Peek is showing that Perl now knows what that character is and how to deal with it.

Let’s see the uppercasing example once more:

$ perl -MDevel::Peek -e 'my $str = "d\xe9barqu\xe9"; utf8::upgrade( $str ); my $cap = uc( $str ); Devel::Peek::Dump( $cap );  warn "$cap\n";
SV = PV(0x5565e07eeed0) at 0x5565e07ee890
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK,UTF8)
  PV = 0x5565e0a119a0 "D\303\211BARQU\303\211"\0 [UTF8 "D\x{c9}BARQU\x{c9}"]
  CUR = 10
  LEN = 12
  COW_REFCNT = 1
DBARQU

The resulting string is still not printing in my terminal, but if you look up \x{c9} you’ll see it is LATIN CAPITAL LETTER E WITH ACUTE, or É.

Now, let’s look at the original string when we convert to UTF-8 (note that I’m using utf8::encode() for convenience; Encode::encode() is usually a better choice):

$ perl -MDevel::Peek -e 'my $str = "d\xe9barqu\xe9"; utf8::encode( $str ); Devel::Peek::Dump $str; warn "$str\n";'
SV = PV(0x55d90baa2ea0) at 0x55d90bace8e8
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x55d90bca6300 "d\303\251barqu\303\251"\0
  CUR = 10
  LEN = 12
  COW_REFCNT = 1
débarqué

Unlike before, this produces the right output on my terminal (“débarqué”), because it’s now printing UTF-8.

However, Perl still sees it simply as a bunch of bytes, with no understanding of how to translate it. So attempting to uppercase the string yields:

$ perl -MDevel::Peek -e 'my $str = "d\xe9barqu\xe9"; utf8::encode( $str ); my $cap = uc( $str ); Devel::Peek::Dump $cap; warn "$cap\n";'
SV = PV(0x56371416bed0) at 0x56371416b890
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x5637143977f0 "D\303\251BARQU\303\251"\0
  CUR = 10
  LEN = 12
  COW_REFCNT = 1
DéBARQUé

… it prints correctly on my terminal because it’s encoded as UTF-8, but utf8::encode() does not set the UTF8 flag, so Perl can’t translate it correctly, even though the string is valid UTF-8. For Perl to know how to affect the string, the UTF8 needs to be set. But there’s no point in setting the UTF8 flag on a string that we know is already valid UTF-8; Perl will think it’s Unicode, because, once again, UTF-8 isn’t Unicode:

$ perl -MDevel::Peek -e 'my $str = "d\xe9barqu\xe9"; utf8::encode( $str ); utf8::upgrade( $str ); my $cap = uc( $str ); Devel::Peek::Dump $cap; warn "$cap\n";'
SV = PV(0x556a3bcc90f0) at 0x556a3bcf48e0
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK,UTF8)
  PV = 0x556a3bef4550 "D\303\203\302\251BARQU\303\203\302\251"\0 [UTF8 "D\x{c3}\x{a9}BARQU\x{c3}\x{a9}"]
  CUR = 14
  LEN = 16
  COW_REFCNT = 1
DéBARQUé

… the result is no different. We are encoding a bunch of bytes that happens to be Unicode into UTF-8. Perl lets us do this, even though we haven’t told Perl it’s actually Unicode. The result is, indeed, valid UTF-8. Next, we tell Perl that it’s Unicode (it’s not; it’s the translation format, UTF-8). When Devel::Peek prints it now, you can see in the PV that it’s now double-encoded UTF-8 in the first string, and valid UTF-8 in the second string, just displayed as if it were Unicode. However, since Perl sees two Unicode characters (not one UTF-8 character) and realises that that both characters are already uppercased: \x{c3} is already uppercased and \x{a9} remains the same when uppercased.

Of course, the result is still valid UTF-8, since it hasn’t been changed, and still prints correctly to my terminal.

To get the right result when uppercasing, you need to pass uc() a Unicode string, and only then convert to Unicode:

$ perl -MDevel::Peek -e 'my $str = "d\xe9barqu\xe9"; utf8::upgrade( $str ); my $cap = uc( $str ); utf8::encode( $cap ); Devel::Peek::Dump $cap; warn "$cap\n";'
SV = PV(0x55f5acdf1ed0) at 0x55f5acdf1890
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x55f5ace0aac0 "D\303\211BARQU\303\211"\0
  CUR = 10
  LEN = 12
  COW_REFCNT = 0
DÉBARQUÉ

So there you have it. If you happen to be reading text into Perl code (from a file, or a database, or the Internet), you will need to decode any UTF-8 encodings into Perl’s Unicode, using Encode::decode() or utf8::decode() before working on the string. Or, you know, use a well-written Perl modules that does this for you. This may involve a little digging: if you’re using LWP, you will need to use decoded_content() instead of content().

And if you’re writing any text out (to a file, or a database, or the Internet), you will encode any Unicode into a character set, probably UTF-8 (but perhaps ASCII with HTML Entities), for viewing at the other end. JSON::encode will do this for you, for example, although you may need to set an HTTP header to identify that you’re outputting UTF-8. And, on the command line, perl -COE will upgrade your output (including errors) into valid UTF-8.

Once your text is coming in as Unicode before you manipulate it, most character set issues will fall away. The few that remain tend either to happen because you’re not outputting UTF-8; you’re converting to UTF-8 too early; or you’ve combined non-decoded bytes with a decoded string. Tools like Devel::Peek, used with care, can help you figure out where your problem lies.

Back to top…