Sat, 26 Jul 2008

Dumping UTF-8 Data


Permanent link

The other day I wrote Perl6::Str, and a small script that I called utf8-dump helped a lot during debugging:

$ echo Überhacker | utf8-dump
\N{LATIN CAPITAL LETTER U WITH DIAERESIS}berhacker

It replaces all non-ASCII-characters with their Unicode name, in a form that can be used in Perl 5 double quoted strings if use charnames qw(:full) is loaded first.

And this is how the script looks:

#!/usr/bin/perl
use strict;
use warnings;
use charnames ();
use Encode qw(decode_utf8);

while (<>){
    $_ = decode_utf8($_);
    s{([^\0-\177])}{N_escape($1)}eg;
    print;
}

sub N_escape {
    my $n = charnames::viacode(ord($_[0]));
    return defined($n) ? "\\N{$n}" : sprintf('\x{%x}', ord($_[0]));
}

(Update 2010-04-19:) Added \x{...} escapes for characters which viacode doesn't like.

[/perl-tips] Permanent link