?

Log in

No account? Create an account

Previous Entry | Next Entry

[Perl] adventures in i18n-land

Today I found out about Unicode normalisation, kind of. Wherein the likes of "office", "office", "office" and "office" are found not to be the same, but can be collated with the joint efforts of Unicode::Normalize and Unicode::Collate :)


What we have here is four strings, with the "ffi" combination combined into ligatures; viz.:
 office
 o\x{FB03}ce
 o\x{FB00}ice
 of\x{FB01}ce


These can be normalised using Unicode::Normalize:
bash$ perl -MEncode -MUnicode::Normalize -e 'BEGIN { binmode(STDIN, ":utf8") } while () { chomp; my $n = NFKD($_); print encode_utf8("\"$_\" ==> \"$n\"\n") }'
"office" ==> "office"
"office" ==> "office"
"office" ==> "office"
"office" ==> "office"


and this is automagically done by Unicode::Collate when compiling sort keys:
bash$ perl -MEncode -MUnicode::Collate -e 'BEGIN { binmode(STDIN, ":utf8") } $c = Unicode::Collate->new; while () { chomp; print $c->viewSortKey($_), "\n" }'
[0B4B 0A91 0A91 0AD3 0A3D 0A65 | 0020 0020 0020 0020 0020 0020 | 0002 0002 0002 0002 0002 0002 | FFFF FFFF FFFF FFFF FFFF FFFF]
[0B4B 0A91 0A91 0AD3 0A3D 0A65 | 0020 0020 0020 0020 0020 0020 | 0002 0004 0004 001F 0002 0002 | FFFF FFFF FFFF FFFF FFFF FFFF]
[0B4B 0A91 0A91 0AD3 0A3D 0A65 | 0020 0020 0020 0020 0020 0020 | 0002 0004 0004 0002 0002 0002 | FFFF FFFF FFFF FFFF FFFF FFFF]
[0B4B 0A91 0A91 0AD3 0A3D 0A65 | 0020 0020 0020 0020 0020 0020 | 0002 0002 0004 0004 0002 0002 | FFFF FFFF FFFF FFFF FFFF FFFF]


Which produces identical level 1&2 sort keys, and just a difference in level 3. By simply adding "normalization => 'NFKD'" to the collation constructor, even that difference would be lost.

Profile

araqnid
Steve Haslam

Latest Month

March 2009
S M T W T F S
1234567
891011121314
15161718192021
22232425262728
293031    
Powered by LiveJournal.com
Designed by Tiffany Chow