Steve Haslam (araqnid) wrote,
Steve Haslam
araqnid

  • Mood:
  • Music:

[Perl] adventures in i18n-land

Today I found out about Unicode normalisation, kind of. Wherein the likes of "office", "office", "office" and "office" are found not to be the same, but can be collated with the joint efforts of Unicode::Normalize and Unicode::Collate :)


What we have here is four strings, with the "ffi" combination combined into ligatures; viz.:
 office
 o\x{FB03}ce
 o\x{FB00}ice
 of\x{FB01}ce


These can be normalised using Unicode::Normalize:
bash$ perl -MEncode -MUnicode::Normalize -e 'BEGIN { binmode(STDIN, ":utf8") } while () { chomp; my $n = NFKD($_); print encode_utf8("\"$_\" ==> \"$n\"\n") }'
"office" ==> "office"
"office" ==> "office"
"office" ==> "office"
"office" ==> "office"


and this is automagically done by Unicode::Collate when compiling sort keys:
bash$ perl -MEncode -MUnicode::Collate -e 'BEGIN { binmode(STDIN, ":utf8") } $c = Unicode::Collate->new; while () { chomp; print $c->viewSortKey($_), "\n" }'
[0B4B 0A91 0A91 0AD3 0A3D 0A65 | 0020 0020 0020 0020 0020 0020 | 0002 0002 0002 0002 0002 0002 | FFFF FFFF FFFF FFFF FFFF FFFF]
[0B4B 0A91 0A91 0AD3 0A3D 0A65 | 0020 0020 0020 0020 0020 0020 | 0002 0004 0004 001F 0002 0002 | FFFF FFFF FFFF FFFF FFFF FFFF]
[0B4B 0A91 0A91 0AD3 0A3D 0A65 | 0020 0020 0020 0020 0020 0020 | 0002 0004 0004 0002 0002 0002 | FFFF FFFF FFFF FFFF FFFF FFFF]
[0B4B 0A91 0A91 0AD3 0A3D 0A65 | 0020 0020 0020 0020 0020 0020 | 0002 0002 0004 0004 0002 0002 | FFFF FFFF FFFF FFFF FFFF FFFF]


Which produces identical level 1&2 sort keys, and just a difference in level 3. By simply adding "normalization => 'NFKD'" to the collation constructor, even that difference would be lost.
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 0 comments