Length of string in Perl independent of character encoding -
the length function assumes chinese characters more 1 character. how determine length of string in perl independent of character encoding (treat chinese characters 1 character)?
the length
function operates on characters, not octets (aka bytes). definition of character depends on encoding. chinese characters still single characters (if encoding correctly set!) take more 1 octet of space. so, length of string in perl dependent on character encoding perl thinks string in; string length independent of character encoding simple byte length.
make sure string in question flagged utf-8 , encoded in utf-8. example, yields 3:
$ perl -e 'print length("长")'
whereas yields 1:
$ perl -e 'use utf8; print length("长")'
as does:
$ perl -e 'use encode; print length(encode::decode("utf-8", "长"))'
if you're getting chinese characters file, make sure binmode $fh, ':utf8'
file before reading or writing it; if you're getting data database, make sure database returning strings in utf-8 format (or use encode
you).
i don't think have have in utf-8, need ensure string flagged having correct encoding. i'd go utf-8 front (and sideways) though that's lingua franca unicode , make things easier if use everywhere.
you might want spend time reading perlunicode man page if you're going dealing non-ascii data.
Comments
Post a Comment