Encoding differences between tr (ubuntu) and tr (mac) -
echo – | tr ’ x
on mac: correctly prints –
on ubuntu 10.10: prints xx�
why?
for reference, first character 'en dash' (u+2013), second 'right single quotation mark' (u+2019)
edit: use perl instead. magic incantation use is:
perl -csd -p -mutf8 -e 'tr/a-zÁÉÍÓÚÀÈÌÒÙÄËÏÖÜÂÊÎÔÛÇa-zäëïöüáéíóúàèìòùâèìòùñçßœ/ /cs'
it means: except letters think of, replace character space, , squeeze spaces.
gnu coreutils tr
has issues multi-byte characters - considers them sequence of single byte characters instead.
taking account, see on linux "expected":
$ echo – | od -t x1 0000000 e2 80 93 0a 0000004 $ echo ’ | od -t x1 0000000 e2 80 99 0a 0000004 $ echo – | tr ’ x | od -t x1 0000000 58 58 93 0a 0000004
edit:
for set of utf-8 compatible utilities, might want have @ heirloom toolchest. contains tr
implementation apparently has utf-8 support, although have not tested it.
Comments
Post a Comment