Converting a UTF-16 file to UTF-8
Lorien recently received a bunch of SVG files encoded in UTF-16 that her Aptana editor didn’t support well. She discovered this because ever time she needed to move to the next character, she needed to arrow right twice.
There were a few interesting properties of a UTF-16 file that we discovered by reading this post:
- This encoding is often used on Windows systems
- UTF-16 uses two bytes per character
- The beginning of a UTF-16 file contains a BOM (Byte Order Mark) that looks like \0xFF\0xFE
She wanted to run a regular expression over the files, but it didn’t work because the files contained double-byte characters. The files didn’t have any special characters as they all looked like ASCII characters, so to fix her issue all we needed to do was to convert the files down to UTF-8.
We discovered a Unix utility program called iconv that helps with character set conversion. A basic run looks like:
After writing a simple shell script to convert all the files, her problem was solved.
One Comment
Thank you for your shell scripting foo Pablo!