Converting a UTF-16 file to UTF-8

Lorien recently received a bunch of SVG files encoded in UTF-16 that her Aptana editor didn’t support well. She discovered this because ever time she needed to move to the next character, she needed to arrow right twice.

There were a few interesting properties of a UTF-16 file that we discovered by reading  this post:

  • This encoding is often used on Windows systems
  • UTF-16 uses two bytes per character
  • The beginning of a UTF-16 file contains a BOM (Byte Order Mark) that looks like \0xFF\0xFE

She wanted to run a regular expression over the files, but it didn’t work because the files contained double-byte characters. The files didn’t have any special characters as they all looked like ASCII characters, so to fix her issue all we needed to do was to convert the files down to UTF-8.

We discovered a Unix utility program called iconv that helps with character set conversion. A basic run looks like:

iconv -f utf-16 -t utf-8 inputFile > outputFile

After writing a  simple shell script to convert all the files, her problem was solved.

#!/bin/bash
mkdir -p out
for f in `ls -1 *.svg`; do
 iconv -f utf-16 -t utf-8 $f > out/$f
done

One Comment

  1. lorien
    Posted September 23, 2010 at 4:02 pm | Permalink

    Thank you for your shell scripting foo Pablo!

Post a Comment

Your email is never shared. Required fields are marked *

*
*