Converting a UTF-16 file to UTF-8

By pablokang in Uncategorized. Posted on September 23rd

Lorien recently received a bunch of SVG files encoded in UTF-16 that her Aptana editor didn’t support well. She discovered this because ever time she needed to move to the next character, she needed to arrow right twice.

There were a few interesting properties of a UTF-16 file that we discovered by reading this post:

This encoding is often used on Windows systems
UTF-16 uses two bytes per character
The beginning of a UTF-16 file contains a BOM (Byte Order Mark) that looks like \0xFF\0xFE

She wanted to run a regular expression over the files, but it didn’t work because the files contained double-byte characters. The files didn’t have any special characters as they all looked like ASCII characters, so to fix her issue all we needed to do was to convert the files down to UTF-8.

We discovered a Unix utility program called iconv that helps with character set conversion. A basic run looks like:

1	iconv -f utf-16 -t utf-8 inputFile > outputFile

After writing a simple shell script to convert all the files, her problem was solved.

#!/bin/bash

mkdir -p out

for f in `ls -1 *.svg`; do

iconv -f utf-16 -t utf-8 $f > out/$f

done

By pablokang | Posted in Uncategorized | Tagged Aptana, iconv, shell, SVG, UTF-16 | Comments (1)

One Comment

lorien

Posted September 23, 2010 at 4:02 pm | Permalink

Thank you for your shell scripting foo Pablo!

OUR BLOG

SEARCH BLOG

Converting a UTF-16 file to UTF-8

One Comment

Post a Comment

SEARCH BLOG

FOLLOW US

RECENT ARTICLES

OUR BLOG

SEARCH BLOG

Converting a UTF-16 file to UTF-8

One Comment

Post a Comment

SEARCH BLOG

FOLLOW US

RECENT ARTICLES

TOPICS