Rectangle 27 0

ruby Parsing a CSV file using different encodings and libraries?


$ curl -s http://jamesabbottdd.com/examples/testfile.csv | xxd | head -n3
0000000: fffe 4300 6100 6d00 7000 6100 6900 6700  ..C.a.m.p.a.i.g.
0000010: 6e00 0900 4300 7500 7200 7200 6500 6e00  n...C.u.r.r.e.n.
0000020: 6300 7900 0900 4200 7500 6400 6700 6500  c.y...B.u.d.g.e.
CSV.foreach('./testfile.csv', :encoding => 'utf-16le:utf-8') do |row| ...

Both of these appear to work okay.

However that gives me invalid byte sequence in UTF-16LE (ArgumentError) coming from inside the CSV library. I think this is due to IO#gets only returning a single byte for some reason when faced with the BOM when called in CSV, resulting in the invalid UTF-16.

Not only spot on, but very educational as well. Top job - thanks!

The byte order markffee at the start suggests the file encoding is little endian UTF-16, and the 00 bytes at every other position back this up.

This would suggest that you should be able to do this:

You can get CSV to strip of the BOM, by using bom|utf-16-le as the encoding:

You might prefer to convert the string to a more familiar encoding instead, in which case you could do:

Note
Rectangle 27 0

ruby Parsing a CSV file using different encodings and libraries?


iconv -f utf-16 -t utf8 testfile.csv | ruby -rcsv -e 'CSV(STDIN).each {|row| puts row}'

Converting the file to UTF8 first and then reading it also works nicely:

Iconv seems to understand correctly that the file has a BOM at the start and strips it off when converting.

Note