Rectangle 27 0

Find out if CSV file contains empty field in Ruby?


File.new(filename).grep(/(^,|,(,|$))/)
require 'csv'

CSV.foreach('/tmp/big.csv') do |row|

  row.each do |column|
    unless column
      puts "empty field"
    end
  end

end
require 'csv'

File.new("/tmp/big.csv").grep(/(^,|,(,|%))/).each do |row_string|
  CSV.parse(row_string) do |row|
    puts row[1]
  end
end
require 'excelsior'

Excelsior::Reader.rows(File.open('/tmp/big.csv')) do |row|

  row.each do |column|

    unless column
      puts "empty field"
    end
  end
end

As you mentioned in your comment, there are a few more idiomatic ways to write this, such as using each instead of the for loop or using unless instead of if !, and using two spaces for indentation, which will turn it into:

Edit: Having my big file around I also tested Uri Agassi's aproach using grep to get the lines of the file with empty fields:

I tested this code with a file like yours (72M, ~30k entries 2.5k fields) and it is about twice as fast, however it segfaults after a few lines, so the gem might not be stable.

It's about 10 times faster. If you need access to the fields you can use CSV.parse:

Otherwise, if you have to parse the whole CSV file anyway, the answer is most likely no. Try running your script without the checking part - just reading the CSV rows. You will see no change in running time. This is because most of the time is spent reading and parsing the CSV file.

There is a ruby gem named excelsior which uses a native CSV parser. You can install it via gem install excelsior and use it like this:

You might wonder if there is a faster CSV library for ruby. There is indeed a gem called FasterCSV but Ruby 1.9 has adopted it as its built-in CSV library, so it probably won't get much faster using Ruby only.

Note
Rectangle 27 0

Find out if CSV file contains empty field in Ruby?


File.new(filename).grep(/(^,|,(,|$))/)
# => all the lines which have an empty field

I'm afraid that you still would go over all the files and read them, so it might not be as fast as you would hope, but unless there is some index on the files, I can't see a way around it.

Parsing the CSVs could take a lot of your CPU. If all you want is to get the lines which contain an empty field (i.e. contain ,, start with a , or end with a ,), you can use grep on the raw lines of the files, without actually parsing them:

Note