Rectangle 27 3

Does your CSV file contain column headers? If not, then explicitly passing header=None to pandas.read_csv can give a slight performance improvement for the Python parsing engine (but not for the C engine):

In [1]: np.savetxt('test.csv', np.random.randn(1000, 20000), delimiter=',')

In [2]: %timeit pd.read_csv('test.csv', delimiter=',', engine='python')
1 loops, best of 3: 9.19 s per loop

In [3]: %timeit pd.read_csv('test.csv', delimiter=',', engine='c')
1 loops, best of 3: 6.47 s per loop

In [4]: %timeit pd.read_csv('test.csv', delimiter=',', engine='python', header=None)
1 loops, best of 3: 6.26 s per loop

In [5]: %timeit pd.read_csv('test.csv', delimiter=',', engine='c', header=None)
1 loops, best of 3: 6.46 s per loop

If there are no missing or invalid values then you can do a little better by passing na_filter=False (only valid for the C engine):

In [6]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None)
1 loops, best of 3: 6.42 s per loop

In [7]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False)
1 loops, best of 3: 4.72 s per loop
dtype
In [8]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64)
1 loops, best of 3: 4.36 s per loop
low_memory=False
In [9]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64, low_memory=True)
1 loops, best of 3: 4.3 s per loop

In [10]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64, low_memory=False)
1 loops, best of 3: 3.27 s per loop

For what it's worth, these benchmarks were all done using the current dev version of pandas (0.16.0-19-g8d2818e).

Huh, that's strange! However when you enable the various simplifying options that I'm using (including header=None) the C-based parser wins.

I just tried using read_fwf (which you just edited out of your answer?) and it appears to be much slower. Around 55s for 3757 lines with 20,000 columns. I converted to fixed width format using column -t (compiled my own version with a much larger max line length setting) and used the settings header = None, engine = "c", quoting = csv.QUOTE_NONE, index_col = False.

Yeah, ignore that - read_fwf doesn't do what I thought it did anyway (it's designed for reading tables with whitespace-delimited columns of fixed character width)

Could you also test with the undocumented low_memory=False? I'm seeing some improvement, but I'm guessing the difference becomes only apparent for really large files.

@moarningsun Good suggestion - that gives another ~25% improvement

python - Why is numpy/pandas parsing of a csv file with long lines so ...

python parsing csv numpy pandas
Rectangle 27 5

Another solution similar to Loki Astari's answer, in C++11. Rows here are std::tuples of a given type. The code scans one line, then scans until each delimiter, and then converts and dumps the value directly into the tuple (with a bit of template code).

for (auto row : csv<std::string, int, float>(file, ',')) {
    std::cout << "first col: " << std::get<0>(row) << std::endl;
}
  • quite clean and simple to use, only C++11.
std::tuple<t1, ...>
operator>>
  • escaping and quoting
  • no error handling in case of malformed CSV.
#include <iterator>
#include <sstream>
#include <string>

namespace csvtools {
    /// Read the last element of the tuple without calling recursively
    template <std::size_t idx, class... fields>
    typename std::enable_if<idx >= std::tuple_size<std::tuple<fields...>>::value - 1>::type
    read_tuple(std::istream &in, std::tuple<fields...> &out, const char delimiter) {
        std::string cell;
        std::getline(in, cell, delimiter);
        std::stringstream cell_stream(cell);
        cell_stream >> std::get<idx>(out);
    }

    /// Read the @p idx-th element of the tuple and then calls itself with @p idx + 1 to
    /// read the next element of the tuple. Automatically falls in the previous case when
    /// reaches the last element of the tuple thanks to enable_if
    template <std::size_t idx, class... fields>
    typename std::enable_if<idx < std::tuple_size<std::tuple<fields...>>::value - 1>::type
    read_tuple(std::istream &in, std::tuple<fields...> &out, const char delimiter) {
        std::string cell;
        std::getline(in, cell, delimiter);
        std::stringstream cell_stream(cell);
        cell_stream >> std::get<idx>(out);
        read_tuple<idx + 1, fields...>(in, out, delimiter);
    }
}

/// Iterable csv wrapper around a stream. @p fields the list of types that form up a row.
template <class... fields>
class csv {
    std::istream &_in;
    const char _delim;
public:
    typedef std::tuple<fields...> value_type;
    class iterator;

    /// Construct from a stream.
    inline csv(std::istream &in, const char delim) : _in(in), _delim(delim) {}

    /// Status of the underlying stream
    /// @{
    inline bool good() const {
        return _in.good();
    }
    inline const std::istream &underlying_stream() const {
        return _in;
    }
    /// @}

    inline iterator begin();
    inline iterator end();
private:

    /// Reads a line into a stringstream, and then reads the line into a tuple, that is returned
    inline value_type read_row() {
        std::string line;
        std::getline(_in, line);
        std::stringstream line_stream(line);
        std::tuple<fields...> retval;
        csvtools::read_tuple<0, fields...>(line_stream, retval, _delim);
        return retval;
    }
};

/// Iterator; just calls recursively @ref csv::read_row and stores the result.
template <class... fields>
class csv<fields...>::iterator {
    csv::value_type _row;
    csv *_parent;
public:
    typedef std::input_iterator_tag iterator_category;
    typedef csv::value_type         value_type;
    typedef std::size_t             difference_type;
    typedef csv::value_type *       pointer;
    typedef csv::value_type &       reference;

    /// Construct an empty/end iterator
    inline iterator() : _parent(nullptr) {}
    /// Construct an iterator at the beginning of the @p parent csv object.
    inline iterator(csv &parent) : _parent(parent.good() ? &parent : nullptr) {
        ++(*this);
    }

    /// Read one row, if possible. Set to end if parent is not good anymore.
    inline iterator &operator++() {
        if (_parent != nullptr) {
            _row = _parent->read_row();
            if (!_parent->good()) {
                _parent = nullptr;
            }
        }
        return *this;
    }

    inline iterator operator++(int) {
        iterator copy = *this;
        ++(*this);
        return copy;
    }

    inline csv::value_type const &operator*() const {
        return _row;
    }

    inline csv::value_type const *operator->() const {
        return &_row;
    }

    bool operator==(iterator const &other) {
        return (this == &other) or (_parent == nullptr and other._parent == nullptr);
    }
    bool operator!=(iterator const &other) {
        return not (*this == other);
    }
};

template <class... fields>
typename csv<fields...>::iterator csv<fields...>::begin() {
    return iterator(*this);
}

template <class... fields>
typename csv<fields...>::iterator csv<fields...>::end() {
    return iterator();
}

I put a tiny working example on GitHub; I've been using it for parsing some numerical data and it served its purpose.

You may not care about inlining, because most of compilers decide it on its own. At least I am sure in Visual C++. It can inline method independently of your method specification.

That's precisely why I marked them explicitly. Gcc and Clang, the ones I mostly use, have as well their own conventions. A "inline" keyword should be just an incentive.

parsing - How can I read and parse CSV files in C++? - Stack Overflow

c++ parsing text csv
Rectangle 27 81

Just use the function for parsing a CSV file

$row = 1;
if (($handle = fopen("test.csv", "r")) !== FALSE) {
  while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
    $num = count($data);
    echo "<p> $num fields in line $row: <br /></p>\n";
    $row++;
    for ($c=0; $c < $num; $c++) {
        echo $data[$c] . "<br />\n";
    }
  }
  fclose($handle);
}

it should be noted that this function does not correctly deal with quotes in CSV. Specifically, it can't deal with this example as found in wikipedia: en.wikipedia.org/wiki/Comma-separated_values#Example there has been an open bug, but it has been closed as "wont fix" bugs.php.net/bug.php?id=50686

Whomever keeps editing this. The link to the MANUAL is there so they can RTFM Do not delete it and add words that I never said. The point of editing is to correct errors.

How to parse a CSV file using PHP - Stack Overflow

php csv fgetcsv
Rectangle 27 81

Just use the function for parsing a CSV file

$row = 1;
if (($handle = fopen("test.csv", "r")) !== FALSE) {
  while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
    $num = count($data);
    echo "<p> $num fields in line $row: <br /></p>\n";
    $row++;
    for ($c=0; $c < $num; $c++) {
        echo $data[$c] . "<br />\n";
    }
  }
  fclose($handle);
}

it should be noted that this function does not correctly deal with quotes in CSV. Specifically, it can't deal with this example as found in wikipedia: en.wikipedia.org/wiki/Comma-separated_values#Example there has been an open bug, but it has been closed as "wont fix" bugs.php.net/bug.php?id=50686

Whomever keeps editing this. The link to the MANUAL is there so they can RTFM Do not delete it and add words that I never said. The point of editing is to correct errors.

How to parse a CSV file using PHP - Stack Overflow

php csv fgetcsv
Rectangle 27 1

Instead of parsing the file manually in a DataTable, then doing some Linq, use Linq directly on it, using this library.

It works pretty well and is very efficient with big files.

1) Add nuget package in your project, and the following line to be able to use it:

using LINQtoCSV;

2) define the class that olds the data

public class IdVolumeNameRow
{
    [CsvColumn(FieldIndex = 1)]
    public string ID { get; set; }

    [CsvColumn(FieldIndex = 2)]
    public decimal Volume { get; set; }

    [CsvColumn(FieldIndex = 3)]
    public string Name{ get; set; }
}

3) and search for the value

var csvAttributes = new CsvFileDescription
    {
        SeparatorChar = ':',
        FirstLineHasColumnNames = true
    };

    var cc = new CsvContext();

    var volume = cc.Read<IdVolumeNameRow>(@"C:\IDVolumeName.txt", csvAttributes)
            .Where(i => i.ID == "90")
            .Select(i => i.Volume)
            .FirstOrDefault();

C# Read a particular value from CSV file - Stack Overflow

c# csv
Rectangle 27 14

You can use fgetcsv to parse a CSV file without having to worry about parsing it yourself.

Example from PHP Manual:

$row = 1;
if (($handle = fopen("test.csv", "r")) !== FALSE) {
    while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
        $num = count($data);
        echo "<p> $num fields in line $row: <br /></p>\n";
        $row++;
        for ($c=0; $c < $num; $c++) {
            echo $data[$c] . "<br />\n";
        }
    }
    fclose($handle);
}

Is there similar function for a line?

fgets

@liysd - Yes, you can pass a string to str_getcsv, however this is only available in PHP 5.3+. The comments section has a replacement, though.

I mean a line is input not a file handle

If the line is input, not from a file, please clarify that in the question.

How to extract data from csv file in PHP - Stack Overflow

php csv split
Rectangle 27 14

You can use fgetcsv to parse a CSV file without having to worry about parsing it yourself.

Example from PHP Manual:

$row = 1;
if (($handle = fopen("test.csv", "r")) !== FALSE) {
    while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
        $num = count($data);
        echo "<p> $num fields in line $row: <br /></p>\n";
        $row++;
        for ($c=0; $c < $num; $c++) {
            echo $data[$c] . "<br />\n";
        }
    }
    fclose($handle);
}

Is there similar function for a line?

fgets

@liysd - Yes, you can pass a string to str_getcsv, however this is only available in PHP 5.3+. The comments section has a replacement, though.

I mean a line is input not a file handle

If the line is input, not from a file, please clarify that in the question.

How to extract data from csv file in PHP - Stack Overflow

php csv split
Rectangle 27 11

I've been using the TextFieldParser Class in the Microsoft.VisualBasic.FileIO namespace for a C# project I'm working on. It will handle complications such as embedded commas or fields that are enclosed in quotes etc. It returns a string[] and, in addition to CSV files, can also be used for parsing just about any type of structured text file.

Interesting. I've never looked here before. I recently wrote a complete CSV class in C#, and this would have helped. I wound up converting newlines to ~'s, and well, commas could only occur in the last field, so I used the maxentries parameter of String.Split to capture the entire last field, commas and all, but I'll have to look at this class. Thanks for the link.

c# - Splitting Comma Separated Values (CSV) - Stack Overflow

c# csv
Rectangle 27 11

I've been using the TextFieldParser Class in the Microsoft.VisualBasic.FileIO namespace for a C# project I'm working on. It will handle complications such as embedded commas or fields that are enclosed in quotes etc. It returns a string[] and, in addition to CSV files, can also be used for parsing just about any type of structured text file.

Interesting. I've never looked here before. I recently wrote a complete CSV class in C#, and this would have helped. I wound up converting newlines to ~'s, and well, commas could only occur in the last field, so I used the maxentries parameter of String.Split to capture the entire last field, commas and all, but I'll have to look at this class. Thanks for the link.

c# - Splitting Comma Separated Values (CSV) - Stack Overflow

c# csv
Rectangle 27 5

Another solution similar to Loki Astari's answer, in C++11. Rows here are std::tuples of a given type. The code scans one line, then scans until each delimiter, and then converts and dumps the value directly into the tuple (with a bit of template code).

for (auto row : csv<std::string, int, float>(file, ',')) {
    std::cout << "first col: " << std::get<0>(row) << std::endl;
}
  • quite clean and simple to use, only C++11.
std::tuple<t1, ...>
operator>>
  • escaping and quoting
  • no error handling in case of malformed CSV.
#include <iterator>
#include <sstream>
#include <string>

namespace csvtools {
    /// Read the last element of the tuple without calling recursively
    template <std::size_t idx, class... fields>
    typename std::enable_if<idx >= std::tuple_size<std::tuple<fields...>>::value - 1>::type
    read_tuple(std::istream &in, std::tuple<fields...> &out, const char delimiter) {
        std::string cell;
        std::getline(in, cell, delimiter);
        std::stringstream cell_stream(cell);
        cell_stream >> std::get<idx>(out);
    }

    /// Read the @p idx-th element of the tuple and then calls itself with @p idx + 1 to
    /// read the next element of the tuple. Automatically falls in the previous case when
    /// reaches the last element of the tuple thanks to enable_if
    template <std::size_t idx, class... fields>
    typename std::enable_if<idx < std::tuple_size<std::tuple<fields...>>::value - 1>::type
    read_tuple(std::istream &in, std::tuple<fields...> &out, const char delimiter) {
        std::string cell;
        std::getline(in, cell, delimiter);
        std::stringstream cell_stream(cell);
        cell_stream >> std::get<idx>(out);
        read_tuple<idx + 1, fields...>(in, out, delimiter);
    }
}

/// Iterable csv wrapper around a stream. @p fields the list of types that form up a row.
template <class... fields>
class csv {
    std::istream &_in;
    const char _delim;
public:
    typedef std::tuple<fields...> value_type;
    class iterator;

    /// Construct from a stream.
    inline csv(std::istream &in, const char delim) : _in(in), _delim(delim) {}

    /// Status of the underlying stream
    /// @{
    inline bool good() const {
        return _in.good();
    }
    inline const std::istream &underlying_stream() const {
        return _in;
    }
    /// @}

    inline iterator begin();
    inline iterator end();
private:

    /// Reads a line into a stringstream, and then reads the line into a tuple, that is returned
    inline value_type read_row() {
        std::string line;
        std::getline(_in, line);
        std::stringstream line_stream(line);
        std::tuple<fields...> retval;
        csvtools::read_tuple<0, fields...>(line_stream, retval, _delim);
        return retval;
    }
};

/// Iterator; just calls recursively @ref csv::read_row and stores the result.
template <class... fields>
class csv<fields...>::iterator {
    csv::value_type _row;
    csv *_parent;
public:
    typedef std::input_iterator_tag iterator_category;
    typedef csv::value_type         value_type;
    typedef std::size_t             difference_type;
    typedef csv::value_type *       pointer;
    typedef csv::value_type &       reference;

    /// Construct an empty/end iterator
    inline iterator() : _parent(nullptr) {}
    /// Construct an iterator at the beginning of the @p parent csv object.
    inline iterator(csv &parent) : _parent(parent.good() ? &parent : nullptr) {
        ++(*this);
    }

    /// Read one row, if possible. Set to end if parent is not good anymore.
    inline iterator &operator++() {
        if (_parent != nullptr) {
            _row = _parent->read_row();
            if (!_parent->good()) {
                _parent = nullptr;
            }
        }
        return *this;
    }

    inline iterator operator++(int) {
        iterator copy = *this;
        ++(*this);
        return copy;
    }

    inline csv::value_type const &operator*() const {
        return _row;
    }

    inline csv::value_type const *operator->() const {
        return &_row;
    }

    bool operator==(iterator const &other) {
        return (this == &other) or (_parent == nullptr and other._parent == nullptr);
    }
    bool operator!=(iterator const &other) {
        return not (*this == other);
    }
};

template <class... fields>
typename csv<fields...>::iterator csv<fields...>::begin() {
    return iterator(*this);
}

template <class... fields>
typename csv<fields...>::iterator csv<fields...>::end() {
    return iterator();
}

I put a tiny working example on GitHub; I've been using it for parsing some numerical data and it served its purpose.

You may not care about inlining, because most of compilers decide it on its own. At least I am sure in Visual C++. It can inline method independently of your method specification.

That's precisely why I marked them explicitly. Gcc and Clang, the ones I mostly use, have as well their own conventions. A "inline" keyword should be just an incentive.

parsing - How can I read and parse CSV files in C++? - Stack Overflow

c++ parsing text csv
Rectangle 27 5

Another solution similar to Loki Astari's answer, in C++11. Rows here are std::tuples of a given type. The code scans one line, then scans until each delimiter, and then converts and dumps the value directly into the tuple (with a bit of template code).

for (auto row : csv<std::string, int, float>(file, ',')) {
    std::cout << "first col: " << std::get<0>(row) << std::endl;
}
  • quite clean and simple to use, only C++11.
std::tuple<t1, ...>
operator>>
  • escaping and quoting
  • no error handling in case of malformed CSV.
#include <iterator>
#include <sstream>
#include <string>

namespace csvtools {
    /// Read the last element of the tuple without calling recursively
    template <std::size_t idx, class... fields>
    typename std::enable_if<idx >= std::tuple_size<std::tuple<fields...>>::value - 1>::type
    read_tuple(std::istream &in, std::tuple<fields...> &out, const char delimiter) {
        std::string cell;
        std::getline(in, cell, delimiter);
        std::stringstream cell_stream(cell);
        cell_stream >> std::get<idx>(out);
    }

    /// Read the @p idx-th element of the tuple and then calls itself with @p idx + 1 to
    /// read the next element of the tuple. Automatically falls in the previous case when
    /// reaches the last element of the tuple thanks to enable_if
    template <std::size_t idx, class... fields>
    typename std::enable_if<idx < std::tuple_size<std::tuple<fields...>>::value - 1>::type
    read_tuple(std::istream &in, std::tuple<fields...> &out, const char delimiter) {
        std::string cell;
        std::getline(in, cell, delimiter);
        std::stringstream cell_stream(cell);
        cell_stream >> std::get<idx>(out);
        read_tuple<idx + 1, fields...>(in, out, delimiter);
    }
}

/// Iterable csv wrapper around a stream. @p fields the list of types that form up a row.
template <class... fields>
class csv {
    std::istream &_in;
    const char _delim;
public:
    typedef std::tuple<fields...> value_type;
    class iterator;

    /// Construct from a stream.
    inline csv(std::istream &in, const char delim) : _in(in), _delim(delim) {}

    /// Status of the underlying stream
    /// @{
    inline bool good() const {
        return _in.good();
    }
    inline const std::istream &underlying_stream() const {
        return _in;
    }
    /// @}

    inline iterator begin();
    inline iterator end();
private:

    /// Reads a line into a stringstream, and then reads the line into a tuple, that is returned
    inline value_type read_row() {
        std::string line;
        std::getline(_in, line);
        std::stringstream line_stream(line);
        std::tuple<fields...> retval;
        csvtools::read_tuple<0, fields...>(line_stream, retval, _delim);
        return retval;
    }
};

/// Iterator; just calls recursively @ref csv::read_row and stores the result.
template <class... fields>
class csv<fields...>::iterator {
    csv::value_type _row;
    csv *_parent;
public:
    typedef std::input_iterator_tag iterator_category;
    typedef csv::value_type         value_type;
    typedef std::size_t             difference_type;
    typedef csv::value_type *       pointer;
    typedef csv::value_type &       reference;

    /// Construct an empty/end iterator
    inline iterator() : _parent(nullptr) {}
    /// Construct an iterator at the beginning of the @p parent csv object.
    inline iterator(csv &parent) : _parent(parent.good() ? &parent : nullptr) {
        ++(*this);
    }

    /// Read one row, if possible. Set to end if parent is not good anymore.
    inline iterator &operator++() {
        if (_parent != nullptr) {
            _row = _parent->read_row();
            if (!_parent->good()) {
                _parent = nullptr;
            }
        }
        return *this;
    }

    inline iterator operator++(int) {
        iterator copy = *this;
        ++(*this);
        return copy;
    }

    inline csv::value_type const &operator*() const {
        return _row;
    }

    inline csv::value_type const *operator->() const {
        return &_row;
    }

    bool operator==(iterator const &other) {
        return (this == &other) or (_parent == nullptr and other._parent == nullptr);
    }
    bool operator!=(iterator const &other) {
        return not (*this == other);
    }
};

template <class... fields>
typename csv<fields...>::iterator csv<fields...>::begin() {
    return iterator(*this);
}

template <class... fields>
typename csv<fields...>::iterator csv<fields...>::end() {
    return iterator();
}

I put a tiny working example on GitHub; I've been using it for parsing some numerical data and it served its purpose.

You may not care about inlining, because most of compilers decide it on its own. At least I am sure in Visual C++. It can inline method independently of your method specification.

That's precisely why I marked them explicitly. Gcc and Clang, the ones I mostly use, have as well their own conventions. A "inline" keyword should be just an incentive.

parsing - How can I read and parse CSV files in C++? - Stack Overflow

c++ parsing text csv
Rectangle 27 11

If you can use System.Web.Extensions, something like this could work:

var csv = new List<string[]>(); // or, List<YourClass>
var lines = System.IO.File.ReadAllLines(@"C:\file.txt");
foreach (string line in lines)
    csv.Add(line.Split(',')); // or, populate YourClass          
string json = new 
    System.Web.Script.Serialization.JavaScriptSerializer().Serialize(csv);

You might have more complex parsing requirements for the csv file and you might have a class that encapsulates the data from one line, but the point is that you can serialize to JSON with one line of code once you have a Collection of lines.

Mostly this result in error if the file is huge. Ex: Error during serialization or deserialization using the JSON JavaScriptSerializer. The length of the string exceeds the value set on the maxJsonLength property

Converting a csv file to json using C# - Stack Overflow

c# json csv
Rectangle 27 11

If you can use System.Web.Extensions, something like this could work:

var csv = new List<string[]>(); // or, List<YourClass>
var lines = System.IO.File.ReadAllLines(@"C:\file.txt");
foreach (string line in lines)
    csv.Add(line.Split(',')); // or, populate YourClass          
string json = new 
    System.Web.Script.Serialization.JavaScriptSerializer().Serialize(csv);

You might have more complex parsing requirements for the csv file and you might have a class that encapsulates the data from one line, but the point is that you can serialize to JSON with one line of code once you have a Collection of lines.

Mostly this result in error if the file is huge. Ex: Error during serialization or deserialization using the JSON JavaScriptSerializer. The length of the string exceeds the value set on the maxJsonLength property

Converting a csv file to json using C# - Stack Overflow

c# json csv
Rectangle 27 4

I wrote this a while back as a lightweight, standalone CSV parser. I believe it meets all of your requirements. Give it a try with the knowledge that it probably isn't bulletproof.

If it does work for you, feel free to change the namespace and use without restriction.

namespace NFC.Portability
{
    using System;
    using System.Collections.Generic;
    using System.Data;
    using System.IO;
    using System.Linq;
    using System.Text;

    /// <summary>
    /// Loads and reads a file with comma-separated values into a tabular format.
    /// </summary>
    /// <remarks>
    /// Parsing assumes that the first line will always contain headers and that values will be double-quoted to escape double quotes and commas.
    /// </remarks>
    public unsafe class CsvReader
    {
        private const char SEGMENT_DELIMITER = ',';
        private const char DOUBLE_QUOTE = '"';
        private const char CARRIAGE_RETURN = '\r';
        private const char NEW_LINE = '\n';

        private DataTable _table = new DataTable();

        /// <summary>
        /// Gets the data contained by the instance in a tabular format.
        /// </summary>
        public DataTable Table
        {
            get
            {
                // validation logic could be added here to ensure that the object isn't in an invalid state

                return _table;
            }
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="path">The fully-qualified path to the file from which the instance will be populated.</param>
        public CsvReader( string path )
        {
            if( path == null )
            {
                throw new ArgumentNullException( "path" );
            }

            FileStream fs = new FileStream( path, FileMode.Open );
            Read( fs );
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="stream">The stream from which the instance will be populated.</param>
        public CsvReader( Stream stream )
        {
            if( stream == null )
            {
                throw new ArgumentNullException( "stream" );
            }

            Read( stream );
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="bytes">The array of bytes from which the instance will be populated.</param>
        public CsvReader( byte[] bytes )
        {
            if( bytes == null )
            {
                throw new ArgumentNullException( "bytes" );
            }

            MemoryStream ms = new MemoryStream();
            ms.Write( bytes, 0, bytes.Length );
            ms.Position = 0;

            Read( ms );
        }

        private void Read( Stream s )
        {
            string lines;

            using( StreamReader sr = new StreamReader( s ) )
            {
                lines = sr.ReadToEnd();
            }

            if( string.IsNullOrWhiteSpace( lines ) )
            {
                throw new InvalidOperationException( "Data source cannot be empty." );
            }

            bool inQuotes = false;
            int lineNumber = 0;
            StringBuilder buffer = new StringBuilder( 128 );
            List<string> values = new List<string>();

            Action endSegment = () =>
            {
                values.Add( buffer.ToString() );
                buffer.Clear();
            };

            Action endLine = () =>
            {
                if( lineNumber == 0 )
                {
                    CreateColumns( values );
                    values.Clear();
                }
                else
                {
                    CreateRow( values );
                    values.Clear();
                }

                values.Clear();
                lineNumber++;
            };

            fixed( char* pStart = lines )
            {
                char* pChar = pStart;
                char* pEnd = pStart + lines.Length;

                while( pChar < pEnd ) // leave null terminator out
                {
                    if( *pChar == DOUBLE_QUOTE )
                    {
                        if( inQuotes )
                        {
                            if( Peek( pChar, pEnd ) == SEGMENT_DELIMITER )
                            {
                                endSegment();
                                pChar++;
                            }
                            else if( !ApproachingNewLine( pChar, pEnd ) )
                            {
                                buffer.Append( DOUBLE_QUOTE );
                            }
                        }

                        inQuotes = !inQuotes;
                    }
                    else if( *pChar == SEGMENT_DELIMITER )
                    {
                        if( !inQuotes )
                        {
                            endSegment();
                        }
                        else
                        {
                            buffer.Append( SEGMENT_DELIMITER );
                        }
                    }
                    else if( AtNewLine( pChar, pEnd ) )
                    {
                        if( !inQuotes )
                        {
                            endSegment();
                            endLine();
                            pChar++;
                        }
                        else
                        {
                            buffer.Append( *pChar );
                        }
                    }
                    else
                    {
                        buffer.Append( *pChar );
                    }

                    pChar++;
                }
            }

            // append trailing values at the end of the file
            if( values.Count > 0 )
            {
                endSegment();
                endLine();
            }
        }

        /// <summary>
        /// Returns the next character in the sequence but does not advance the pointer. Checks bounds.
        /// </summary>
        /// <param name="pChar">Pointer to current character.</param>
        /// <param name="pEnd">End of range to check.</param>
        /// <returns>
        /// Returns the next character in the sequence, or char.MinValue if range is exceeded.
        /// </returns>
        private char Peek( char* pChar, char* pEnd )
        {
            if( pChar < pEnd )
            {
                return *( pChar + 1 );
            }

            return char.MinValue;
        }

        /// <summary>
        /// Determines if the current character represents a newline. This includes lookahead for two character newline delimiters.
        /// </summary>
        /// <param name="pChar"></param>
        /// <param name="pEnd"></param>
        /// <returns></returns>
        private bool AtNewLine( char* pChar, char* pEnd )
        {
            if( *pChar == NEW_LINE )
            {
                return true;
            }

            if( *pChar == CARRIAGE_RETURN && Peek( pChar, pEnd ) == NEW_LINE )
            {
                return true;
            }

            return false;
        }

        /// <summary>
        /// Determines if the next character represents a newline, or the start of a newline.
        /// </summary>
        /// <param name="pChar"></param>
        /// <param name="pEnd"></param>
        /// <returns></returns>
        private bool ApproachingNewLine( char* pChar, char* pEnd )
        {
            if( Peek( pChar, pEnd ) == CARRIAGE_RETURN || Peek( pChar, pEnd ) == NEW_LINE )
            {
                // technically this cheats a little to avoid a two char peek by only checking for a carriage return or new line, not both in sequence
                return true;
            }

            return false;
        }

        private void CreateColumns( List<string> columns )
        {
            foreach( string column in columns )
            {
                DataColumn dc = new DataColumn( column );
                _table.Columns.Add( dc );
            }
        }

        private void CreateRow( List<string> values )
        {
            if( values.Where( (o) => !string.IsNullOrWhiteSpace( o ) ).Count() == 0 )
            {
                return; // ignore rows which have no content
            }

            DataRow dr = _table.NewRow();
            _table.Rows.Add( dr );

            for( int i = 0; i < values.Count; i++ )
            {
                dr[i] = values[i];
            }
        }
    }
}

c# - CSV Parsing Options with .NET - Stack Overflow

c# .net parsing
Rectangle 27 3

I'd try with EPPlus library for creating your excel files directly.

It supports charts and has generally worked great for my projects in the past. The easiest way may be to prepare a "template" file in Excel with blank data (just normal a xlsx file) and insert desired charts and any other required elements. Then you can just open the template file with that library in C#, fill datasheet with data, and save it as another xlsx file with actual data.

Maybe some flag for "recalculation" of data needs to be set, which occurs when the file is opened. Don't know exactly for that library, but it was required for another one I used in the past for xls files.

(I suppose the you already have you data in your application, if not, check those answers for parsing CSV: CSV parser/reader for C#?, CSV parser/reader for C#?)

I've used this library in the past as well, and it works beautifully. Basically, instead of creating a .csv that you open in Excel, it will create a .xlsx for you with the same data. From there, you can create a graph within that .xlsx based on that data.

c# - Externally create an excel graph from a CSV file - Stack Overflow

c# excel csv graph
Rectangle 27 3

I'd try with EPPlus library for creating your excel files directly.

It supports charts and has generally worked great for my projects in the past. The easiest way may be to prepare a "template" file in Excel with blank data (just normal a xlsx file) and insert desired charts and any other required elements. Then you can just open the template file with that library in C#, fill datasheet with data, and save it as another xlsx file with actual data.

Maybe some flag for "recalculation" of data needs to be set, which occurs when the file is opened. Don't know exactly for that library, but it was required for another one I used in the past for xls files.

(I suppose the you already have you data in your application, if not, check those answers for parsing CSV: CSV parser/reader for C#?, CSV parser/reader for C#?)

I've used this library in the past as well, and it works beautifully. Basically, instead of creating a .csv that you open in Excel, it will create a .xlsx for you with the same data. From there, you can create a graph within that .xlsx based on that data.

c# - Externally create an excel graph from a CSV file - Stack Overflow

c# excel csv graph
Rectangle 27 4

I recommend referring to an existing solution than reinventing your own (unless you're going for the learning experience!) Parsing CSV is trickier than it seems.

The main trickiness with parsing CSV is figuring out what the exact rules are. There are so many different variants, with slight differences. Actually parsing it once you figured out which rules you need isn't that hard, even when you do it manually.

c# - Splitting string by spaces but ignore spaces inside a quote - Sta...

c# regex csv split
Rectangle 27 4

I recommend referring to an existing solution than reinventing your own (unless you're going for the learning experience!) Parsing CSV is trickier than it seems.

The main trickiness with parsing CSV is figuring out what the exact rules are. There are so many different variants, with slight differences. Actually parsing it once you figured out which rules you need isn't that hard, even when you do it manually.

c# - Splitting string by spaces but ignore spaces inside a quote - Sta...

c# regex csv split
Rectangle 27 8

I have created specific CSV related functions from which a more general solution can be composed.

It turns out that attempting to parse a CSV file is quite tricky due to anomalies around both the comma (,) and the double quote ("). The rules for a CSV are if a column value contains either a comma or a quote, the entire value must be placed in double quotes. And if any double quotes appear in the value, each double quote must be escaped by inserting an additional double quote in front of the existing double quote. This is one of the reasons why the oft cited StringOps.split(",") method simply doesn't work unless one can guarantee they will never encounter a file using the comma/double quote escaping rules. And that's a very unreasonable guarantee.

Additionally, consider that there can be characters between a valid comma separator and the start of a single double quote. Or there can be characters between a final double quote and the next comma or the end of the line. The rules to address this is for those outside-the-double-quote-bounds values to be discarded. This is yet another reason a simple StringOps.split(",") is not only an insufficient answer, but actually incorrect.

One final note about a unexpected behavior I found using StringOps.split(","). Do you know what value result has in this code snippet?:

val result = ",,".split(",")

If you guessed "result references an Array[String] containing three elements of which each is an empty String", you would be incorrect. result references an empty Array[String]. And for me, an empty Array[String] isn't the answer I was expecting or needed. So, for the love of all that is Holy, please Please PLEASE put the final nail in StringOps.split(",") coffin!

So, let's start with the already read in file which is being presented as a List[String]. Below in object Parser is a general solution with two functions; fromLine and fromLines. The latter function, fromLines, is provided for convenience and merely maps across the former function, fromLine.

object Parser {
  def fromLine(line: String): List[String] = {
    def recursive(
        lineRemaining: String
      , isWithinDoubleQuotes: Boolean
      , valueAccumulator: String
      , accumulator: List[String]
    ): List[String] = {
      if (lineRemaining.isEmpty)
        valueAccumulator :: accumulator
      else
        if (lineRemaining.head == '"')
          if (isWithinDoubleQuotes)
            if (lineRemaining.tail.nonEmpty && lineRemaining.tail.head == '"')
              //escaped double quote
              recursive(lineRemaining.drop(2), true, valueAccumulator + '"', accumulator)
            else
              //end of double quote pair (ignore whatever's between here and the next comma)
              recursive(lineRemaining.dropWhile(_ != ','), false, valueAccumulator, accumulator)
          else
            //start of a double quote pair (ignore whatever's in valueAccumulator)
            recursive(lineRemaining.drop(1), true, "", accumulator)
        else
          if (isWithinDoubleQuotes)
            //scan to next double quote
            recursive(
                lineRemaining.dropWhile(_ != '"')
              , true
              , valueAccumulator + lineRemaining.takeWhile(_ != '"')
              , accumulator
            )
          else
            if (lineRemaining.head == ',')
              //advance to next field value
              recursive(
                  lineRemaining.drop(1)
                , false
                , ""
                , valueAccumulator :: accumulator
              )
            else
              //scan to next double quote or comma
              recursive(
                  lineRemaining.dropWhile(char => (char != '"') && (char != ','))
                , false
                , valueAccumulator + lineRemaining.takeWhile(char => (char != '"') && (char != ','))
                , accumulator
              )
    }
    if (line.nonEmpty)
      recursive(line, false, "", Nil).reverse
    else
      Nil
  }

  def fromLines(lines: List[String]): List[List[String]] =
    lines.map(fromLine)
}
val testRowsHardcoded: List[String] = {
    val superTrickyTestCase = {
      val dqx1 = '"'
      val dqx2 = dqx1.toString + dqx1.toString
      s"${dqx1}${dqx2}a${dqx2} , ${dqx2}1${dqx1} , ${dqx1}${dqx2}b${dqx2} , ${dqx2}2${dqx1} , ${dqx1}${dqx2}c${dqx2} , ${dqx2}3${dqx1}"
    }
    val nonTrickyTestCases =
"""
,,
a,b,c
a,,b,,c
 a, b, c
a ,b ,c
 a , b , c
"a,1","b,2","c,2"
"a"",""1","b"",""2","c"",""2"
 "a"" , ""1" , "b"" , ""2" , "c"",""2"
""".split("\n").tail.toList
   (superTrickyTestCase :: nonTrickyTestCases.reverse).reverse
  }
  val parsedLines =
    Parser.fromLines(testRowsHardcoded)
  parsedLines.map(_.mkString("|")).mkString("\n")

I visually verifyed the tests completed correctly and had left me with decomposed accurate raw strings. So, I now had what I needed for the input parsing side so I could begin my data refining.

After data refining was completed, I needed to be able to compose output so I could send my refined data back out reapplying all the CSV encoding rules.

So, let's start with a List[List[String]] as the source of the refinements. Below in object Composer is a general solution with two functions; toLine and toLines. The latter function, toLines, is provided for convenience and merely maps across the former function, toLine.

object Composer {
  def toLine(line: List[String]): String = {
    def encode(value: String): String = {
      if ((value.indexOf(',') < 0) && (value.indexOf('"') < 0))
        //no commas or double quotes, so nothing to encode
        value
      else
        //found a comma or a double quote,
        //  so double all the double quotes
        //  and then surround the whole result with double quotes
        "\"" + value.replace("\"", "\"\"") + "\""
    }
    if (line.nonEmpty)
      line.map(encode(_)).mkString(",")
    else
      ""
  }

  def toLines(lines: List[List[String]]): List[String] =
    lines.map(toLine)
}

To validate the above code works for all the various weird input scenarios, I reused the test cases I used for Parser. Again, using the Eclipse ScalaIDE Worksheet, I added a bit more code below my existing code where I could visually verify the results. Here's the the code I added:

val composedLines =
  Composer.toLines(parsedLines)
composedLines.mkString("\n")
val parsedLines2 =
  Parser.fromLines(composedLines)
parsedLines == parsedLines2

When the Scala WorkSheet is saved, it executes its contents. The very last line should show a value of "true". It is the result of round tripping all the test cases through the parser, through the composer and back through the parser.

PS. Thanks to @dhg pointing it out, there is a CSV Scala library which handles parsing CSVs, just in case you want something which is likely more robust and has more options than my Scala code snippets above.

Could you add your source regarding csv escaping rules? I'm not sure that what you describe is universal, so it would be nice to know what software produces the csvs your code parses. I'd assume MS Excel and compatible software? Anyway thanks for documenting this here!

@SillyFreak Tysvm for requesting that. I just appended a paragraph to cover both CSV being a loose definition AND the basis for the rules I used.

Seems like an absurd amount of effort just to avoid using a pre-existing library. Not to mention that an existing library has already thought of all the corner cases and is thoroughly tested.

Thanks for this, @chaotic3quilibrium. I agree this is useful and, moreover, instructive. Is it normal StackOverflow practice though to ask and answer your own question? Do you get points for answering just as if you answered someone else's question?

@Phasmid I don't know, and honestly don't care. It is vastly more costly for me to have to spend hours looking for a (robust) solution, or fixing someone else's broken library (assuming it's even possible) or even in the writing my own solution (only to forget about it when I need it on a future project), than whatever arbitrary non-financial point based reward system is behind SO (StackOverflow). And SO encourages both posting questions and providing your own answer. I've done this before and ended up choosing a different answer than mine because it was better.

parsing - What's a simple (Scala only) way to read in and then write o...

scala parsing csv
Rectangle 27 1

Add "import os" to the list of imports at the top of the file. Then, right after parsing, you can check and set the argument:

if args.outputfile is None:
    args.outputfile = os.path.splitext(args.inputfile)[0] + '.xml'

By the way, arguments default to their long option name. You only need the 'dest' keyword when you want to use a different name. So, for example, 'verbose' could be:

parser.add_argument('-v', '--verbose', action='store_true',
    help='Increases messages being printed to stdout')

EDIT: Here is the example reworked with outputfile handling and with chepner's suggestion to use positional arguments for the file names.

import os
import sys
import argparse
import csv
import indent
from xml.etree.ElementTree import ElementTree, Element, SubElement, Comment, tostring

def get_args(args):
    parser=argparse.ArgumentParser(description='Convert wordlist text files to various formats.', prog='Text Converter')
    parser.add_argument('-v','--verbose',action='store_true',dest='verbose',help='Increases messages being printed to stdout')
    parser.add_argument('-c','--csv',action='store_true',dest='readcsv',help='Reads CSV file and converts to XML file with same name')
    parser.add_argument('-x','--xml',action='store_true',dest='toxml',help='Convert CSV to XML with different name')
    #parser.add_argument('-i','--inputfile',type=str,help='Name of file to be imported',required=True)
    #parser.add_argument('-o','--outputfile',help='Output file name')
    parser.add_argument('inputfile',type=str,help='Name of file to be imported')
    parser.add_argument('outputfile',help='(optional) Output file name',nargs='?')
    args = parser.parse_args()
    if not (args.toxml or args.readcsv):
        parser.error('No action requested')
        return None
    if args.outputfile is None:
        args.outputfile = os.path.splitext(args.inputfile)[0] + '.xml'
    return args

def main(argv):
    args = get_args(argv[1:])
    if args is None:
        return 1
    inputfile = open(args.inputfile, 'r')
    outputfile = open(args.outputfile, 'w')
    reader = read_csv(inputfile)
    if args.verbose:
        print ('Verbose Selected')
    if args.toxml:
        if args.verbose:
            print ('Convert to XML Selected')
        generate_xml(reader, outputfile)
    if args.readcsv:
        if args.verbose:
            print ('Reading CSV file')
    return 1 # you probably want to return 0 on success

def read_csv(inputfile):
      return list(csv.reader(inputfile))

def generate_xml(reader,outfile):
    root = Element('Solution')
    root.set('version','1.0')
    tree = ElementTree(root)

    head = SubElement(root, 'DrillHoles')
    head.set('total_holes', '238')

    description = SubElement(head,'description')
    current_group = None
    i = 0
    for row in reader:
        if i > 0:
            x1,y1,z1,x2,y2,z2,cost = row
            if current_group is None or i != current_group.text:
                current_group = SubElement(description, 'hole',{'hole_id':"%s"%i})

                collar = SubElement (current_group, 'collar',{'':', '.join((x1,y1,z1))}),
                toe = SubElement (current_group, 'toe',{'':', '.join((x2,y2,z2))})
                cost = SubElement(current_group, 'cost',{'':cost})
        i+=1
    indent.indent(root)
    tree.write(outfile)

if (__name__ == "__main__"):
    sys.exit(main(sys.argv))

No, you can change the variables or add new ones.

However, there is an error: global name 'os' not defined

@Andy, Okay, added a comment to "import os".

yes i tried that as well, but i also receive: Attribute error: file object has no attribute 'rfind'

@Andy which version of python are you using? Which line gets the error?

Using argparse to convert csv to xml in python - Stack Overflow

python xml csv argparse