Rectangle 27 20

Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

The problem is that subscribing the file, force to read all the lines of the file. This is a really huge file and memory usage raise to much if I do that.

@Mario: Added a generator version, which might be faster (but I didn't have time to test it - maybe you do).

@Mario: Wah, that's irritating. Here is another gist (gist.github.com/820490), just tried it myself with python 2.5. If that doesn't solve it, I'm out of options (and time ;) for this answer. Good luck!

list - How do you split a csv file into evenly sized chunks in Python?...

python list csv chunks
Rectangle 27 19

Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

The problem is that subscribing the file, force to read all the lines of the file. This is a really huge file and memory usage raise to much if I do that.

@Mario: Added a generator version, which might be faster (but I didn't have time to test it - maybe you do).

@Mario: Wah, that's irritating. Here is another gist (gist.github.com/820490), just tried it myself with python 2.5. If that doesn't solve it, I'm out of options (and time ;) for this answer. Good luck!

list - How do you split a csv file into evenly sized chunks in Python?...

python list csv chunks
Rectangle 27 1

...
-C, --line-bytes=SIZE
put at most SIZE bytes of lines per output file
...

The description of this option may not be very obvious, but it seems to cover what you are asking for: the file is split at the latest possible line break before reaching SIZE bytes.

python - Split large files by size limit without cutting lines - Stack...

python bash split
Rectangle 27 1

There isn't a good way to do this for all .csv files. You should be able to divide the file into chunks using file.seek to skip a section of the file. Then you have to scan one byte at a time to find the end of the row. The you can process the two chunks independently. Something like the following (untested) code should get you started.

file_one = open('foo.csv')
file_two = open('foo.csv') 
file_two.seek(0, 2)     # seek to the end of the file
sz = file_two.tell()    # fetch the offset
file_two.seek(sz / 2)   # seek back to the middle
chr = ''
while chr != '\n':
    chr = file_two.read(1)
# file_two is now positioned at the start of a record
segment_one = csv.reader(file_one)
segment_two = csv.reader(file_two)

I'm not sure how you can tell that you have finished traversing segment_one. If you have a column in the CSV that is a row id, then you can stop processing segment_one when you encounter the row id from the first row in segment_two.

list - How do you split a csv file into evenly sized chunks in Python?...

python list csv chunks
Rectangle 27 1

There isn't a good way to do this for all .csv files. You should be able to divide the file into chunks using file.seek to skip a section of the file. Then you have to scan one byte at a time to find the end of the row. The you can process the two chunks independently. Something like the following (untested) code should get you started.

file_one = open('foo.csv')
file_two = open('foo.csv') 
file_two.seek(0, 2)     # seek to the end of the file
sz = file_two.tell()    # fetch the offset
file_two.seek(sz / 2)   # seek back to the middle
chr = ''
while chr != '\n':
    chr = file_two.read(1)
# file_two is now positioned at the start of a record
segment_one = csv.reader(file_one)
segment_two = csv.reader(file_two)

I'm not sure how you can tell that you have finished traversing segment_one. If you have a column in the CSV that is a row id, then you can stop processing segment_one when you encounter the row id from the first row in segment_two.

list - How do you split a csv file into evenly sized chunks in Python?...

python list csv chunks
Rectangle 27 0

You are almost certainly limited by hardware here; a Python or Perl implementation is not likely to work around this.

If you are limited by CPU, then using the Python or Perl bindings on to the same compression libraries won't make any difference.

If you are limited by disk IO, then using the Python or Perl IO operation won't make your disks any faster.

python - Efficient way to split files based on size - Stack Overflow

python perl unix split
Rectangle 27 0

I'm not sure what you mean by "then again by the parser". After the splitting has been done, there's no further traversal of the string, only a traversal of the list of split strings. This will probably actually be the fastest way to accomplish this, so long as the size of your string isn't absolutely huge. The fact that python uses immutable strings means that you must always create a new string, so this has to be done at some point anyway.

If your string is very large, the disadvantage is in memory usage: you'll have the original string and a list of split strings in memory at the same time, doubling the memory required. An iterator approach can save you this, building a string as needed, though it still pays the "splitting" penalty. However, if your string is that large, you generally want to avoid even the unsplit string being in memory. It would be better just to read the string from a file, which already allows you to iterate through it as lines.

However if you do have a huge string in memory already, one approach would be to use StringIO, which presents a file-like interface to a string, including allowing iterating by line (internally using .find to find the next newline). You then get:

import StringIO
s = StringIO.StringIO(myString)
for line in s:
    do_something_with(line)

python - Iterate over the lines of a string - Stack Overflow

python string iterator
Rectangle 27 0

def split(input_list,num_fractions=None,subset_length=None):
   '''                                                                                                                                
   Given a list/tuple split original list based on either one of two parameters given but NOT both,                                   
   Returns generator                                                                                                                  
   num_fractions : Number of subsets original list has to be divided into, of same size to the extent possible.                       
                   In case equilength subsets can't be generated, all but the last subset                                             
                   will have the same number of elements.                                                                             
   subset_length : Split on every subset_length elements till the list is exhausted.                                                  

   '''
   if not input_list:
       yield input_list #For some reason I can't just return from here : return not allowed in generator expression                   
   elif not bool(num_fractions) ^ bool(subset_length): #Will check for both the invalid cases, '0' and 'None'.. Oh Python :)          
       raise Exception("Only one of the params : num_fractions,subset_length to be provided")
   else:
       if num_fractions: #calcluate subset_length in this case                                                                        
           subset_length = max(len(input_list)/num_fractions,1)

       for start in xrange(0,len(input_list),subset_length):
           yield input_list[start:start+subset_length]



>> list(list_split.split((2, 2, 10, 10, 344, 344, 45, 43, 2, 2, 10, 10, 12, 8, 2, 10),subset_length=4))
 [(2, 2, 10, 10), (344, 344, 45, 43), (2, 2, 10, 10), (12, 8, 2, 10)]

The code is longer than the solutions given above, but covers all possible sequence split conditions.

Thanks but the one above works just fine. saving the code as it could be very use full in the right place.

python - Spliting a long tuple into smaller tuples - Stack Overflow

python list split tuples
Rectangle 27 0

Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

The problem is that subscribing the file, force to read all the lines of the file. This is a really huge file and memory usage raise to much if I do that.

@Mario: Added a generator version, which might be faster (but I didn't have time to test it - maybe you do).

@Mario: Wah, that's irritating. Here is another gist (gist.github.com/820490), just tried it myself with python 2.5. If that doesn't solve it, I'm out of options (and time ;) for this answer. Good luck!

list - How do you split a csv file into evenly sized chunks in Python?...

python list csv chunks
Rectangle 27 0

I'm not sure what you mean by "then again by the parser". After the splitting has been done, there's no further traversal of the string, only a traversal of the list of split strings. This will probably actually be the fastest way to accomplish this, so long as the size of your string isn't absolutely huge. The fact that python uses immutable strings means that you must always create a new string, so this has to be done at some point anyway.

If your string is very large, the disadvantage is in memory usage: you'll have the original string and a list of split strings in memory at the same time, doubling the memory required. An iterator approach can save you this, building a string as needed, though it still pays the "splitting" penalty. However, if your string is that large, you generally want to avoid even the unsplit string being in memory. It would be better just to read the string from a file, which already allows you to iterate through it as lines.

However if you do have a huge string in memory already, one approach would be to use StringIO, which presents a file-like interface to a string, including allowing iterating by line (internally using .find to find the next newline). You then get:

import StringIO
s = StringIO.StringIO(myString)
for line in s:
    do_something_with(line)

python - Iterate over the lines of a string - Stack Overflow

python string iterator
Rectangle 27 0

zipsplit (part of Info-ZIP) is available on most *nix distributions.

zipsplit - split a zipfile into smaller zipfiles

Or if using split:

split -b 1024m file file.part
file
file.partaa
file.partab

In order to create the original file from the split ones, do

cat file.part* | gzip -dc > outfile

@Mari see edit above

python - Efficient way to split files based on size - Stack Overflow

python perl unix split
Rectangle 27 0

There isn't a good way to do this for all .csv files. You should be able to divide the file into chunks using file.seek to skip a section of the file. Then you have to scan one byte at a time to find the end of the row. The you can process the two chunks independently. Something like the following (untested) code should get you started.

file_one = open('foo.csv')
file_two = open('foo.csv') 
file_two.seek(0, 2)     # seek to the end of the file
sz = file_two.tell()    # fetch the offset
file_two.seek(sz / 2)   # seek back to the middle
chr = ''
while chr != '\n':
    chr = file_two.read(1)
# file_two is now positioned at the start of a record
segment_one = csv.reader(file_one)
segment_two = csv.reader(file_two)

I'm not sure how you can tell that you have finished traversing segment_one. If you have a column in the CSV that is a row id, then you can stop processing segment_one when you encounter the row id from the first row in segment_two.

list - How do you split a csv file into evenly sized chunks in Python?...

python list csv chunks