Rectangle 27 0

python What is the fastestmost efficient way to loop through a large collection of files and save a plot of the data?


def do_transformation(filename)
    t,f = loadtxt(filename, unpack=True)

    dt = t[1]-t[0]
    fou = absolute(fft.fft(f))
    frq = absolute(fft.fftfreq(len(t),dt))

    ymax = median(fou)*30

    figure(figsize=(15,7))
    plot(frq,fou,'k')

    xlim(0,400)
    ylim(0,ymax)

    iname = filename.replace('.dat','.png')
    savefig(iname,dpi=80)
    close()

pool = multiprocessing.Pool(multiprocessing.cpu_count())
for filename in filelist:
    pool.apply_async(do_transformation, (filename,))
pool.close()
pool.join()

For example, something like this (untested, but should give you the idea):

Have you considered using the multiprocessing module to parallelize processing the files? Assuming that you're actually CPU-bound here (meaning it's the fourier transform that's eating up most of the running time, not reading/writing the files), that should speed up execution time without actually needing to speed up the loop itself.

I just selected your comment as the answer, since it's effectively sped up the program almost 8x (the number of cpu's). But if you could, I have another question. There are a handful of files here that are quite substantial, taking a quite a long time to process. Is there a way to assign multiple processors to the same task instead of applying them to separate files?

There's no simple tweak to say "Throw more CPUs at this task". You'd need to refactor the code to break your worker method up into smaller pieces that multiple processes can work on at the same time, and then pull it back together once all the pieces are ready. For example, it looks like fou = absolute(... and frq = absolute(... could be calculated in parallel. You have to be careful though, because passing large amounts of data between processes can be slow. Its hard for me to say exactly what kind of changes you could make because I really don't understand the algorithms you're using.

You may need to tweak what work actually gets done in the worker processes. Trying to parallelize the disk I/O portions may not help you much (or even hurt you), for example.

hmmmm. Could you elaborate a little more? I'm in a bit of a time crunch with this program. It seems at the rate I'm going, it looks like another another day or two before finishing, and I'm really shooting for 12-18 hours.

Note
Rectangle 27 0

python What is the fastestmost efficient way to loop through a large collection of files and save a plot of the data?


def do_transformation(filename)
    t,f = loadtxt(filename, unpack=True)

    dt = t[1]-t[0]
    fou = absolute(fft.fft(f))
    frq = absolute(fft.fftfreq(len(t),dt))

    ymax = median(fou)*30

    figure(figsize=(15,7))
    plot(frq,fou,'k')

    xlim(0,400)
    ylim(0,ymax)

    iname = filename.replace('.dat','.png')
    savefig(iname,dpi=80)
    close()

pool = multiprocessing.Pool(multiprocessing.cpu_count())
for filename in filelist:
    pool.apply_async(do_transformation, (filename,))
pool.close()
pool.join()

For example, something like this (untested, but should give you the idea):

Have you considered using the multiprocessing module to parallelize processing the files? Assuming that you're actually CPU-bound here (meaning it's the fourier transform that's eating up most of the running time, not reading/writing the files), that should speed up execution time without actually needing to speed up the loop itself.

I just selected your comment as the answer, since it's effectively sped up the program almost 8x (the number of cpu's). But if you could, I have another question. There are a handful of files here that are quite substantial, taking a quite a long time to process. Is there a way to assign multiple processors to the same task instead of applying them to separate files?

There's no simple tweak to say "Throw more CPUs at this task". You'd need to refactor the code to break your worker method up into smaller pieces that multiple processes can work on at the same time, and then pull it back together once all the pieces are ready. For example, it looks like fou = absolute(... and frq = absolute(... could be calculated in parallel. You have to be careful though, because passing large amounts of data between processes can be slow. Its hard for me to say exactly what kind of changes you could make because I really don't understand the algorithms you're using.

You may need to tweak what work actually gets done in the worker processes. Trying to parallelize the disk I/O portions may not help you much (or even hurt you), for example.

hmmmm. Could you elaborate a little more? I'm in a bit of a time crunch with this program. It seems at the rate I'm going, it looks like another another day or two before finishing, and I'm really shooting for 12-18 hours.

Note