[galaxy-user] fasta_to_tabular.py slowness

Brad Chapman chapmanb at 50mail.com
Mon Jul 20 09:37:41 EDT 2009


Hi all;

Rasmus:
> >> I've found that fasta_to_tabular.py is very slow with big sequences,  
> >> e.g. ~4 minutes for a single 5MB sequence.
[...]
> >> -                fasta_seq = "%s%s" % ( fasta_seq, line )
> >> +                fasta_seq += line

Bob:
> > I suspect an additional improvement would be seen by keeping fasta_seq  
> > as a list of strings, using fasta_seq.append(line), and the catenating  
> > them together with "".join when it's time to output.

Greg:
> Think about memory when you have large files...

The memory usage shouldn't be any different than the current
implementation since an entire sequence is read into memory, and
then written to the output file. Bob's list/join approach is the
standard way to quickly do this, although in Python 2.5 and above
the concatenation approach is almost as good. The Python wiki has a
good summary of this common speed-up improvement:

http://wiki.python.org/moin/PythonSpeed/PerformanceTips#StringConcatenation

Definitely worth adding. If memory is a problem the code could
be improved to read in a specified number of lines and write them
incrementally to the output file instead of breaking at sequence
records.

Brad



More information about the galaxy-user mailing list