Monday 15 September 2014

performance - fast method in Python to split a large text file using number of lines as input variable -


I am dividing a text file that uses the number of lines as a variable. I wrote this function to save to a temporary directory, which expects the final file to have 4 million lines in each file. Import import tempfile group from importertools, count temp_dir = tempfile.mkdtemp () def tempfile_split (filename, temp_dir, chunk = 4000000): with the form data as file (file name, 'R'): Group = group (datafile, key = lambda k, line = count (): for next (line) // chunk), groups in groups: output_name = os.path.normpath (Os.path.join (For the line in the temp_dir + os.sep, "tempfile_% s.tmp"% k)) group: Open with (output_name, 'a') outfile as: outfile.write (line) < / Pre>

The main problem is the speed of the function. The time is about 30 minutes of Windows OS and Python 2.7 to split a file of 8 million lines into two files of 4 millions lines.

  for line in group: Output as Open (Output_name, 'A') As: outfile.write (line)   

is opening the file, and writing a line, for each line in the group is slow.

Instead, write once per group. Outfile: Outfile.write (''.))

  Open (Output_name, 'A')

No comments:

Post a Comment