Sunday 15 June 2014

unix - Parse CSV file and perform conversion in Linux -


I have a large CSV file (many 100 MB) with many columns:

 < Code> 1; 18Jun2013; 23: 58: 58; ;; L; OT; ;;;;;;;;;;; O; F; ;;;; Ot; H, e, r ;;;;; C; O; L; U, m, n of ;;;;;   

You see that the second column is a date that I have to format% Y-% m-% d in the format and to place an order in a database. I believe that the database is easier and faster to replace raw data rather than later.

The main script is using bash for now for the sake of conversion I have done the following:

  sed -n '2, $ p' $ TMPF | While reading the line; Start = $ (echo "$ line" | cut-d \; -f1) origin = $ (resonant "$ line" | cut-d \; -f2) #cache date translation, poorly eval origdateh for Hash table = H $ Is generated if ["x $ {! Origdateh}" = "x"]; Then # has not been cached, need to call date, then datex = $ store (date -d "$ origdate" +% Y-% m-% d) eval h $ origdate = "$ datex" other #cache hit datex = $ (Eval echo \ $ h $ origdate) fi = $ (resonant "$ line" | cut-d \; -f3-) echo "$ start; $ datex; $ end" & gt; & Gt; $ TMPF2   

I use sed to start with the second line (the first line contains the CSV header) and I think all the subsites are echoed more slowly - Slows down very slowly, so "hashtable" is not really too much use ...

Who can do it fast?

Do not use a bash script but at the very least a Python script, it is more readable / Maintenance and will probably be more efficient.

The example code can look like this (untested):

  # file: converter_ import you datetime def convert_line (line) : Split line at '#'; ' Line = line. Split (';') # Obtain date part (second column) # Parse date from date date = datetime.date.strptime (line [1], '% d% a% Y') # Convert format to # substitute object in line line [1] = date.strftime ('% Y-% m-% d') # Return changed line return ';' Include (line), while true: print convert_line (raw_input ())   

Now you just do:

  cat file.csv | Python Converter.py & gt; Optional implementation:  
  # file: converter_2.py import datetime defconvert_line (line): # partition line ';' Line = line. Split (';') # Obtain date part (second column) # Parse date from date date = datetime.date.strptime (line [1], '% d% a% Y') # For Conversion Format # Replace the object in the line line [1] = date.starttime ('% Y-% m-% d') # Return Changed Line Return ';' Open ('file_converted.csv', 'w +') in the form of Open ('file.csv'), open in the form of outfile: outfile.writelines (for the line in the convert_line (line) infile)  
 If your CSV contains some header lines, then you should not convert them with this function, of course.   

No comments:

Post a Comment