Sunday 15 February 2015

python - Read CSV and remove duplicate values based on the values in two (out of many) columns -


[Use of Python 3] I have a CSV file that I want to read and remove the duplicate 'special' case. The script should output the specified CSV to the CDS, while the headers should respect.

As the best example, it is explaining. CSV-file looks like this:

  Name of the name Header X header o Headers ... 1 string float string ... 1 string float string ... 1 string float string .. 2 String float string ... 2b string float string ... 3 string float string ... 4b string float string ... 5c string float string ... 6d string float string .... .. ... ... ... ... ...   

Here are duplicate rows for id = 1 and id = 2, although I want to place all the rows Duplicate where the name is the same, in this example I want to put all instances of id = 1, but remove all instances of id = 2 In other words, remove all the rows which are duplicates where the name is greater than 1 . (Is this a sensation ?!)

Currently, based on the thread I have the following code (below). Although this is exactly the opposite, removes duplicates based on two columns, and except for all instances of id = 2 and to remove the rows where id = 1.

In addition, ideally I would like to print the script count duplicate it removed.

  import csv filename = 'test.csv' outfile = 'outfile.csv' with open (outfile, 'w') as fout: author = none entries = set ( ) If the writer is not a writer: author = CSV.Dictitire (forbidden, lineterminator = '\ n', field name = reader.fieldnames) open as author.writeheader () (file name, 'r') With: reader = csv.DictReader (fin) for the line in the reader: key = (line ['id'], row ['name']) If there is no key in entries: author.writerow (line) If the rows are sorted by the ID, then add / key (key)    

You can use the following code. import csv import itertools import operator file name = 'test.csv' outfile = 'outfile.csv' ndups = 0 with open file name (file name, 'r'), Open (outfile, 'w') fout: reader = csv.DictReader (fin) IRRI for IRRI, itertools.groupby (reader, key = operator.metem ('id') in GRP): rows = list (grp) For ter = csv.DictWriter (fout, lineterminator = '\ n', fieldnames = reader.fieldnames) if the line (lines for [line ['name']}}> line in lines (1: ndups + = len ( Rows) Continue to the author. Authors (rows) Print ( '{} Duplicate.' Format (ndups))

No comments:

Post a Comment