Monday 15 April 2013

How to filter overlap rows in a big file with R -


I Trying to filter overlap rows in a large file with overlap degree, set to 25%, in other words, the number of elements of intersection between any two rows is less than 0.25 times the number of them. A line is erased, more than 0.25. So if I have a large file with a total of 1000 000 rows, the first 5 rows are as follows:

C6C24C32C 54C67
C6C24C32C51C68C78C6C32C 54C67C
6C32C55C63C 85C94C75C
C32 C53 c67

Because the number of elements of intersection between 1 row and the second line is 3 (such as C6, C24, C32), the number of middle number They have 8, (like c6, c24, c32, c54 , C67, c51, c68, c78), 3/8 = 0.375 & gt; 0.25, the second row is removed. So do the third and fifth lines. The last answer is 1 and 4th row.

C6C24C32C 54C67
C6C32C55C63C85C94C75

The pseudo-code is as follows:

  for i = 1: (n-1) # n is the number of rows of a file for Jammu = (i + 1): n If the ith line and overlap degree of the Jath line is greater than 0.25, then delete the jth line from the end of the file   

end

Are:

  con & lt; -File ("inputfile.txt", "r") fileConn & lt; -File ("outputfile025.txt") data & lt; -readLines (con, n = 1) con1 & lt; -strsplit (data, "\ t") write lines (for con1 [[1]] [], fileConn) (i in 2: 1000000) {data & lt; -readlines (con, n = 1) con2 & lt; -strsplit (data, "\ t") intersect = length (int1) (con1 [[1]] [], con2 [[1]] []) (writes) (written) (written) () (envelope) FileConn )}} Close (con) close (fileConn)    

The problem above is that the above code can only be used to filter overlaps between 1 rows and any other rows How to filter overlap between rows and any other rows, 2, 3, ...... Anyone have any idea about solving this problem? Thanks!

using a solution here agrep which approximates matching pattern (here One element of your list): Within the other elements of the list using Normalized Levenshitin editing distance:

  max.distance = list (deletions = 0.25)   

For example looping through your data using lapply :

  res < - List (lapply (seq_along (ll), function (x) {res & lt; - agrep (pattern = ll [x], # for each string [-x], # I search between other strings = FALSE, max = list (deletions = 0.25)) # I determine the distance of Levenshitin (length (res) == 0) NA and race}) [[[is.na (res) & amp; You can remove duplicate and unavailable values: "c6 c24 c32 c54 c67" "c6 c32 c54 c67" "c6 c32 c55 c63 c85 c94 c75"   

/ P>

PS: Here ll is:

  ll    

No comments:

Post a Comment