Giuseppe: How to filter overlap rows in a big file with R -

Monday, 15 April 2013

How to filter overlap rows in a big file with R -

I Trying to filter overlap rows in a large file with overlap degree, set to 25%, in other words, the number of elements of intersection between any two rows is less than 0.25 times the number of them. A line is erased, more than 0.25. So if I have a large file with a total of 1000 000 rows, the first 5 rows are as follows:

C6C24C32C 54C67
C6C24C32C51C68C78C6C32C 54C67C
6C32C55C63C 85C94C75C
C32 C53 c67

Because the number of elements of intersection between 1 row and the second line is 3 (such as C6, C24, C32), the number of middle number They have 8, (like c6, c24, c32, c54 , C67, c51, c68, c78), 3/8 = 0.375 & gt; 0.25, the second row is removed. So do the third and fifth lines. The last answer is 1 and 4th row.

C6C24C32C 54C67
C6C32C55C63C85C94C75

The pseudo-code is as follows:

  for i = 1: (n-1) # n is the number of rows of a file for Jammu = (i + 1): n If the ith line and overlap degree of the Jath line is greater than 0.25, then delete the jth line from the end of the file    end   
 Are:  
   con & lt; -File ("inputfile.txt", "r") fileConn & lt; -File ("outputfile025.txt") data & lt; -readLines (con, n = 1) con1 & lt; -strsplit (data, "\ t") write lines (for con1 [[1]] [], fileConn) (i in 2: 1000000) {data & lt; -readlines (con, n = 1) con2 & lt; -strsplit (data, "\ t") intersect = length (int1) (con1 [[1]] [], con2 [[1]] []) (writes) (written) (written) () (envelope) FileConn )}} Close (con) close (fileConn)     The problem above is that the above code can only be used to filter overlaps between 1 rows and any other rows How to filter overlap between rows and any other rows, 2, 3, ...... Anyone have any idea about solving this problem? Thanks!  
   
  using a solution here  agrep  which approximates matching pattern (here One element of your list): Within the other elements of the list using Normalized Levenshitin editing distance:  
  max.distance = list (deletions = 0.25)    For example looping through your data using  lapply :  
  res < - List (lapply (seq_along (ll), function (x) {res & lt; - agrep (pattern = ll [x], # for each string [-x], # I search between other strings = FALSE, max = list (deletions = 0.25)) # I determine the distance of Levenshitin (length (res) == 0) NA and race}) [[[is.na (res) & amp; You can remove duplicate and unavailable values: "c6 c24 c32 c54 c67" "c6 c32 c54 c67" "c6 c32 c55 c63 c85 c94 c75"    / P> 
 PS: Here  ll  is:  
  ll    

 




Posted by



Unknown




at

03:22











Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest












No comments:







Post a Comment




Newer Post


Older Post

Home




Subscribe to:
Post Comments (Atom)