I Trying to filter overlap rows in a large file with overlap degree, set to 25%, in other words, the number of elements of intersection between any two rows is less than 0.25 times the number of them. A line is erased, more than 0.25. So if I have a large file with a total of 1000 000 rows, the first 5 rows are as follows:
C6C24C32C 54C67
C6C24C32C51C68C78C6C32C 54C67C
6C32C55C63C 85C94C75C
C32 C53 c67
Because the number of elements of intersection between 1 row and the second line is 3 (such as C6, C24, C32), the number of middle number They have 8, (like c6, c24, c32, c54 , C67, c51, c68, c78), 3/8 = 0.375 & gt; 0.25, the second row is removed. So do the third and fifth lines. The last answer is 1 and 4th row.
C6C24C32C 54C67
C6C32C55C63C85C94C75
The pseudo-code is as follows:
for i = 1: (n-1) # n is the number of rows of a file for Jammu = (i + 1): n If the ith line and overlap degree of the Jath line is greater than 0.25, then delete the jth line from the end of the file
end
Are:
con & lt; -File ("inputfile.txt", "r") fileConn & lt; -File ("outputfile025.txt") data & lt; -readLines (con, n = 1) con1 & lt; -strsplit (data, "\ t") write lines (for con1 [[1]] [], fileConn) (i in 2: 1000000) {data & lt; -readlines (con, n = 1) con2 & lt; -strsplit (data, "\ t") intersect = length (int1) (con1 [[1]] [], con2 [[1]] []) (writes) (written) (written) () (envelope) FileConn )}} Close (con) close (fileConn)
The problem above is that the above code can only be used to filter overlaps between 1 rows and any other rows How to filter overlap between rows and any other rows, 2, 3, ...... Anyone have any idea about solving this problem? Thanks!
using a solution here
agrep which approximates matching pattern (here One element of your list): Within the other elements of the list using Normalized Levenshitin editing distance:
max.distance = list (deletions = 0.25)
For example looping through your data using
lapply :
res < - List (lapply (seq_along (ll), function (x) {res & lt; - agrep (pattern = ll [x], # for each string [-x], # I search between other strings = FALSE, max = list (deletions = 0.25)) # I determine the distance of Levenshitin (length (res) == 0) NA and race}) [[[is.na (res) & amp; You can remove duplicate and unavailable values: "c6 c24 c32 c54 c67" "c6 c32 c54 c67" "c6 c32 c55 c63 c85 c94 c75"
/ P>
PS: Here
ll is:
ll
No comments:
Post a Comment