I am trying to read the web page source in R and it is processing it as a string. I am trying to take out the paragraphs and remove the html tag from paragraph text. I am running the following problem:
I tried to implement the function to remove the html tag:
cleanFun = function (fullStr) {tags of #indind Location quotation tag lock = cdidid (straw_locket_all (full STR, "& lt;") [[1]] [, 2], straw_ local_all (full STR, "& gt;") [[1]] [, 1 ]); #Storage tagstressing = list (for # tag string tags) # Extract and store tag string (i in 1: dim (tag lock) [1]) {tagstring [i] = substit (full STR, tag lock [i, 1] , Tag lock [i, 2]); } #remove tag string for paragraph newStr = fullStr (i in 1: length (tagsettings)) {newStr = str_replace_all (newStr, tag strings [[I]] [1], "")} return (newStr)}; This works for some tags, but not all tags, this is an example where this failing is following the string:
test = "junk junk"> a href = \ "/ wiki / abstract (mathematics) \" title = \ "abstract (mathematics) \" & gt; garbage junk " The goal must be: cleanup (test) = "junk junk junk junk" However, this does not seem to work. I thought it might be something with string length or escape characters, but I could not find a solution associated with them.
This can only be achieved through regular expression and grep family: < Pre> cleanFun & lt; - Function (html string) {Return (gsub (" This can be found in many HTML tags of the same string Will work with!
No comments:
Post a Comment