Monday 15 March 2010

Removing html tags from a string in R -


I am trying to read the web page source in R and it is processing it as a string. I am trying to take out the paragraphs and remove the html tag from paragraph text. I am running the following problem:

I tried to implement the function to remove the html tag:

  cleanFun = function (fullStr) {tags of #indind Location quotation tag lock = cdidid (straw_locket_all (full STR, "& lt;") [[1]] [, 2], straw_ local_all (full STR, "& gt;") [[1]] [, 1 ]); #Storage tagstressing = list (for # tag string tags) # Extract and store tag string (i in 1: dim (tag lock) [1]) {tagstring [i] = substit (full STR, tag lock [i, 1] , Tag lock [i, 2]); } #remove tag string for paragraph newStr = fullStr (i in 1: length (tagsettings)) {newStr = str_replace_all (newStr, tag strings [[I]] [1], "")} return (newStr)};   

This works for some tags, but not all tags, this is an example where this failing is following the string:

  test = "junk junk"> a href = \ "/ wiki / abstract (mathematics) \" title = \ "abstract (mathematics) \" & gt; garbage junk "   The goal must be:  
  cleanup (test) = "junk junk junk junk"   

However, this does not seem to work. I thought it might be something with string length or escape characters, but I could not find a solution associated with them.

This can only be achieved through regular expression and grep family: < Pre> cleanFun & lt; - Function (html string) {Return (gsub ("

This can be found in many HTML tags of the same string Will work with!

No comments:

Post a Comment