Saturday 15 May 2010

ruby - How to parse an article from web-page using Regexp? -


I have an article on the page and I need to parse all the text.

I know that an article is more than 15 words, symbols '' or ',' or '-', or ':', or '.' Joined.

How can I write a Reggae to an article with Ruby to pars and parse it?

For example:

I need to parse the main text: ATLANTA รข ???? From the emotional high provided by Matt Harvey and Jack Wheeler, Mattesa ???? Young, hard-throwing right elephants, on Wednesday the team returned to the realities of their everyday existence ...

I know how to parse the page and get the content, but I Do not know how to write it on a Regexp! To analyze the original HTML tags with the expected text, let me write some reggies to check the rule: Articles are more than 15 words, only '' or ',' or '-', or ':' or ' . '

Look at your requirements to make it a horrible gem for web scrapping.

  is required 'nokogiri' is required 'open-yury' doctor = nongoose :: html (open ('http://www.nytimes.com/2013/06/ 20 / sports / baseball / for-the-mets-an-afterglow-then-realitys-harsh-light.html? Ref = sports & _r = 1 & amp;; ')) str = doc.at_css (' div.articleBody & Gt ; Nyt_text & gt; p '). Inserts text # # gt; & Gt; Atlanta A ???? From the emotional high provided by Matt Harvey and Jack Wheeler, Mattesa ???? Youth, throwing right elephants, on Wednesday the team returned to the realities of their everyday existence. Str.scan (/ \ w + /) # = & gt; # "From", # "the", # "excellent", # "emotional", # "high", # "rendering", # "by", # "matte", # "harvey", # "and" ###########################################################################################################, "On", # "Wednesday", # "descended", # "back", # "to", # "the", # "Realities", # "of" "#" its ", #" Daily ", # "Existence"]   

I know that articles are more than 15 words:

  Str.scan (/ \ w + /) size & gt; 15 # = & gt; The truth is   

symbol '' or ',' or '-', or ':' or '.' Added: ['', ',', '-', ':', '.']. Map {| I | Str.include? I} # = & gt; [True, true, true, false, false]

No comments:

Post a Comment