Giuseppe: ruby - How to parse an article from web-page using Regexp? -

Saturday, 15 May 2010

ruby - How to parse an article from web-page using Regexp? -

I have an article on the page and I need to parse all the text.

I know that an article is more than 15 words, symbols '' or ',' or '-', or ':', or '.' Joined.

How can I write a Reggae to an article with Ruby to pars and parse it?

For example:

I need to parse the main text: ATLANTA â ???? From the emotional high provided by Matt Harvey and Jack Wheeler, Mattesa ???? Young, hard-throwing right elephants, on Wednesday the team returned to the realities of their everyday existence ...

I know how to parse the page and get the content, but I Do not know how to write it on a Regexp! To analyze the original HTML tags with the expected text, let me write some reggies to check the rule: Articles are more than 15 words, only '' or ',' or '-', or ':' or ' . '

  Look at your requirements to make it a horrible gem for web scrapping.  
  is required 'nokogiri' is required 'open-yury' doctor = nongoose :: html (open ('http://www.nytimes.com/2013/06/ 20 / sports / baseball / for-the-mets-an-afterglow-then-realitys-harsh-light.html? Ref = sports & _r = 1 & amp;; ')) str = doc.at_css (' div.articleBody & Gt ; Nyt_text & gt; p '). Inserts text # # gt; & Gt; Atlanta A ???? From the emotional high provided by Matt Harvey and Jack Wheeler, Mattesa ???? Youth, throwing right elephants, on Wednesday the team returned to the realities of their everyday existence. Str.scan (/ \ w + /) # = & gt; # "From", # "the", # "excellent", # "emotional", # "high", # "rendering", # "by", # "matte", # "harvey", # "and" ###########################################################################################################, "On", # "Wednesday", # "descended", # "back", # "to", # "the", # "Realities", # "of" "#" its ", #" Daily ", # "Existence"]      I know that articles are more than 15 words:    
  Str.scan (/ \ w + /) size & gt; 15 # = & gt; The truth is      symbol '' or ',' or '-', or ':' or '.' Added:      ['', ',', '-', ':', '.']. Map {| I | Str.include? I} # = & gt; [True, true, true, false, false]




Posted by



Unknown




at

03:22











Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest




No comments:







Post a Comment




Newer Post


Older Post

Home




Subscribe to:
Post Comments (Atom)


















    
About Me




Unknown



View my complete profile



Blog Archive








        ► 
      



2015

(1886)





        ► 
      



September

(203)







        ► 
      



August

(208)







        ► 
      



July

(224)







        ► 
      



June

(210)







        ► 
      



May

(230)







        ► 
      



April

(195)







        ► 
      



March

(209)







        ► 
      



February

(201)







        ► 
      



January

(206)









        ► 
      



2014

(2117)





        ► 
      



September

(239)







        ► 
      



August

(251)







        ► 
      



July

(226)







        ► 
      



June

(208)







        ► 
      



May

(229)







        ► 
      



April

(199)







        ► 
      



March

(255)







        ► 
      



February

(275)







        ► 
      



January

(235)









        ► 
      



2013

(2011)





        ► 
      



September

(199)







        ► 
      



August

(228)







        ► 
      



July

(210)







        ► 
      



June

(222)







        ► 
      



May

(217)







        ► 
      



April

(229)







        ► 
      



March

(243)







        ► 
      



February

(221)







        ► 
      



January

(242)









        ► 
      



2012

(1993)





        ► 
      



September

(227)







        ► 
      



August

(235)







        ► 
      



July

(225)







        ► 
      



June

(206)







        ► 
      



May

(221)







        ► 
      



April

(216)







        ► 
      



March

(206)







        ► 
      



February

(227)







        ► 
      



January

(230)









        ► 
      



2011

(1964)





        ► 
      



September

(220)







        ► 
      



August

(222)







        ► 
      



July

(219)







        ► 
      



June

(224)







        ► 
      



May

(219)







        ► 
      



April

(206)







        ► 
      



March

(216)







        ► 
      



February

(221)







        ► 
      



January

(217)









        ▼ 
      



2010

(1952)





        ► 
      



September

(230)







        ► 
      



August

(202)







        ► 
      



July

(221)







        ► 
      



June

(207)







        ▼ 
      



May

(213)

C string array sizeof() changes -
sql - ORA-00979: not a GROUP BY expression error -
Hosting powershell import-module failure -
android - How can I test timed Notification? Or is...
In C#, how do I initialize an Object with key/valu...
ios - iPad app not laying out properly -
ember.js - Is Ember really a single page app? -
linux - cut matched and put in another location -
git - How do I overwrite my local repo with the on...
codea - How can I build an IOS app on IOS? -
javascript - Why use === instead of == when compar...
reporting services - Editing RDL Reports with SSDT...
java - Login with htmlunit and javascript -
Neo4j linked list - multiple nodes -
r - Select list element based on their name -
javascript - parent.document gives undefined in Ch...
wpf - Translating numeric value in DataTable for e...
php - How to use multiple languages in one view in...
show new info on list click PHP -
System.ArgumentNullException error with MonoDevelo...
php - Do I use a function here? -
javascript - HTML select box with blank to be disp...
css - How do i Center this Simple Slideshow? -
math function in javascript exponentiation -
javascript - does this variable get hoisted no mat...
python - filter_interface = models.HORIZONTAL --- ...
android - How can I auto-pause the audio player in...
python - sqlalchemy.exc.ArgumentError: Can't load ...
c++ - Redefinition of class error -
Erlang: Storing data in a record properly and retr...
Sorting a text file in Java -
java - Optimize calculation of prime numbers -
jpa - Java Apache Derby Database Restoration Error -
java - org.codehaus.jackson.map.JsonMappingExcepti...
python - Locating specific keys and corresponding ...
Java: Recursively Finding the minimum element in a...
jquery - AJAX and 200 OK messages: Are they requir...
C++: BOOST-ASIO No server causes client to crash? -
android - How to draw a vertical line on a CubicLi...
delphi - Http get json into string -
isa swizzling - I'm having trouble adding two matr...
URL Schemes on iOS -
objective c - Game Center login -
Apache server log highest traffic using bash -
Javascript to enable, disable drop down menu in JS...
cocoa - Running .app by double click on it and lin...
Line Detection openCV -
jqGrid change formatter of cell when cell has no v...
windows 7 - Red5 Server won't start -
cordova - Blackberry Connection refused when deplo...
android - How to specify HTTP headers for async im...
validation - XML Schema restrictions on the curren...
iphone - NSCoding and ARC -
java - Guide to implementing spring security passw...
database - Rails, passing an array as criteria to ...
vb.net - Convert timespan into hhh:mm:ss to insert...
c# - HtmlAgilityPack Attributes.Remove on Image On...
php - JQuery loaded form won't submit -
c++ - Changing vectors that were already created -
java - cannot instantiate a class using a button -
jquery - Create a time interval that will cause a ...
c++ - Linking g++ 4.8 to libstdc++ -
java - check file name in file array list for spec...
How do I redirect all requests that contains a cer...
python - How to add a custom decorator to a fabric...
java - Find a Specific String in Excel File using ...
shapes - css before & after vertical alignment top -
java - Mouse focus on JLabel -
Rails: How to receive errors from nested models -
How to remove username from Django registration 1.0 -
ruby - How to parse an article from web-page using...
c# - Correctly Using CanExecute for MVVM Light ICo...
java - What happened to BasePagingLoadResult in gx...
probability - Using random number generator to gen...
ide - Extract # TODO tags from source code and con...
android:: how to add suitable delay during playing...
python 2.7 - share dict between processes -
python - BeautifulSoup and -
class - Optional argument in a method with ocaml -
ios - Are AFNetworking success/failure blocks invo...
jquery - addClass vs attr and creating an ID -
asp.net - The first else part is not reading -
C# WPF WebBrowser control, JavaScript writeln func...
How to get a list of all the url parameters with t...
php - Get file name of file being uploaded and wri...
c++ - Constructor Call Within Different Constructo...
java - Why does this DataOutputStream print gibber...
java 7 update25 applet not working on IE10 -
java - How to generate series of ascending number ...
java - How can I check to see if the index of a st...
jquery - Why I can't serialize form with article t...
javascript - Use node.js as a WebRTC peer? -
c# - Regex.Match() won't match a substring -
c# - Resolve XAML Binding through Code on Windows ...
php - How do I add an onclick event to the Wordpre...
Convert auto page numbers special character to act...
excel vba - Incoming market feed to save in differ...
html - Ruby on Rails link_to -
ASP.NET controls and HTML5 -
Laravel default .htaccess file will not work -








        ► 
      



April

(199)







        ► 
      



March

(234)







        ► 
      



February

(244)







        ► 
      



January

(202)


















    















Powered by Blogger.