Monday 15 August 2011

Parse HTML Table with Python BeautifulSoup -


I'm trying to parse an HTML table to use BeautifulSoup, which I want to get three columns Uploaded (0 to 735, 0.50 to 1.0 and 0.5 to 0.0) as lists in the form of. To be explained, I would consider 0 to 735 integers to be keys and decimal numbers.

By reading a lot of other posts, I have come with the following which I do not want to come close to making a list. This all displays the text in the table as seen here

  import bs4 to BeautifulSoup soup = beautiful soup (open ("fide.html") table = soup.find ('table') rows = table.findAll ('TR for strings in tr' '): Column = tr.findAll (' TD 'for TD in the column): text =' '.join (td.find (text = true)) print text + "|" , Print   

I'm new to Python and beautiful soup, so please be gentle with me! Beautiful soups like HTML Parsers estimate that what you want is an object model that mirrors the input HTML structure. But sometimes (like this case) it gets more than the help of that model. Pipersing includes some HTML parsing features which are more robust using raw rezeps but otherwise work in similar fashion, you define the snippet of interest and ignore the rest. Here is a parser that reads through the HTML source you send:

  import makeHTMLTags pyparsing, withAttribute, pressing, Regex, group "" "Looking for this recurring pattern:  00-03  gt; td valign = "top"> .50  & lt; ; TD valign = "top"> .50  and want to have a dict (.50, .50) with key 0, 1, 2, and 3 with all values. TD, tdend = makeHTMLTags ("td") keytd = td.copy (). SetParseAction (withAttribute (bgcolor = "#FFFFCC") td, tdend, keytd = map (pressing, (td, tdend, keytd)) realnum = Regex (r'1 \ \ d + ') setParseAction (lambda T: .. Float ('[']) integer = regex (r '\ d {1,3}') setParseAction (lambda T: integer (t [0]) dash = pressing ('-') # entryExpr above HTML bits Match = ("full") (+ "integer" integer ("start") + dash + integer ("end") + tdend + group (2 * (td + realnum + tdend) ("vals" ))   

This parser not only raises matching triple, this end-of-end Extracts integer and also the sum of real numbers (and ALSO already converts from string to integer or floats over purse time).

By looking at the table, I'm guessing that in fact you want a lookup that takes a key like 700, and pair of values ​​0.99, 0.01), because the range of 700 620-735 This bit source of code falls into the search of HTML text, leads to more than matched entries and incorporates the key-value pair into the decrypt lookup:

  # entry Input HTML to match the encoded expression, and Build (Entry.start, entry.end + 1): Lookup [i] = entry.vals   

and now to try to see something: < Print the test value for the test in pre> # (0,20,100,700): print (test, lookup [test])

Print:

  0 (0.5, 0.5) 20 (0.53, 0.47) 100 (0.64, 0.36) 700 (0.99, 0.01)    

No comments:

Post a Comment