Sunday 15 July 2012

python - a (presumably basic) web scraping of http://www.ssa.gov/cgi-bin/popularnames.cgi in urllib -


I am very new to Python (and web scrapping). let me ask you a question.

Many websites do not actually report their specific URLs in Firefox or other browsers. For example, popular child names are shown with ranks (since 1880) in the Social Security Administrator, but when I The year changes from 1880 to 1881, then the URL does not change. It is consistent,

Because I do not know the specific URL, I could not download the webpage using urlib.

The source of this page includes:

input type = "text" name = "year" id = "yob" size = "4" value = " 1880 "& gt;

It seems that, if I can control this "year" value (e.g., "1881" or "1991"), I can deal with this problem. Huh? I still do not know how to do it.

Can anyone tell me the solution for this?

If you know some websites that can help in my studies, please let me know.

Thank you!

You can still use urllib . The button does a post to the current URL. Using Firefox I took a look at network traffic and found that they are sending 3 parameters: member , top , and year . You can send the same logic:

  import urllib url = 'http://www.ssa.gov/cgi-bin/popularnames.cgi' post_params = {# member was empty, so I 'Top': '25', 'Year': Year} Post_Arg = urlib.urlencode (post_params)   

Now, just send url-encoded logic:

  urllib.urlopen (url, post_args)   

If you also need to send the header:

  header = {'accept' : 'Text /html,application/xhtml+xml.application/xml;q=0.9,*/*;q=0.8', 'Accept-language': 'N-US, N; Q = 0.5 ',' connection ':' keep-alive ',' host ':' www.ssa.gov ',' referrer ':' http://www.ssa.gov/cgi-bin/popularnames.cgi ' , 'User-agent': 'Mozilla / 5.0 (Windows NT 6.1; WOW64; rv: 21.0) Gecko / 20100101 Firefox / 21.0'} # POST with data: urllib.urlopen (url, post_args, headers)   

execute a code loop: year in the xrange (1880, 2014): # code above ...

No comments:

Post a Comment