I am very new to Python (and web scrapping). let me ask you a question.
Many websites do not actually report their specific URLs in Firefox or other browsers. For example, popular child names are shown with ranks (since 1880) in the Social Security Administrator, but when I The year changes from 1880 to 1881, then the URL does not change. It is consistent,
Because I do not know the specific URL, I could not download the webpage using urlib.
The source of this page includes:
It seems that, if I can control this "year" value (e.g., "1881" or "1991"), I can deal with this problem. Huh? I still do not know how to do it. Can anyone tell me the solution for this? If you know some websites that can help in my studies, please let me know. Thank you! You can still use Now, just send url-encoded logic: If you also need to send the header: execute a code loop: input type = "text" name = "year" id = "yob" size = "4" value = " 1880 "& gt;
urllib . The button does a post to the current URL. Using Firefox I took a look at network traffic and found that they are sending 3 parameters:
member ,
top , and
year . You can send the same logic:
import urllib url = 'http://www.ssa.gov/cgi-bin/popularnames.cgi' post_params = {# member was empty, so I 'Top': '25', 'Year': Year} Post_Arg = urlib.urlencode (post_params)
urllib.urlopen (url, post_args)
header = {'accept' : 'Text /html,application/xhtml+xml.application/xml;q=0.9,*/*;q=0.8', 'Accept-language': 'N-US, N; Q = 0.5 ',' connection ':' keep-alive ',' host ':' www.ssa.gov ',' referrer ':' http://www.ssa.gov/cgi-bin/popularnames.cgi ' , 'User-agent': 'Mozilla / 5.0 (Windows NT 6.1; WOW64; rv: 21.0) Gecko / 20100101 Firefox / 21.0'} # POST with data: urllib.urlopen (url, post_args, headers)
year in the xrange (1880, 2014): # code above ...
No comments:
Post a Comment