I am new to C # and WinForms, I want to create a web crawler - a web page Parsing and showing them in a serial form. I do not know how to crawl the bot with a specific hyper-link depth.
So I think I have 2 questions:
- How to make the bot with crawling, the depth of the specified link?
- How to show all hyperlinks in sequence?
PS I would love it to be a code sample.
P.P.S. 1 button = button is 1; And 1 richtextbox = richTextBox1;
This is my code: I know that it is very ugly .... (All codes in one button):
Public partial category form 1: form {Public Form 1 () {Initialization (); } Private Zero Button 1_Click (Object Sender, EventArgs E) {// Declaration HttpWebRequest Request = (HttpWebRequest) WebRequest.Create (url); HttpWebResponse response = (HttpWebResponse) request.GetResponse (); Streamer sr = new streamer (feedback; gatorpansstream ()); Match Me; String second test = @ "((ht) {1} tp [s] ?: //) [-a-gaa-z0- 9 @:% _ \ +. ~ #? And \\] +)"; & Lt; String & gt; SavedUrls = new list & lt; String & gt; (); & Lt; String & gt; Title = new list & lt; String & gt; (); // Visit this URL: string url = UrlTextBox.Text = "http://www.yahoo.com"; If (! (Url.StartsWith ("http: //") || url.StartsWith ("https: //"))) url = "http: //" + url; // scrap hole HTML code: string s = sr.ReadToEnd (); Try (// Get Url: M = Reggae Matches (S, another test, ReggaeAppsance. Ignore Season | Regex option. Compatified, Timespain. Foamseconds (1)); while (m.Success) {savedUrls.Add ( M.Groups [1] .tostring ()); m = m.NextMatch ();} Get the title: match M2 = regesx.match (s, @ "
\ s * ( . +?) \ S * title> "); if (m2.Success) {title.Add (m2.Groups [1]. Value);} Show the title: richTextBox1.Text + = Title [0] + "Show \" URL: TrimUrls (Refilled URL);} Hold (RegexMatchTimeoutExcept Ion) {Console.WriteLine ("Mailing operation has timed out.");} Sr.Close ();} Private Zero TrimUrls (reference list & lt; string & gt; URL) {list & lt; string & Gt; D = urls.Distinct (). ToList (); (V. Index ('.')! = -1 & amp; V! = "Http://www.w3.org") {richTextBox1.Text + = V + "\ n";}}}} code> pre> } p>
And another question: does anyone know how to save it in XML like PEM? P> div> <
I highly recommend you with the HTML Agility Pack. P>
With HTML Agility Pack you can do something like this: / P>
var doc = new HtmlDocument (); Doc.LoadHtml (HTML); Var urls = new list & lt; String & gt; (); Doc.DocumentNode.SelectNodes ("// a"). ForEach (x = & gt; {urls.Add (x.Attributes ["href"]. Value);}); Code>
Edit: strong> p>
You can do something like this, but please add some exception to it. P>
Public Class ParsResult {Public ParsResult Parent {Receives; Set; } Public string URL {get; Set; } Receive public Int32 depth {}; Set; }} Code> pre>
__ p>
Private read-only list & lt; ParsResult & gt; _results = new list & lt; ParsResult & gt; (); Private Int32 _maxDepth = 5; Public Zero Fu (string URL texture = empty, int 32 depth = 0, parsprisalt parent = empty) {If (Depth> gt_ = _maxDepth) return; String html; (Var wc = new WebClient ()) using html = wc.DownloadString (urlToCheck? Parent.Url); Var doc = new HtmlDocument (); Doc.LoadHtml (HTML); Var aNods = doc.DocumentNode.SelectNodes ("// a"); If (ANODS == weakness!! A nodes. Any ()) return; Foreign (one node in one node) {var url = aNode.Attributes ["href"]; If (URL == empty) will continue; Var Results = New ParsResult {Depth = Depth, Generator = Parents, Url = url.Value}; _results.Add (results); Console.light line ("{0} - {1}", depth, result. URL); Fu (Depth: depth + 1, parent: result); } Code> pre> div> html>
No comments:
Post a Comment