Saturday 15 February 2014

winforms - C# Web Crawler/Parser/Spider -

I am new to C # and WinForms, I want to create a web crawler - a web page Parsing and showing them in a serial form. I do not know how to crawl the bot with a specific hyper-link depth.

So I think I have 2 questions:

  1. How to make the bot with crawling, the depth of the specified link?
  2. How to show all hyperlinks in sequence?

    PS I would love it to be a code sample.

    P.P.S. 1 button = button is 1; And 1 richtextbox = richTextBox1;

    This is my code: I know that it is very ugly .... (All codes in one button):

      Public partial category form 1: form {Public Form 1 () {Initialization (); } Private Zero Button 1_Click (Object Sender, EventArgs E) {// Declaration HttpWebRequest Request = (HttpWebRequest) WebRequest.Create (url); HttpWebResponse response = (HttpWebResponse) request.GetResponse (); Streamer sr = new streamer (feedback; gatorpansstream ()); Match Me; String second test = @ "((ht) {1} tp [s] ?: //) [-a-gaa-z0- 9 @:% _ \ +. ~ #? And \\] +)"; & Lt; String & gt; SavedUrls = new list & lt; String & gt; (); & Lt; String & gt; Title = new list & lt; String & gt; (); // Visit this URL: string url = UrlTextBox.Text = ""; If (! (Url.StartsWith ("http: //") || url.StartsWith ("https: //"))) url = "http: //" + url; // scrap hole HTML code: string s = sr.ReadToEnd (); Try (// Get Url: M = Reggae Matches (S, another test, ReggaeAppsance. Ignore Season | Regex option. Compatified, Timespain. Foamseconds (1)); while (m.Success) {savedUrls.Add ( M.Groups [1] .tostring ()); m = m.NextMatch ();} Get the title: match M2 = regesx.match (s, @ " \ s * ( . +?) \ S * </ title> "); if (m2.Success) {title.Add (m2.Groups [1]. Value);} Show the title: richTextBox1.Text + = Title [0] + "Show \" URL: TrimUrls (Refilled URL);} Hold (RegexMatchTimeoutExcept Ion) {Console.WriteLine ("Mailing operation has timed out.");} Sr.Close ();} Private Zero TrimUrls (reference list & lt; string & gt; URL) {list & lt; string & Gt; D = urls.Distinct (). ToList (); (V. Index ('.')! = -1 & amp; V! = "Http://") {richTextBox1.Text + = V + "\ n";}}}} </ code> </ pre> <p>} </ p> <p> And another question: does anyone know how to save it in XML like PEM? </ P> </ div> <<P> <div class = "post-text" itemprop = "text"> <p> I highly recommend you with the HTML Agility Pack. </ P> <p> With HTML Agility Pack you can do something like this: / P> <pre> <code> var doc = new HtmlDocument (); Doc.LoadHtml (HTML); Var urls = new list & lt; String & gt; (); Doc.DocumentNode.SelectNodes ("// a"). ForEach (x = & gt; {urls.Add (x.Attributes ["href"]. Value);}); </ Code> <p> <strong> Edit: </ strong> </ p> <p> You can do something like this, but please add some exception to it. </ P> <pre> <Code> Public Class ParsResult {Public ParsResult Parent {Receives; Set; } Public string URL {get; Set; } Receive public Int32 depth {}; Set; }} </ Code> </ pre> <p> __ </ p> <pre> <code> Private read-only list & lt; ParsResult & gt; _results = new list & lt; ParsResult & gt; (); Private Int32 _maxDepth = 5; Public Zero Fu (string URL texture = empty, int 32 depth = 0, parsprisalt parent = empty) {If (Depth> gt_ = _maxDepth) return; String html; (Var wc = new WebClient ()) using html = wc.DownloadString (urlToCheck? Parent.Url); Var doc = new HtmlDocument (); Doc.LoadHtml (HTML); Var aNods = doc.DocumentNode.SelectNodes ("// a"); If (ANODS == weakness!! A nodes. Any ()) return; Foreign (one node in one node) {var url = aNode.Attributes ["href"]; If (URL == empty) will continue; Var Results = New ParsResult {Depth = Depth, Generator = Parents, Url = url.Value}; _results.Add (results); Console.light line ("{0} - {1}", depth, result. URL); Fu (Depth: depth + 1, parent: result); } </ Code> </ pre> </ div> </ html><br /><br /> 
