Saturday 15 August 2015

apache - Nutch Crawler read Segment results -


I crawled using apache-nutch-crawler1.6. After crawling when I try to read the content of the crawled result using the command ???

  bin / nutch readseg -dump crawl / segments / * segmentAllContent   

error

  Org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: File: /home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/sections/2013062110/crawl_generate Input Path does not exist: File: / Home / ubuntu / nutch / framework / apache-nutch-1.6 / blog / segments / 2013062110 / crawl_fetch input path does not exist: file: /home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/sections/2013062110/ Crawl_parse Input path does not exist: File: /home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/sections/2013062110/content input path does not exist: file: / home / ubu ntu / nutch / framework / apache- Nutch-1.6 / blog / areas / 2013062110 / parse_data input path does not exist: file: /home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/segments/2013062110/parse_text org At .apache.hadoop.mapred On Org.apache.hadoop.mapred.Sequ FileInputFormat.listStatus (FileInputFormat.java:197) on enceFileInputFormat.listStatus org.apache.hadoop.mapred.FileInputFormat.getSplits (SequenceFileInputFormat.java:40) (FileInputFormat.java:208) org on .apache.hadoop.mapred.JobClient.writeOldSplits (JobClient.java:989) organization org.apache.hadoop.mapred.JobClient $ 2.run .apache.hadoop.mapred.JobClient.writeSplits (JobClient.java:981 ) Org.apache.hadoop.mapred.JobClient.access $ 600 (JobClient.java:174) at org.apache.hadoop.mapred.JobClient at $ 2.run (JobClient.java:897) javax.security.auth.Subject on .doAs on java.security.AccessController.doPrivileged (Native method) (JobClient.java:850) (Subject.java:416) org.apache.hadoop.security.UserGroupInformation.doAs (UserGroupInformation.java:1121) org. Apache.hadoop.mapred.JobClient.submitJobInternal (at JobClient.java:850) at org.apache.hadoop at .mapred.JobClient.sub mitJob (JobClient.java:824) org.apache.hadoop.mapred.JobClient.runJob (JobClient.java:1261) on org.apache.nutch.segment.SegmentReader.dump (SegmentReader.java:224) Org.apache. How to read html content after crawling MentReader.main (SegmentReader.java +7272) at nutch.segment.Seg   

I usually try to merge all the segments first,

bin / nutch mergesegs crawl / merged crawl / segments / *

and then < P> bin / nutch readseg -dump crawl / merged / * segmentAllContent

No comments:

Post a Comment