View
 

Tips and Best Practices

Page history last edited by 80legs 10 months, 1 week ago

This page contains an ever-growing list of best practices when running web crawls with 80legs.

 

Table of Contents


 

 

How to Improve Crawl Performance


Make Crawls Start Faster

  • Ensure that your seed list contains at least a dozen URLs.  Several hundreds or thousands would be better.  In general, more is better!

  • You can make your initial seed list larger by including sub-pages that you expect to crawl.  Example: if crawling http://dir.yahoo.com, add http://dir.yahoo.com/Arts/ and http://dir.yahoo.com/Business_and_Economy/ to make your crawl start faster.  Don't worry about recrawling the same pages over and over - we only crawl pages once and automatically remove duplicates while crawling.

  • Using less max number of pages crawled (less than 1 million pages) will ensure that your job starts faster, even in times of high load.  If possible, design your crawls to use as few max number of pages as possible.

     

 

Writing your Own 80apps


Things to Watch Out for with Custom Parselinks()

  • Many pages return a "301 Moved Permanently", "302 Moved Temporarily", or "303 See Other" response with the new URL in the "Location" header.  Make sure to deal with these by parsing out the new URL and adding it to the links to be crawled.  If you are using the built-in default parselinks(), note that it handles 301 and 302 by default, but not 303.

 

if (statusCodeLine.contains("301") || statusCodeLine.contains("302") || statusCodeLine.contains("303")) {
 try {
    String outLink = null;
    String location = headers.get("Location");

    if (location == null) {
       location = headers.get("location");
    }
    if (location != null) {
       URL newURL = new URL(url);
       outLink = URLFunctions.preProcessParsedLink(location, newURL, newURL, true);
    }
    if (!outLink.equals("")) {
       links.add(outLink);
    }
 } catch (Exception e) {
    e.printStackTrace(); 
 }

 return links;
 }

 

  • Returning null in the parseLinks() code will invoke the default 80legs parseLinks() function.  This means that 80legs will parse that page for links using what it uses by default.  If you do not want any pages to be crawled, you should return an empty set instead of null. 
  • If the depth level of a crawl is set to 0, then parseLinks() will not be executed on any of the pages crawled.

 

Comments (0)

You don't have permission to comment on this page.