View
 

Writing 80apps

Page history last edited by Toan Duong 5 months, 1 week ago

Table of Contents


 

80legs lets you write your own 80apps to fully customize how you crawl and process web pages.  Learning how to make an 80app can take some effort, but is well worth it if you want complete control over your crawls.

 

Step 1:  Setting Up your Development Environment


You will first need to set up your development environment.  Two popular choices are Eclipse and Netbeans.  Follow the instructions below to setup your preferred development environment.  Of course, you aren't required to use either of these IDEs and you can use any development environment you wish.

 

 

 

Step 2:  Writing your Code


After you've setup your development environment, it's time to start writing your first 80app.  Follow the instructions below to write an 80app in the appropriate environment.

 

 

 

Step 3:  Exporting your JAR


Now that your code is finished, you'll need to export it in the form of a JAR so you can use it in 80legs.  Follow the instructions below to export your 80app in the appropriate environment.

 

 

 

Step 4:  (Almost There!) Testing your Code


It is highly recommended that you follow the steps below as it is much harder to debug a live job.  To test your code, follow these steps:

 

  1. Test your code on your local machine (highly recommended)
    1. Download the latest 80legsProcessTest.jar from http://code.google.com/p/eightylegs/downloads/list.
    2. Test your code locally (click here for instructions) until it works for your test cases.

 

The following is an optional test that you can run, but it is primarily for debugging purposes only:

 

      2.  Test your code in the test sandbox using small crawls (this step is generally not necessary and should primarily be used to debug a live crawl if it produces unexpected results)

    1. Upload your JAR file to 80legs in the Code section, which is accessible through the 80legs Portal.
    2. Test your code by creating a job in the Test Sandbox environment (selectable when creating a crawl).
      1. Instructions on creating a new crawl are here.
    3. Run your sandbox job and check your results.
    4. Debug your code as necessary.  Note that any exceptions and System.out.println(...) in the code will be printed out in the results for the crawl.  This can be very useful in cases where you are not able to reproduce the problem with your code locally.

 

 

Step 5:  (Finally!) Running your Code


To run your own 80legs analysis code in Java, follow these steps:

 

  1. Upload your code (Note: We do not require you to upload source code, only JAR files).
    1. Go to the 'Create Custom Code' section in the Web Portal and click 'Upload Code'.
    2. Give your code a name and select your JAR file.
    3. Select the heap space required.  If you are unsure of how much heap space is needed, start with the lowest option.
    4. If you want, you can select data to run with your code.  This will be used to verify your code.
    5. Click the 'Upload' button.
    6. If your code fails the approval process, you will be given an error code corresponding to the reason your code failed.  See the Error Codes page for more information.
  2. After your code is approved, create an 80legs job and run it in the Live environment, and specify which JAR file you want to use for computations.
  3. Retrieve the results from your job in the 80legs Portal.
  4. Extract your results from the .80 files using the instructions here.

 

 

Limitations on your Code


80legs runs your code on a distributed computing system, which consists of a wide variety of computers.  Due to heterogenous nature of our infrastructure, we must impose a few limitations on your code.

 

JVM

Your code will only run on computers in this network that have Java 1.6.  The maximum amount of memory available on these nodes is 256MB, but in some cases there may not be this much available.  80legs runs code in a very limited Java sandbox.  This sandbox will prevent you from making any network connections.  Here is the policy file used by 80legs that specifies the permissions granted to your code:

 

grant codeBase {

permission java.lang.RuntimePermission "stopThread";

permission java.net.SocketPermission "localhost:1024-", "listen";

permission java.util.PropertyPermission "java.version", "read";

permission java.util.PropertyPermission "java.vendor", "read";

permission java.util.PropertyPermission "java.vendor.url", "read";

permission java.util.PropertyPermission "java.class.version", "read";

permission java.util.PropertyPermission "os.name", "read";

permission java.util.PropertyPermission "os.version", "read";

permission java.util.PropertyPermission "os.arch", "read";

permission java.util.PropertyPermission "file.separator", "read";

permission java.util.PropertyPermission "path.separator", "read";

permission java.util.PropertyPermission "line.separator", "read";

permission java.util.PropertyPermission "java.specification.version", "read";

permission java.util.PropertyPermission "java.specification.vendor", "read";

permission java.util.PropertyPermission "java.specification.name", "read";

permission java.util.PropertyPermission "java.vm.specification.version", "read";

permission java.util.PropertyPermission "java.vm.specification.vendor", "read";

permission java.util.PropertyPermission "java.vm.specification.name", "read";

permission java.util.PropertyPermission "java.vm.version", "read";

permission java.util.PropertyPermission "java.vm.vendor", "read";

permission java.util.PropertyPermission "java.vm.name", "read";

};

 

Data Size

We require that data be less than 10MB in size.  The smaller your data, the better it is for you and us.

 

Code Size

Your JAR file must be less than 10MB in size as well.  Again, smaller is better.

 

Time Limits

80legs enforces time limits on all custom code.  If these timeouts are exceeded for any given page, the processing will be abandoned on that document and a timeout error will be logged for your job.  If your job generates too many timeout errors, your job will be stopped (reasons that jobs are stopped).  The current limits are:

  • Your constructor and initialize() method must complete within 60 seconds. 
  • Your parseLinks() and processDocument() methods must complete within a total of 10 seconds per document processed. 

 

 

Troubleshooting


ParseLinks()

  • If the depth level of a crawl is set to 0, then parseLinks() will not be executed on any of the pages crawled.

 

 

Other Useful Resources


Serialization:

Comments (0)

You don't have permission to comment on this page.