View
 

Jobs

Page history last edited by 80legs 10 months ago

Each crawl is run as a "job" in the 80legs platform.  To use 80legs, you must create crawl jobs, configure their settings, and run them.

 

80legs crawl jobs have four categories of settings:  Job Settings, Crawl Settings, Analysis Settings, and Result Settings.  Crawl jobs can be created and managed through the web portal and the API.

 

Table of Contents


 

Running Your First Crawl Job


  1. Log in to the 80legs portal.
  2. Click on "Create New Crawl" link.
  3. Enter the details for your first crawl job:
    1. Job Name - Anything you want.  This is a label that will help you identify the job.
    2. Run Environment - choose "Run on Live Server"   ("Run in Test Sandbox" is for testing your custom code using a restricted sandbox for debugging purposes.)
    3. Repeat Frequency - choose "Do Not Repeat"
    4. Crawl Settings
      1. Seed List of URLs - type in "http://dir.yahoo.com" (you can enter as many URLs you want or upload a seed list)
      2. Outgoing Links to Crawl - select "Crawl all links found" 
      3. Depth Level - select "80legs will auto-select the depth level of your crawl"
      4. Crawl Type - select "80legs will crawl pages as fast as possible"
      5. Number of URLs to Crawl - input "1000"
      6. Mime Types to Crawl - select "text"
    5. Analysis Settings
      1. Analysis to Run - select 80apps and choose "Lossy Return Page Content" (this app will return filtered page content)
      2. Maximum Result Size - enter "100" 
      3. Pages / Files to Analyze - select "80legs will run the analysis on all pages and files crawled"
      4. Mime Types to Analyze - select "text" 
  4. Click "Create Crawl" and you will be taken to the dashboard page where you should see your job.  This page refreshes automatically every 15-30 seconds while your crawl is running.
  5. You will start to see pages crawled and analyzed increasing within a few minutes.  Once it starts, it will get to 1,000 in a few minutes and your crawl job will be completed.
  6. When your crawl job is complete, you will see the final details of the crawl.
  7. To view your results, click on the "Results" tab
    1. Download the "Crawled URLs" file first.  Unzip it and open it in a text editor or spreadsheet.  The first three columns show show the following:
      1. URL crawled
      2. Status from crawl - this might show a status code (like a 200 or 404) if the page crawl was attempted or it might show a robots.txt error or some other reason that the page was not crawled.
      3. Status from the processing - this will mostly show GOOD and NO_PROCESS, but you will see some other status messages too.  (see the process status definitions)
    2. Download the "Analyzed URLs" file.  Unzip it and open it in a text editor or spreadsheet.  The columns are:
      1. URL analyzed
      2. Unique count - this is how many of your items were matched one or more times
      3. Total count - this is the total number of times all of your items were matched
      4. Individual counts - all of the remaining columns are the individual counts for the items you entered.  There will be a column for each item and they are in the order you entered them.  If you just used "yahoo" as your only string, there will be one column here that will be the total number of times yahoo appeared in the document.

 

 

Crawl Job Settings


Setting Name
Description 
Requirements 
Job Name The name of your job. Can be at most 256 characters.
Run Environment

The environment in which your job will run.  80legs provides a 'sandbox' environment and a 'live' environment.

 

The sandbox environment lets you test your jobs in a more limited environment by running them on a small number of sample pages.  There are no charges for running a job in the sandbox.

 

The live environment runs your job normally.  Live jobs will incur charges.

N/A
Repeat Frequency

You can select how often you want your job to repeat. Available options are:

  • Do Not Repeat - Your job will only run once.
  • Repeat Daily - Your job will run on a daily basis.
  • Repeat Weekly - Your job will run on a weekly basis.
  • Repeat Monthly - Your job will run on a monthly basis.

 

For repeating options, you can choose a start date, end date, and frequency interval.  If you choose 'Do Not Repeat', the job will be queued to start immediately.  It will typically start within 2-3 minutes.

 

In the future, 80legs will allow more frequent crawling (e.g., on an hourly basis).
N/A

 

 

Crawl Settings


Setting Name Description Requirements
Seed List of URLs These are the URLs from which your crawl will start.

Simple:

One URL per line.

Newline-delimited.

Cannot be more than 100,000 URLs.

Note:  For larger seed lists, you can use the "Seed Lists" section to upload files of size up to 1 GB.

 

Advanced:

A previously uploaded seed list must be selected.  The seed list can be uploaded using the "Seed Lists" section. 

Outgoing Links to Crawl

Use this field to specify which outgoing links you want to crawl from a page.

 

'Crawl all links found' will crawl all links.  'Crawl links from the same domain for each URL in my seed list' will start with a URL in your seed list and follow links from the same domain as that URL.  This will be done separately for each URL in your seed list.

 

You can also specify a pattern to match each link against in the form of a regular expression.  To help you out, here are some examples:

  • yahoo\.com will tell 80legs to only crawl links that contain yahoo.com
  • (yahoo\.com|google\.com) will tell 80legs to only crawl links that contain yahoo.com or google.com

Simple:

N/A 

Advanced: 

Patterns must conform to the Java specification for regular expressions.

Depth Level

You can tell 80legs how deep you want to crawl by setting this variable.

 

A depth level of 0 means only your seed list will be crawled.

Cannot be more than 32,768.

Must be at least 0.

Crawl Type

80legs can crawl multiple depth levels at a time for a faster crawl.  However, this can result in a less comprehensive crawl.

 

If you need an all-encompassing and comprehensive crawl, you can use the "Comprehensive" or true "Breadth-First" options to crawl 2 or 1 depth levels at a time respectively.  These two options ensure that your crawl covers all pages, but they may be much slower than the "Fast" crawl option.

N/A
Number of URLs to Crawl

By setting this variable, you are putting an upper limit on the total number of pages that your crawl will reach over a single run of your job.  You can limit the size of your crawl by setting this variable.

 

Note that this field is used to calculate how much will be reserved for the job from your available balance.  The amount to be reserved is currently calculated using 1.25 times the value of this field.

Once a job is created, this amount will be reserved and will not be accessible for use with new jobs.  The reserved amount is automatically released when the job finishes or the job is cancelled or deleted.

Sandbox jobs are not charged and do not have a minimum balance requirement.

Cannot be more than 10,000,000 URLs.

Must be at least 1 URL.

 

Note:  The upper limit will be increased very soon and eventually lifted altogether.

 

Note:  Javascript must be enabled in your browser for this field to dynamically update correctly.

MIME Types to Crawl

Select which MIME types 80legs should crawl.

 

The 'Simple' mode allows you to crawl major MIME types, such as image/* and text/* (this includes text/plain and text/html).  The 'Advanced' mode allows you to specify exactly which MIME types 80legs should crawl.

This option allows us to use headers to determine whether a page should be crawled or not.  If a page is not one of the specified MIME types, then it will not be crawled.

At least one must be selected.

Must be 'text' if 'Keyword Matching' or 'Pattern Matching' is selected as the 'Analysis to Run'.

Preserve Query Strings When Crawling

Select 'Yes' if you want 80legs to preserve query strings when crawling links.  Select 'No' if you do not.

 

Example:  80legs encounters the link http://www.example.com/page?q=1234.  If 'Yes' is selected, 80legs will follow the link http://www.example.com/page?q=1234.  If 'No' is selected, 80legs will follow the link http://www.example.com/page.

This setting is only available in the API.
Max Number of URLs Per Page This is an upper limit on the number of links that are followed from each page that you crawl.  For example, if there are 100 links on a page, but you have set this field to 10, only the first 10 links from the page will be crawled.  This is useful for certain pages that have a large number of outgoing links. This setting is only available in the API.

 

 

Analysis Settings


Setting Name Description Requirements
Analysis to Run

Use this field to specify what type of analysis you want to run.  Available options are:

  • Keyword Matching - 80legs will match the content of pages using a list of keywords you provide
  • Pattern Matching - 80legs will match the content of pages using a list of patterns you provide (in the form of regular expressions)
  • Code - 80legs will run your custom analysis code on web content and return the results of that code
  • 80apps - 80legs will use our built-in applications (80apps) to perform powerful computation on your crawl.

See below.

Maximum Result Size

The result files generated by 80legs can be quite large, so we provide this field to help make downloading these files easier. If the total size of your result file is larger than this variable, then 80legs will split the result file into appropriate chunk sizes.

Must be between 10 and 100.

 

This setting is not available in the API in this section.  It is available in the API in the 'Result Settings' section below.

Pages / Files to Analyze

Use this field to specify which pages are actually analyzed during your crawl.

 

By default, 80legs will run the analysis on all pages and files crawled.  For greater control, you can specify which pages or files to analyze by specifying a pattern to match its URL against in the form of a regular expression.

 

To help you out, here are some examples:

  • .jpg will tell 80legs to only analyze pages that contain .jpg
  • (.jpg|.png) will tell 80legs to only analyze pages that contain .jpg or .png
Must conform to the Java specification for regular expressions.
MIME Types to Analyze

Select which MIME types 80legs should analyze.

 

The 'Simple' mode allows you to analyze major MIME types, such as image/* and text/* (this includes text/plain and text/html).  The 'Advanced' mode allows you to specify exactly which MIME types 80legs should analyze.

This option allows you to specify exactly which types of pages you would like to run an analysis on.  If a page is not one of the specified MIME types, then it is skipped.

At least one must be selected.

Must be 'text' if 'Keyword Matching' or 'Pattern Matching' is selected as the Analysis Method.

Keyword Matching List

(if selected)

This is a list of keywords which you want to look for in page content.

 

80legs will look for every keyword you provide in the pages you want to analyze and return count results based on the result type you select.

One keyword per line (note: the entire line is considered one keyword, so a line can have many words).

Newline-delimited.

At most 1,000 keywords.

File must be at most 1 MB in size.

 

Note:  We hope to lift the limit on number of keywords allowed in future releases.

Pattern Matching List

(if selected)

This is a list of regular expressions which you want to use to match page content.

 

80legs will match every regular expression you provide against pages you want to analyze and return count results based on the result type you select.

One regular expression per line.

Newline-delimited.

At most 1,000 regular expressions.

File must be at most 1 MB in size.

Must conform to the Java specification for regular expressions.

 

Note:  We hope to lift the limit on number of regular expressions allowed in future releases.

Code

(if selected)

This is a JAR file created by you that will run your own custom code.

 

80legs will run this code on the web content you specify through the other settings and store the results of your code in the result file.

 

File must be in .jar format.

File must be at most 10 MB in size.

 

Data

(if selected)

If you select 'Code' as your analysis method, you'll be given the option of uploading data to be run with your code.  You can use this data field to upload information your code needs to run, such as an image to which you want to compare other images. File must be at most 10 MB in size.

80apps

(if selected)

If you select '80apps' as your analysis method, you will be given the option of selecting from one of our several built-in applications (80apps) that can
perform powerful computations on the pages that you crawl. The list of 80apps available for use contains 80apps that are FREE, 80apps that are part of App Pack (distinguished by AP), 80apps owned by the user (distinguished by D) and 80apps bought from the Marketplace (distinguished by *).
The means by which the 80apps available are only distinguishable from the portal.  The list is available through the API but are not marked with AP, D and *.

 

 

Result Settings


This section is only available in the API.

 

Setting Name Description Requirements
Result Type

You can select what type of results 80legs generates for you. Available options are:

  • Unique and total count - 80legs outputs the # of unique matches and total # of matches for your content selection strings (i.e., keywords or regular expressions)
  • Boolean array - 80legs outputs the two numbers above plus a 1 or 0 for each string, depending on whether or not that string was found
  • Count array - 80legs outputs the unique and total count plus the total count for each string
  • Code results - If you select to analyze content using code, result type will default to this option

 

Here are some examples of each result type. In these examples, we've crawled and analyzed two pages:

 

The contents of the first page are 'test1 test1 test2 test3 test5'. The contents of the second page are 'this is a test'.  The keyword list is:

test

test1

test2

test3

test4

test5

test6

 

For 'Unique and total count' the output will be:

http://www.test.com/test1.html,5,10

http://www.test.com/test2.html,1,1

 

For 'Boolean array' the output will be:

http://www.test.com/test1.html,5,10,1,1,1,1,0,1,0

http://www.test.com/test2.html,1,1,1,0,0,0,0,0,0

 

For 'Count array' the output will be:

http://www.test.com/test1.html,5,10,5,2,1,1,0,1,0

http://www.test.com/test2.html,1,1,1,0,0,0,0,0,0

 

The result file will be in CSV format, with the first entry on each line being the URL that was analyzed. The subsequent entries on the line will contain the results for that URL as described above.

 

Result files will have the following naming convention:  <Job ID>_<Job Queue ID>_<a or c>_<part number>.csv.zip

 

The 'a or c' corresponds to whether or not the result file contains analyzed or crawled URLs.

 

Note that there may be some URLs that contain all 0s in the results. This will happen if you told 80legs to analyze a page (via your analysis regular expression), but none of your content selection strings were found.

 

If running custom code, result files will show the results of the code.

This setting is only available in the API.  This works only when our default parseLinks() is used. 
Include Crawled URLs in Results

If you select Yes, 80legs will generate two sets of result files. The first set will contain results for pages that were analyzed. The second set will contain the URLs of pages that were crawled, but not analyzed.

 

If you select No, 80legs will only generate result files that contain the URLs of pages that were analyzed.

This setting is only available in the API.
Maximum Result Size

The result files generated by 80legs can be quite large, so we provide this field to help make downloading these files easier. If the total size of your result file is larger than this variable, then 80legs will split the result file into appropriate chunk sizes.

This setting is only available in the API.  It is available in the portal in the 'Crawl Settings' section above.

 

Must be between 10 and 100.

 

 

Current Limits on Jobs


These are the current limits imposed by our live and sandbox servers:

 

Job Settings and Data Limits

  • Max URLs crawled per job (live servers) = 10,000,000
  • Max URLs crawled per job (sandbox server) = 100
  • Max size of seed list = 1GB
  • Max size of custom data = 10MB
  • Additional cost for processDocument() results larger than 1GB/MPC (larger than 1 gigabyte per million pages crawled - see Pricing)

 

Custom Code Limits

  • Max size of a result returned from processDocument() = 100KB
  • Max time for constructor and initialize() = 60 seconds
  • Max time for parseLinks() and processDocument() = 10 seconds
  • Max size of custom-code JAR = 10MB

 

 

 

 

Comments (0)

You don't have permission to comment on this page.