View
 

Seed Lists

Page history last edited by Shion Deysarkar 2 years, 2 months ago


 

Introduction


A seed list is the set of URLs from where your crawl starts.  These are the first URLs 80legs will crawl when you run a job.

 

80legs lets you specify your seed list in 1 of 2 ways:

  1. Manually provide a seed list through the web portal or API
  2. Upload a text file containing your seed lists through the web portal or API

 

 

Creating and Uploading a Seed List


If you plan on uploading your seed list, you should take the following steps:

  1. Open a text editor
  2. Enter each of your seed list URLs.  Separate each URL by a new line.
  3. Save the file in a text format (e.g., .txt)
  4. Zip the file
  5. Upload it to 80legs through the Web Portal under the Seed Lists tab or through the API.  When you upload the seed list, 80legs will automatically remove any invalid URLs from the list.  See below for valid URL requirements.

 

The maximum allowable size of your seed list will depend on which plan you are using.  See the Pricing page for more details on seed list size limits.

 

 

Using a Seed List


Once you've uploaded your seed list, you can access it when creating a job.  When you're using the web portal, click on "Uploaded Seed List" next to Seed List URLs field.  Your seed list will be available in the selection field.  If you're using the API, please refer to the documentation here.

 

 

Valid URL Requirements


URLs used as seed list URLs with 80legs must have the following properties:

  • Must not be empty
  • Must not contain < or >
  • Must contain /
  • Must start with http
  • Length is greater than 12 and less than 800
  • Must not be http://localhost or https://localhost
  • Must not have \n or \r
  • Must have valid domain

 

Valid domains must have the following properties:

  • Top level domain length must be greater than 1 and less than 7
  • Must contain "."
  • Must not contain ".."
  • Must not start with "."
  • Must not contain anything but letters, numbers, dots, dashes, underscores in third level (sub) domain

 

Comments (0)

You don't have permission to comment on this page.