80legs creates result files in the form of the .80 file, which is specific to the 80legs service. .80 files are binary files.
Table of Contents
Overview
80legs produces two types of results files:
- Crawled URLs - this file contains information about the URLs that were crawled for your job.
- Analyzed URLs - there are two different types of Analyzed URLs files. The type of your file depends on whether you ran your own custom I80App or not.
- Built-in Keyword/Pattern Matching - each Analyzed URLs result file is a CSV file (a comma-separated values file with extension .csv).
- Custom Code (I80App) - each Analyzed URLs result file will be a .80 file, which is specific to the 80legs service. .80 files are binary files.
Result files produced after running your 80app will provide the following information:
- The URL your job analyzed
- The result of your analysis on the content located at that URL
- Additional meta-data for this URL and analysis result
NOTE: Result files are stored for a maximum of 3 days. Users subscribed to a pricing plan may have result files stored for longer. Please see our storage policy for further information.
Crawled URLs File
This file contains information about the URLs that were crawled for your job. To use it, unzip it and pull it into a text editor or spreadsheet. This is a comma-separated-values (.csv) file containing the following values:
- URL crawled
- Status from crawl - this might show a status code (like a 200 or 404) if the page crawl was attempted or it might show a robots.txt error or some other reason that the page was not crawled.
- Status from the processing - this will mostly show GOOD and NO_PROCESS, but you will see some other status messages too. (see the process status definitions)
- Document Size - this is the size of the document that was retrieved for the URL in bytes. It does not include the headers and other data from the request.
- Parse Time (ms) - this is the amount of CPU-time, in milliseconds, that was spent parsing this document for links in parseLinks().
- Processing Time (ms) - this is the amount of CPU-time, in milliseconds, that was spent processing this document in processDocument().
Analyzed URLs File
Pre-Built 80apps
Pre-built 80apps will provide analyzed result files in textual format.
Keyword Matcher, Regular Expression Matcher 80apps
This file contains information about the URLs that were analyzed for your job. To use it, unzip it and pull it into a text editor or spreadsheet. This is a comma-separated-values (.csv) file containing the following values:
- URL analyzed
- Unique count - this is how many of your items were matched one or more times
- Total count - this is the total number of times all of your items were matched
- Keyword counts - all of the remaining columns are the individual counts for the items you entered. There will be a column for each item and they are in the order you entered them. If you just used "yahoo" as your only string, there will be one column here that will be the total number of times yahoo appeared in the document.
Custom Code - .80 files
Due to the binary nature of .80 files, you cannot use a text editor to read them. Instead, we recommend that you use the CustomerResults class (Java or .NET) to deserialize the binary data in a .80 file. A simple way to get your results is by using the CustomerResults.readFile() as follows:
HashMap<String,byte[]> results = CustomerResults.readFile ( fileName );
The basic deserialize flow is as follows, but you can replace the BufferedInputStream with any InputStream if you want to read some other way (e.g. read from memory):
BufferedInputStream r = new BufferedInputStream ( new FileInputStream ( fileName ) );
CustomerResults results = new CustomerResults();
try {
results.startRead ( r );
String url;
while ( (url = results.readNextUrl(r)) != null ) {
byte[] customerData = results.readNextData ( r );
// do something with the url and customerData
}
}
catch ( Exception e ) {
// format error
}
For people interested in deserializing in other languages, the file format this creates and reads is:
<classID><versionID><URL-SIZE><URL><DATA-SIZE><DATA>
Note that:
-
The last 4 items (<URL-SIZE><URL><DATA-SIZE><DATA>) repeat for each url/data pair.
-
<classID>, <versionID>, <URL-SIZE>, and <DATA-SIZE> are encoded 32-bit integers.
-
The url is encoded using UTF-8.
Here are some additional resources for reading .80 files (n.b.: these have not been checked!):
Retrieving Large Result Files
Due to the nature of 80legs jobs, each result file can be fairly large in size (usually on the order of tens of MBs). Depending on the number of URLs you have crawled and the size of your results, the number of output files might be very large. In this case, we recommend using our 80legs Result Downloader App or the 80legs Command-Line Result Downloader for automated retrieval.
80legs does not store result files permanently. Please read our storage policy for details on how long we keep results.
Your job may encounter errors on occasion. If this happens, the result files should show errors so you know what went wrong. You can see a list of possible errors here.
80legs provides a desktop utility that can be used to download the results from your 80legs jobs. View the docs here.
80legs provides a command-line utility that can be used to download the results from your 80legs jobs. View the docs here.
Comments (0)
You don't have permission to comment on this page.