Table of Contents
I80App Overview
Signatures
public String getVersion();
public void initialize (Properties properties, byte[] data);
public Collection<String> parseLinks(byte[] documentContent, String url, Map<String, String> headers, Map<Default80AppPropertyKeys, Object> default80AppProperties, String statusCodeLine);
public byte[] processDocument(byte[] documentContent, String url, Map<String, String> headers, Map<Default80AppPropertyKeys, Object> default80AppProperties, String statusCodeLine);
How do I implement and use an I80App object?
I80App is a java interface that must be implemented by any custom code object submitted to 80legs. There are a few simple steps required to use this object:
- Create a class that implements the I80App interface (see Writing Your Code)
- Create a factory object that creates an instance of your object that implements the I80App interface
- Package your code as a JAR (Java ARchive) (see Exporting Your JAR)
- Test your JAR (see Testing Your Code)
- Submit your JAR to 80legs through the code area of the portal.
- Run 80legs jobs using your JAR through the portal.
What is the lifecycle of this object
In pseudo-code, this object is used like this by 80legs (where WebAnalysis is assumed to be your class that implements the I80App interface):
// 80legs initializes your code
WebAnalysis webAnalysis = new WebAnalysis();
webAnalysis.initialize ( properties, userData );
// 80legs loops over the urls that need to be crawled - these are managed centrally and this central management system is outside of the user's control
for ( String urlToCrawl : urlsToCrawl ) {
// 80legs reads the url contents
crawl ( urlToCrawl );
// 80legs calls your link parsing code with the contents and information from the current url
newLinksToCrawl = webAnalysis.parseLinks ( information from urlToCrawl );
if ( newLinksToCrawl == null ) {
// if you return null from your parseLinks function, 80legs uses it's own default link parsing
newLinksToCrawl = defaultParseLinks ( information from urlToCrawl );
}
// 80legs calls your custom document processing code witht he contents and information from the current url
results = webAnalysis.processDocument ( information from urlToCrawl );
// 80legs controls the central storage and management of your results as well as controlling the url queue (frontier) - this is outside of the user's control
sendResultsToCentralManager ( urlToCrawl, newLinksToCrawl, results );
}
getVersion()
Signature
public String getVersion();
What is getVersion()?
getVersion() is used by 80legs to verify that the signatures of your object match the current signatures used. You should return the value suggested by Sample80App from the 80legsCustomCodeSample project. The suggested code is currently:
public String getVersion() {
return "80App_1.2"; // this must be set correctly or 80legs will not use your jar.
}
Are there any requirements for my getVersion() function?
Yes, it must return the correct string or your jar will be rejected.
initialize()
Signature
public void initialize (Properties properties, byte[] data);
What is initialize()?
initialize() is called once by 80legs to initialize each client. It is called after the constructor and before any document contents are read.
Are there any requirements for my initialize() function?
It can be completely empty, but it must exist. It must also complete within 60 seconds or your JAR will not be approved.
What are the arguments for initialize()?
- Properties properties - Currently, the properties argument is not used. In a later version of 80legs, this will be a convenient way to pass user parameters or any key/value (String/String) pairs to your custom object.
- byte[] data - This is the contents of the custom data file uploaded with your job from the portal. The contents are unused by 80legs and are simply passed into your initialize() function so you can use the data to do whatever initialization you might need to do.
parseLinks()
Signature
public Collection<String> parseLinks(byte[] documentContent, String url, Map<String, String> headers, Map<Default80AppPropertyKeys, Object> default80AppProperties, String statusCodeLine);
What is parseLinks()?
If you want to control the links that 80legs follows from any document, simply implement this function to return the list of links that you want to follow. Here are some ways this function can be used:
- Parse the links from documentContents
- Pull links from the headers (i.e. redirections or other oddities)
- Speculatively create urls to crawl based on the url of this page (e.g. if the page is http://www.domain.com/testA/testB/index.html, you might want to return http://www.domain.com/testA speculatively)
- Return an empty list - 80legs will not follow any pages if you return an empty list
- Return null - this will use the default 80legs link parsing
Are there any requirements for my parseLinks() function?
- Return the links you want to crawl. If you don't want to crawl any links from this url, return an empty list. If you want to use the 80legs default link parser, return null.
- parseLinks() plus processDocument() must complete within 10 seconds for each document.
What are the arguments for parseLinks()?
- byte[] documentContent - The byte content of the document to be processed. This could be any sort of document content, including HTML, JPG, SWF, etc.
- String url - The URL of the document to be processed, in UTF-8 String representation.
- Map<String,String> headers - A Map of the HTTP headers retrieved with this document. The Map key is the name of the header and the Map value is the header value. All keys and values are UTF-8 String representations.
- Map<Default80AppPropertyKeys, Object> default80AppProperties - A Map of 80App default properties such as current page depth, etc. Default80AppPropertyKeys is a java enum class that represents a set of property keys. This class is included in 80AppTemplate project.
- String statusCodeLine - The first line from the HTTP response for this document, including the protocol, status code, and status description (i.e., "HTTP/1.1 200 OK"), as a UTF-8 String.
Do I need to consider duplicates returned from different URLs?
Do not worry about returning duplicates from parseLinks(). 80legs centrally controls the urls to be crawled (the url frontier) and ensures it will not recrawl duplicates. For example, if you return http://www.80legs.com from every single parseLinks() call for your million page crawl, 80legs does not crawl 80legs.com one million times. It is only returned from the frontier once (for performance reasons it might be crawled a few more times if it is near the start or end of a job, but that is another consideration).
Can I see the 80legs link parsing code?
Yes! We have open sourced our link parsing code as a parseLinks() method in our default 80app. Please download 80AppDefaultApp from the source repository. DefaultRegexProcessing is our class that implements I80App. From there, you can see how our default parseLinks() method works. You are welcome to start with that code and create your own custom parseLinks() method from it to implement your custom functionality. Also, see this other question from our main FAQ.
processDocument()
Signature
public byte[] processDocument(byte[] documentContent, String url, Map<String, String> headers, Map<Default80AppPropertyKeys, Object> default80AppProperties, String statusCodeLine);
What is processDocument()?
processDocument() is used to provide your own custom content analysis to 80legs. It is a completely free-form function that you can use to perform any sort of document analysis that you would like (text analysis, image analysis, analyzing other file formats such as .PDF and .DOC). You serialize your results into a byte[] and 80legs will return URL/Result pairs to you for all urls you have analyzed on our system.
Are there any requirements for my processDocument() function?
- Return the results you want from this document as a byte[]. Return null if you do not want this page to be returned in your analyzed results.
- The return value must be no larger than 1024 bytes. It will be truncated and flagged as such if you return a larger byte[].
- parseLinks() plus processDocument() must complete within 10 seconds for each document.
What are the arguments for processDocument()?
- byte[] documentContent - The byte content of the document to be processed. This could be any sort of document content, including HTML, JPG, SWF, etc.
- String url - The URL of the document to be processed, in UTF-8 String representation.
- Map<String,String> headers - A Map of the HTTP headers retrieved with this document. The Map key is the name of the header and the Map value is the header value. All keys and values are UTF-8 Stringrepresentations.
- Map<Default80AppPropertyKeys, Object> default80AppProperties - A Map of 80App default properties such as current page depth, etc. Default80AppPropertyKeys is a java enum class that represents a set of property keys. This class is included in 80AppTemplate project.
- String statusCodeLine - The first line from the HTTP response for this document, including the protocol, status code, and status description (i.e., "HTTP/1.1 200 OK"), as a UTF-8 String.
Comments (0)
You don't have permission to comment on this page.