The jARVEST's transformer model
Every harverster built with jARVEST is composed by a set of transformers. A transformer is a component which receives a stream of strings (Java String) and outputs a stream of strings.
In addition, concrete transformers can receive parameters. For example, the xpath transformer (see available transformers) needs to be configured with the XPath expression to find in the incoming input Strings (treated as HTML documents).
Each transformer can have children, which are also transformers. The flow of strings to the children depends on the parent policy: cascade (or serial) and branched, which is a common parameter of every transformer.
Basic syntax of one transformerExample:
Cascade (or serial) connection
The parent can treat its children in a serial fashion, that is, the parent outputs are feeded to the first child; the outputs of the first child are feeded to the second, and so on.
Let's see an example. The following harvester: is the same as this one: The two transformers (wget and xpath) composing the harvester are connected in cascade mode, because there is a hidden parent (added by default), which is configured to connect its children in cascade. So this is also the same harvester: The pipe transformer is a simple transformer which does nothing with the inputs, it only forwards them to its children in cascade mode.
Any transformer can also branch its output among the children. There are two branch modes:
- Feed all children with its output, known as BRANCH_DUPLICATED branch mode
- "Scatter" its output among children, known as BRANCH_SCATTERED. The first output of the parent goes to the first child, the second output goes to the second, and so on. If there are more outputs than children (n), the n+1 output goes to the first child again.
- All the outputs of the first child, next the outputs of the second, and so on, known as ORDERED merge mode.
- The first output of every child, next the second output of every child, and so on, known as SCATTERED merge mode.
- All the outputs SCATTERED and concatenated in a single output, known as COLLAPSED. Collapsing the output can also be done with the merge transformer
The following harvester retrieves all the links (href) and their text from each input url, by generating two consecutive outputs per link. This is done by branching the output of the previous child (wget) among its children: Please note that the branching parameters are common to any transformer, so this harvester could be also rewritten as:
Every transformer can be executed in loop mode. This mode means that the transformer can include an specific child (known as the loop controller) which will receive the parent output and, if it returns some output, this output will be feeded back to the parent transformer. If there is no output, the loop ends.
To configure a given transformer in loop mode, you should call the repeat? method at the end of the transformer:
The loop mode is useful, for example, to iterate among paginated results in web pages. For example:
Parameters with input auto-references
Every parameter value can include an special wildcard inside: %%n%% (n is a number >=0). This represents "the value of the input n". The first input is 0. For example, the following harvester: will compare each input (except the first) against the first input.
You can define global variables with the setvar transformer. It is useful to save values at any time (including the value arriving as an input) and retrieve them in the future as any parameter of any other transformer. Example:
You can filter some inputs in any transformer. These inputs will be ignored and consumed, so they will be not passed to the next transformers. This can be done with the inputFilter parameter, which receives a string representing which inputs should be ignored (example1: 1,5,6 example2: 0-10 example3: 3-)
|Transformer||Description / Parameters|
|wget||For each input string 's', performs an GET HTTP request to the URL 's' and returns its contents a new output.
|xpath||For each input string 's', treat it as HTML by building its DOM tree and run a given XPath expression. Each matched content will be returned as a new output.
|xpathscrap||For each input string 's', treat it as HTML by building its DOM tree and run a given XPath expression. The whole inner XML of each matched content will be retrieved.
|select||For each input string 's', treat it as HTML, select nodes with a given CSS selector expression. For each matched node, a) the inner combined text (default), b) an specified attribute or c) the inner HTML, can be returned.
|decorate||For each input string 's', generate a new output by prepending a 'head' and appending a 'tail'.
|match||For each input string 's', matches a regular expression with only one capture (between parenthesis). Each captured result will be returned as a new output.
|append||All input strings are returned as a new output (if any), plus a given additional output string at the end.
|replace||For each input string 's', generate a new output by replacing each match of a regular expression with a given string.
|compare||For each input string 's', compares it with a given value 'v' as String|Date|Number, and generates new output by prefixing the input with a different prefix if 's' is less, equals, greater than 'v', or an error has been produced in comparison.
|merge||Collapses all inputs as a single output.|
|post||Performs a POST HTTP request to an URL given as parameters. The output of this harverster can be (i) the input strings with no transformations or (ii) the output of the server as a single output (inputs are ignored).
Note: All returned cookies will be kept during the rest of the harvester execution (including further wget/post requests). In other words, you can use this harverster to perform login on cookie-based session sites.
|pipe||A simple transformer (does not transform the data), but useful for grouping a set of children in Serial connection.|
|branch||A simple transformer (does not transform the data), but useful for grouping a set of children in Branched connection.
|one_to_one||Treats each input of the parent independently, and ensures only <=1 outputs per input. That is, each output of the parent will be forwarded to the child block one at a time. The child block's outputs will be collapsed before the next output of the parent is forwarded again.
For example, if we have multiple input sites and we want to make an xpath query over each site. Each xpath query could return more than one output, so if we want to keep the correspondence between each input url with their xpath query results, we must use the one_to_one approach.
|setvar||Defines a "global variable" with a given name and value. The variable can be retrieved afterwards with %%varname%%. This transformer does not modify the inputs, it only forwards them.