Define clue extraction rules

Tools used: MetaStudio, a data schema definition tool.

Create clues

Only clues are to be extracted from the category page, so defining data extraction rules is skipped in this phase.

On the Clue Editor work board, take the following steps to create a clue in type of Pattern:

  1. Create a new clue via pushing newClue button;
  2. Set the type of the clue to be Pattern via pushing the Pattern radio button;
  3. In tab-window Pattern, click mouse's right-button and select menu item Insert to create a pattern record of the new clue.


Map clues

Operators should tell MetaStudio at which position or within which scope one or more clues are to be extracted, which is fulfilled by mapping a DOM node standing for the position or the scope to the new-created clue. There are the following steps for this task:

  1. Enable reverse selection;
  2. Browse the sample page which has been loaded into the embedded Web browser. Click the string "Product Directory" so that the DOM node with row No. 465 is positioned by MetaStudio;
  3. Choose node 465's parent node as the scope within which clues to be extracted, whose row No. is 461. When the node is selected in the DOM Tree Viewer, the border of the HTML area will blink for 3 times, which is the method to check if the DOM node is suitable;
  4. Click right-button pop-up menu item Clue Mapping>>Clue Mapping>>s_clue 0 to map node 461 to clue No.0.

Note: Some types of DOM nodes, e.g. HTML TBODY element and text node, haven't blinking borders.

Note: When a node can stand for the scope, its ancestors also embrace the scope. Operators should select a suitable node among them. If the scope is too large, some unwanted clues may be extracted by mistake. On the other hand, if the scope is too small, some clues may lose.



Map patterns

Take the following steps to map pattern values and to name target themes:

  1. After having enabled reverse selection, in the embedded browser window, click a hyper-link, e.g. Agriculture & Food, over which a clue will be extracted, to position an HTML A element node in the DOM tree viewer;
  2. Expand the sub-tree below the node and select attribute node @href;
  3. Click right-button pop-up menu item Clue Mapping>>Pattern Mapping>>scope 0 to automatically fill the edit box Loc Prefix of the pattern record with the URL of this hyper-link;
  4. Edit the pattern value so that it will cover all hyper-links. Unfortunately, in this example, only the shortest string "/" is valid;
  5. Fill edit box Target Theme with the target name, i.e. ComYellowPage_mic_en_l2

Following figure shows the clue and the pattern after mapping:

Enlarge