Define data extraction rules

Tools used: MetaStudio, a data schema definition tool.

Create a bucket and properties

Take the following steps to create a bucket and its properties:

  1. On the Bucket Editor work board, push button newBckt;
  2. Name the bucket. In this case, the name is Product. After having submitted, a property mapping table is built up;
  3. Over the property mapping table, click right-button pop-up menu item Property>>Create to create properties and to set their attributes.

The following properties and their attributes are created(shown in figure 1):

Property Name Key Clue Url Block Null
title Validation & Data No No No No
product page Validation & Data Yes Yes No No
description No No No No No

Property product page's attribute clue is set so that a Info clue is created automatically on the Clue Editor work board.


Figure 1 (Enlarge)



Map properties

Take the following steps to map properties:

  1. Enable reverse selection;
  2. In the embedded browser, click the first product's name, i.e. "Antique Wooden Gift Box (TF7-0016)", to position DOM node No. 1219 in the DOM Tree Viewer;
  3. Expand the sub DOM tree to find a #text node whose value is "Antique Wooden Gift Box (TF7-0016)";
  4. Click right-button pop-up menu item Info Mapping>>name to map the #text node(No. 1223) to property title;
  5. Repeat 2~4 to map property product page. Because the property's attribute clue has been set, only DOM node @href, No. 1222, can be mapped;
  6. Repeat 2~4 to map property description.

If you want to learn more on property and attributes, please refer to MetaStudio User's Guide or MetaStudio Senior User's Handbook.

Note: All properties, except those with attribute null being set, should be mapped. Despite a DOM node, e.g. the #text node for property description, had empty value, it should still be mapped. If the HTML page's structure changes too much to find the required node for a property, different approaches should be taken for two different cases respectively. If all required DOM nodes in the first two rows of the product list exist, structure changes in other rows will not impact data schema definition and data extraction. In contrast, if any required DOM node in the first two rows cannot be found, another sample page should be chosen.

Figure 2 shows mapping relations between properties and DOM nodes.


Figure 2 (Enlarge)

In the Replica Management groupbox, Click right button to enable the second replica. Operations should be repeated to map DOM nodes in the second row to the second replica. Figure 3 shows the mapping.


Figure 3 (Enlarge)

Push button MAP and GEM to preview the data extraction instruction file and data structure specification file which are shown in figure 4 and figure 5. Thereafter, push button TestThis or TestAll to verify the data extraction rules. Extraction results are shown in figure 6.


Figure 4 (Enlarge)



Figure 5 (Enlarge)



Figure 6 (Enlarge)