Validate extraction rules

Tools used:

After having defined data and clue extraction rules as stated in previous chapter, operators can preview and validate all kinds of data and clue instruction files, including data extraction instruction file(an XSLT file, also called as MAP file), data structure specification file(also called as GEM file), clue extraction instruction file(also called as SCE file) and data extraction workflow file(also called as Profile file). If the files can be previewed without alerting for failures and all rules in data schema recognition rule file(also called as DSD file) can pass validation test, further test can be taken.

Operators can also verify the data extraction rules via trying to extract data from the sample page with MetaStudio. Push the button TestThis or TestAll in GEM Editor tab window of MetaStudio's Output region to view the extracted results. If failures are alerted or the results are not expected, There must be errors in data extraction rules. How to push the buttons are stated in MetaStudio's User Guide.

If all above has succeeded, it is time to try the extraction rules with DataScraper. Run DataScraper and input a number of clues to be extracted to start the work flow engine, which is stated in detail in DataScraper's User Guide. It is recommended the number of clues to be extracted should be small at the beginning for a new theme because the new extraction rules may not work properly as expected in which case the rules should be redefined with MetaStudio. If the rules work properly in the trial, continuous extraction with long duration can be started. Detailed information can be gotten from MetaStudio User's Guide and DataScraper User's Guide.

During extracting, some log messages might be presented in DataScraper's output window. If multiple data schemas are defined for one single theme, DataScraper will try them one by one to find a suitable one. The process will be logged as stated in DataScraper User's Guide. If none is suitable, an error will be logged and the clue's status is turned into unknownschema. Later, operators can load the page pointed to by that clue into MetaStudio's work boards to analyze its data structure. There may be one more data schema being defined because the target site makes use of multiple templates to generating HTML pages. In contrast if the page makes no sense to the whole extraction results, the clue in status of unknownschema can be ignored.