Theme

When defining data schemas via MetaStudio, it is not necessarily for operators to define a lot of data schemas for every Web pages respectively because many pages may have same data structure and presentation format, just like a table containing many records with identical table schema. The operator should only define one data schema for this group of pages against one sample page. By now all seem on the rails.

Unfortunately, there may be variations on presenting formats of Web pages with same semantics. For example, you may see different presentation formats with or without logging onto the site despite the semantics of the content do not change. As a result, operators have to define multiple data schemas for the pages with same semantics. The concept Theme is used to group the Web pages according to semantics instead of presentation formats. Every theme has a theme name. In order to differentiate the data schemas belonging to the same theme, each data schema is assigned a Middle Name. In summary, a Data Schema is uniquely determined by the combination of the theme name and its Middle name.

During extracting data from the Web, DataScraper firstly determines a theme by analyzing the URL string. Then it loads all Data Schemas belonging to the theme and selects a suitable one autonomously according to the Data Schema Recognition Rule File(DSRRF). Thereafter it loads all Data and Clue Extraction Instruction Files(DCEIF) of this Data Schema into its workflow engine.