Data Structure Specification File

Data Structure Specification File, also called as GEM file, describes data structure of a data extraction result file. The specification files are stored in DataStore server's folder $CATALINE/work/DataStore/context/extraction/config/<theme_name>/. The file names are suffixed with .gem.xml. The structure of the files in this type is shown as follows:

<?xml version="1.0" encoding="UTF-8"?>
<geometa-extraction-mapping> <!-- the root element -->
<theme>testTheme</theme> <!-- the theme name -->
<transform> <!-- If the target Web page is transformed into a intermediate DOM, this element specify the transformation rules -->
<template>testTheme.default.gem.xslt <!-- name of the data extraction instruction file -->
<context>//*[@id='blueFrame']</context> <!-- If data to be extracted are embraced by HTML FRAME/IFRAME -->
<context>//*[@id='rightFrame']</context> <!-- this value locates the HTML FRAME/IFRAME element. -->
</template>
<output>transDOM_xxx</output> <!-- If transformation is taken, the value is the name of the intermediate DOM. -->
</transform>
<bean name="objectName"> <!-- name of the bucket -->
<property name="txt" type="string"> <!-- name of the property -->
<from> <!-- from which source the property is extracted -->
<type>DOM</type> <!-- DOM denotes the property is extracted from a DOM. -->
<location>transDOM_xxx</location> <!-- Where to locate the source. TransDOM_xxx denotes one of the intermediate DOMs. -->
<path>//txt/text()</path> <!-- The XPath expression to locate a property -->
</from>
</property>
<property name="header" type="string"> <!-- one more property -->
<from>
<type>DOM</type>
<location>transDOM_xxx</location>
<path>//header/text()</path>
</from>
</property>
</bean>
</geometa-extraction-mapping>

Where,

  • Types of sources from which properties are extracted can take the following values:
    • DOM: means properties are extracted from a DOM
    • XML: means properties are extracted from a XML document
  • Locations of sources from which properties are extracted can take the following values:
    • transDOM_xxx: names the DOM from which properties are to be extracted, which is the product when transforming a target Web page according to a data extraction instruction file.
    • a file path: means properties are extracted from a file.
    • URL: means properties are extracted from a source in the Web.
  • In the string of "transDOM_xxx", xxx is the serial number of XSLT files which are used to transform target Web pages.
  • There may be multiple transform elements which correspond to transDOM_xxx one by one.
  • There are three types of properties:
    • string: means the value of the property is in type of string.
    • block: means the value of the property is an HTML fragment while it is also a string literally.
    • link: means the value of the property denotes an URL address which should be handled separately when the result files are manipulated. The value may not be a completed URL because the XSLT command doesn't manipulate the value during extraction. So the software manipulating the results should complete it if it's in type of link.