Data Structure Specification File

Data Structure Specification File, also called as GEM file, describes data structure of a data extraction result file. The specification files are stored in DataStore server's folder $CATALINE/work/DataStore/context/extraction/config/<theme_name>/. The file names are suffixed with .gem.xml. The structure of the files in this type is shown as follows:

<?xml version="1.0" encoding="UTF-8"?>

<geometa-extraction-mapping>

<theme>testTheme</theme>

<template>testTheme.default.gem.xslt

<context>//*[@id='blueFrame']</context>

<context>//*[@id='rightFrame']</context>

</template>

<output>transDOM_xxx</output>

</transform>

<location>transDOM_xxx</location>

</from>

</property>

<from>

<location>transDOM_xxx</location>

<path>//header/text()</path>

</from>

</property>

</bean>

</geometa-extraction-mapping>

Where,

Types of sources from which properties are extracted can take the following values:
- DOM: means properties are extracted from a DOM
- XML: means properties are extracted from a XML document
Locations of sources from which properties are extracted can take the following values:
- transDOM_xxx: names the DOM from which properties are to be extracted, which is the product when transforming a target Web page according to a data extraction instruction file.
- a file path: means properties are extracted from a file.
- URL: means properties are extracted from a source in the Web.
In the string of "transDOM_xxx", xxx is the serial number of XSLT files which are used to transform target Web pages.
There may be multiple transform elements which correspond to transDOM_xxx one by one.
There are three types of properties:
- string: means the value of the property is in type of string.
- block: means the value of the property is an HTML fragment while it is also a string literally.
- link: means the value of the property denotes an URL address which should be handled separately when the result files are manipulated. The value may not be a completed URL because the XSLT command doesn't manipulate the value during extraction. So the software manipulating the results should complete it if it's in type of link.

GooSeeker

Data Structure Specification File

Languages