Preparation

Tools used: MetaStudio, a data schema definition tool.

In this step, the semantic structure of one of the target pages, named as a sample page, is analyzed, which is preparation for define a data schema stated in the next step.

In a forum, all discussions on a forum topic can be viewed as being put in respective "containers" on an HTML page. The containers are placed one by one in chronological order. Currently, most of forums are built upon a few popular forum platform. All HTML pages containing the containers for discussions are generated on the fly according to a few templates. As a result the data schema is fixed along time. So data extraction from forums are easiest and this preparation step can be finished soon.

In order for MetaStudio to generate data and clue extraction rules, it must be told what is the structure of the discussions' "container" and where the data snippets in the container are fixed. Positioning a data snippets on an HTML page making use of DOM, so being familiar with DOM and HTML are helpful for operators to run MetaSeeker and to define data schemas.

MetaStudio has an embedded DOM tree viewer which provides a GUI for operators to view HTML document's structure. For a complex HTML page, the DOM tree may be very large and complicated to be analyzed. Fortunately, the DOM tree viewer is not the only tool to position a DOM node containing the data snippets to be extracted. MetaStudio provides a reverse selection approach through which the DOM node is selected in the DOM tree viewer by clicking a data snippet on the embedded browser directly. Let's study an example. Assumed a DOM node containing a discussion's title named with the string "Thinking in Java" would be positioned, the operator should just click the string in the embedded browser. Then MetaStudio would position the DOM node so that background color of this row in the DOM tree viewer is changed to blue. It should be paid attention that precision of reverse selection might be poor because if a target node, mainly a text node, was nested deeply in the DOM tree its ancestor instead of itself might be positioned. In these cases, the selected node should be double checked via verifying its value and attributes. If current node is not the expected, the exact one can be found by expanding the subtree under currently positioned node and searching within it.

During analyzing the structure of the HTML page, the operator should make sure which data snippets should be extracted and manipulated later. As an example, we are interesting in the following data snippets:

  • content brief: a brief of the discussion on a topic, which will be indexed and stored into Lucene index base.
  • content: the discussion on a topic, which will be indexed and stored into Lucene index base.
  • title: the title of a book, which will be indexed and stored into Lucene index base.
  • book page: the location storing the book, which will be stored without being indexed into Lucene index base. When the Lucene document containing this field is recalled, the field is presented as a hyper-link through which the book can be downloaded.

Obviously, we want to build up a ebook search engine via which a resource for downloading a book can be found.



Exercises

Make this forum topic to be a sample page against which the following exercises can be taken.

  • Enable reverse selection
  • Click any data snippets on the embedded browser and watch MetaStudio automatically positioning the according DOM nodes.
  • Click the content of the first discussion in the discussion thread. Turn to the DOM Tree Viewer, traverse the tree to locate the nodes about content brief, content, title and book page. Study the HTML structure.
  • Click the content of the second discussion in the discussion thread. Turn to the DOM Tree Viewer, traverse the tree to locate the nodes about content brief, content, title and book page. Study the HTML structure and compare it with that for the first discussion.