Define data and clue extraction rules

Tools used: MetaStudio, a data schema definition tool.

MetaStudio provides a few types of Buckets which describe semantic structures of "containers" holding data snippets on HTML pages. In current release, only one type of bucket, named as ListBucket, is available, which is most suitable for extracting forums, blogs, commodity lists and yellow pages. In a bucket a group of properties are stored. During data extraction, one instance of the bucket and its properties are casted into a bean. Multiple buckets may be defined against a sample page, which are made into a data schema.

On the work boards of MetaStudio, the mapping relations from data snippets on the sample page to properties should be assigned. MetaStudio provides DOM Tree Viewer and Reverse Selection to facilitate assigning mapping relations which are stated in detail in MetaStudio User's Guide and MetaStudio Senior User's Handbook.

In a forum, there are tens and hundreds discussions on a topic. In order to define data extraction rules, what MetaStudio should do is to find the semantic structure of a discussion and the repeated rules of them. At the beginning of defining the rules, the operator should create a few buckets and their properties on the Bucket Editor work board. Then he should map data snippets of the first two discussions to the primary and second replica of the bucket. From then on, MetaStudio can find discussion repeat rules and autonomously generate data extraction rules. Please visit MetaStudio User's Guide#Define data extraction rules for further information.

As an ordinary spider does, MetaSeeker should extract from a page not only data but also clues along which new pages can be visited. Since not all hyper links are interesting for operators, MetaStudio can be told which clues should be extracted and which themes the clues should belong to and which clues are in-thread or new.

In summary, MetaSeeker lets you to precisely extract data with semantic structure from the Web and prevents you from fishing for a needle in the ocean.



Exercises

Take the following steps:

  1. Name the new theme as PostDetail_itpub.
  2. Create a bucket, named as PostDetail, and its properties which are determined in previous chapter.
  3. Map data snippets for primary replica as stated in MetaStudio User's Guide. Following are mapping relations gotten:
    • DOM node No. 885 -> content brief
    • DOM node No. 893 -> content
    • DOM node No. 939 -> title
    • DOM node No. 938 -> book page


  4. Map data snippets for secondary replica. The following figure shows mapping relations assigned.
  5. Now there is already one Info Clue on the Clue Editor work board. Create one more clue in type of Marker Clue and map the mark ">>" as stated in MetaStudio User's Guide.
  6. Upload data schema specification file and data and clue extraction instruction files onto the MetaCamp server and the DataStore server as stated in MetaStudio User's Guide.