Phase 1: extract categories

Most B2B, B2C, C2C or yellow page sites provide a page classifying the information into many different categories which is just the entrance to crawl the sites as discussed in previous section. Clues are firstly extracted from every category items, along which commodity lists or business entity lists in specific categories can be extracted further by MetaSeeker. The general phases to extract all business information from a site in this kind are shown as follows:

  1. Clues are extracted from the category page, i.e. the entrance to all business information;
  2. Along a clue about a specific category, one whole catalog on commodities or business entities in specific category are extracted;
  3. If a clue about detailed information of a commodity or a business entity has been extracted in previous step, the information can be extracted further along this clue.

This chapter focuses on phase 1. Other phases are stated in Phase 2: extract catalog and Phase 3: extract detailed information. The target site is http://www.made-in-china.com

Note: Currently MetaStudio provides only ListBucket which describes data schema for a 2-dimension table. On category pages, the categories are stored in a tree where sub-trees represent sub-categories. Since ListBucket is not suitable to extract information on relations between categories and their sub-categories, only clues are extracted.

The following steps are taken in this phase:

  1. Define a data schema against the category page with MetaStudio and upload the data schema and data and clue extraction instruction files onto MetaCamp server.
  2. Extract clues for every categories with DataScraper.