Clue

The most important information in a Clue is the URL of the target page, with which DataScraper will load the target page into its embedded browser to extract data and new clues. With these new clues, DataScraper will extract more new clues. The processes repeat again and again and the Web is crawled wider and wider.

The original clue, the entrance to crawl the Web for a specific theme, is created by MetaStudio when defining a data schema against a sample page. What to do to define a data schema and to create an original clue is described on page MetaSeeker Cook Book#Scenario 2: collect information on commodities.

From the point of view of making use of clues by DataScraper, clues are grouped into two categories as follows:

  • In-thread clue: When DataScraper begin to crawl the Web according to one clue retrieved from the DataStore server, the workflow engine initiates one thread executing the task. In case that there are multiple pages in a forum and all the forum topics are to be extracted, DataScraper should turn the pages over again and again. In order to turn a page over, DataScraper extracts a clue implying next page on the current page, through which it can navigate to the next page. The clues of this type are not stored onto the DataStore server. Instead they are used within the current execution thread of the workflow.
  • New clue: During data and clue extraction, every clue of this type are stored in the SpiderClue table on DataStore server. DataScraper initiates a new workflow thread by retrieving a New clue in status start from DataStore server.

From the point of view of extracting new clues from a target page by DataScraper, clues are grouped into five categories as follows:

  • Info Clue: A clue in this type is extracted according to a data extraction rule instead of a clue extraction rule. In this case, the extracted data snippet is presented as a hyper-link from which a clue can be extracted at the same time.
  • Single Clue: A clue in this type is extracted from a fixed location on the target page according to a clue extraction rule. So the extraction rule is most prone to fail confronting changes in HTML DOM structure.
  • Marker Clue: A clue is extracted within a scope on the target page through matching a mark which is a character string embraced by an HTML A tag and should appear within the scope. For example, the forum topics are paginated and there is a mark, e.g. next page, for navigating to next page. Over this mark, a Marker Clue can be defined which is used by DataScraper to navigate to next page.
  • Pattern Clue: A series of clues are extracted within a scope of target HTML document according to patterns of URL's character strings. For example, all clues contain http://www.gooseeker.com are extracted in a specific scope of target page.
  • Relative Clue: After a referred HTML DOM node has been determined in a scope of target page, a relative clue can be extracted from the next sibling node of same type. For example, marks [1] [2] [3] designate paginenting. Assumed we were browsing page [2], we assigned the node presented as [2] as the reference point. Then we could extract a clue from the node presented as [3].