Pros and cons

Compared to Ordinary Web Crawler

There are many aliases for Web crawler, for example, Web spider, Web robot or Web worm. While the tool DataScraper in MetaSeeker acts just like a Web crawler, MetaSeeker toolkit is different from an ordinary Web scrawler.

At the beginning, an ordinary Web scrawler loads a Web page which is the entry for it to crawl a specific area of the Web. After having manipulated the page, e.g. stored the page on hard disks, it tries to find all clues, i.e. links (or hyperlinks, or Web links), which can direct it to more HTML pages further. Alternatively, how far the found clues go, i.e. crawling depth, is determined by user of a scrawler. When crawling further and broader, it just like a real spider crawling on a real web. If you ask a Web crawler what it has found, it must tell you that it has a great amount of data and clues. Unfortunately it doesn't know what about the data. If you only want to setup a information retrieval system, e.g. a search engine based on full-text indexing, the crawler can satisfy all your requirements to downloading HTML pages as more as you can. On the contrary, if you want to manipulate the information on the pages differently according to their meta data(what about the data), the ordinary Web crawler can do little for you.

On the contrary, MetaSeeker tries to define and recognize the data schema(or data structure, or meta data) of the Web pages, which is specified with XML tags in Data Schema Specification Files. It should be kept in mind that the schema only uncovers what about the data instead of the meaning of the data, i.e. what IS the data. You know, MetaSeeker is not an artificial intelligent toolkit. What it can do is helping users to define and recognize data schema of Web pages since it provides a friendly GUI and a lot of facilities. It cannot recognize data schema autonomously. Instead users should tell it what is about what. While it is too simple compared to some solutions based on artificial intelligent technology, it is one of the most effective, flexible and robust products.

In summary, while MetaSeeker need user's directions on specifying what is about which data on Web pages, the extracted data are well formatted with meta information which can direct computers to manipulate the data exactly and semantically. For example, the extracted data can be easily transformed and aggregated into HTML or XML documents in different structures using a XSLT engine. Another example is that the extracted data can be aggregated and presented onto portals or mashup services automatically. None of the above can be fulfilled by an ordinary crawler.

Tips: MetaSeeker can also download whole Web pages onto hard disks as an ordinary Web crawler does. But MetaSeeker cannot keep the folder hierarchy of the target site unchanged, which is a feature of some Web crawlers for mirroring target sites.



Compared to Ordinary HTML Wrappers

HTML wrapper is a must for data extraction from Web, which transforms the original HTML pages and filters out useless data. A template may be fed into a wrapper to direct it on how to transform the target page. Since Web is a vast repository of HTML pages with widely different meaning and format, a great amount of wrappers are implemented each of which is for a specific Web site or even for a specific Web page respectively. The situation becomes more complicated when the wrappers are implemented via all kinds of programing languages. We programmers are always re-inventing a wheel. Therefore it is a good idea to implement a factory to generate a series of wrappers. MetaSeeker can just act as a wrapper factory which possesses many differentiated characters as follows:

  • The wrappers generated by MetaSeeker are meta-driven and transparent to programming languages.
  • MetaSeeker is a GUI-based wrapper factory which releases users from programing even a single line of codes.
  • The generated wrapper is not a discrete functional component, whose functions are distributed around the whole MetaSeeker network. As a result, on-the-fly data extraction with large capacity becomes a reality.


MetaSeeker's Strengths

  • MetaSeeker is transparent to Web pages' authoring methods. That is it manipulates all Web pages in a consistent manner despite the pages are authored with HTML, PHP, JSP, ASP, ASPX etc.
  • MetaSeeker is very adaptive for defining data schema for or extracting data from most of forums, blogs, yellow pages, product or business lists etc. Otherwise users should code a lot of HTML wrappers for every sites even for every channels or columns within a site without the help of MetaSeeker.
  • MetaSeeker generates data extraction rules automatically according to directions from users via friendly GUI. Users will never experience in pain to code a lot of particular HTML wrappers.
  • The operation of MetaSeeker is straight-forward, it costs a user only minutes to define data schema of a group of Web pages, without including the time for the user to understand the data structure of a specific sample Web page.
  • MetaSeeker provides many validation facilities which can help users to find if the defined data schema is precise and if the generated data extraction rules can work as users expect. As a result the procedures of defining data schema and validating it can be taken simultaneously and interleaved, which can shorten the time to finish the work.
  • MetaSeeker provides many monitoring facilities. Data schema definition and data extraction procedures are totally under control.
  • MetaSeeker is deployed distributed, which prohibits performance bottleneck effectively.
  • MetaSeeker extracts data more exactly than many other competitors. Exactitude is a must for computers to manipulate extracted data autonomously without users' intervention. Otherwise, it will cost much more for human beings to pre-process the data, for example finding and discarding invalid data.


MetaSeeker's Weaks

  • It would be a complex work to define data schema with MetaStudio, a tool in MetaSeeker toolkit. But it is worthwhile to improve exactitude which can save a lot of man power to pre-process extracted data or intervene computers' work.
  • In some special cases, e.g. an HTML page is composed with a lot of content sources, compared to some Web scrapers performing scraping on servers, DataScraper may be some bit low because it waits for finishing loading of all composite contents. This approach can prohibit overlooking required data. For example, DataScraper can work properly to extract information within nested HTML FRAME/IFRAME sections. The weak can be remedied by deploying multiple DataScrapers distributedly.