tools-yak
[Top] [All Lists]

[tools-yak@collab] Re: Scraping Web pages

To: tools-yak@xxxxxxxxxxxxxxxxxxx
From: Simon Buckingham Shum <S.Buckingham.Shum@xxxxxxxxxx>
Date: Mon, 14 Apr 2003 11:22:05 +0100
Message-id: <5.2.0.9.2.20030414111852.0242dac0@owa2000.open.ac.uk>
At 16:03 05/04/2003 -0800, you wrote:
Who in this group has thought much about and acted on the process of scraping Web pages?

Jack et al

on the AKT technology profiles page: http://www.aktors.org/technologies/

you will find Dome: http://www.aktors.org/technologies/dome/

which has been used to screen scrape a huge RDF 'triple-store' of UK computer science research community and publications - www.hyphen.info

this itself uses '3-Store' - recently released open source - http://www.aktors.org/technologies/3store/

Simon


Web homepage: http://www.ecs.soton.ac.uk/~tal00r/Dome

Developer: Thomas Leonard

Owner: University of Southampton

Builds on other technologies: XML, HTML

What's the Problem?

Populating the knowledge base requires collecting data from many web pages. Because of the large number of pages to be examined and the need to regularly update the information, a tool is needed to do this automatically.

Eventually, all pages should provide machine-readable meta-data in a standard format to make this task very easy. In the meantime, and to boot-strap this process, we need to cope with the current situation of each site generating different pages from their databases.

Towards a Solution

Dome is a programmable XML/HTML editor. Users load in a page from the target site and record a sequence of editing operations to extract the desired information. This sequence can then be replayed automatically on the rest of the site's pages.

The source HTML is converted to XHTML using the W3C's HTML-Tidy program automatically by Dome, and tidied up in the process. A Dome program is then recorded which removes all unnecessary elements from the page, leaving just the desired data, and the element names and layout can be changed for a desired output format, such as RDF.

<image>

Dome has a number of simple programming constructs, such as loop-over-sequence (shown above by the blue bars), nesting (the rectangles) and simple exceptions (the red arrows).

Links


<Prev in Thread] Current Thread [Next in Thread>