| To: | tools-yak@xxxxxxxxxxxxxxxxxxx |
|---|---|
| From: | Simon Buckingham Shum <S.Buckingham.Shum@xxxxxxxxxx> |
| Date: | Mon, 14 Apr 2003 11:22:05 +0100 |
| Message-id: | <5.2.0.9.2.20030414111852.0242dac0@owa2000.open.ac.uk> |
At 16:03 05/04/2003 -0800, you wrote:Who in this group has thought much about and acted on the process of scraping Web pages? Jack et al on the AKT technology profiles page: http://www.aktors.org/technologies/ you will find Dome: http://www.aktors.org/technologies/dome/ which has been used to screen scrape a huge RDF 'triple-store' of UK computer science research community and publications - www.hyphen.info this itself uses '3-Store' - recently released open source - http://www.aktors.org/technologies/3store/ Simon Web homepage: http://www.ecs.soton.ac.uk/~tal00r/Dome Developer: Thomas Leonard Owner: University of Southampton Builds on other technologies: XML, HTML What's the Problem?Populating the knowledge base requires collecting data from many web pages. Because of the large number of pages to be examined and the need to regularly update the information, a tool is needed to do this automatically.Eventually, all pages should provide machine-readable meta-data in a standard format to make this task very easy. In the meantime, and to boot-strap this process, we need to cope with the current situation of each site generating different pages from their databases. Towards a SolutionDome is a programmable XML/HTML editor. Users load in a page from the target site and record a sequence of editing operations to extract the desired information. This sequence can then be replayed automatically on the rest of the site's pages.The source HTML is converted to XHTML using the W3C's HTML-Tidy program automatically by Dome, and tidied up in the process. A Dome program is then recorded which removes all unnecessary elements from the page, leaving just the desired data, and the element names and layout can be changed for a desired output format, such as RDF. <image> Dome has a number of simple programming constructs, such as loop-over-sequence (shown above by the blue bars), nesting (the rectangles) and simple exceptions (the red arrows). Links
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| ||
| Previous by Date: | [tools-yak@collab] Re: compendium to wiki? a challenge! :), Simon Buckingham Shum |
|---|---|
| Next by Date: | [tools-yak@collab] Re: Scraping Web pages, Shawn Murphy |
| Previous by Thread: | [tools-yak@collab] Re: What is the status of Nooron content?, Andrius Kulikauskas |
| Next by Thread: | [tools-yak@collab] Re: Scraping Web pages, Shawn Murphy |
| Indexes: | [Date] [Thread] [Top] [All Lists] |