tools-yak
[Top] [All Lists]

[tools-yak@collab] Re: Scraping Web pages

To: tools-yak <tools-yak@xxxxxxxxxxxxxxxxxxx>
From: Shawn Murphy <smurp@xxxxxxxxx>
Date: 14 Apr 2003 08:41:49 -0600
Message-id: <1050331309.1712.1430.camel@prometheus>
The intruiging CritLink Mediator by Ka Ping Yee 
  http://crit.org
accept URLs such as:
  http://crit.org/http://www.bootstrap.org/
and then present the found pages to you with a crit toolbar at the top. 
The toolbar (and some javascript) facilitate the selection of regions of
text on the 'transcoded' page which can then be commented upon.  Crit in
effect, lets anybody comment on any html on the web, in a fine-grained
fashion.  The links are classified as support, issue, comment or query.    (01)

Crit deserves a look, at the very least on two technical grounds:
  1) how it points links into foreign pages
  2) how it copes with messy, weird html,  probably a few effective 
     heuristics there.    (02)


On Mon, 2003-04-14 at 04:22, Simon Buckingham Shum wrote:
> At 16:03 05/04/2003 -0800, you wrote:
> > Who in this group has thought much about and acted on the process of
> > scraping Web pages?
> 
> Jack et al
> 
> on the AKT technology profiles page:
> http://www.aktors.org/technologies/
> 
> you will find Dome: http://www.aktors.org/technologies/dome/
> 
> which has been used to screen scrape a huge RDF 'triple-store' of UK
> computer science research community and publications - www.hyphen.info
> 
> this itself uses '3-Store' - recently released open source -
> http://www.aktors.org/technologies/3store/
> 
> Simon
> 
> 
> Web homepage: http://www.ecs.soton.ac.uk/~tal00r/Dome
> 
> Developer: Thomas Leonard
> 
> Owner: University of Southampton
> 
> Builds on other technologies: XML, HTML
> 
> 
> What's the Problem?
> Populating the knowledge base requires collecting data from many web
> pages. Because of the large number of pages to be examined and the
> need to regularly update the information, a tool is needed to do this
> automatically.
> 
> Eventually, all pages should provide machine-readable meta-data in a
> standard format to make this task very easy. In the meantime, and to
> boot-strap this process, we need to cope with the current situation of
> each site generating different pages from their databases. 
> 
> 
> Towards a Solution
> Dome is a programmable XML/HTML editor. Users load in a page from the
> target site and record a sequence of editing operations to extract the
> desired information. This sequence can then be replayed automatically
> on the rest of the site's pages. 
> 
> The source HTML is converted to XHTML using the W3C's HTML-Tidy
> program automatically by Dome, and tidied up in the process. A Dome
> program is then recorded which removes all unnecessary elements from
> the page, leaving just the desired data, and the element names and
> layout can be changed for a desired output format, such as RDF.
> 
> <image>
> 
> Dome has a number of simple programming constructs, such as
> loop-over-sequence (shown above by the blue bars), nesting (the
> rectangles) and simple exceptions (the red arrows). 
> 
> 
> Links
>       * Dome's homepage - including a short tutorial. 
>       * Hyphen.info - a large collection of data harvested using Dome.
  * -- 
    ====================================================================
      Shawn Murphy                                http://www.smurp.com
      mailto:smurp@smurp.com                     http://www.nooron.org
      tel:+780-903-2428                       http://www.noosphere.org    (03)

-- 
This message is archived at:    (04)

http://collab.blueoxen.net/forums/cgi-bin/mesg.cgi?a=tools-yak&i=1050331309.1712.1430.camel@prometheus    (05)
<Prev in Thread] Current Thread [Next in Thread>