tools-yak
[Top] [All Lists]

[tools-yak@collab] Re: Scraping Web pages

To: tools-yak@xxxxxxxxxxxxxxxxxxx
Cc: shannon@xxxxxxxxxx, thinkingrelevantly@xxxxxxxxxxxxxxx
From: Andrius Kulikauskas <ms@xxxxx>
Date: Mon, 07 Apr 2003 19:53:57 +0300
Message-id: <3E91AD25.6090600@ms.lt>
[at BlueOxen, and cc: to Minciu Sodas thinking relevantly]    (01)

Jack Park wrote:    (02)

  > Who in this group has thought much about and acted on the process of
  > scraping Web pages?    (03)


Jack, I haven't followed this thread completely, but just wanted to let
you think of a couple of our lab members:    (04)

- Peter Kaminski, http://www.socialtext.com and
http://www.peterkaminski.com  knows a lot about job boards and, in
particular, the scraping of resumes and job postings that they would get.    (05)

- Shannon Clark, http://www.jigzaw.com , has developed really cool
technology that can scrape web pages in real-time for all manner of
applications, for example, making use of Google in real-time to build a
calendar of all conferences in the Bay Area on a certain topic for a
certain weekend.  He's used that to make web portal modules of a whole
bunch of different kinds for news feeds, stock prices, Amazon books,
etc. etc.  He's got an applet that I think automatically finds the URLs
at Amazon for books that you blog and adds your ID number.  His very
keen to license his AI technology.    (06)

I learned a lot thinking through with Shannon about the relationships
between AI and markup.  It crystallized for me when I tried to create
some code to monitor our lab member sites for changes.    (07)

The first thing I learned was that, although it's just about trivial to
figure out if a page has changed, it's not simple at all to isolate the
part that did (from the part that did not).  I came up with a nice
trick: I just looked for the lines in the new page that weren't in the
old one.    (08)

However, this lead to all kinds of false positives.  First of all, you
get lots of timestamps!  But there's more obscure stuff.  For example, 
Stephen Danic's Lucid Fried Eggs site http://www.memes.net has a "random 
page" link.  Well, of course that's different each time you visit.    (09)

So there are two options:
A) You get Stephen Danic to output changes in some markup language form.
B) You design a special case work around for reading changes in Stephen 
Danic's pages.    (010)

The main conclusion is that:
You can't force everybody to do A for you.  Just because you want to 
monitor their data doesn't mean that they will make an effort to make it 
easy for you.
Therefore - you have to do B for at least some of the things you want.
Therefore - you start with B for the things you want most.
Therefore - there's not that much reason to do A because you've got 
pretty decent machinery to do B that also handles a lot of what you'd do 
with A.
Therefore - you may never actually get around to A.    (011)

In other words, Shannon convinced me that AI (or at least, special case 
solutions) is truly fundamental.  Sharing MarkUp feeds etc. is only for 
people who really want to make the effort to work with you AND who's 
data is exceptionally weird.    (012)

Note that one interesting case is "copyrights" because it's not obvious 
how you can pick out what parts of a page (or blog) are from the author, 
and what parts are excerpts from other works.  So that may be an example 
of intrinsically "exceptionally weird" data where we need human 
participation (and could use markup).    (013)

Peace,    (014)

Andrius    (015)

Andrius Kulikauskas
Minciu Sodas
http://www.ms.lt
ms@ms.lt
+370 (5) 2645950
Vilnius, Lithuania    (016)






-- 
This message is archived at:    (017)

http://collab.blueoxen.net/forums/cgi-bin/mesg.cgi?a=tools-yak&i=3E91AD25.6090600@ms.lt    (018)
<Prev in Thread] Current Thread [Next in Thread>