Re: wget configuration and using offsite URL

From John R. McPherson
DateWed, 19 Jun 2002 08:53:19 +1200
Subject Re: wget configuration and using offsite URL
In-Reply-To (OF0E33EFE1-B92BAEC2-ON85256BDC-0069EBC0-altarum-org)
On Tue, Jun 18, 2002 at 03:48:12PM -0400, steve.brophy@altarum.org wrote:
> Hello list,

> I'm trying to build a collection which holds the indexes for documents
> from another site, with links back to the original site.

> Then I run the "import.pl" program. This puts the 'doc.xml' files under
> the archive area, but the [URL] metadata defined in them only contains
> something like "http://filename.html" -- it is missing the original web
> site and the path. Since none of that information is sitting in the
> import area, I'm guessing that I need better arguments for the "mirror" or
> "wget" programs?
>
> Is there an example of a working "wget.cfg" file to look at? One was
> mentioned in the list archives but that link isn't working anymore.

> The setup here is running Greenstone version 2.38 on Windows NT.
> The collect.cfg contains:
> plugin HTMLPlug -file_is_url
> format DocumentUseHTML true
> The wget.url contains one url.
> The wget.cfg contains only "verbose = on"

The wget option you are looking for is
#
add_hostdir = on
#
This creates a directory structure based on the website's name.
eg
<collectdir>/import/www.example.com/index.html
etc. Greenstone's HTMLPlug uses the filename to determine the
pages url. HTMLPlug doesn't need the -file_is_url option as
far as I can tell for this to work. We have a collection that
does this for searching our school's webpages:
http://nzdl2.cs.waikato.ac.nz/cgi-bin/library?c=scms&p=about

If you still need to look at our wget.cfg or collect.cfg files,
email me and I'll send them.

Hope this helps,
John McPherson

--
"I've never met a human being who would want to read 17,000 pages of
documentation, and if there was, I'd kill him to get him out of the
gene pool."
-- Joseph Costello, President of Cadence