wget configuration and using offsite URL

From steve.brophy@altarum.org
DateTue, 18 Jun 2002 15:48:12 -0400
Subject wget configuration and using offsite URL
Hello list,

I'm trying to build a collection which holds the indexes for documents
from another site, with links back to the original site.

It works ok when I use the collector to add a page, but I'm missing
something when trying to do the same thing from the command line.

First I run the "mirror.pl" program. The downloaded files are put into
the import directory, but there is no other information along with them.
The collector didn't leave the import directory around, so I'm not sure
what else should be there when it works.

Then I run the "import.pl" program. This puts the 'doc.xml' files under
the archive area, but the [URL] metadata defined in them only contains
something like "http://filename.html" -- it is missing the original web
site and the path. Since none of that information is sitting in the
import area, I'm guessing that I need better arguments for the "mirror" or
"wget" programs?

Is there an example of a working "wget.cfg" file to look at? One was
mentioned in the list archives but that link isn't working anymore.

The setup here is running Greenstone version 2.38 on Windows NT.
The collect.cfg contains:
plugin HTMLPlug -file_is_url
format DocumentUseHTML true
The wget.url contains one url.
The wget.cfg contains only "verbose = on"


Steve Brophy, Altarum, Ann Arbor Michigan