|On Tue, Jun 18, 2002 at 03:48:12PM -0400, email@example.com wrote:
> Hello list,
> I'm trying to build a collection which holds the indexes for documents
> from another site, with links back to the original site.
> Then I run the "import.pl" program. This puts the 'doc.xml' files under
> the archive area, but the [URL] metadata defined in them only contains
> something like "http://filename.html" -- it is missing the original web
> site and the path. Since none of that information is sitting in the
> import area, I'm guessing that I need better arguments for the "mirror" or
> "wget" programs?
> Is there an example of a working "wget.cfg" file to look at? One was
> mentioned in the list archives but that link isn't working anymore.
> The setup here is running Greenstone version 2.38 on Windows NT.
> The collect.cfg contains:
> plugin HTMLPlug -file_is_url
> format DocumentUseHTML true
> The wget.url contains one url.
> The wget.cfg contains only "verbose = on"
The wget option you are looking for is
add_hostdir = on
This creates a directory structure based on the website's name.
etc. Greenstone's HTMLPlug uses the filename to determine the
pages url. HTMLPlug doesn't need the -file_is_url option as
far as I can tell for this to work. We have a collection that
does this for searching our school's webpages:
If you still need to look at our wget.cfg or collect.cfg files,
email me and I'll send them.
Hope this helps,
"I've never met a human being who would want to read 17,000 pages of
documentation, and if there was, I'd kill him to get him out of the
-- Joseph Costello, President of Cadence