Re: [greenstone-users] [GSDL-user] import from WWW

From John R. McPherson
DateThu, 24 Jul 2003 10:29:55 +1200
Subject Re: [greenstone-users] [GSDL-user] import from WWW
In-Reply-To (200307181048-h6IAmPQ31907-mailgate5-cinetic-de)
der_maudi@web.de wrote:
> Dear list,
>
> I have found an useful encyclopedia in the web. Now I will import this "book" in a collection. It works fine (apart from the long time) with the import by the collector.
> But if I import from the command line:
>
> "perl -S import.pl -removeold -importdir http://www.website.com coll"
>
> I got an error: "no plugin could process this file" (in the fail.log)
> How can I import from web from the command line?

Hi,
you can't set the import directory to be a website. What you can do
is to copy the website into your import directory. To do this, you can
use the "mirror.pl" script that comes with greenstone. This uses the
wget program, which we also distribute with greenstone. This script
needs some extra configuration files in your collection's "etc"
directory:
1) <collect>/etc/wget.url
which contains a list of urls to start with. eg
##
http://example.com/foo/bar
http://example.org/blah.html
##
2) <collect>/etc/wget.cfg
which contains settings for wget. some useful ones are:
##
quota = 100m # stop after downloading 100 MBytes of data
recursive = on # recursively download links as well
reclevel = 6 # maximum distance of links from homepage (def=5)
##
You can find most of the options either by running
"wget --help", or by reading the online manual at
http://www.gnu.org/manual/wget-1.8.1/html_node/wget_toc.html

After setting up these files, you can run
perl -S mirror.pl
and after that you can run import.pl and buildcol.pl as normal.

> Can I import more than one link? (How)

If you use wget, you simply put all the links into the starting
wget.url file in the collection's etc directory.

> It is possible to use the metadata.xml (produced with the organizer) e.g. if I will list the autor or organisation.
> If YES: must the filename looks like: "http://www.website.com/book1.htm"?
> If NO : How can I create such a list?

If you want to use metadata.xml to manually add metadata, the filename
will look like <collect>/import/www.example.com/book1.htm

Hope this helps
John McPherson