Re: [greenstone-users] Unicode and dead links

From Katherine Don
DateWed, 31 Mar 2004 16:08:23 +1200
Subject Re: [greenstone-users] Unicode and dead links
In-Reply-To (BC81BB7B-B1C3%james-elmborg-uiowa-edu)
Hi Jim

HTMLPlug by default only associates gif|jpe?g|jpe|png|css files with the
document - you can change this by adding

-assoc_files (?i).(gif|jpe?g|jpe|png|css|pdf)$

as an option to HTMLPlug.
This means that your pdfs will be saved as associated files with the
HTML documents - and I think this will mean that the links work (hopefully)

As they are also being processed in their own right by PDFPlug, another
copy will be saved as the source document for this.

If you don't want the PDFs to be processed separately you can modify the
block expression for HTMLPlug - this specifies the files that don't get
passed on to other plugins. add this option to HTMLPlug (and make sure
HTMLPlug is before PDFPlug in the plugin list):

-block_exp (?i).(gif|jpe?g|jpe|png|css|pdf)$

This means that the pdfs wont be processed by PDFPlug - they won't be
available by searching, only by clicking from the HTML document.

I hope this helps.

Note that you can find out the options to any plugin by using the GLI to
build your collection (all options are listed with tooltips) or running
perl -S XXXPlug

Katherine Don

> I have folder full of web pages I'm trying to make into a Greenstone
> library. The pages contain information about international writers and so
> many languages are represented on the pages with all the attendant
> diacritics . The pages are created with a generic template. Each has some
> text, a photograph, and a link to a PDF file. HTML files are in the main
> folder along with one folder that contains all the images and one folder
> that contains the pdf's. All links to PDFs and images are relative.
> I've built the collection multiple times on two machines (one mac os x and
> one redhat linux 9). I've used both the GLI interface and the web
> collector. In every case the pages import fine and the library builds
> without errors. When I go to the view the collection, I have two problems:
> 1. None of the links to the PDF files work. The PDF files were processed
> and show up as browseable and searchable in the collection, but all the
> links to them from the HTML pages are broken and retrieve the standard
> "Internal Link Missing" message. Am I right to assume that if the links
> work in the original HTML, they should work in the final Greenstone library?
> Am I right to assume links to PDF files should be retained in the final
> library? If I am right, can anyone point me to my problem?