Re: [greenstone-users] Excluding documents from full-text indexing

From Michael Dewsnip
DateTue, 22 Mar 2005 13:36:42 +1200
Subject Re: [greenstone-users] Excluding documents from full-text indexing
In-Reply-To (423B0365-30004-atp-rub-de)
Hi Axel,

You should be able to use UnknownPlug to process the files (with
associated metadata in metadata.xml files). If this only applies to some
of your files you can collect these together into a directory and
specify the directory name in the "-process_exp" option to UnknownPlug.

Perhaps an easier solution is to keep the text with the documents, but
use format statements to remove the link to the Greenstone version of
the document (just link to the web version of the documents). By
specifying an extra metadata value for the "hidden" documents, you can
check this in your format statements to customise what options the users
have for the hidden documents. The advantage of this is that you keep
the full-text search ability, and it doesn't require any build time
changes. This may not be good enough from a legal point of view, however
(I don't know much about this).

Regards,

Michael

schild wrote:

> Hi list,
>
> does anybody know, if it is possible to exclude specific files from
> the full-text indexing process in the build phase of a collection,
> while nevertheless do a metadata indexing for those documents? In
> particular what I want to do is:
>
> certain documents in my collection should be only accessible via a WEB
> link (those files must not directly be included into the repository of
> the digital library), whereas other are directly included. For those
> that should be accessed over a WEB link, I included a dummy file
> instead (just an empty file having the same name as the original
> file). If I do the build process, all plugins exit with an error
> (since there is no text to be index or the file does not conform with
> the normal file structure. As an example here is what the librarian
> interface spits out in expert mode
>
> import.pl> Converting paxson96endtoend.pdf to HTML format
> import.pl> Error: May not be a PDF file (continuing anyway)
> import.pl> Error (0): PDF file is damaged - attempting to
> reconstruct xref table...
> import.pl> Error: Couldn't find trailer dictionary
> import.pl> Error: Couldn't read xref table
> import.pl> Error executing pdftohtml.pl
> import.pl> Could not convert paxson96endtoend.pdf to HTML format
>
> Obviously an empty .pdf file is not conformant with the pdf file
> structure.... By the way, this need for me arises from copyright laws.
>
> Anybody who got a clue on this one?
>
> Thanks,
>
> Axel
>