Re: my 8 questions about GreenStone system

From John R. McPherson
DateThu, 03 Oct 2002 11:46:25 +1200
Subject Re: my 8 questions about GreenStone system
In-Reply-To (000501c26a3d$3b8c1260$35aa29c0-p410)
> Phan Vo Minh Thang wrote:
>
> Dear John,

I am on both the greenstone and the greenstone developers lists. You
don't need to email copies to the developers individually.

Most of the questions below are answered in the manuals. By the way,
it might be better to think of the Developers Manual as an "Advanced
Users' Manual" as it explains how to do more than just the basic
stuff described in the User Manual.

> 1.About searching methods.
> Does GreenStone version 2.38 support ranked searching ? If yes, how could I
> do to switch between ranked searching and boolean searching ? I can't find
> out that option in the preperences page.

By default, search results are ranked. If you change the "some" to "all"
on the search page, then a boolean search is performed. There is the
advanced boolean query setting in the preferences page.

> 2.About remote handling.
> Is it possible to add new documents to the collection remotely?. Once, I
> tried to specify the input file from collector wizard via "file:/" option at
> a remote computer (let say computer B), but it always looks for the files
> from the computer that installed collection server (let say computer A)! Can
> I handle collection, specify the input files located in the remote computer
> (different from A). Or I have to create a ftp server ? Is it necessary ?

the "file://" URI is always local to a particular machine. If you are
running the collector remotely, this URI must be relevant to the server
machine - files are not uploaded from your current machine.
This is discussed on page 24 of the Users Guide. This is the
standard behaviour of web browsers as well.

> 4 About metadata.
> In the GreenStone Developer Guide document page 19-20, it is said that the
> Dublin Core metadata was used for defining metadata types and the meta data
> was store with document. But I can't see any other metadata fields accept
> the Title one. How should I modify my config file to get the full metadata
> information. Right now,I declare my plugins without any arguments.

Which metadata is assigned depends on each plugin, as different document
types contain different metadata. Unfortunately I can't find a list of
which fields each plugin is capable of... maybe this is something we need
to do...

> 6. About the splitting documents into sections and searching on those
> sections indices.
> I have some MSWord documents. As we know that we can use wvWare to convert
> them into HTML format and then HTMLPlug will convert the output to be GML
> format. How can I split each document into sections

> ..<p> <div name="MSWordTemplateNameComeoutHere" .....><p> Marked text by
> applied MSWord template tag<p></p></div>..

As I told you previously, HTMLPlug will split into paragraphs if given
the "-description_tags" option. However, this relies on the HTML code
having special <Section></Section> tags. This is discussed on page 36
of the Developers Guide. You could either edit the HTML manually, or
do this automatically based on the text you have above, if you know
some scripting.

> 5. About assigning the value for metadata.
> You can see that the Dublin Core information for each document in MSWord
> format is available inside the properties information of themselves (MSWord
> supports that) so how can we reuse that information. I looked at wvWare
> output file. Only Title field is available. So what can be the solution !

Since MS Word is a proprietry format, we use an external 3rd-party
conversion tool. I don't think wvware can extract this metadata. If you
figure it out, tell the wvware authors how you did it!
Otherwise use can manually add metadata to your documents, as discussed
on page 34 of the Developers manual.

> 7. About Date format.
> I major about the Date information, and want to classify document on Date.
> How can I add DateClassifier into my collection. Can you show me step by
> step to do it. I'm sure that it will be useful for other people. I tried to
> add format string to the collection config but the Date tag doesn't appear
> in the navigation bar.

As I told you before in an earlier email, the Date Classifier requires
the input documents to have Date metadata. This metadata must be in a
numerical YYYYMMDD (year/month/day) format. As far as I am aware, only
the BibTex, Email, and Postscript plugins can automatically extract
dates from import files, as these filetypes have structured date information
in them. You could use the metadata.xml files as mentioned above.

> 8. About import directories
> Is import directory is a temporary directory ? After building the
> collection, is it deleted ? The reason I ask so because I couldn't see the
> import directory in my collection. But the system store my input file in the
> tmp build* directory ! So if I remove the tbuild* in the tmp
> directory, does the system work properly ? Some time when I build the
> collection again, document content disappears. Is is the result of remove
> files in tmp directory.

If you are using The Collector, I think that the import directory is
removed, as the data is copied from the locations that you give. If you
use the command line, then the import directory is left untouched.
I personally don't use the collector, so I can't really help you
here.

I hope people find this information useful.

John McPherson