[greenstone-users] Re: Word docs in greenstone

From Katherine Don
DateWed, 29 Oct 2003 15:50:37 +1300
Subject [greenstone-users] Re: Word docs in greenstone
In-Reply-To (20031027050718-39342-qmail-web10304-mail-yahoo-com)
nathan shan wrote:

> Sir
>
> Kindly refer to your reply to my query (given below):
>
> My query:
> > when a document is displayed (especially
> > journal article / book / theses) either through
> > Browsing/Searching, Is it possible to display its
> > Table of contents (i.e. the different headings and
> > subheadings, just like book-marks in MS Word
> document
> > OR subject hierarchy display while browsing as
> > mentioned above) of that document and to select and
> > see whichever portion of the document that is
> > required).
>
> Your reply:
>
> Do you mean like the table of contents seen for
> documents in the
> greenstone demo collection? This is automatically
> generated for
> documents that have sections - you can tag source
> documents to make
> them
> into sections. see section "tagging document files" at
> teh end of
> section 2.1 in the developers guide.
> --------------------
>
> Yes, as seen in greenstone demo collection. Such
> display I want for any document (book, journal article
> etc.) that has headings and subheadings.
>
> As you have indicated, there is no section like
> "TAGGING DOCUMENT FILES" at the end of section 2.1 in
> Dev.Guide.However, there is one at the end (5.3)of
> "From Paper to Collection" document. This is the one
> you meant.

No, I did mean in the developers guide. In version 2.39 at the end of
section 2.1, there is an unnumbered section titled 'tagging document
files'. Perhaps you have an old version? Anyway, the section in 'from
paper to collection' has pretty much the same information, and may
actually be a better version. So thanks for pointing that out.


>
> Suppose I have a MS-WORD document which is a journal
> article. That has headings like abstract,
> Introduction, Materials and methods etc. . Now to
> import this into a Digital collection and display it
> as seen in the "greenstone demo collection", PLEASE
> TELL ME THE STEPS so that I shall try. I am not
> catching the thing correctly, from reading the manual.

Greenstone cannot use Word documents directly. We use a conversion
program (wvware) to convert them to html and then index and display the
html. What sectioning you get from the conversion depends on the wvware
program, the html it outputs, and the what the HTMLPlug plugin can do
with that html. I don't know much about wvware or the Word plugin, but
it appears from our word and pdf demo on nzdl.org (
http://www.nzdl.org/cgi-bin/library?a=p&p=about&c=wrdpdf-e) that it
doesn't do any sectioning. So to get greenstone to section your
document you will need to add in section tags to the document. these
tags are described in the manual. All I know about this I got from
reading the manual so I am not sure I can give you any more information
than whats there.

You need to open the document in Word and type in the section tags.
eg add the following before the Introduction heading

<!--
<Section>
<Description>
<Metadata name="Title"> Introduction</Metadata>
</Description>
-->

then have the heading and content of the introduction

then put
<!--
</Section>
-->

You need to do this for each section that you want separate. the tags
can be nested to provide subsections.

Then in the collection configuration file, you need to add the
-description_tags option to HTMLPlug.

If you then reimport and rebuild the collection with the modified
documents, you should get a table of contents like those in teh demo
collection.

>
> Further I notice, in the greenstone demo collection,
> that the different parts of a document are displayed
> in html mode. Whereas in my case, the document is in
> MSWORD and want to display the different parts of a
> document in MSWORD.

You cannot do this in Greenstone - we cannot handle proprietary format
documents except to run third party software for conversion to HTML or
other simple format, and providing a link to the entire document as is.
We have no way of splitting up a Word document into its parts. If you
are viewing the document in Word inside your browser, we have no control
over how it does it.


> What to do in case of documents in
> other formats (pdf, html, etc.) which has headings and
> subheadings.

The same thing as for Word documents. Either the plugin extracts
structure, or you have to add it yourself using the tags described
above, in which case HTMLPlug can extract it.

>
> awaiting your reply on the greenstonelist
>

please also mail your questions to the list so other people can answer
too.

regards,
Katherine Don

>
> Sincerely
> Shanmuganathan
>
> PS: I am going through all the manuals and trying to
> understand the things. Still, I feel that, without
> having a "query and answer" sessions with you people,
> it is difficult to understand and make progress.
>