[greenstone-users] Section classifiers

From Diego Spano
DateThu Jan 15 01:10:37 2009
Subject [greenstone-users] Section classifiers
In-Reply-To (20090114082549-M54889-webmail-etfbl-net)
Vladimir,

Just want to give some help to one of your problems. You said that "the
problem of empty introductory pages for each section. Is it possible to
avoid them, because they break the continuity of the material".

Do you have the option "-headerpage" set in PagedImgPlug?. If you have it,
remove it and rebuild the collection. This way there will be no more blank
pages!.

Hope this helps..

Diego

-----Mensaje original-----
De: greenstone-users-bounces@list.scms.waikato.ac.nz
[mailto:greenstone-users-bounces@list.scms.waikato.ac.nz] En nombre de
Vladimir R. Risojevic
Enviado el: miā–”rcoles, 14 de enero de 2009 6:28
Para: kjdon@cs.waikato.ac.nz
CC: greenstone-users@list.scms.waikato.ac.nz
Asunto: Re: [greenstone-users] Section classifiers

Dear Katherine,

Thank you very much for your answer. I apologize for not answering your
questions earlier. My first question was concerned with the classifiers.
Now, I understand that sorting of the metadata is the default behavior of
classifiers, and I have figured out that SectionList has -sort nosort option
that suppresses the sorting. In this way I am able to produce a classifier
which is essentially a table of contents for chapters in my book. This is a
satisfactory functionality to start with.

I obviously didn't research that thoroughly before my previous post. I still
haven't experimented with the Hierarchy classifier but I will try it as soon
as possible.

However, I still don't understand what is a top level bookshelf in the case
of SectionList -metadata dc.Title -sort nosort. I have only one document in
a collection and it seems that this is somehow connected. In the future
there will probably be more documents in this collection.

At the moment, I am in the process of deciding what would be the best way to
organize metadata in order to comply with standards and ensure future
interoperability. That is why I decided to use PageGroups to define a
logical structure for chapters and sections. This approach, on the other
hand, causes the problem of empty introductory pages for each section. Is it
possible to avoid them, because they break the continuity of the material?

As I said I am pretty satisfied with a table of contents implemented as a
SectionList classifier but I would like to investigate the possiblities of
having both the table of contents and a goto box on a document page. Could
you give me some clues where to start with the source code to achieve this
functionality?

I hope that I answered your questions, but I also asked some new ones...

Thank you very much for your patience.

Regards,

Vladimir


--

Hi Vladimir

I would just like some clarification on what you are wanting.

When you say table of contents, do you mean on the document page itself, or
as a classifier?

In the standard greenstone interface (not sure if you have done much
modification or not) we have:

classifiers: Built using AZList, Section list etc. Accessed from the
navigation bar. Not shown on a document page (the link is shown but the
content is not). Sorting is always done as this is the point of classifiers,
to organise the documents into some structure so that they can be found
easily.

document page: for multi section documents, we have a choice of navigation
structures: table of contents, and goto page box. Currently greenstone lets
you have one or the other.

Do you have many documents in your collection, or just one?
Do you want a "table of contents" on the document page or as a classifier?

If you are interested in doing coding, you may be able to get a table of
contents and a goto box on the same document page.

The Hierarchy classifier can be used to produce a classifier with a fixed
order. You write a structure file, and assign metadata to the documents
based on that structure. Then that structure is used in the classifier
instead of sorting. This only works at the document level, not at section
level.

When you are searching, the search results come back either ranked or in
build order. This depends a little bit on which indexer you are using.
MG/MGPP: If you do a "some" search then the results are ranked, if you do an
"all" search then the results are in build order.
Lucene: you have an option to sort by rank or by metadata that you have
built indexes on.

Build order is the order that documents were processed during build. This
can be changed using -sortmeta option to import. Unfortunately, whole
documents are processed at once, so you can never change the order of
sections inside a document.

You may want to use Lucene as your indexer. The user can then choose what to
sort search results by. If you want a fixed order, then you can modify
macros so that the sort option is not displayed, but it hard coded to a
specific field.
To do section sorting, you may need to add -sections_index_document_metadata
unless_section_metadata_exists option to buildcol, unless all sections have
the metadata you want to sort by (which they may do in your case).

I hope you can understand all this.
Regards,
Katherine

Vladimir R. Risojevic wrote:
> Dear all,
>
> I have a PagedImage collection with the following structure:
>
> <PagedDocument>
> <Metadata name="dc.Title">Book Title</Metadata> <PageGroup> <Metadata
> name="dc.Title">Chapter 1</Metadata> <Page pagenum="1"
> imgfile="page1.tif" txtfile="page1.txt" /> ...
> </PageGroup>
> <PageGroup>
> <Metadata name="dc.Title">Chapter 2</Metadata> <Page ... /> ...
> </PageGroup>
> ...
> </PagedDocument>
>
> I would like to have a table of contents with sections (Chapter 1,
> Chapter 2, etc.). To this end I built a paged document and created a
> classifier SectionList -metadata dc.Title which produced a list of
> sections sorted in some strange order (my titles are in Cyrillic
> script and I know that Unicode sorting is not quite right with
> SectionList), but there is no way to turn off sorting - I would like
> section titles to appear in the same order as in the item file.
> Moreover, there is a top bookshelf which is always expanded, labeled
> "Title", and clicking on it crashes the server. I tried with Latin
> metadata and the list is alphabetically sorted and everything else is the
same.
> Then I tried
> AZSectionList -metadata dc.Title
> There aren't many sections and a hlist is not produced. Everything is
> the same as before except for the top bookshelf which is missing.
> AZCompactSectionList -metadata dc.Title -doclevel section returns
> nothing for Cyrillic script and for Latin script is the same as
> AZSectionList except that chapters are bookshelves.
> Finally,
> GenericList -metadata dc.Title -classify_sections sorts Cyrillic
> metadata alphabetically. I tried to add some additional metadata and
> use -sort_leaf_nodes_using option but it didn't work, probably because
> these are not leaf nodes.
>
> When I build a hierarchical document the order of sections in the list
> is the same as with a paged document. However, when I remove
> -classify_sections from GenericList then sections are in the same
> order as in the item file, which is fine.
>
> I can live with a hierarchical document (although I would like to have
> something else, see 3. below) but I would like to know is there a way
> to avoid sorting the titles of sections. Well, maybe AZ* classifiers
> have to be sorted which is suggested by their name, but what with
SectionList and GenericList?
> Also, I don't think that I understand the difference between
> AZSectionList and AZCompactSectionList.
>
> 2. The documents are OCR'ed so I want to add the full text searching.
> When I build a search index on full text at the section level in the
> search results I get a list of pages which is not sorted in any way.
> Contrary to the above here I would like to sort the list. I tried the
> -sortmeta ex.Title option but that didn't help. Is there a way to sort
> the search results according to the page numbers?
>
> 3. For me the holy grail of the organization of this collection is to
> have a paged document with prev/next buttons, a goto box and a table
> of contents (as produced with GenericList above) which is always
> present, similar as in hierarchical documents. I've built a few
> collections with Greenstone and I don't see how this is possible with
> standard Greenstone. Please correct me if I'm wrong or give me some
> suggestions would it be possible to modify Greenstone to allow for
> this, and if the answer is positive give me some pointers where to look in
the source code because I would like to try to do it.
>
> I apologize for this extremely long post but I would like to get some
> things straight, and to achieve some functionlity for the collections I'm
building.
>
> Thank you very much in advance.
>
> Best regards,
>
> Vladimir Risojevic
>
>
>
>
>