[greenstone-users] Section classifiers

From Vladimir R. Risojevic
DateWed Jan 14 21:28:29 2009
Subject [greenstone-users] Section classifiers
Dear Katherine,

Thank you very much for your answer. I apologize for not answering your
questions earlier. My first question was concerned with the classifiers. Now,
I understand that sorting of the metadata is the default behavior of
classifiers, and I have figured out that SectionList has -sort nosort option
that suppresses the sorting. In this way I am able to produce a classifier
which is essentially a table of contents for chapters in my book. This is a
satisfactory functionality to start with.

I obviously didn't research that thoroughly before my previous post. I still
haven't experimented with the Hierarchy classifier but I will try it as soon
as possible.

However, I still don't understand what is a top level bookshelf in the case of
SectionList -metadata dc.Title -sort nosort. I have only one document in a
collection and it seems that this is somehow connected. In the future there
will probably be more documents in this collection.

At the moment, I am in the process of deciding what would be the best way to
organize metadata in order to comply with standards and ensure future
interoperability. That is why I decided to use PageGroups to define a logical
structure for chapters and sections. This approach, on the other hand, causes
the problem of empty introductory pages for each section. Is it possible to
avoid them, because they break the continuity of the material?

As I said I am pretty satisfied with a table of contents implemented as a
SectionList classifier but I would like to investigate the possiblities of
having both the table of contents and a goto box on a document page. Could you
give me some clues where to start with the source code to achieve this
functionality?

I hope that I answered your questions, but I also asked some new ones...

Thank you very much for your patience.

Regards,

Vladimir


--

Hi Vladimir

I would just like some clarification on what you are wanting.

When you say table of contents, do you mean on the document page itself, or as
a classifier?

In the standard greenstone interface (not sure if you have done much
modification or not) we have:

classifiers: Built using AZList, Section list etc. Accessed from the
navigation bar. Not shown on a document page (the link is shown but the
content is not). Sorting is always done as this is the point of classifiers,
to organise the documents into some structure so that they can be found easily.

document page: for multi section documents, we have a choice of navigation
structures: table of contents, and goto page box. Currently greenstone lets
you have one or the other.

Do you have many documents in your collection, or just one?
Do you want a "table of contents" on the document page or as a classifier?

If you are interested in doing coding, you may be able to get a table of
contents and a goto box on the same document page.

The Hierarchy classifier can be used to produce a classifier with a fixed
order. You write a structure file, and assign metadata to the documents based
on that structure. Then that structure is used in the classifier instead of
sorting. This only works at the document level, not at section level.

When you are searching, the search results come back either ranked or in build
order. This depends a little bit on which indexer you are using. MG/MGPP: If
you do a "some" search then the results are ranked, if you do an "all" search
then the results are in build order.
Lucene: you have an option to sort by rank or by metadata that you have built
indexes on.

Build order is the order that documents were processed during build. This can
be changed using -sortmeta option to import. Unfortunately, whole documents
are processed at once, so you can never change the order of sections inside a
document.

You may want to use Lucene as your indexer. The user can then choose what to
sort search results by. If you want a fixed order, then you can modify macros
so that the sort option is not displayed, but it hard coded to a specific field.
To do section sorting, you may need to add -sections_index_document_metadata
unless_section_metadata_exists option to buildcol, unless all sections have
the metadata you want to sort by (which they may do in your case).

I hope you can understand all this.
Regards,
Katherine

Vladimir R. Risojevic wrote:
> Dear all,
>
> I have a PagedImage collection with the following structure:
>
> <PagedDocument>
> <Metadata name="dc.Title">Book Title</Metadata>
> <PageGroup>
> <Metadata name="dc.Title">Chapter 1</Metadata>
> <Page pagenum="1" imgfile="page1.tif" txtfile="page1.txt" />
> ...
> </PageGroup>
> <PageGroup>
> <Metadata name="dc.Title">Chapter 2</Metadata>
> <Page ... />
> ...
> </PageGroup>
> ...
> </PagedDocument>
>
> I would like to have a table of contents with sections (Chapter 1, Chapter 2,
> etc.). To this end I built a paged document and created a classifier
> SectionList -metadata dc.Title
> which produced a list of sections sorted in some strange order (my titles are
> in Cyrillic script and I know that Unicode sorting is not quite right with
> SectionList), but there is no way to turn off sorting - I would like section
> titles to appear in the same order as in the item file. Moreover, there is a
> top bookshelf which is always expanded, labeled "Title", and clicking on it
> crashes the server. I tried with Latin metadata and the list is alphabetically
> sorted and everything else is the same.
> Then I tried
> AZSectionList -metadata dc.Title
> There aren't many sections and a hlist is not produced. Everything is the same
> as before except for the top bookshelf which is missing.
> AZCompactSectionList -metadata dc.Title -doclevel section
> returns nothing for Cyrillic script and for Latin script is the same as
> AZSectionList except that chapters are bookshelves.
> Finally,
> GenericList -metadata dc.Title -classify_sections
> sorts Cyrillic metadata alphabetically. I tried to add some additional
> metadata and use -sort_leaf_nodes_using option but it didn't work, probably
> because these are not leaf nodes.
>
> When I build a hierarchical document the order of sections in the list is the
> same as with a paged document. However, when I remove -classify_sections from
> GenericList then sections are in the same order as in the item file, which is
> fine.
>
> I can live with a hierarchical document (although I would like to have
> something else, see 3. below) but I would like to know is there a way to avoid
> sorting the titles of sections. Well, maybe AZ* classifiers have to be sorted
> which is suggested by their name, but what with SectionList and GenericList?
> Also, I don't think that I understand the difference between AZSectionList and
> AZCompactSectionList.
>
> 2. The documents are OCR'ed so I want to add the full text searching. When I
> build a search index on full text at the section level in the search results I
> get a list of pages which is not sorted in any way. Contrary to the above here
> I would like to sort the list. I tried the -sortmeta ex.Title option but that
> didn't help. Is there a way to sort the search results according to the page
> numbers?
>
> 3. For me the holy grail of the organization of this collection is to have a
> paged document with prev/next buttons, a goto box and a table of contents (as
> produced with GenericList above) which is always present, similar as in
> hierarchical documents. I've built a few collections with Greenstone and I
> don't see how this is possible with standard Greenstone. Please correct me if
> I'm wrong or give me some suggestions would it be possible to modify
> Greenstone to allow for this, and if the answer is positive give me some
> pointers where to look in the source code because I would like to try to do it.
>
> I apologize for this extremely long post but I would like to get some things
> straight, and to achieve some functionlity for the collections I'm building.
>
> Thank you very much in advance.
>
> Best regards,
>
> Vladimir Risojevic
>
>
>
>
>