[greenstone-users] Section classifiers

From Vladimir R. Risojevic
DateThu Jan 15 11:29:25 2009
Subject [greenstone-users] Section classifiers
In-Reply-To (002501c97641$5f251f70$1d6f5e50$-gov-ar)
Diego,

Thank you for your post. Unfortunately, I don't have the "-headerpage" option
set. I can avoid empty pages if I don't use the PageGroup, but instead assign
metadata with section titles to the first pages of sections. But that doesn't
feel right because these item files don't represent the structure of the
document anymore.

Regards,

Vladimir

On Wed, 14 Jan 2009 10:12:29 -0200, Diego Spano wrote
> Vladimir,
>
> Just want to give some help to one of your problems. You said that "the
> problem of empty introductory pages for each section. Is it possible
> to avoid them, because they break the continuity of the material".
>
> Do you have the option "-headerpage" set in PagedImgPlug?. If you
> have it, remove it and rebuild the collection. This way there will
> be no more blank pages!.
>
> Hope this helps..
>
> Diego
>
> -----Mensaje original-----
> De: greenstone-users-bounces@list.scms.waikato.ac.nz
> [mailto:greenstone-users-bounces@list.scms.waikato.ac.nz] En nombre
> de Vladimir R. Risojevic Enviado el: miā–”rcoles, 14 de enero de 2009 6:28
> Para: kjdon@cs.waikato.ac.nz
> CC: greenstone-users@list.scms.waikato.ac.nz
> Asunto: Re: [greenstone-users] Section classifiers
>
> Dear Katherine,
>
> Thank you very much for your answer. I apologize for not answering your
> questions earlier. My first question was concerned with the classifiers.
> Now, I understand that sorting of the metadata is the default
> behavior of classifiers, and I have figured out that SectionList has
> -sort nosort option that suppresses the sorting. In this way I am
> able to produce a classifier which is essentially a table of
> contents for chapters in my book. This is a satisfactory
> functionality to start with.
>
> I obviously didn't research that thoroughly before my previous post.
> I still haven't experimented with the Hierarchy classifier but I
> will try it as soon as possible.
>
> However, I still don't understand what is a top level bookshelf in
> the case of SectionList -metadata dc.Title -sort nosort. I have only
> one document in a collection and it seems that this is somehow
> connected. In the future there will probably be more documents in
> this collection.
>
> At the moment, I am in the process of deciding what would be the
> best way to organize metadata in order to comply with standards and
> ensure future interoperability. That is why I decided to use
> PageGroups to define a logical structure for chapters and sections.
> This approach, on the other hand, causes the problem of empty
> introductory pages for each section. Is it possible to avoid them,
> because they break the continuity of the material?
>
> As I said I am pretty satisfied with a table of contents implemented
> as a SectionList classifier but I would like to investigate the
> possiblities of having both the table of contents and a goto box on
> a document page. Could you give me some clues where to start with
> the source code to achieve this functionality?
>
> I hope that I answered your questions, but I also asked some new
> ones...
>
> Thank you very much for your patience.
>
> Regards,
>
> Vladimir
>
> --
>
> Hi Vladimir
>
> I would just like some clarification on what you are wanting.
>
> When you say table of contents, do you mean on the document page
> itself, or as a classifier?
>
> In the standard greenstone interface (not sure if you have done much
> modification or not) we have:
>
> classifiers: Built using AZList, Section list etc. Accessed from the
> navigation bar. Not shown on a document page (the link is shown but the
> content is not). Sorting is always done as this is the point of
> classifiers, to organise the documents into some structure so that
> they can be found easily.
>
> document page: for multi section documents, we have a choice of navigation
> structures: table of contents, and goto page box. Currently
> greenstone lets you have one or the other.
>
> Do you have many documents in your collection, or just one?
> Do you want a "table of contents" on the document page or as a classifier?
>
> If you are interested in doing coding, you may be able to get a
> table of contents and a goto box on the same document page.
>
> The Hierarchy classifier can be used to produce a classifier with a fixed
> order. You write a structure file, and assign metadata to the documents
> based on that structure. Then that structure is used in the
> classifier instead of sorting. This only works at the document level,
> not at section level.
>
> When you are searching, the search results come back either ranked
> or in build order. This depends a little bit on which indexer you
> are using. MG/MGPP: If you do a "some" search then the results are
> ranked, if you do an "all" search then the results are in build
> order. Lucene: you have an option to sort by rank or by metadata
> that you have built indexes on.
>
> Build order is the order that documents were processed during build.
> This can be changed using -sortmeta option to import. Unfortunately,
> whole documents are processed at once, so you can never change the
> order of sections inside a document.
>
> You may want to use Lucene as your indexer. The user can then choose
> what to sort search results by. If you want a fixed order, then you
> can modify macros so that the sort option is not displayed, but it
> hard coded to a specific field. To do section sorting, you may need
> to add -sections_index_document_metadata
> unless_section_metadata_exists option to buildcol, unless all
> sections have the metadata you want to sort by (which they may do in
> your case).
>
> I hope you can understand all this.
> Regards,
> Katherine
>
> Vladimir R. Risojevic wrote:
> > Dear all,
> >
> > I have a PagedImage collection with the following structure:
> >
> > <PagedDocument>
> > <Metadata name="dc.Title">Book Title</Metadata> <PageGroup> <Metadata
> > name="dc.Title">Chapter 1</Metadata> <Page pagenum="1"
> > imgfile="page1.tif" txtfile="page1.txt" /> ...
> > </PageGroup>
> > <PageGroup>
> > <Metadata name="dc.Title">Chapter 2</Metadata> <Page ... /> ...
> > </PageGroup>
> > ...
> > </PagedDocument>
> >
> > I would like to have a table of contents with sections (Chapter 1,
> > Chapter 2, etc.). To this end I built a paged document and created a
> > classifier SectionList -metadata dc.Title which produced a list of
> > sections sorted in some strange order (my titles are in Cyrillic
> > script and I know that Unicode sorting is not quite right with
> > SectionList), but there is no way to turn off sorting - I would like
> > section titles to appear in the same order as in the item file.
> > Moreover, there is a top bookshelf which is always expanded, labeled
> > "Title", and clicking on it crashes the server. I tried with Latin
> > metadata and the list is alphabetically sorted and everything else is the
> same.
> > Then I tried
> > AZSectionList -metadata dc.Title
> > There aren't many sections and a hlist is not produced. Everything is
> > the same as before except for the top bookshelf which is missing.
> > AZCompactSectionList -metadata dc.Title -doclevel section returns
> > nothing for Cyrillic script and for Latin script is the same as
> > AZSectionList except that chapters are bookshelves.
> > Finally,
> > GenericList -metadata dc.Title -classify_sections sorts Cyrillic
> > metadata alphabetically. I tried to add some additional metadata and
> > use -sort_leaf_nodes_using option but it didn't work, probably because
> > these are not leaf nodes.
> >
> > When I build a hierarchical document the order of sections in the list
> > is the same as with a paged document. However, when I remove
> > -classify_sections from GenericList then sections are in the same
> > order as in the item file, which is fine.
> >
> > I can live with a hierarchical document (although I would like to have
> > something else, see 3. below) but I would like to know is there a way
> > to avoid sorting the titles of sections. Well, maybe AZ* classifiers
> > have to be sorted which is suggested by their name, but what with
> SectionList and GenericList?
> > Also, I don't think that I understand the difference between
> > AZSectionList and AZCompactSectionList.
> >
> > 2. The documents are OCR'ed so I want to add the full text searching.
> > When I build a search index on full text at the section level in the
> > search results I get a list of pages which is not sorted in any way.
> > Contrary to the above here I would like to sort the list. I tried the
> > -sortmeta ex.Title option but that didn't help. Is there a way to sort
> > the search results according to the page numbers?
> >
> > 3. For me the holy grail of the organization of this collection is to
> > have a paged document with prev/next buttons, a goto box and a table
> > of contents (as produced with GenericList above) which is always
> > present, similar as in hierarchical documents. I've built a few
> > collections with Greenstone and I don't see how this is possible with
> > standard Greenstone. Please correct me if I'm wrong or give me some
> > suggestions would it be possible to modify Greenstone to allow for
> > this, and if the answer is positive give me some pointers where to look in
> the source code because I would like to try to do it.
> >
> > I apologize for this extremely long post but I would like to get some
> > things straight, and to achieve some functionlity for the collections I'm
> building.
> >
> > Thank you very much in advance.
> >
> > Best regards,
> >
> > Vladimir Risojevic
> >
> >
> >
> >
> >