Re: [greenstone-users] newspapers

From Katherine Don
DateThu, 16 Dec 2004 10:36:53 +1300
Subject Re: [greenstone-users] newspapers
In-Reply-To (5-2-1-1-2-20041213163006-02666b58-postoffice10-mail-cornell-edu)
Hi Rick

I'll answer the easy bits first.
cross-collection search:

create several collections using the same configuration (same indexes)
and add the line
supercollection collname1 collname2 collname3
where you specify the names of all the collections.
When you do a search, you will search over all the collections. In the
preferences page, the user has the option of selecting only a few
collections if they wish.
The downside of this is that browsing classifiers are not combined
between collections. This is probably not too bad if the sub collections
are split up by series or date - as these are the most useful browsing
mechanisms, and the first browsing step could be selecting which
collection to look in.

mgpp indexing:
mgpp indexing uses xml tags to determine where sections start and end,
and what is metadata. Before indexing the content of sections, we strip
out any html tags - there is no point in indexing all the <p>, <img> etc
tags as metadata elements. PDF files are converted to html for
processing by Greenstone, and therefore go through this process of
having the html stripped out. This is the bit that is very slow. We have
on our todo list to speed this up, but haven't gotten around to it yet.
So you have two options. If you know perl, you could try and speed up
this bit of code (and please send us your improved version). Or you can
use the -no_strip_html option to buildcol.pl. This means none of the
html tags will be removed, hopefully speeding up the preprocessing
time.Of course now there will be a lot more to index though :-(
The indexes will be larger, but hopefully it shouldn't make too much
difference to the user. If you try this, please let me know if you have
any problems with it.

forward and back navigation:
We generally do this within documents - ie a pdf file that has many
pages can be split into sections, one per page, and then you get forward
and back arrows to navigate between pages.
Greenstone is not so good at supporting this between documents.
What you might want to do is add NextPage and PrevPage (for example)
metadata to each document, then modify the document formatting to
display links using the Next and PrevPage metadata.

One way to do this is to create an index on a metadata field that has a
unique value for each file. Then you can use the values from this to
identify each page - and these values would go into the Next and Prev
page metadata. Then the next page link would look like
a=q&h=<indexname>&q=[NextPage]&ifl=1
This does a search on the index for the NextPage value, and as it is
unique, this will only return one document. the ifl=1 is "I feel lucky"
and takes you straight to the first document rather than to the search
results page.

If the xml files have all this next and prev information in them, you
may be able to write a plugin to process this, and add the metadata to
the documents. Or you could write a script to preprocess these files and
turn the information into Greenstone metadata.xml files. see the
developers guide for more info about metadata.xml files.

Here are some more general suggestions.

I recommend doing command line building if you are not already - using
the GLI means you have to import each time. but command line, you can
just rebuild if you don't need to import. Also, while you are developing
the style of the collection, just use a few documents so it doesn't take
too long.

You should think about what you actually want to present to the user. Do
you want to show the greenstone version or just the original pdf? It may
help to preconvert all your files to html, clean the html up and add
links to the original pdf perhaps, before giving them to greenstone.
If you don't want to display the greenstone version, you don't need to
keep any of the html tags - stripping these out and just giving
greenstone text documents with an associated pdf would speed up the
build process. (you may need to modify TextPlug to associate the file,
but that shouldn't be too difficult).
If you never want to display the pdf, do the conversion to html and just
give those documents to greenstone, and forget about the pdfs.

I hope this helps, and is useful to get you started.

Regards,
Katherine Don

Enrico Silterra wrote:
> So,
> we are trying to load a collection of digitized newspapers in
> greenstone.
> files provided by our vendor give us
> 1) pdf files per page
> 2) pdf files per article.
> 3) xml files which relate these things.
>
> We have the following problems though.
> a) we need a useful forward and backward navigation. How do people do
> forward,
> and backward navigation? I am unable to find any documentation of
> forw/backw
> are there good examples somewhere we could follow?
> b) our indexes will be for over 100 years worth of material, and we simply
> cannot rebuild an index for this material simply to add an issue. How
> do we implement cross "collection" searching? I think we could
> implement a year or decade, as one
> collection, and batch things up that way.
> c) Our indexing takes phenomenal amounts of time. How does mgpp index
> pdf documents?
> Is some sort of ocr happening?
> d) Have other people solved similar problems with their newspaper
> collection?
> How?
> Thanks in advance for hints, information or suggestions.
> Rick Silterra
>
>
> ******************************
> Enrico Silterra
> Meta Data Engineer
> 107-E Olin Library
> Cornell University
> Ithaca NY 14853
>
> Voice: 607-255-6851
> Fax: 607-255-6110
> E-mail: es287@cornell.edu
> http://www.library.cornell.edu/cts/
> ******************************
>
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>