HTMLPlug,page turning,page images, and alternate versions of the document.

From Stephen.DeGabrielle@ntu.edu.au
DateFri, 29 Nov 2002 10:54:17 +0930
Subject HTMLPlug,page turning,page images, and alternate versions of the document.
mailto:greenstone@colosys.net


Hi,
I am trying to make a collection of digitised books (scanned and ocr'd to
create text).

I am hoping to use the HTMLPlug to do this.

My first problem is I want my documents to be a sequence of pages- but my
documents seem to always to be hierarchical in structure - from reading the
manual I believe that hierarchical documents cause greenstone to display a
TOC while non-hierachical documents cause the 'Next/Previous' arrows to be
available. (see format DocumentContents true/false page 43 developers
manual).

Secondly, I am trying to get them to refer to 'alternate versions';
- the image (jpg) of the page
- a pdf of the whole document

I believe I have done this properly by including these filenames in my page
metadata and in the metadata for each page. But Weirdly it only works for
one document and then - only for the pdf. (and only on the first page)
--
format DocumentText '<p><a href="/gsdl/collect/gsarch/index/assoc//[PageImage]">View page
image</a><br><a href="/gsdl/collect/gsarch/index/assoc//[AlternateVersion]">PDF Version </a>
<p>[Text]'
# where [PageImage] is some metadata element set as described above.

--
This email contains my collect.cfg, a sample of my document, and the
original email in the GS archives which gave me the idea.
(I adapted the advice to use htmlPlug instead of Bookplug(HBSPlug) as
advised - This may be my fatal mistake)
--

I am using greenstone 2.38 on Win2000

Regards,

Stephen De Gabrielle

PS
I can't get 'format DocumentArrowsBottom true' to make arrows show up
either.
PPS
I can't use PDFPlug as the documents will be too big, and my clienst will
slow old computers and low speed conections. Supplying the link to the pdf
is a desirable but optional extra.


--
Stephen De Gabrielle
Digitisation Officer
AraDA Project
NTU Library
+61 8 8946 7009
http://www.ntu.edu.au/library


--example document---
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Generator" CONTENT="OmniPage 11 - www.scansoft.com">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
<TITLE>Recognized HTML document</TITLE>
<link href="aterra.css" rel="stylesheet" type="text/css">
</HEAD>
<BODY>
<!--
<Section>
<Description>
<Metadata name="Title">A terra, a gente e os costumes de Timor</Metadata>
<Metadata name="Author">Braga, Paulo</Metadata>
<Metadata name="PageImage">frontcover.jpg</Metadata>
<Metadata name="AlternateVersion">doc.pdf</Metadata>
</Description>
-->
<h1>CADERNOS COLONIAIS</h1>
<p align="center"><img src="front-cover.jpg" alt="Front Cover: Includes
image with captiion: Rapaz de Manufai" width="300" height="460" border
="1"></p>
<h1>A TERRA, A GENTE <font size="-1">E OS</font></h1>
<h1>COSTUMES DE TIMOR</h1>
<p>POR</p>
<p>PAULO BRAGA</p>
<!--
</Section>
<Section>
<Description>
<Metadata name="Title">A terra, a gente e os costumes de Timor</Metadata>
<Metadata name="PageImage">titlepage.jpg</Metadata>
</Description>
-->
<h1>CADERNOS COLONIAIS</h1>
<p align="center">N.□ 7</p>
<p align="center"> <em>PAULO BRAGA </em> </p>
<h1>A TERRA, A GENTE E OS</h1>
<h1>: COSTUMES DE TIMOR :</h1>
<p>EDITORIAL COSMOS</p>
<p>Rua do Mundo, 100, S.□ </p>
<p>L 1 S B 0 A </p>
<p align="center"><img src="front-cover.jpg" width="300" height="460"
border="1">
</p>
<!--
</Section>
<Section>
<Description>
<Metadata name="Title">A terra, a gente e os costumes de Timor</Metadata>
<Metadata name="PageImage">01.jpg</Metadata>
</Description>
-->
<p class="text"><em>Este Caderno jala da mais distante e desconhecida
col□nia de Portugal
? e □ um simples apanhado de impress□es. </em>


-collect.cfg--

creator stephen.degabrielle@ntu.edu.au
maintainer stephen.degabrielle@ntu.edu.au
public true

indexes document:text document:Title
defaultindex document:text

plugin GAPlug
plugin HTMLPlug -description_tags -assoc_files '
(?i).(jpe?g|gif|png|css|pdf)$' -block_exp "(?i).(gif|jpe?g|png|css|pdf)$"
# -rename_assoc_files
plugin ArcPlug
plugin RecPlug
format DocumentArrowsBottom true
#
# To generate a link to a file alter the
# DocumentText format string from it's default of
# format DocumentText '[Text]'
# to something more like
format DocumentText '<p><a href="/gsdl/collect/gsarch/index/assoc//[PageImage]">View page
image</a><br><a href="/gsdl/collect/gsarch/index/assoc//[AlternateVersion]">PDF Version </a>
<p>[Text]'
# where [PageImage] is some metadata element set as described above.

classify AZList -metadata Title

collectionmeta collectionname "AraDA Project demo 271102"
collectionmeta iconcollection ""
collectionmeta collectionextra ""
collectionmeta .document:text "text"
collectionmeta .document:Title "titles"


-message in archives-
mailto:greenstone@colosys.net
----- Original Message -----
From: "Marvin Brunner" <mbru@oce.nl>
To: <greenstone@tripath.colosys.net>
Sent: Tuesday, March 13, 2001 10:57 PM
Subject: computer science technical reports collection

> Hi,
> I'm working on building a collection of old books, scanned, cleaned up
> and made searchable by OCR. Each page is currently a single .txt file.
> The IndexPlug is used to add metadata such as the pagenumber and chapter
> to each file.
> Reading the book however would be much easier when a next and previous
> bookpage button is available, such as in the Computer Science Technical
> Reports collection. I have tried to add metadata fields in the index.txt
> file for next and previous, but this was not successful.
> Additionally, I would like to have a link on the bookpage to a PDF file
> of the scanned page.
>
> Is it possible to do it with the IndexPlug, and if so, how?
> Can I get the files from the Computer Science Technical Reports
> collection that are used to build that collection (such as the
> collect.cfg, plugins used, and additional files like index.txt)?
>
>
> Thanks in advance,
> Marvin Brunner
> Oc□ Technologies B.V.
> The Netherlands

Hi Marvin,

The CSTR collection (and others that do similar things e.g. Gutenberg) uses
a specialized plugin to split each document into sections (pages in this
case) as it goes along.

Unfortunately these plugins have all been produced to handle the specific
format of input documents used by their intended collection. This makes
them
unsuitable for use with another collection unless you're using the same
format of documents or are willing to alter the plugins.

This leads to your two options as far as getting your collection formatted
the way you want it.

1. Write your own plugin. While this is not too difficult it does require a
certain amount of perl knowledge and a reasonable understanding of how
Greenstone plugins work. If you want to tackle this the Greenstone
developers manual is a good place to start.

2. Use an existing plugin (HBSPlug - see below) and format your input
documents to suit. This is fine if you don't have too many documents or can
automate the reformatting task. It may be impractical otherwise. I'll
assume
that you have reasonable control over the format of your input documents
and
don't mind spending some time fiddling about to get the collection looking
the way you want (after all you're already creating an index.txt file).

HBSPlug
----------
The HBSPlug plugin comes as a standard part of Greenstone and is ideal for
creating a collection of books (e.g. the Humanity development library).
Have
a look at the comments at the top of gsdl/perllib/plugins/HBSPlug.pm for
details of how to use it. Basically you'll need to reformat your input
documents so that each book is contained in a .hb file the format of which
goes something like:

<<TOC1>><<Title>>Book title<</Title>><<Creator>>Authors name<</Creator>>
May or may not be some text here (this applies to any level of
the hierarchical structure).
<<TOC2>><<Title>>Chapter 1<</Title>>
<<TOC3>><<Title>>Page 1<</Title>>
Some text belonging to page 1
<<TOC3>><<Title>>Page 2<</Title>>
Some text belonging to page 2
<<TOC2>><<Title>>Chapter 2<</Title>>
<<TOC3>><<Title>>Page 3<</Title>>
Some text belonging to page 3

Note that any metadata can be added to any <<TOC>> line (e.g. you might
want
to add some metadata which is the link to the pdf file for each page).
Also,
since all your metadata is being added in the .hb files you'll no longer
need your index.txt.

Note also that you can have any depth of <<TOC>> tags.
e.g. <<TOC1>>book level
<<TOC2>>chapter level
<<TOC3>>section level
<<TOC4>>page level
and that any section, at any level, may contain text.


As far as generating a link to the pdf file, I guess to do that you'll want
to alter the DocumentText format string from it's default of
format DocumentText '[Text]'
to something more like
format DocumentText '<p><a href="[linktopdf]">Go to pdf</a><p>[Text]'
where [linktopdf] is some metadata element set as described above.


Please mail the list if you have any questions about any of this.
regards,
Stefan.


--
Stephen De Gabrielle
Digitisation Officer
AraDA Project
NTU Library
+61 8 8946 7009
http://www.ntu.edu.au/library
--
Stephen De Gabrielle
Digitisation Officer
AraDA Project
NTU Library
+61 8 8946 7009
http://www.ntu.edu.au/library