|Date||Fri, 25 Feb 2005 11:59:03 +0930|
|Subject||Re: [greenstone-users] images|
Are you digitising books - then inserting the page images into word documents - then importing them into greenstone?
If so I assume this is a function of your scanner/ocr software. (and you are using word to get the text as well as just the images)
WordPlug converts wordDocuments into html - and should take the images with it. - but I am not sure if it will automatically put pages into your documents - more likely one long page (could be a pain if you have a 300 page book loading into the browser with page images and all)
The scanner/OCR software I have used also provides an option for making PDF files - this might be an option for you (if it is available in the software you use- I have used Omnipage and HP scanner software and they both do this)
The PDF files can be made to include the images (hiding the text behind the images - so full text search is possible).
The reason I mention PDF is that the PDFPlug will retain your pages if you use the -use_sections option. You may also need to try the other settings in PDFPlug to get the hidden text. You can still link to the PDF as well as the Paginated greenstone version.
If you and not required to use word an an intermediate format I would suggest using PagedImgPlug.pm as
it "processes sequences of images, with optional OCR text"
NOTE PDFPlug and WordPlug both convert your documents into Greenstone Archive Document. These contain escaped html extracted from the original files by the third party tools that do the PDF or Word conversion to HTML. (The Plugins written by the Greenstone developers don't actually do the conversion themselves - they just pass on the document 'GhostScript' or vxWord )
You can use a combination of plugins - So RecPlug can still be used to add metadata.
BTW - I use ImagePlug to maintain a small library of digital photos of my cat. I find it useful for image collections (Photographs - not page images) where each image gets assigned metadata using the GLI (vir recPlug - but thats automated in the GLI)
I have copied the folowing text from the PagedImgPlug plugin itself (see below)
Let me know if this helps,
Stephen De Gabrielle
# processes sequences of images, with optional OCR text
# This plugin takes *.item files, which contain metadata and lists of image
# files, and produces a document containing sections, one for each page.
# The files should be named something.item, then you can have more than one
# book in a directory. You will need to create these files, one for each
# The format of the xxx.item file is as follows:
# The first lines contain any metadata for the whole document
# <Title>Snail farming
# Then comes a list of pages, one page per line, each line has the format
# page num and imagefile are required. pagenum is used for the Title
# of the section, and in the display is shown as page <pagenum>.
# imagefile is the image for the page. textfile is an optional text
# file containing the OCR (or any) text for the page - this gets added
# as the text for the section. r is optional, and signals that the image
# should be rotated 180deg. Eg use this if the image has been made upside down.
# So an example item file looks like:
# <Title>Snail farming
# The second page has no text, the fourth page is a back page, and
# should be rotated.
# All the supplemetary image amd text files should be in the same folder as
# the .item file.
# To display the images instead of the document text, you can use [srcicon]
# in the DocumentText format statement.
# For example,
# format DocumentText "<center><table width=537><tr><td>[srcicon]</td></tr></table></center>"
# To have it create thumbnail size images, use the '-thumbnail' option.
# To have it create medium size images for display, use the '-screenview'
# option. As usual, running
# 'perl -S pluginfo.pl PagedImgPlug' will list all the options.
# If you want the resulting documents to be presented with a table of
# contents, use '-documenttype hierarchy', otherwise they will have
# next and previous arrows, and a goto page X box.
# If you have used -screenview, you can also use [screenicon] in the format
# statement to display the smaller image. Here is an example that switches
# between the two:
# format DocumentText "<center><table width=537><tr><td><a href=/gsdlmod?e=d-00000-00---off-0gsarch--00-0----0-10-0---0---0direct-10---4-----dfr--0-1l--11-en-50---20-preferences-Haik+Zargaryan--00-0-1-00-0--4----0-0-11-10-0utfZz-8-00&a=d&d=OF02FD70C4-84DEF90C-ON69256FB2-00835373-69256FB3-000D7EFF-nt-gov-au&p=small>Switch to small version.</a></td></tr><tr><td><a href=/gsdlmod?e=d-00000-00---off-0gsarch--00-0----0-10-0---0---0direct-10---4-----dfr--0-1l--11-en-50---20-preferences-Haik+Zargaryan--00-0-1-00-0--4----0-0-11-10-0utfZz-8-00&a=d&d=OF02FD70C4-84DEF90C-ON69256FB2-00835373-69256FB3-000D7EFF-nt-gov-au&p=small title=Switch to small version>[srcicon]</a></td></tr></table></center>"
# Additional metadata can be added into the .item files, alternatively you can
# use normal metadata.xml files, with the name of the xxx.item file as the
Stephen De Gabrielle
I'm sick of the Internet - I want a yabby net.
This is a resend of a post from last Saturday - as no one responded
yet. Could someone do so or suggest a better place to go for info?
I've read on the mailing list that using ImagePlug is not a good idea
if you already have documents with lots of metadata, and simply want
someone to be able to see the original image of an HTML or Word
document. It was suggested putting a URL to the image (posted on
another server) in the metadata because using ImagePlug will end up
making a duplicate of each image.
Our situation is that we anticipate that for every Word doc (the
usual format we import in), we'll want to have a link to an original
image. In fact an original book with many pages will be made into
one Word document, with pagination noted in the Word doc. The user
should have the ability to click on a page number in the Word doc to
see the original image. It would also be helpful for the user to be
able to browse the original images by paging through them. I've seen
different libraries use a variety of image browsers for such a
Can someone suggest any such (freeware!) browsers and how to
integrate them into a Greenstone library, or another solution.
Pointers to Greenstone libraries with this capability would help as
greenstone-users mailing list