[greenstone-users] Collection of Devangari documents in pdf

From Tim Chase
DateMon Aug 31 22:52:16 2009
Subject [greenstone-users] Collection of Devangari documents in pdf
Hello Greenstone Users

I'm attempting to create a collection of documents that are in Nepali using
devenagari script. It appears like I'm having difficulties during the
import process when I try to import PDF format files and they get converted
to the Greenstone Archive Format(GAF). When the PDF file is converted to
the xml file some of the characters are missing. So later when the build
process is complete the indexes over the text are incorrect.

Obviously the best solution would be for my pdf files to import correctly
but if that is not possible, Can I have my collection display pdf files
while using doc or htm files as the source document during the import so
that correct indexes are built during the build process?

I have done some testing with different input formats for the same document
with the following results: All text is typed in Unicode devenagari using
the Arial Unicode font.

The original source was a word document that was saved as an htm file and
also printed to a pdf file. Files attached..

Import:

Htm - imports flawlessly

pdf - lost characters on conversion to GAF

doc - imports flawlessly


Build:

Html - good index, text displays according to how it was formatted in word.
ie. Named styles are converted to H1, ect. So formatting is a bit tricky
when coming from Word.

PDF - bad index from a bad import to GAF. Xml file shows up but with
missing characters. This format works best for printing off files and does
not require a word processor.

Doc - good index, text in the htm displays with the text somewhat scrunched
unless the windows_scripting option is selected in the wordplugin in which
case the file displays the same as the html formatted file above. A word
processor is required to open the doc file.

Any advice is very appreciated. I'm using ver 2.82

Tim


-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/attachments/20090819/3603ca72/poem-0001.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: poem.pdf
Type: application/pdf
Size: 55467 bytes
Desc: not available
Url : https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/attachments/20090819/3603ca72/poem-0001.pdf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: poem.doc
Type: application/msword
Size: 30208 bytes
Desc: not available
Url : https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/attachments/20090819/3603ca72/poem-0001.doc