Re: [greenstone-users] Accents in PDF and Word files

From Michael Dewsnip
DateThu, 30 Sep 2004 10:33:14 +1200
Subject Re: [greenstone-users] Accents in PDF and Word files
In-Reply-To (5-0-2-1-2-20040929150504-00b62d00-pop-doc-bondy-ird-fr)
Hi,

(Thanks for joining the mailing list and asking your question here. We did
receive your question on the support form and were intending to answer it, but
prefer to answer questions on the list so other users can contribute answers and
benefit from the discussion.)

I note from your support form request that you are using Greenstone v2.50.
Greenstone v2.51 includes a newer version of the PDF to HTML conversion program
that is used to process PDF files. It is likely that the problems you are having
with encodings have been fixed in this new version, so I recommend you try this
first. You can download Greenstone v2.51 (which also contains many other
improvements over Greenstone v2.50) from
http://sourceforge.net/projects/greenstone. Or, you can just download the new
version of the pdftohtml.exe program from
http://www.cs.waikato.ac.nz/~mdewsnip/greenstone/pdftohtml.exe.zip and overwrite
your existing gsdlbinwindowspdftohtml.exe file (but of course you won't get
the other v2.51 improvements).

If you still have the same problems after rebuilding your collection, please
send me personally one of the PDF files you are having problems with and I'll
try it here.

Regards,

Michael

Pier.Luigi.Rossi@bondy.ird.fr wrote:

> Hi,
> i just became a menber of the list.
>
> I have to made collections with pdf files in french.
>
> A file contains title with accents and text with accents.
>
> When a build collection using pdfplug i have 2 choice
>
> if i use pdfplug with imput_encoding iso_8859_1
> the accents in the title are show well when i consult the collection :
> □chelle is show □chelle
> the accents in the text are show not well for the text document (well
> in the pdf document !) : unité and not unit□
> but all the documents are indexed in the collection
>
> if i use pdfplug with imput_encoding auto
> the accents in the title are show not well when i consult the collection :
> □chelle is show chelle
> the accents in the text are show well for the text document (well in the
> pdf document !) : unit□ is unit□
> not all the documents are indexed in the collection
>
> When i search the collection if i want find documents it is very hard ....
> if i search documents abaout unit□ i have to write .... unité (and i can't
> whit my pc)
> The difficult is to explain that to all the users ....
>
> I try to change preferences in UTF-8 or in iso_8859_1 but if
> a search unit□ i can't find unité in the index .....
>
> Maybe people working whith accents in other langages have the same problems ?
>
> Is it possible to made index whitout accents and filter the search entry
> to put it whitout accents ?
> if a search unit□ a filter translete it in unite and the index contains
> just unite ....
>
> Regards
>
> Pier Luigi Rossi
>
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users