We have had a look at this document, and can't get it to display
properly. Here is what John Thompson had to say about it:
The Arabic file is using custom embedded fonts and custom encoding... so
I doubt we'll ever be able to extract from it without some sort of
custom mapping from the characters behind the glyphs of the Axt Bassima
Light font through to proper Unicode characters. For instance, from the
document there is this character:
Which, when extracted, appears to map to this ANSI character:
□ (hex: 00DC, dec: 220)
When in unicode it should be:
? (hex: 0628, dec: 1576)
I've checked, and there are plenty of characters outside the ANSI range,
so I think the underlying encoding must be unicode of some form. My
guess is that if you extract as UTF-8 then display using UTF-8 but use
the Axt Bassima Light font the characters will be fine.
So, my suspicion is that the next step would be to convert the
characters using PERL and a mapping through to their proper characters.
Given I can't even find a character map for Axt Bassima Light, I doubt
that's going to be very easy.
I am not sure what to do about this.
Is this font a common one that is used for arabic? Have you tried
viewing the greenstone document in the font mentioned above?
If John is right, then I guess if we can get the mapping for the
characters used, we could add it in to Greenstone.
Maybe you could try the arabic mailing list for help - visit
http://www.freelists.org/list/greenstone4arab to subscribe.
Katherine of Greenstone Team wrote:
> Hi Zineb
> Can you please send me your document (off list) and I will take a
> look. It looks like maybe the pdftohtml program is not producing utf8,
> but the code is assuming it does.
> Zineb Naji wrote:
>> Hi all,
>> gr error displays when Cration Collection
>> import.pl> Converting
>> %D8%A7%D9%84%D8%A7%D9%85%D8%A7%D8%B2%D8%BA%D9%8A%D8%A9.*pdf to HTML
>> *import.pl> HTMLPlugin processing C:\Program
>> import.pl> doc::add_utf8_metadata - warning: 'FilenameRoot''s value
>> ????????? wasn't utf8. Tried converting to utf8: □□□□□□□□□
>> import.pl> Fin de l'importation.
>> on the *Design *panel>*PDFPlugin-convert auto-_filename___encoding
>> /*to display the Arabic carractaire*/
> greenstone-users mailing list