I'm no expert in Arabic OCR, although I have done some work with Arabic
PDFs. In one collection that I have been helping with (incidentally this is
not a Greenstone collection) there are a number of PDFs in Dari and Pashto.
Some work without any problems but for some the searching doesn't work
correctly. The root cause of the problem was that these erroneous pdfs had
stored the arabic text using the presentation forms.
Because Arabic is a cursive script it requires a different character set
when displaying it on the screen (in very crude terms to ensure that each
character joins up correctly, and that is crude because they do look
different depending upon whether the character appears at the start, middle
or end of the word.)
The character ? "The Letter Beheh" has UNICODE U+0680
The same character in the different presentation forms are:
Isolated ? has UNICODE U+FB5A
Final ? has UNICODE U+FB5B
Initial ? has UNICODE U+FB5A
Medial ? has UNICODE U+FB5A
In addition to this there is the merging of multiple letters together when
presenting the script.
When the data is saved it should not be saved in any presentation form, to
quote the UNICODE FAQ <http://unicode.org/faq/middleeast.html> on this
Q: Can one use the Arabic presentation forms in a data file?
A: It is strongly discouraged and not recommended because it does not
guarantee data integrity and interoperability. Data files should include
only the Arabic script code values that are defined in Row 6, U+0600 to
The issue that I came across was that when the data was stored in
presentation form the words would not be matched when doing a search, this
should be understandable when you realise that the underlying UNCODE is very
different (even if the word searched for is presented identically).
We don't have one at the moment I have done a fair amount of searching but I
haven't found anything that matches what is needed. So I plan to write a
simple program that can parse text (once the pdf have been through the
pdftotext program) but I haven;t got around to writing it yet, if anyone
wants to sponsor me to do this then its priority can certainly be bumped up.
I hope that has been a little helpful and I should stress that I am not able
to read Arabic (or even speak it) so much of my knowledge has come from
colleagues at AREU and of course the Internet.
2009/7/20 John Rose <email@example.com>
> Dear Arabic speaking colleagues,
> I'm inviting Graeme to explain his reference to "characters in presentation
> format". It would seem to me that, if the OCR program can save Arabic text
> in any standard character set, searching for full or truncated words should
> be possible in Greenstone (the latter only if the collection is built with
> the Lucene indexer, same restriction as for other languages).
> I have been told by some Arabic speakers that satisfactory OCR software is
> not readily available for Arabic text. It would be nice if Amr and Graeme or
> other colleagues could comment on this, providing information on their
> experience with Arabic OCR software.
> Reminding that there is an Arabic Greenstone blog at
> http://arabicgsdlblog.blogspot.com/ and also an Arabic discussion list at
> http://www.freelists.org/list/greenstone4arab . We hope to promote
> the establishment of an Arabic Greenstone user group, but this effort is
> hampered by lack of information on users and applications. It would be
> useful if users of Greenstone in Arabic could identify themselves and
> provide feedback on their needs, problems and applications, either to this
> list or offlist to me.
> Best regards, John Rose of Greenstone Team
> From: amr hassan <firstname.lastname@example.org>
>> To: "email@example.com"
>> Date: Tue, 14 Jul 2009 03:15:40 -0700 (PDT)
>> Message-ID: <firstname.lastname@example.org>
>> Subject: [greenstone-users] Very Important
>> Can Greenstone Recognize Arabic Letters Made By OCR ?
>> From: graeme <email@example.com>
>> Precedence: list
>> MIME-Version: 1.0
>> Cc: "firstname.lastname@example.org"
>> To: amr hassan <email@example.com>
>> References: <firstname.lastname@example.org>
>> In-Reply-To: <email@example.com>
>> Date: Tue, 14 Jul 2009 18:11:43 +0700
>> Message-ID: <firstname.lastname@example.org>
>> Content-Type: multipart/alternative; boundary=0016e64601d018b503046ea88249
>> Subject: Re: [greenstone-users] Very Important
>> The short answer is yes.
>> A possible problem is that if the OCR saves the characters in presentation
>> format then searching for part words would not work.
> John B. Rose
> 1 Bis, Rue des Ch?tre-Sacs
> 92310 S?vres
> Email: <email@example.com>
> (in case of bounce then send to
-------------- next part --------------
An HTML attachment was scrubbed...