[greenstone-devel] PagedImagePlugin and PDFs

From Yitzchak Schaffer
DateFri Jan 7 06:30:10 2011
Subject [greenstone-devel] PagedImagePlugin and PDFs
Hello all,

We are starting work on a book collection based on single-page PDFs. We
need to extract the text from the PDFs for indexing. We are planning to
use the collection with our EmeraldView standalone frontend, with a new
paged PDF presentation scheme.

It looks to me like the PDFPlugin should be able to handle this with the
convertto option, but Ghostscript was choking on our PDFs. Generating
images is not important, as we plan on presenting the PDFs as-is only.

In experimenting with the import process to get this to work, I have
produced a modified PagedImagePlugin that appears to do what we need. I
attach a patch here in case it might prove useful to anyone else. It
assumes that one has the pdftotext executable installed; on my dev
machine I just put it in GSDLHOMEbinwindows

The patch includes a few lines that are collection-specific, around line
40 of the patch.

Cheers,

--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
212.742.8770 ext. 2432
http://www.tourolib.org/

-------------- next part --------------
255c255
< return q^(?i)(.jpe?g|.gif|.png|.tif?f|.te?xt|.html?|~)$^
---
> return q^(?i)(.jpe?g|.gif|.png|.tif?f|.te?xt|.html?|.pdf|~)$^
397a398,465
> sub process_pdf {
> my $self = shift;
> my ($filename_full_path, $filename_no_path, $doc_obj, $section) = @_;
>
> return 0 if ($filename_no_path eq "" || !-f $filename_full_path);
>
> if (!$self->{'processing_tmp_files'} ) {
> $doc_obj->associate_source_file($filename_full_path);
> }
>
> $self->generate_pdf_stuff( $filename_full_path, $filename_no_path, $doc_obj, $section );
>
> return 1; # what are we really supposed to return? seems like void, from ImageConverter::generate_images()
> }
>
> sub extract_pdf_text {
> my $self = shift;
> my ($filename_full_path, $file, $doc_obj, $cursection) = @_;
>
> # check that the PDF exists!!
> if (!-f $filename_full_path) {
> print "PagedImagePlugin: ERROR: File $filename_full_path does not exist, skipping ";
> return 0;
> }
>
> # remember that this text file was one of our source files, but only
> # if we are not processing a tmp file
> if (!$self->{'processing_tmp_files'} ) {
> $doc_obj->associate_source_file($filename_full_path);
> }
>
> my $cmd = &util::filename_cat($ENV{'GSDLHOME'}, "bin", $ENV{'GSDLOS'}, "pdftotext");
>
> my $text = `$cmd -enc UTF-8 "$filename_full_path" -`;
> $text =~ s/ '/'/g; # confusion around RTL indicators
> $text =~ s/[ ]+/ /g;
> $text =~ s/ / /g;
>
> if (!length ($text)) {
> # It's a bit unusual but not out of the question to have no text, so just give a warning
> print "PagedImagePlugin: WARNING: $filename_full_path contains no text ";
> }
>
> # we need to escape the escape character, or else mg will convert into
> # eg literal newlines, instead of leaving the text as ' '
> $text =~ s/\/\\/g; # macro language
> $text =~ s/_/\_/g; # macro language
>
>
> if ($text =~ m/<html.*?>s*<head.*?>.*</head>s*<body.*?>(.*)</body>s*</html>s*$/is) {
> # looks like HTML input
> # no need to escape < and > or put in <pre> tags
>
> $text = $1;
>
> # insert preformat tags and add text to document object
> $doc_obj->add_utf8_text($cursection, "$text");
> }
> else {
> $text =~ s/</&lt;/g;
> $text =~ s/>/&gt;/g;
> # insert preformat tags and add text to document object
> $doc_obj->add_utf8_text($cursection, "<pre> $text </pre>");
> }
>
> return 1;
> }
>
400a469
>
425a495,536
> sub generate_pdf_stuff {
> my $self = shift;
> my ($filename_full_path, $filename_no_path, $doc_obj, $section) = @_;
>
> return 0 if ($filename_no_path eq "" || !-f $filename_full_path);
>
> if ($self->{'enable_cache'}) {
> $self->init_cache_for_file($filename_full_path);
> }
>
> my $verbosity = $self->{'verbosity'};
> my $outhandle = $self->{'outhandle'};
>
> my $filehead = $filename_no_path;
> $filehead =~ s/.([^.]*)$//; # filename with no extension
> my $assocfilemeta = "[assocfilepath]";
> if ($section ne $doc_obj->get_top_section()) {
> $assocfilemeta = "[parent(Top):assocfilepath]";
> }
>
> # The images that will get generated may contain percent signs in their src filenames
> # Encode those percent signs themselves so that urls to the imgs refer to them correctly
> my $url_to_filehead = &unicode::filename_to_url($filehead);
> my $url_to_filename_no_path = &unicode::filename_to_url($filename_no_path);
>
> my $type = "application/pdf";
>
> # here we overwrite the original with the potentially converted one
> $doc_obj->set_utf8_metadata_element($section, "Source", &unicode::url_decode($filename_no_path)); # displayname of generated image
> $doc_obj->set_utf8_metadata_element($section, "SourceFile", $url_to_filename_no_path); # displayname of generated image
>
> #overwrite the ones added in BasePlugin
> $doc_obj->set_metadata_element ($section, "FileFormat", 'PagedPDF');
> $doc_obj->set_metadata_element ($section, "FileSize", (-s $filename_full_path) );
>
> $doc_obj->add_metadata ($section, "srclink", "<a href="/gsdl/collect/[collection]/index/assoc/$assocfilemeta/[Image]">");
> $doc_obj->add_metadata ($section, "/srclink", "</a>");
> $doc_obj->add_metadata ($section, "srcicon", "<img src="/gsdl/collect/[collection]/index/assoc/$assocfilemeta/[Image]" width="[ImageWidth]" height="[ImageHeight]">");
>
> # Add the image as an associated file
> $doc_obj->associate_file($filename_full_path, $filename_no_path, "image/$type", $section);
> }
445c556,563
< if (defined $imgfile) {
---
> if (defined $imgfile && $imgfile =~ /.pdf$/i) {
> $self->process_pdf($self->{'xml_file_dir'}.$imgfile, $imgfile, $doc_obj, $self->{'current_section'});
>
> if ( ! defined $_{'txtfile'} ) {
> $self->extract_pdf_text($self->{'xml_file_dir'}.$imgfile, $imgfile, $doc_obj, $self->{'current_section'});
> }
> }
> elsif (defined $imgfile) {
449a568
>
453c572,573
< } else {
---
> }
> if ( ! $doc_obj->get_text( $self->{'current_section'} ) ) {