Re: [greenstone-users] PDF plugin and number of pages

From Michael Dewsnip
DateThu, 23 Sep 2004 11:09:59 +1200
Subject Re: [greenstone-users] PDF plugin and number of pages
In-Reply-To (415062A3-2000807-unesco-org-uy)
Hi Eduardo,

Yes, these are two useful bits of metadata that PDFPlug should be extracting automatically. In fact, we decided recently that all plugins should extract file size metadata, so hopefully this will make it into the next release.

In terms of the "number of pages" metadata, luckily this isn't too difficult to add. The pdftohtml program that Greenstone uses creates anchor tags in the HTML output (<a name=1>, <a name=2> etc.) at the start of each page. It is fairly simple to look for these tags and count them to
determine the number of pages. I've added this code near the end of gsdl/perllib/plugins/

# Add NumPages metadata (we have "<a name=1>" etc for each page)
my $textref = $_[0];
my @pages = ($$textref =~ /<a name=d+>/ig);
$doc_obj->add_utf8_metadata($cursection, "NumPages", scalar(@pages));

just before

return $result;

If you want to add the file size metadata yourself you'll need to determine the size of the original file, then call add_utf8_metadata as I've done above.

Hope this helps,


Eduardo Trâ–¡pani wrote:

> Hi,
> I would like to add metadata for the number of pages and the file size as automatically extracted elements for PDFs. That will allow people to know in advance how big files are (bandwidth/paper) before they download them.
> Do you know how can I go about it? If you have to point me to the developer's manual don't hesitate to do so. I don't know if there is information on changes to an existing plugin. But then again, maybe I should write a new plugin that extracts ex.pages and ex.size. What is best?
> Eduardo.
> _______________________________________________
> greenstone-users mailing list