[greenstone-devel] PDFPlug.pm plugin page/prefix processing perl problems

From Stephen.DeGabrielle@nt.gov.au
DateWed, 12 Jan 2005 17:46:37 +0930
Subject [greenstone-devel] PDFPlug.pm plugin page/prefix processing perl problems

Hi,

Have a look at the following snippet:
(starts somewhere near line 100 of PDFPlug.pm)
####
    # following title_sub removes "Page 1" added by pdftohtml, and a leading
    # "1", which is often the page number at the top of the page. Bad Luck
    # if your document title actually starts with "1 " - is there a better way?

    #my $self = new ConvertToPlug ($class, @args, "-title_sub", '^(Pages+d+)?(s*1s+)?');
    my $self = new ConvertToPlug ($class, @args);
    $self->{'plugin_type'} = "PDFPlug";
    if ($use_sections) {
            $self = new ConvertToPlug ($class, @args, "-title_sub", '^(Pages+d+)?(s*1s+)?');
        $self->{'use_sections'}=1;
    }
####

It is part of 'sub new' in PDFPlug.pm, and is my attempt to fix PDFPlug so it doesn't overide a
title_sub specified in the arguments of PDFPlug.pm with the '^(Pages+d+)?(s*1s+)?' arguments
required to remove the "Page 1" added by pdftohtml (as noted in the comments)

I thought I got it right - but attempts to rebuild with a small collection seem to have killed my ability to generate sections at all (my test document included the greenstone developers guide).

Any help/suggestions appreciated

regards,

Stephen

PS

here is my import log;
I have know idea how the leading s turned up.
-----
s
Command: C:Program FilesgsdlbinwindowsperlbinPerl.exe -S C:Program Filesgsdlbinscriptimport.pl -gli -language en -collectdir C:Program Filesgsdlcollect -OIDtype incremental -verbosity 3 -removeold testpref
import.pl> Windows does not support pdf to text. PDFs will be converted to HTML instead
import.pl> Windows does not support pdf to text. PDFs will be converted to HTML instead
import.pl> Removing current contents of the archives directory...
import.pl> util::rm_r couldn't remove directory C:/Program Files/gsdl/collect/testpref/archives/D3.dir
import.pl> util::rm_r couldn't remove directory C:/Program Files/gsdl/collect/testpref/archives
import.pl> RecPlug: getting directory C:Program Filesgsdlcollect estprefimport
import.pl> RecPlug: preparing metadata for 20040825_pt_remotehealth.shtml
import.pl> RecPlug recurring: 20040825_pt_remotehealth.shtml
import.pl> HTMLPlug: processing 20040825_pt_remotehealth.shtml
import.pl>  extracted Title metadata "Northern Territory Government of Australia - Media Releases - HOWARD TOLD &ndash; NO NUCLEAR DUMP FOR TERRITORY" from <title> tags
import.pl> Title sub expression "" applied to <title> tags
import.pl>  extracted Title metadata "Northern Territory Government of Australia - Media Releases - HOWARD TOLD &ndash; NO NUCLEAR DUMP FOR TERRITORY" from <title> tags
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages opspacer.gif to topspacer.gif
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages itle5.gif to title5.gif
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimportimagesbannershp03.jpg to 03.jpg
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimportimagespeter_toyne.jpg to peter_toyne.jpg
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages tglogo.gif to ntglogo.gif
import.pl> RecPlug: preparing metadata for 20040825_ss_alcoholframework.shtml
import.pl> RecPlug recurring: 20040825_ss_alcoholframework.shtml
import.pl> HTMLPlug: processing 20040825_ss_alcoholframework.shtml
import.pl>  extracted Title metadata "Northern Territory Government of Australia - Media Releases - HOWARD TOLD &ndash; NO NUCLEAR DUMP FOR TERRITORY" from <title> tags
import.pl> Title sub expression "" applied to <title> tags
import.pl>  extracted Title metadata "Northern Territory Government of Australia - Media Releases - HOWARD TOLD &ndash; NO NUCLEAR DUMP FOR TERRITORY" from <title> tags
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages opspacer.gif to topspacer.gif
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages itle5.gif to title5.gif
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimportimagesbannershp03.jpg to 03.jpg
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimportimagessyd_stirling.jpg to syd_stirling.jpg
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages tglogo.gif to ntglogo.gif
import.pl> RecPlug: preparing metadata for 20040826_cm_communitycabinet.shtml
import.pl> RecPlug recurring: 20040826_cm_communitycabinet.shtml
import.pl> HTMLPlug: processing 20040826_cm_communitycabinet.shtml
import.pl>  extracted Title metadata "Northern Territory Government of Australia - Media Releases - HOWARD TOLD &ndash; NO NUCLEAR DUMP FOR TERRITORY" from <title> tags
import.pl> Title sub expression "" applied to <title> tags
import.pl>  extracted Title metadata "Northern Territory Government of Australia - Media Releases - HOWARD TOLD &ndash; NO NUCLEAR DUMP FOR TERRITORY" from <title> tags
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages opspacer.gif to topspacer.gif
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages itle5.gif to title5.gif
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimportimagesbannershp03.jpg to 03.jpg
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimportimagesministersclare_martin.jpg to clare_martin.jpg
import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages tglogo.gif to ntglogo.gif
import.pl> RecPlug: preparing metadata for Develop-en.pdf
import.pl> RecPlug recurring: Develop-en.pdf
import.pl> Converting Develop-en.pdf to HTML format
import.pl> PDFPlug: Calculating sections...
import.pl> PDFPlug: warning - no sections found
import.pl> PDFPlug: passing Develop-en.pdf on to HTMLPlug
import.pl> HTMLPlug: processing Develop-en.html
import.pl> HTMLPlug: WARNING: Develop-en.html contains the following text outside
import.pl>           of the final closing </Section> tag. This text will
import.pl>           be ignored. (". Font and spacing is ignor...)
import.pl> RecPlug: preparing metadata for martin.0201.100 teachers.pdf
import.pl> RecPlug recurring: martin.0201.100 teachers.pdf
import.pl> Converting martin.0201.100teachers.pdf to HTML format
import.pl> PDFPlug: Calculating sections...
import.pl> PDFPlug: warning - no sections found
import.pl> PDFPlug: passing martin.0201.100 teachers.pdf on to HTMLPlug
import.pl> HTMLPlug: processing martin.0201.100teachers.html
import.pl> HTMLPlug: WARNING: martin.0201.100teachers.html appears to contain no Section tags so
import.pl>           will be processed as a single section document
import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 624.
import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 626.
import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 686.
import.pl>  extracted Title metadata "Clare Martin CHIEF MINISTER ACTING MINISTER FOR EMPLOYMENT EDUCATION AND..." from first 100 chars
import.pl> Title sub expression "^(Pages+d+)?(s*1s+)?" applied to first 100 chars
import.pl>  extracted Title metadata "Clare Martin CHIEF MINISTER ACTING MINISTER FOR EMPLOYMENT EDUCATION AND..." from first 100 chars
import.pl> RecPlug: preparing metadata for martin.0401.ParksLatest.pdf
import.pl> RecPlug recurring: martin.0401.ParksLatest.pdf
import.pl> Converting martin.0401.ParksLatest.pdf to HTML format
import.pl> PDFPlug: Calculating sections...
import.pl> PDFPlug: warning - no sections found
import.pl> PDFPlug: passing martin.0401.ParksLatest.pdf on to HTMLPlug
import.pl> HTMLPlug: processing martin.0401.ParksLatest.html
import.pl> HTMLPlug: WARNING: martin.0401.ParksLatest.html appears to contain no Section tags so
import.pl>           will be processed as a single section document
import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 624.
import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 626.
import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 686.
import.pl>  extracted Title metadata "Clare Martin CHIEF MINISTER MINISTER FOR INDIGENOUS AFFAIRS 4 January..." from first 100 chars
import.pl> Title sub expression "^(Pages+d+)?(s*1s+)?" applied to first 100 chars
import.pl>  extracted Title metadata "Clare Martin CHIEF MINISTER MINISTER FOR INDIGENOUS AFFAIRS 4 January..." from first 100 chars
import.pl> RecPlug: preparing metadata for Martin.0701.DestinationKatherine&TennantCreek.pdf
import.pl> RecPlug recurring: Martin.0701.DestinationKatherine&TennantCreek.pdf
import.pl> Converting Martin.0701.DestinationKatherine&TennantCreek.pdf to HTML format
import.pl> PDFPlug: Calculating sections...
import.pl> PDFPlug: warning - no sections found
import.pl> PDFPlug: passing Martin.0701.DestinationKatherine&TennantCreek.pdf on to HTMLPlug
import.pl> HTMLPlug: processing Martin.0701.DestinationKatherine&TennantCreek.html
import.pl> HTMLPlug: WARNING: Martin.0701.DestinationKatherine&TennantCreek.html appears to contain no Section tags so
import.pl>           will be processed as a single section document
import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 624.
import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 626.
import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 686.
import.pl>  extracted Title metadata "Clare Martin CHIEF MINISTER MINISTER FOR TOURISM 7 January 2005 ..." from first 100 chars
import.pl> Title sub expression "^(Pages+d+)?(s*1s+)?" applied to first 100 chars
import.pl>  extracted Title metadata "Clare Martin CHIEF MINISTER MINISTER FOR TOURISM 7 January 2005 ..." from first 100 chars
import.pl> RecPlug: preparing metadata for test.pdf
import.pl> RecPlug recurring: test.pdf
import.pl> Converting test.pdf to HTML format
import.pl> BasPlug: WARNING: language could not be extracted from C:Program Filesgsdlcollect estpref mp est.html - defaulting to en
import.pl> PDFPlug: Calculating sections...
import.pl> PDFPlug: warning - no sections found
import.pl> PDFPlug: passing test.pdf on to HTMLPlug
import.pl> HTMLPlug: processing test.html
import.pl> HTMLPlug: WARNING: test.html appears to contain no Section tags so
import.pl>           will be processed as a single section document
import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 624.
import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 626.
import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 686.
import.pl>  extracted Title metadata "Gfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf..." from first 100 chars
import.pl> Title sub expression "^(Pages+d+)?(s*1s+)?" applied to first 100 chars
import.pl>  extracted Title metadata "Gfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf..." from first 100 chars
import.pl> *********************************************
import.pl> Import complete
import.pl> *********************************************
import.pl> * 8 documents were considered for processing
import.pl> * 8 were processed and included in the collection
import.pl> Command complete.
import.pl> Extracting new metadata from archive files.
import.pl> Archived metadata extraction complete.
Command: C:Program FilesgsdlbinwindowsperlbinPerl.exe -S C:Program Filesgsdlbinscriptbuildcol.pl -gli -language en -collectdir C:Program Filesgsdlcollect -verbosity 3 testpref
buildcol.pl> doclevel = document
buildcol.pl> Windows does not support pdf to text. PDFs will be converted to HTML instead
buildcol.pl> Windows does not support pdf to text. PDFs will be converted to HTML instead
buildcol.pl> *** creating the compressed text
buildcol.pl>     collecting text statistics (mgpp_passes -T1)
buildcol.pl> ArcPlug: processing C:Program Filesgsdlcollect estprefarchivesarchives.inf
buildcol.pl> GAPlug: processing D0.dirdoc.xml
buildcol.pl> GAPlug: processing D1.dirdoc.xml
buildcol.pl> GAPlug: processing D2.dirdoc.xml
buildcol.pl> GAPlug: processing D3.dirdoc.xml
buildcol.pl> GAPlug: processing D4.dirdoc.xml
buildcol.pl> GAPlug: processing D5.dirdoc.xml
buildcol.pl> GAPlug: processing D6.dirdoc.xml
buildcol.pl> GAPlug: processing D7.dirdoc.xml
buildcol.pl> Stats (Compressing text from text)
buildcol.pl> Total bytes in collection: 138163
buildcol.pl> Total bytes in text: 138163
buildcol.pl>     creating the compression dictionary
buildcol.pl>     compressing the text (mgpp_passes -T2)
buildcol.pl> ArcPlug: processing C:Program Filesgsdlcollect estprefarchivesarchives.inf
buildcol.pl> GAPlug: processing D0.dirdoc.xml
buildcol.pl> GAPlug: processing D1.dirdoc.xml
buildcol.pl> GAPlug: processing D2.dirdoc.xml
buildcol.pl> GAPlug: processing D3.dirdoc.xml
buildcol.pl> GAPlug: processing D4.dirdoc.xml
buildcol.pl> GAPlug: processing D5.dirdoc.xml
buildcol.pl> GAPlug: processing D6.dirdoc.xml
buildcol.pl> GAPlug: processing D7.dirdoc.xml
buildcol.pl> Stats (Compressing text from text)
buildcol.pl> Total bytes in collection: 138163
buildcol.pl> Total bytes in text: 138163
buildcol.pl> *** building index text,Title,Source in subdirectory idx
buildcol.pl>     creating index dictionary (mgpp_passes -I1)
buildcol.pl> ArcPlug: processing C:Program Filesgsdlcollect estprefarchivesarchives.inf
buildcol.pl> GAPlug: processing D0.dirdoc.xml
buildcol.pl> GAPlug: processing D1.dirdoc.xml
buildcol.pl> GAPlug: processing D2.dirdoc.xml
buildcol.pl> GAPlug: processing D3.dirdoc.xml
buildcol.pl> GAPlug: processing D4.dirdoc.xml
buildcol.pl> GAPlug: processing D5.dirdoc.xml
buildcol.pl> GAPlug: processing D6.dirdoc.xml
buildcol.pl> GAPlug: processing D7.dirdoc.xml
buildcol.pl> Stats (Creating index text,Title,Source)
buildcol.pl> Total bytes in collection: 138163
buildcol.pl> Total bytes in text,Title,Source: 114750
buildcol.pl>     inverting the text (mgpp_passes -I2)
buildcol.pl> ArcPlug: processing C:Program Filesgsdlcollect estprefarchivesarchives.inf
buildcol.pl> GAPlug: processing D0.dirdoc.xml
buildcol.pl> GAPlug: processing D1.dirdoc.xml
buildcol.pl> GAPlug: processing D2.dirdoc.xml
buildcol.pl> GAPlug: processing D3.dirdoc.xml
buildcol.pl> GAPlug: processing D4.dirdoc.xml
buildcol.pl> GAPlug: processing D5.dirdoc.xml
buildcol.pl> GAPlug: processing D6.dirdoc.xml
buildcol.pl> GAPlug: processing D7.dirdoc.xml
buildcol.pl> Stats (Creating index text,Title,Source)
buildcol.pl> Total bytes in collection: 138163
buildcol.pl> Total bytes in text,Title,Source: 114750
buildcol.pl>     create the weights file
buildcol.pl>     creating 'on-disk' stemmed dictionary
buildcol.pl>     creating stem indexes
buildcol.pl> deleting testpref.ic
buildcol.pl> deleting testpref.ict
buildcol.pl> deleting testpref.id
buildcol.pl> deleting testpref.idh
buildcol.pl> deleting testpref.ii
buildcol.pl> deleting testpref.invf.state.1608
buildcol.pl> *** creating the info database and processing associated files
buildcol.pl> ArcPlug: processing C:Program Filesgsdlcollect estprefarchivesarchives.inf
buildcol.pl> GAPlug: processing D0.dirdoc.xml
buildcol.pl> GAPlug: processing D1.dirdoc.xml
buildcol.pl> GAPlug: processing D2.dirdoc.xml
buildcol.pl> GAPlug: processing D3.dirdoc.xml
buildcol.pl> GAPlug: processing D4.dirdoc.xml
buildcol.pl> GAPlug: processing D5.dirdoc.xml
buildcol.pl> GAPlug: processing D6.dirdoc.xml
buildcol.pl> GAPlug: processing D7.dirdoc.xml
buildcol.pl> *** creating auxiliary files
buildcol.pl> Command complete.
----

Develop-en.pdf and test.pdf are multipage PDF documents
the other documents are HTML or single page PDF

Sorry about the subject line. I couldn't resist.

s.