From | Stephen.DeGabrielle64;nt.gov.au |
Date | Wed, 12 Jan 2005 17:46:37 +0930 |
Subject | [greenstone-devel] PDFPlug.pm plugin page/prefix processing perl problems |
Hi, Have a look at the following snippet: (starts somewhere near line 100 of PDFPlug.pm) #### # following title_sub removes "Page 1" added by pdftohtml, and a leading # "1", which is often the page number at the top of the page. Bad Luck # if your document title actually starts with "1 " - is there a better way? #my $self = new ConvertToPlug ($class, @args, "-title_sub", '^(Pages+d+)?(s*1s+)?'); my $self = new ConvertToPlug ($class, @args); $self->{'plugin_type'} = "PDFPlug"; if ($use_sections) { $self = new ConvertToPlug ($class, @args, "-title_sub", '^(Pages+d+)?(s*1s+)?'); $self->{'use_sections'}=1; } #### It is part of 'sub new' in PDFPlug.pm, and is my attempt to fix PDFPlug so it doesn't overide a title_sub specified in the arguments of PDFPlug.pm with the '^(Pages+d+)?(s*1s+)?' arguments required to remove the "Page 1" added by pdftohtml (as noted in the comments) I thought I got it right - but attempts to rebuild with a small collection seem to have killed my ability to generate sections at all (my test document included the greenstone developers guide). Any help/suggestions appreciated regards, Stephen PS here is my import log; I have know idea how the leading s turned up. ----- s Command: C:Program FilesgsdlbinwindowsperlbinPerl.exe -S C:Program Filesgsdlbinscriptimport.pl -gli -language en -collectdir C:Program Filesgsdlcollect -OIDtype incremental -verbosity 3 -removeold testpref import.pl> Windows does not support pdf to text. PDFs will be converted to HTML instead import.pl> Windows does not support pdf to text. PDFs will be converted to HTML instead import.pl> Removing current contents of the archives directory... import.pl> util::rm_r couldn't remove directory C:/Program Files/gsdl/collect/testpref/archives/D3.dir import.pl> util::rm_r couldn't remove directory C:/Program Files/gsdl/collect/testpref/archives import.pl> RecPlug: getting directory C:Program Filesgsdlcollect estprefimport import.pl> RecPlug: preparing metadata for 20040825_pt_remotehealth.shtml import.pl> RecPlug recurring: 20040825_pt_remotehealth.shtml import.pl> HTMLPlug: processing 20040825_pt_remotehealth.shtml import.pl> extracted Title metadata "Northern Territory Government of Australia - Media Releases - HOWARD TOLD – NO NUCLEAR DUMP FOR TERRITORY" from <title> tags import.pl> Title sub expression "" applied to <title> tags import.pl> extracted Title metadata "Northern Territory Government of Australia - Media Releases - HOWARD TOLD – NO NUCLEAR DUMP FOR TERRITORY" from <title> tags import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages opspacer.gif to topspacer.gif import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages itle5.gif to title5.gif import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimportimagesbannershp03.jpg to 03.jpg import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimportimagespeter_toyne.jpg to peter_toyne.jpg import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages tglogo.gif to ntglogo.gif import.pl> RecPlug: preparing metadata for 20040825_ss_alcoholframework.shtml import.pl> RecPlug recurring: 20040825_ss_alcoholframework.shtml import.pl> HTMLPlug: processing 20040825_ss_alcoholframework.shtml import.pl> extracted Title metadata "Northern Territory Government of Australia - Media Releases - HOWARD TOLD – NO NUCLEAR DUMP FOR TERRITORY" from <title> tags import.pl> Title sub expression "" applied to <title> tags import.pl> extracted Title metadata "Northern Territory Government of Australia - Media Releases - HOWARD TOLD – NO NUCLEAR DUMP FOR TERRITORY" from <title> tags import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages opspacer.gif to topspacer.gif import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages itle5.gif to title5.gif import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimportimagesbannershp03.jpg to 03.jpg import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimportimagessyd_stirling.jpg to syd_stirling.jpg import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages tglogo.gif to ntglogo.gif import.pl> RecPlug: preparing metadata for 20040826_cm_communitycabinet.shtml import.pl> RecPlug recurring: 20040826_cm_communitycabinet.shtml import.pl> HTMLPlug: processing 20040826_cm_communitycabinet.shtml import.pl> extracted Title metadata "Northern Territory Government of Australia - Media Releases - HOWARD TOLD – NO NUCLEAR DUMP FOR TERRITORY" from <title> tags import.pl> Title sub expression "" applied to <title> tags import.pl> extracted Title metadata "Northern Territory Government of Australia - Media Releases - HOWARD TOLD – NO NUCLEAR DUMP FOR TERRITORY" from <title> tags import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages opspacer.gif to topspacer.gif import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages itle5.gif to title5.gif import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimportimagesbannershp03.jpg to 03.jpg import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimportimagesministersclare_martin.jpg to clare_martin.jpg import.pl> docsave::process couldn't copy the associated file C:Program Filesgsdlcollect estprefimport tgimages tglogo.gif to ntglogo.gif import.pl> RecPlug: preparing metadata for Develop-en.pdf import.pl> RecPlug recurring: Develop-en.pdf import.pl> Converting Develop-en.pdf to HTML format import.pl> PDFPlug: Calculating sections... import.pl> PDFPlug: warning - no sections found import.pl> PDFPlug: passing Develop-en.pdf on to HTMLPlug import.pl> HTMLPlug: processing Develop-en.html import.pl> HTMLPlug: WARNING: Develop-en.html contains the following text outside import.pl> of the final closing </Section> tag. This text will import.pl> be ignored. (". Font and spacing is ignor...) import.pl> RecPlug: preparing metadata for martin.0201.100 teachers.pdf import.pl> RecPlug recurring: martin.0201.100 teachers.pdf import.pl> Converting martin.0201.100teachers.pdf to HTML format import.pl> PDFPlug: Calculating sections... import.pl> PDFPlug: warning - no sections found import.pl> PDFPlug: passing martin.0201.100 teachers.pdf on to HTMLPlug import.pl> HTMLPlug: processing martin.0201.100teachers.html import.pl> HTMLPlug: WARNING: martin.0201.100teachers.html appears to contain no Section tags so import.pl> will be processed as a single section document import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 624. import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 626. import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 686. import.pl> extracted Title metadata "Clare Martin CHIEF MINISTER ACTING MINISTER FOR EMPLOYMENT EDUCATION AND..." from first 100 chars import.pl> Title sub expression "^(Pages+d+)?(s*1s+)?" applied to first 100 chars import.pl> extracted Title metadata "Clare Martin CHIEF MINISTER ACTING MINISTER FOR EMPLOYMENT EDUCATION AND..." from first 100 chars import.pl> RecPlug: preparing metadata for martin.0401.ParksLatest.pdf import.pl> RecPlug recurring: martin.0401.ParksLatest.pdf import.pl> Converting martin.0401.ParksLatest.pdf to HTML format import.pl> PDFPlug: Calculating sections... import.pl> PDFPlug: warning - no sections found import.pl> PDFPlug: passing martin.0401.ParksLatest.pdf on to HTMLPlug import.pl> HTMLPlug: processing martin.0401.ParksLatest.html import.pl> HTMLPlug: WARNING: martin.0401.ParksLatest.html appears to contain no Section tags so import.pl> will be processed as a single section document import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 624. import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 626. import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 686. import.pl> extracted Title metadata "Clare Martin CHIEF MINISTER MINISTER FOR INDIGENOUS AFFAIRS 4 January..." from first 100 chars import.pl> Title sub expression "^(Pages+d+)?(s*1s+)?" applied to first 100 chars import.pl> extracted Title metadata "Clare Martin CHIEF MINISTER MINISTER FOR INDIGENOUS AFFAIRS 4 January..." from first 100 chars import.pl> RecPlug: preparing metadata for Martin.0701.DestinationKatherine&TennantCreek.pdf import.pl> RecPlug recurring: Martin.0701.DestinationKatherine&TennantCreek.pdf import.pl> Converting Martin.0701.DestinationKatherine&TennantCreek.pdf to HTML format import.pl> PDFPlug: Calculating sections... import.pl> PDFPlug: warning - no sections found import.pl> PDFPlug: passing Martin.0701.DestinationKatherine&TennantCreek.pdf on to HTMLPlug import.pl> HTMLPlug: processing Martin.0701.DestinationKatherine&TennantCreek.html import.pl> HTMLPlug: WARNING: Martin.0701.DestinationKatherine&TennantCreek.html appears to contain no Section tags so import.pl> will be processed as a single section document import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 624. import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 626. import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 686. import.pl> extracted Title metadata "Clare Martin CHIEF MINISTER MINISTER FOR TOURISM 7 January 2005 ..." from first 100 chars import.pl> Title sub expression "^(Pages+d+)?(s*1s+)?" applied to first 100 chars import.pl> extracted Title metadata "Clare Martin CHIEF MINISTER MINISTER FOR TOURISM 7 January 2005 ..." from first 100 chars import.pl> RecPlug: preparing metadata for test.pdf import.pl> RecPlug recurring: test.pdf import.pl> Converting test.pdf to HTML format import.pl> BasPlug: WARNING: language could not be extracted from C:Program Filesgsdlcollect estpref mp est.html - defaulting to en import.pl> PDFPlug: Calculating sections... import.pl> PDFPlug: warning - no sections found import.pl> PDFPlug: passing test.pdf on to HTMLPlug import.pl> HTMLPlug: processing test.html import.pl> HTMLPlug: WARNING: test.html appears to contain no Section tags so import.pl> will be processed as a single section document import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 624. import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 626. import.pl> Use of uninitialized value in pattern match (m//) at C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 686. import.pl> extracted Title metadata "Gfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf..." from first 100 chars import.pl> Title sub expression "^(Pages+d+)?(s*1s+)?" applied to first 100 chars import.pl> extracted Title metadata "Gfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf..." from first 100 chars import.pl> ********************************************* import.pl> Import complete import.pl> ********************************************* import.pl> * 8 documents were considered for processing import.pl> * 8 were processed and included in the collection import.pl> Command complete. import.pl> Extracting new metadata from archive files. import.pl> Archived metadata extraction complete. Command: C:Program FilesgsdlbinwindowsperlbinPerl.exe -S C:Program Filesgsdlbinscriptbuildcol.pl -gli -language en -collectdir C:Program Filesgsdlcollect -verbosity 3 testpref buildcol.pl> doclevel = document buildcol.pl> Windows does not support pdf to text. PDFs will be converted to HTML instead buildcol.pl> Windows does not support pdf to text. PDFs will be converted to HTML instead buildcol.pl> *** creating the compressed text buildcol.pl> collecting text statistics (mgpp_passes -T1) buildcol.pl> ArcPlug: processing C:Program Filesgsdlcollect estprefarchivesarchives.inf buildcol.pl> GAPlug: processing D0.dirdoc.xml buildcol.pl> GAPlug: processing D1.dirdoc.xml buildcol.pl> GAPlug: processing D2.dirdoc.xml buildcol.pl> GAPlug: processing D3.dirdoc.xml buildcol.pl> GAPlug: processing D4.dirdoc.xml buildcol.pl> GAPlug: processing D5.dirdoc.xml buildcol.pl> GAPlug: processing D6.dirdoc.xml buildcol.pl> GAPlug: processing D7.dirdoc.xml buildcol.pl> Stats (Compressing text from text) buildcol.pl> Total bytes in collection: 138163 buildcol.pl> Total bytes in text: 138163 buildcol.pl> creating the compression dictionary buildcol.pl> compressing the text (mgpp_passes -T2) buildcol.pl> ArcPlug: processing C:Program Filesgsdlcollect estprefarchivesarchives.inf buildcol.pl> GAPlug: processing D0.dirdoc.xml buildcol.pl> GAPlug: processing D1.dirdoc.xml buildcol.pl> GAPlug: processing D2.dirdoc.xml buildcol.pl> GAPlug: processing D3.dirdoc.xml buildcol.pl> GAPlug: processing D4.dirdoc.xml buildcol.pl> GAPlug: processing D5.dirdoc.xml buildcol.pl> GAPlug: processing D6.dirdoc.xml buildcol.pl> GAPlug: processing D7.dirdoc.xml buildcol.pl> Stats (Compressing text from text) buildcol.pl> Total bytes in collection: 138163 buildcol.pl> Total bytes in text: 138163 buildcol.pl> *** building index text,Title,Source in subdirectory idx buildcol.pl> creating index dictionary (mgpp_passes -I1) buildcol.pl> ArcPlug: processing C:Program Filesgsdlcollect estprefarchivesarchives.inf buildcol.pl> GAPlug: processing D0.dirdoc.xml buildcol.pl> GAPlug: processing D1.dirdoc.xml buildcol.pl> GAPlug: processing D2.dirdoc.xml buildcol.pl> GAPlug: processing D3.dirdoc.xml buildcol.pl> GAPlug: processing D4.dirdoc.xml buildcol.pl> GAPlug: processing D5.dirdoc.xml buildcol.pl> GAPlug: processing D6.dirdoc.xml buildcol.pl> GAPlug: processing D7.dirdoc.xml buildcol.pl> Stats (Creating index text,Title,Source) buildcol.pl> Total bytes in collection: 138163 buildcol.pl> Total bytes in text,Title,Source: 114750 buildcol.pl> inverting the text (mgpp_passes -I2) buildcol.pl> ArcPlug: processing C:Program Filesgsdlcollect estprefarchivesarchives.inf buildcol.pl> GAPlug: processing D0.dirdoc.xml buildcol.pl> GAPlug: processing D1.dirdoc.xml buildcol.pl> GAPlug: processing D2.dirdoc.xml buildcol.pl> GAPlug: processing D3.dirdoc.xml buildcol.pl> GAPlug: processing D4.dirdoc.xml buildcol.pl> GAPlug: processing D5.dirdoc.xml buildcol.pl> GAPlug: processing D6.dirdoc.xml buildcol.pl> GAPlug: processing D7.dirdoc.xml buildcol.pl> Stats (Creating index text,Title,Source) buildcol.pl> Total bytes in collection: 138163 buildcol.pl> Total bytes in text,Title,Source: 114750 buildcol.pl> create the weights file buildcol.pl> creating 'on-disk' stemmed dictionary buildcol.pl> creating stem indexes buildcol.pl> deleting testpref.ic buildcol.pl> deleting testpref.ict buildcol.pl> deleting testpref.id buildcol.pl> deleting testpref.idh buildcol.pl> deleting testpref.ii buildcol.pl> deleting testpref.invf.state.1608 buildcol.pl> *** creating the info database and processing associated files buildcol.pl> ArcPlug: processing C:Program Filesgsdlcollect estprefarchivesarchives.inf buildcol.pl> GAPlug: processing D0.dirdoc.xml buildcol.pl> GAPlug: processing D1.dirdoc.xml buildcol.pl> GAPlug: processing D2.dirdoc.xml buildcol.pl> GAPlug: processing D3.dirdoc.xml buildcol.pl> GAPlug: processing D4.dirdoc.xml buildcol.pl> GAPlug: processing D5.dirdoc.xml buildcol.pl> GAPlug: processing D6.dirdoc.xml buildcol.pl> GAPlug: processing D7.dirdoc.xml buildcol.pl> *** creating auxiliary files buildcol.pl> Command complete. ---- Develop-en.pdf and test.pdf are multipage PDF documents the other documents are HTML or single page PDF Sorry about the subject line. I couldn't resist. s. |