Re: [greenstone-devel] PDFPlug.pm plugin page/prefix processing perl problems

From Michael Dewsnip
DateThu, 20 Jan 2005 16:52:37 +1300
Subject Re: [greenstone-devel] PDFPlug.pm plugin page/prefix processing perl problems
In-Reply-To (OF67A379C0-8FA06079-ON69256F87-002A976B-69256F87-002D55E2-nt-gov-au)
Hi Stephen,

I think the problem might be that the "plugin_type" attribute isn't set
after you create the new ConvertToPlug object when $use_sections is
true. Try:

my $self = new ConvertToPlug ($class, @args);
$self->{'plugin_type'} = "PDFPlug";
if ($use_sections) {
$self = new ConvertToPlug ($class, @args, "-title_sub",
'^(Pages+d+)?(s*1s+)?');
$self->{'plugin_type'} = "PDFPlug"; // Otherwise
plugin_type won't be set for the new ConvertToPlug object
$self->{'use_sections'}=1;
}

Hope this fixes your problem.

All the best,

Michael

Stephen.DeGabrielle@nt.gov.au wrote:

>
> Hi,
>
> Have a look at the following snippet:
> (starts somewhere near line 100 of PDFPlug.pm)
> ####
> # following title_sub removes "Page 1" added by pdftohtml, and a
> leading
> # "1", which is often the page number at the top of the page. Bad
> Luck
> # if your document title actually starts with "1 " - is there a
> better way?
>
> #my $self = new ConvertToPlug ($class, @args, "-title_sub",
> '^(Pages+d+)?(s*1s+)?');
> my $self = new ConvertToPlug ($class, @args);
> $self->{'plugin_type'} = "PDFPlug";
> if ($use_sections) {
> $self = new ConvertToPlug ($class, @args, "-title_sub",
> '^(Pages+d+)?(s*1s+)?');
> $self->{'use_sections'}=1;
> }
> ####
>
> It is part of 'sub new' in PDFPlug.pm, and is my attempt to fix
> PDFPlug so it doesn't overide a
> title_sub specified in the arguments of PDFPlug.pm with the
> '^(Pages+d+)?(s*1s+)?' arguments
> required to remove the "Page 1" added by pdftohtml (as noted in the
> comments)
>
> I thought I got it right - but attempts to rebuild with a small
> collection seem to have killed my ability to generate sections at all
> (my test document included the greenstone developers guide).
>
> Any help/suggestions appreciated
>
> regards,
>
> Stephen
>
> PS
>
> here is my import log;
> I have know idea how the leading s turned up.
> -----
> s
> Command: C:Program FilesgsdlbinwindowsperlbinPerl.exe -S
> C:Program Filesgsdlbinscriptimport.pl -gli -language en
> -collectdir C:Program Filesgsdlcollect -OIDtype incremental
> -verbosity 3 -removeold testpref
> import.pl> Windows does not support pdf to text. PDFs will be
> converted to HTML instead
> import.pl> Windows does not support pdf to text. PDFs will be
> converted to HTML instead
> import.pl> Removing current contents of the archives directory...
> import.pl> util::rm_r couldn't remove directory C:/Program
> Files/gsdl/collect/testpref/archives/D3.dir
> import.pl> util::rm_r couldn't remove directory C:/Program
> Files/gsdl/collect/testpref/archives
> import.pl> RecPlug: getting directory C:Program
> Filesgsdlcollect estprefimport
> import.pl> RecPlug: preparing metadata for 20040825_pt_remotehealth.shtml
> import.pl> RecPlug recurring: 20040825_pt_remotehealth.shtml
> import.pl> HTMLPlug: processing 20040825_pt_remotehealth.shtml
> import.pl> extracted Title metadata "Northern Territory Government of
> Australia - Media Releases - HOWARD TOLD – NO NUCLEAR DUMP FOR
> TERRITORY" from <title> tags
> import.pl> Title sub expression "" applied to <title> tags
> import.pl> extracted Title metadata "Northern Territory Government of
> Australia - Media Releases - HOWARD TOLD &ndash; NO NUCLEAR DUMP FOR
> TERRITORY" from <title> tags
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimport tgimages opspacer.gif
> to topspacer.gif
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimport tgimages itle5.gif to
> title5.gif
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimportimagesbannershp03.jpg
> to 03.jpg
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimportimagespeter_toyne.jpg
> to peter_toyne.jpg
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimport tgimages tglogo.gif
> to ntglogo.gif
> import.pl> RecPlug: preparing metadata for
> 20040825_ss_alcoholframework.shtml
> import.pl> RecPlug recurring: 20040825_ss_alcoholframework.shtml
> import.pl> HTMLPlug: processing 20040825_ss_alcoholframework.shtml
> import.pl> extracted Title metadata "Northern Territory Government of
> Australia - Media Releases - HOWARD TOLD &ndash; NO NUCLEAR DUMP FOR
> TERRITORY" from <title> tags
> import.pl> Title sub expression "" applied to <title> tags
> import.pl> extracted Title metadata "Northern Territory Government of
> Australia - Media Releases - HOWARD TOLD &ndash; NO NUCLEAR DUMP FOR
> TERRITORY" from <title> tags
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimport tgimages opspacer.gif
> to topspacer.gif
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimport tgimages itle5.gif to
> title5.gif
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimportimagesbannershp03.jpg
> to 03.jpg
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimportimagessyd_stirling.jpg
> to syd_stirling.jpg
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimport tgimages tglogo.gif
> to ntglogo.gif
> import.pl> RecPlug: preparing metadata for
> 20040826_cm_communitycabinet.shtml
> import.pl> RecPlug recurring: 20040826_cm_communitycabinet.shtml
> import.pl> HTMLPlug: processing 20040826_cm_communitycabinet.shtml
> import.pl> extracted Title metadata "Northern Territory Government of
> Australia - Media Releases - HOWARD TOLD &ndash; NO NUCLEAR DUMP FOR
> TERRITORY" from <title> tags
> import.pl> Title sub expression "" applied to <title> tags
> import.pl> extracted Title metadata "Northern Territory Government of
> Australia - Media Releases - HOWARD TOLD &ndash; NO NUCLEAR DUMP FOR
> TERRITORY" from <title> tags
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimport tgimages opspacer.gif
> to topspacer.gif
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimport tgimages itle5.gif to
> title5.gif
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimportimagesbannershp03.jpg
> to 03.jpg
> import.pl> docsave::process couldn't copy the associated file
> C:Program
> Filesgsdlcollecttestprefimportimagesministersclare_martin.jpg
> to clare_martin.jpg
> import.pl> docsave::process couldn't copy the associated file
> C:Program Filesgsdlcollect estprefimport tgimages tglogo.gif
> to ntglogo.gif
> import.pl> RecPlug: preparing metadata for Develop-en.pdf
> import.pl> RecPlug recurring: Develop-en.pdf
> import.pl> Converting Develop-en.pdf to HTML format
> import.pl> PDFPlug: Calculating sections...
> import.pl> PDFPlug: warning - no sections found
> import.pl> PDFPlug: passing Develop-en.pdf on to HTMLPlug
> import.pl> HTMLPlug: processing Develop-en.html
> import.pl> HTMLPlug: WARNING: Develop-en.html contains the following
> text outside
> import.pl> of the final closing </Section> tag. This text will
> import.pl> be ignored. (". Font and spacing is ignor...)
> import.pl> RecPlug: preparing metadata for martin.0201.100 teachers.pdf
> import.pl> RecPlug recurring: martin.0201.100 teachers.pdf
> import.pl> Converting martin.0201.100teachers.pdf to HTML format
> import.pl> PDFPlug: Calculating sections...
> import.pl> PDFPlug: warning - no sections found
> import.pl> PDFPlug: passing martin.0201.100 teachers.pdf on to HTMLPlug
> import.pl> HTMLPlug: processing martin.0201.100teachers.html
> import.pl> HTMLPlug: WARNING: martin.0201.100teachers.html appears to
> contain no Section tags so
> import.pl> will be processed as a single section document
> import.pl> Use of uninitialized value in pattern match (m//) at
> C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 624.
> import.pl> Use of uninitialized value in pattern match (m//) at
> C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 626.
> import.pl> Use of uninitialized value in pattern match (m//) at
> C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 686.
> import.pl> extracted Title metadata "Clare Martin CHIEF MINISTER
> ACTING MINISTER FOR EMPLOYMENT EDUCATION AND..." from first 100 chars
> import.pl> Title sub expression "^(Pages+d+)?(s*1s+)?" applied to
> first 100 chars
> import.pl> extracted Title metadata "Clare Martin CHIEF MINISTER
> ACTING MINISTER FOR EMPLOYMENT EDUCATION AND..." from first 100 chars
> import.pl> RecPlug: preparing metadata for martin.0401.ParksLatest.pdf
> import.pl> RecPlug recurring: martin.0401.ParksLatest.pdf
> import.pl> Converting martin.0401.ParksLatest.pdf to HTML format
> import.pl> PDFPlug: Calculating sections...
> import.pl> PDFPlug: warning - no sections found
> import.pl> PDFPlug: passing martin.0401.ParksLatest.pdf on to HTMLPlug
> import.pl> HTMLPlug: processing martin.0401.ParksLatest.html
> import.pl> HTMLPlug: WARNING: martin.0401.ParksLatest.html appears to
> contain no Section tags so
> import.pl> will be processed as a single section document
> import.pl> Use of uninitialized value in pattern match (m//) at
> C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 624.
> import.pl> Use of uninitialized value in pattern match (m//) at
> C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 626.
> import.pl> Use of uninitialized value in pattern match (m//) at
> C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 686.
> import.pl> extracted Title metadata "Clare Martin CHIEF MINISTER
> MINISTER FOR INDIGENOUS AFFAIRS 4 January..." from first 100 chars
> import.pl> Title sub expression "^(Pages+d+)?(s*1s+)?" applied to
> first 100 chars
> import.pl> extracted Title metadata "Clare Martin CHIEF MINISTER
> MINISTER FOR INDIGENOUS AFFAIRS 4 January..." from first 100 chars
> import.pl> RecPlug: preparing metadata for
> Martin.0701.DestinationKatherine&TennantCreek.pdf
> import.pl> RecPlug recurring:
> Martin.0701.DestinationKatherine&TennantCreek.pdf
> import.pl> Converting
> Martin.0701.DestinationKatherine&TennantCreek.pdf to HTML format
> import.pl> PDFPlug: Calculating sections...
> import.pl> PDFPlug: warning - no sections found
> import.pl> PDFPlug: passing
> Martin.0701.DestinationKatherine&TennantCreek.pdf on to HTMLPlug
> import.pl> HTMLPlug: processing
> Martin.0701.DestinationKatherine&TennantCreek.html
> import.pl> HTMLPlug: WARNING:
> Martin.0701.DestinationKatherine&TennantCreek.html appears to contain
> no Section tags so
> import.pl> will be processed as a single section document
> import.pl> Use of uninitialized value in pattern match (m//) at
> C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 624.
> import.pl> Use of uninitialized value in pattern match (m//) at
> C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 626.
> import.pl> Use of uninitialized value in pattern match (m//) at
> C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 686.
> import.pl> extracted Title metadata "Clare Martin CHIEF MINISTER
> MINISTER FOR TOURISM 7 January 2005 ..." from first 100 chars
> import.pl> Title sub expression "^(Pages+d+)?(s*1s+)?" applied to
> first 100 chars
> import.pl> extracted Title metadata "Clare Martin CHIEF MINISTER
> MINISTER FOR TOURISM 7 January 2005 ..." from first 100 chars
> import.pl> RecPlug: preparing metadata for test.pdf
> import.pl> RecPlug recurring: test.pdf
> import.pl> Converting test.pdf to HTML format
> import.pl> BasPlug: WARNING: language could not be extracted from
> C:Program Filesgsdlcollect estpref mp est.html - defaulting to en
> import.pl> PDFPlug: Calculating sections...
> import.pl> PDFPlug: warning - no sections found
> import.pl> PDFPlug: passing test.pdf on to HTMLPlug
> import.pl> HTMLPlug: processing test.html
> import.pl> HTMLPlug: WARNING: test.html appears to contain no Section
> tags so
> import.pl> will be processed as a single section document
> import.pl> Use of uninitialized value in pattern match (m//) at
> C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 624.
> import.pl> Use of uninitialized value in pattern match (m//) at
> C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 626.
> import.pl> Use of uninitialized value in pattern match (m//) at
> C:Program Filesgsdl/perllib/plugins/HTMLPlug.pm line 686.
> import.pl> extracted Title metadata "Gfsdgsdg sdgf sdGfsdgsdg sdgf
> sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf..."
> from first 100 chars
> import.pl> Title sub expression "^(Pages+d+)?(s*1s+)?" applied to
> first 100 chars
> import.pl> extracted Title metadata "Gfsdgsdg sdgf sdGfsdgsdg sdgf
> sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf sdGfsdgsdg sdgf..."
> from first 100 chars
> import.pl> *********************************************
> import.pl> Import complete
> import.pl> *********************************************
> import.pl> * 8 documents were considered for processing
> import.pl> * 8 were processed and included in the collection
> import.pl> Command complete.
> import.pl> Extracting new metadata from archive files.
> import.pl> Archived metadata extraction complete.
> Command: C:Program FilesgsdlbinwindowsperlbinPerl.exe -S
> C:Program Filesgsdlbinscriptbuildcol.pl -gli -language en
> -collectdir C:Program Filesgsdlcollect -verbosity 3 testpref
> buildcol.pl> doclevel = document
> buildcol.pl> Windows does not support pdf to text. PDFs will be
> converted to HTML instead
> buildcol.pl> Windows does not support pdf to text. PDFs will be
> converted to HTML instead
> buildcol.pl> *** creating the compressed text
> buildcol.pl> collecting text statistics (mgpp_passes -T1)
> buildcol.pl> ArcPlug: processing C:Program
> Filesgsdlcollect estprefarchivesarchives.inf
> buildcol.pl> GAPlug: processing D0.dirdoc.xml
> buildcol.pl> GAPlug: processing D1.dirdoc.xml
> buildcol.pl> GAPlug: processing D2.dirdoc.xml
> buildcol.pl> GAPlug: processing D3.dirdoc.xml
> buildcol.pl> GAPlug: processing D4.dirdoc.xml
> buildcol.pl> GAPlug: processing D5.dirdoc.xml
> buildcol.pl> GAPlug: processing D6.dirdoc.xml
> buildcol.pl> GAPlug: processing D7.dirdoc.xml
> buildcol.pl> Stats (Compressing text from text)
> buildcol.pl> Total bytes in collection: 138163
> buildcol.pl> Total bytes in text: 138163
> buildcol.pl> creating the compression dictionary
> buildcol.pl> compressing the text (mgpp_passes -T2)
> buildcol.pl> ArcPlug: processing C:Program
> Filesgsdlcollect estprefarchivesarchives.inf
> buildcol.pl> GAPlug: processing D0.dirdoc.xml
> buildcol.pl> GAPlug: processing D1.dirdoc.xml
> buildcol.pl> GAPlug: processing D2.dirdoc.xml
> buildcol.pl> GAPlug: processing D3.dirdoc.xml
> buildcol.pl> GAPlug: processing D4.dirdoc.xml
> buildcol.pl> GAPlug: processing D5.dirdoc.xml
> buildcol.pl> GAPlug: processing D6.dirdoc.xml
> buildcol.pl> GAPlug: processing D7.dirdoc.xml
> buildcol.pl> Stats (Compressing text from text)
> buildcol.pl> Total bytes in collection: 138163
> buildcol.pl> Total bytes in text: 138163
> buildcol.pl> *** building index text,Title,Source in subdirectory idx
> buildcol.pl> creating index dictionary (mgpp_passes -I1)
> buildcol.pl> ArcPlug: processing C:Program
> Filesgsdlcollect estprefarchivesarchives.inf
> buildcol.pl> GAPlug: processing D0.dirdoc.xml
> buildcol.pl> GAPlug: processing D1.dirdoc.xml
> buildcol.pl> GAPlug: processing D2.dirdoc.xml
> buildcol.pl> GAPlug: processing D3.dirdoc.xml
> buildcol.pl> GAPlug: processing D4.dirdoc.xml
> buildcol.pl> GAPlug: processing D5.dirdoc.xml
> buildcol.pl> GAPlug: processing D6.dirdoc.xml
> buildcol.pl> GAPlug: processing D7.dirdoc.xml
> buildcol.pl> Stats (Creating index text,Title,Source)
> buildcol.pl> Total bytes in collection: 138163
> buildcol.pl> Total bytes in text,Title,Source: 114750
> buildcol.pl> inverting the text (mgpp_passes -I2)
> buildcol.pl> ArcPlug: processing C:Program
> Filesgsdlcollect estprefarchivesarchives.inf
> buildcol.pl> GAPlug: processing D0.dirdoc.xml
> buildcol.pl> GAPlug: processing D1.dirdoc.xml
> buildcol.pl> GAPlug: processing D2.dirdoc.xml
> buildcol.pl> GAPlug: processing D3.dirdoc.xml
> buildcol.pl> GAPlug: processing D4.dirdoc.xml
> buildcol.pl> GAPlug: processing D5.dirdoc.xml
> buildcol.pl> GAPlug: processing D6.dirdoc.xml
> buildcol.pl> GAPlug: processing D7.dirdoc.xml
> buildcol.pl> Stats (Creating index text,Title,Source)
> buildcol.pl> Total bytes in collection: 138163
> buildcol.pl> Total bytes in text,Title,Source: 114750
> buildcol.pl> create the weights file
> buildcol.pl> creating 'on-disk' stemmed dictionary
> buildcol.pl> creating stem indexes
> buildcol.pl> deleting testpref.ic
> buildcol.pl> deleting testpref.ict
> buildcol.pl> deleting testpref.id
> buildcol.pl> deleting testpref.idh
> buildcol.pl> deleting testpref.ii
> buildcol.pl> deleting testpref.invf.state.1608
> buildcol.pl> *** creating the info database and processing associated
> files
> buildcol.pl> ArcPlug: processing C:Program
> Filesgsdlcollect estprefarchivesarchives.inf
> buildcol.pl> GAPlug: processing D0.dirdoc.xml
> buildcol.pl> GAPlug: processing D1.dirdoc.xml
> buildcol.pl> GAPlug: processing D2.dirdoc.xml
> buildcol.pl> GAPlug: processing D3.dirdoc.xml
> buildcol.pl> GAPlug: processing D4.dirdoc.xml
> buildcol.pl> GAPlug: processing D5.dirdoc.xml
> buildcol.pl> GAPlug: processing D6.dirdoc.xml
> buildcol.pl> GAPlug: processing D7.dirdoc.xml
> buildcol.pl> *** creating auxiliary files
> buildcol.pl> Command complete.
> ----
>
> Develop-en.pdf and test.pdf are multipage PDF documents
> the other documents are HTML or single page PDF
>
> Sorry about the subject line. I couldn't resist.
>
> s.
>
>------------------------------------------------------------------------
>
>_______________________________________________
>greenstone-devel mailing list
>greenstone-devel@list.scms.waikato.ac.nz
>https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel
>
>