Re: [greenstone-users] Newbie seeking some guidance

From Michael Dewsnip
DateThu, 31 Aug 2006 09:51:56 +1200
Subject Re: [greenstone-users] Newbie seeking some guidance
In-Reply-To (029c01c6cc5f$61dc8840$5c016c0a-hq-prl-ab-ca)
Hi Michael,

Glad to hear the new CSVPlug is useful. Thanks for reporting the problem
with quotes in the first line; I have just modified CSVPlug to handle
this case. You can download the updated version from
http://www.cs.waikato.ac.nz/~mdewsnip/greenstone/temp-2.70w/CSVPlug.pm

If this doesn't fix the problem can you please send me a small CSV file
that causes the problem and I'll try it here.

All the best,

Michael

Michael Silver wrote:

>greenstone-users-bounces@list.scms.waikato.ac.nz wrote:
>
>
>>Hi Ed,
>>
>>I've just created a new CSVPlug that should be a lot closer to what
>>you're looking for. This plugin reads CSV files and creates a
>>Greenstone record for each line of the file (containing the metadata
>>from that line). The first line of the file must specify the metadata
>>element names, comma separated. The plugin will be included in
>>Greenstone v2.71,
>>or you can download it now from
>>http://www.cs.waikato.ac.nz/~mdewsnip/greenstone/temp-2.70w/CSVPlug.pm
>>
>>
><snip>
>
>Oh frabjous day! This is exactly what I've been trying to do with my data,
>and it works like a charm! Thank you very much.
>
>I have encountered one issue, easily worked around. If I export the data
>from Access as a CSV file, I have to choose to have a text identifier
>because the abstracts have commas in them. If I do that, it also puts double
>quotes around the field names, which causes an error during the build
>process. The import process seems to work just fine, but the build process
>dies. Here are the logs:
>
>Command: C:PerlbinPerl.exe -S C:Program
>FilesGreenstonebinscriptimport.pl -gli -language en -collectdir
>C:Program FilesGreenstonecollect -removeold testsub
>import.pl> Removing current contents of the archives directory...
>import.pl> RecPlug: getting directory C:Program
>FilesGreenstonecollect estsubimport
>import.pl> SplitPlug found 2 documents in C:Program
>FilesGreenstonecollect estsubimport est.csv
>import.pl> WARNING: No plugin could recognise .test.csv.swp
>import.pl> segment 1 -
>import.pl> CSVPlug: processing test.csv
>import.pl> segment 2 -
>import.pl> CSVPlug: processing test.csv
>import.pl> WARNING: No plugin could recognise test.csv~
>import.pl> *********************************************
>import.pl> Import complete
>import.pl> *********************************************
>import.pl> * 4 documents were considered for processing
>import.pl> * 2 were processed and included in the collection
>import.pl> * 2 were unrecognised
>import.pl> See C:Program FilesGreenstonecollect estsubetcfail.log for
>a list of unrecognised and/or rejected documents
>import.pl> Command complete.
>import.pl> Extracting new metadata from archive files.
>import.pl> Archived metadata extraction complete.
>
>Obviously, the swp and ~ files are the result of editing the document in the
>import directory while I was testing the different variations.
>
>Command: C:PerlbinPerl.exe -S C:Program
>FilesGreenstonebinscriptbuildcol.pl -gli -language en -collectdir
>C:Program FilesGreenstonecollect -removeold testsub
>buildcol.pl> *** creating the compressed text
>buildcol.pl> collecting text statistics (mgpp_passes -T1)
>buildcol.pl> ArcPlug: processing C:Program
>FilesGreenstonecollect estsubarchivesarchives.inf
>buildcol.pl> GAPlug: processing HASH77c8.dirdoc.xml
>buildcol.pl> **** Error is:
>buildcol.pl> not well-formed (invalid token) at line 13, column 21, byte 558
>at C:/Perl/site/lib/XML/Parser.pm line 187
>buildcol.pl> WARNING: No plugin could process HASH77c8.dirdoc.xml
>buildcol.pl> GAPlug: processing HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> **** Error is:
>buildcol.pl> not well-formed (invalid token) at line 13, column 21, byte 558
>at C:/Perl/site/lib/XML/Parser.pm line 187
>buildcol.pl> WARNING: No plugin could process HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> Stats (Compressing text from text)
>buildcol.pl> Total bytes in collection: 0
>buildcol.pl> Total bytes in text: 0
>buildcol.pl> ***************
>buildcol.pl> WARNING: There is very little or no text to compress
>buildcol.pl> Was this your intention?
>buildcol.pl> ***************
>buildcol.pl> creating the compression dictionary
>buildcol.pl> compressing the text (mgpp_passes -T2)
>buildcol.pl> ArcPlug: processing C:Program
>FilesGreenstonecollect estsubarchivesarchives.inf
>buildcol.pl> GAPlug: processing HASH77c8.dirdoc.xml
>buildcol.pl> **** Error is:
>buildcol.pl> not well-formed (invalid token) at line 13, column 21, byte 558
>at C:/Perl/site/lib/XML/Parser.pm line 187
>buildcol.pl> WARNING: No plugin could process HASH77c8.dirdoc.xml
>buildcol.pl> GAPlug: processing HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> **** Error is:
>buildcol.pl> not well-formed (invalid token) at line 13, column 21, byte 558
>at C:/Perl/site/lib/XML/Parser.pm line 187
>buildcol.pl> WARNING: No plugin could process HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> Stats (Compressing text from text)
>buildcol.pl> Total bytes in collection: 0
>buildcol.pl> Total bytes in text: 0
>buildcol.pl> ***************
>buildcol.pl> WARNING: There is very little or no text to compress
>buildcol.pl> Was this your intention?
>buildcol.pl> ***************
>buildcol.pl> *** building index text;dc.Title;dc.Description^abstract in
>subdirectory idx
>buildcol.pl> creating index dictionary (mgpp_passes -I1)
>buildcol.pl> ArcPlug: processing C:Program
>FilesGreenstonecollect estsubarchivesarchives.inf
>buildcol.pl> GAPlug: processing HASH77c8.dirdoc.xml
>buildcol.pl> **** Error is:
>buildcol.pl> not well-formed (invalid token) at line 13, column 21, byte 558
>at C:/Perl/site/lib/XML/Parser.pm line 187
>buildcol.pl> WARNING: No plugin could process HASH77c8.dirdoc.xml
>buildcol.pl> GAPlug: processing HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> **** Error is:
>buildcol.pl> not well-formed (invalid token) at line 13, column 21, byte 558
>at C:/Perl/site/lib/XML/Parser.pm line 187
>buildcol.pl> WARNING: No plugin could process HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> Stats (Creating index text;dc.Title;dc.Description^abstract)
>buildcol.pl> Total bytes in collection: 0
>buildcol.pl> Total bytes in text;dc.Title;dc.Description^abstract: 0
>buildcol.pl> ***************
>buildcol.pl> WARNING: There is very little or no text to process for
>text;dc.Title;dc.Description^abstract
>buildcol.pl> Was this your intention?
>buildcol.pl> ***************
>buildcol.pl> inverting the text (mgpp_passes -I2)
>buildcol.pl> ArcPlug: processing C:Program
>FilesGreenstonecollect estsubarchivesarchives.inf
>buildcol.pl> GAPlug: processing HASH77c8.dirdoc.xml
>buildcol.pl> **** Error is:
>buildcol.pl> not well-formed (invalid token) at line 13, column 21, byte 558
>at C:/Perl/site/lib/XML/Parser.pm line 187
>buildcol.pl> WARNING: No plugin could process HASH77c8.dirdoc.xml
>buildcol.pl> GAPlug: processing HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> **** Error is:
>buildcol.pl> not well-formed (invalid token) at line 13, column 21, byte 558
>at C:/Perl/site/lib/XML/Parser.pm line 187
>buildcol.pl> WARNING: No plugin could process HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> Stats (Creating index text;dc.Title;dc.Description^abstract)
>buildcol.pl> Total bytes in collection: 0
>buildcol.pl> Total bytes in text;dc.Title;dc.Description^abstract: 0
>buildcol.pl> ***************
>buildcol.pl> WARNING: There is very little or no text to process for
>text;dc.Title;dc.Description^abstract
>buildcol.pl> Was this your intention?
>buildcol.pl> ***************
>buildcol.pl> create the weights file
>buildcol.pl> creating 'on-disk' stemmed dictionary
>buildcol.pl> creating stem indexes
>buildcol.pl> Use of uninitialized value in concatenation (.) or string at
>C:Program FilesGreenstone/perllib/mgppbuilder.pm line 792.
>buildcol.pl> Use of uninitialized value in concatenation (.) or string at
>C:Program FilesGreenstone/perllib/mgppbuilder.pm line 792.
>buildcol.pl> *** creating the info database and processing associated files
>buildcol.pl> Use of uninitialized value in string eq at C:Program
>FilesGreenstone/perllib/mgppbuilder.pm line 628.
>buildcol.pl> Use of uninitialized value in concatenation (.) or string at
>C:Program FilesGreenstone/perllib/mgppbuilder.pm line 719.
>buildcol.pl> Use of uninitialized value in concatenation (.) or string at
>C:Program FilesGreenstone/perllib/mgppbuilder.pm line 731.
>buildcol.pl> Use of uninitialized value in string eq at C:Program
>FilesGreenstone/perllib/mgppbuilder.pm line 628.
>buildcol.pl> Use of uninitialized value in concatenation (.) or string at
>C:Program FilesGreenstone/perllib/mgppbuilder.pm line 719.
>buildcol.pl> Use of uninitialized value in concatenation (.) or string at
>C:Program FilesGreenstone/perllib/mgppbuilder.pm line 731.
>buildcol.pl> ArcPlug: processing C:Program
>FilesGreenstonecollect estsubarchivesarchives.inf
>buildcol.pl> GAPlug: processing HASH77c8.dirdoc.xml
>buildcol.pl> **** Error is:
>buildcol.pl> not well-formed (invalid token) at line 13, column 21, byte 558
>at C:/Perl/site/lib/XML/Parser.pm line 187
>buildcol.pl> WARNING: No plugin could process HASH77c8.dirdoc.xml
>buildcol.pl> GAPlug: processing HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> **** Error is:
>buildcol.pl> not well-formed (invalid token) at line 13, column 21, byte 558
>at C:/Perl/site/lib/XML/Parser.pm line 187
>buildcol.pl> WARNING: No plugin could process HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> *** creating auxiliary files
>buildcol.pl> arcinfo::save_info couldn't write C:Program
>FilesGreenstonecollect estsubarchivesHASH77c84dadc5fd.dirdoc.xmlarch
>ives.inf
>buildcol.pl> Command failed.
>
>A popup then comes up titled "Collection Preview State." "An error has
>occurred which will precent the collection being previewed." The new
>collection is not available. If there was a pre-existing collection, it is
>not replaced or overwritten.
>
>
>Removing the quotes producing the following output, and no error messages:
>
>Command: C:PerlbinPerl.exe -S C:Program
>FilesGreenstonebinscriptimport.pl -gli -language en -collectdir
>C:Program FilesGreenstonecollect -removeold testsub
>import.pl> Removing current contents of the archives directory...
>import.pl> RecPlug: getting directory C:Program
>FilesGreenstonecollect estsubimport
>import.pl> SplitPlug found 2 documents in C:Program
>FilesGreenstonecollect estsubimport est.csv
>import.pl> WARNING: No plugin could recognise .test.csv.swp
>import.pl> segment 1 -
>import.pl> CSVPlug: processing test.csv
>import.pl> segment 2 -
>import.pl> CSVPlug: processing test.csv
>import.pl> WARNING: No plugin could recognise test.csv~
>import.pl> *********************************************
>import.pl> Import complete
>import.pl> *********************************************
>import.pl> * 4 documents were considered for processing
>import.pl> * 2 were processed and included in the collection
>import.pl> * 2 were unrecognised
>import.pl> See C:Program FilesGreenstonecollect estsubetcfail.log for
>a list of unrecognised and/or rejected documents
>import.pl> Command complete.
>import.pl> Extracting new metadata from archive files.
>import.pl> Archived metadata extraction complete.
>Command: C:PerlbinPerl.exe -S C:Program
>FilesGreenstonebinscriptbuildcol.pl -gli -language en -collectdir
>C:Program FilesGreenstonecollect -removeold testsub
>buildcol.pl> *** creating the compressed text
>buildcol.pl> collecting text statistics (mgpp_passes -T1)
>buildcol.pl> ArcPlug: processing C:Program
>FilesGreenstonecollect estsubarchivesarchives.inf
>buildcol.pl> GAPlug: processing HASH77c8.dirdoc.xml
>buildcol.pl> GAPlug: processing HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> Stats (Compressing text from text)
>buildcol.pl> Total bytes in collection: 137
>buildcol.pl> Total bytes in text: 137
>buildcol.pl> creating the compression dictionary
>buildcol.pl> compressing the text (mgpp_passes -T2)
>buildcol.pl> ArcPlug: processing C:Program
>FilesGreenstonecollect estsubarchivesarchives.inf
>buildcol.pl> GAPlug: processing HASH77c8.dirdoc.xml
>buildcol.pl> GAPlug: processing HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> Stats (Compressing text from text)
>buildcol.pl> Total bytes in collection: 137
>buildcol.pl> Total bytes in text: 137
>buildcol.pl> *** building index text;dc.Title;dc.Description^abstract in
>subdirectory idx
>buildcol.pl> creating index dictionary (mgpp_passes -I1)
>buildcol.pl> ArcPlug: processing C:Program
>FilesGreenstonecollect estsubarchivesarchives.inf
>buildcol.pl> GAPlug: processing HASH77c8.dirdoc.xml
>buildcol.pl> GAPlug: processing HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> Stats (Creating index text;dc.Title;dc.Description^abstract)
>buildcol.pl> Total bytes in collection: 137
>buildcol.pl> Total bytes in text;dc.Title;dc.Description^abstract: 326
>buildcol.pl> inverting the text (mgpp_passes -I2)
>buildcol.pl> ArcPlug: processing C:Program
>FilesGreenstonecollect estsubarchivesarchives.inf
>buildcol.pl> GAPlug: processing HASH77c8.dirdoc.xml
>buildcol.pl> GAPlug: processing HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> Stats (Creating index text;dc.Title;dc.Description^abstract)
>buildcol.pl> Total bytes in collection: 137
>buildcol.pl> Total bytes in text;dc.Title;dc.Description^abstract: 326
>buildcol.pl> create the weights file
>buildcol.pl> creating 'on-disk' stemmed dictionary
>buildcol.pl> creating stem indexes
>buildcol.pl> *** creating the info database and processing associated files
>buildcol.pl> ArcPlug: processing C:Program
>FilesGreenstonecollect estsubarchivesarchives.inf
>buildcol.pl> GAPlug: processing HASH77c8.dirdoc.xml
>buildcol.pl> GAPlug: processing HASH77c84dadc5fd.dirdoc.xml
>buildcol.pl> *** creating auxiliary files
>buildcol.pl> Command complete.
>
>
>As I said, this is easily worked around - get rid of the quotes on the first
>line. On the other hand, I greatly appreciate you making this available, so
>I wanted to pass along my thanks and this detail in case someone else
>encounters the same problem.
>
>Thank you,
>Michael
>
>Michael Silver, Network Administrator
>Parkland Regional Library
>5404 56 Avenue Lacombe, AB T4L 1G1
>Phone: 403.782.3850 Fax: 403.782.4650
>http://www.prl.ab.ca/
>
>
>
>
>
>
>
>