[greenstone-users] Problems with doc.xml and parser

From Katherine of Greenstone Team
DateTue Oct 7 09:46:37 2008
Subject [greenstone-users] Problems with doc.xml and parser
In-Reply-To (39EECA7249F24AE7B8A9F5EA636757E9-orsna-gov-ar)
Hi Diego

Its probably easiest if you send a couple of sample files that cause the
problem. It looks like there are & characters in the text which need to
be escaped.

Send to me off list if you like.
Cheers,
Katherine

Diego Spano wrote:
> Hi list,
>
> I□m running GS 2.80 on Fedora 8 and a collection of tiff images with
> OCR using PagedImgPlug and Lucene. Import run with no problems, but
> when I run build process I hace a lot of parsing related errors:
>
> GAPlug: processing 96-2576-00-Cuerpo02/doc.xml
> Starting to index <xml doc on stdin>
> [ Doc: 1parse error:
> org.xml.sax.SAXParseException: The entity name must immediately follow
> the '&' in the entity reference.
> at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
> at
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
> Source)
> at javax.xml.parsers.SAXParser.parse(Unknown Source)
> at org.greenstone.LuceneWrapper.Indexer.index(Indexer.java:117)
> at
> org.greenstone.LuceneWrapper.IndexXML.indexFile(IndexXML.java:65)
> at
> org.greenstone.LuceneWrapper.GS2LuceneIndexer.main(GS2LuceneIndexer.java:110)
> GAPlug: processing 96-2576-00-Cuerpo03/doc.xml
> Starting to index <xml doc on stdin>
> [ Doc: 102parse error:
> org.xml.sax.SAXParseException: The entity name must immediately follow
> the '&' in the entity reference.
> at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
> at
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
> Source)
> at javax.xml.parsers.SAXParser.parse(Unknown Source)
> at org.greenstone.LuceneWrapper.Indexer.index(Indexer.java:117)
> at
> org.greenstone.LuceneWrapper.IndexXML.indexFile(IndexXML.java:65)
> at
> org.greenstone.LuceneWrapper.GS2LuceneIndexer.main(GS2LuceneIndexer.java:110)
> GAPlug: processing 96-2576-00-Cuerpo04/doc.xml
> Starting to index <xml doc on stdin>
> [ Doc: 307parse error:
> org.xml.sax.SAXParseException: The reference to entity "rging" must
> end with the ';' delimiter.
> at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
> at
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown
> Source)
> at
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
> Source)
> at javax.xml.parsers.SAXParser.parse(Unknown Source)
> at org.greenstone.LuceneWrapper.Indexer.index(Indexer.java:117)
> at
> org.greenstone.LuceneWrapper.IndexXML.indexFile(IndexXML.java:65)
> at
> org.greenstone.LuceneWrapper.GS2LuceneIndexer.main(GS2LuceneIndexer.java:110)
> All text come from OCR so it□s probably a very dirty text, but it is
> supposed that the import process take care of this when creating
> doc.xml. I can□t edit all txt and replace all strange chars!!!.
>
> Any help?
>
> TIA
>
> Diego Spano
> ------------------------------------------------------------------------
>
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
>
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>