[greenstone-users] Problems with doc.xml and parser

From Diego Spano
DateTue Oct 7 05:06:18 2008
Subject [greenstone-users] Problems with doc.xml and parser
Hi list,

I□m running GS 2.80 on Fedora 8 and a collection of tiff images with OCR
using PagedImgPlug and Lucene. Import run with no problems, but when I run
build process I hace a lot of parsing related errors:

GAPlug: processing 96-2576-00-Cuerpo02/doc.xml
Starting to index <xml doc on stdin>
[ Doc: 1parse error:
org.xml.sax.SAXParseException: The entity name must immediately follow the
'&' in the entity reference.
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseEx
ception(Unknown Source)
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(Unkno
wn Source)
at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanE
ntityReference(Unknown Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$Fragm
entContentDriver.next(Unknown Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanD
ocument(Unknown Source)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Un
known Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.greenstone.LuceneWrapper.Indexer.index(Indexer.java:117)
at org.greenstone.LuceneWrapper.IndexXML.indexFile(IndexXML.java:65)
at
org.greenstone.LuceneWrapper.GS2LuceneIndexer.main(GS2LuceneIndexer.java:110
)
GAPlug: processing 96-2576-00-Cuerpo03/doc.xml
Starting to index <xml doc on stdin>
[ Doc: 102parse error:
org.xml.sax.SAXParseException: The entity name must immediately follow the
'&' in the entity reference.
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseEx
ception(Unknown Source)
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(Unkno
wn Source)
at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanE
ntityReference(Unknown Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$Fragm
entContentDriver.next(Unknown Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanD
ocument(Unknown Source)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Un
known Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.greenstone.LuceneWrapper.Indexer.index(Indexer.java:117)
at org.greenstone.LuceneWrapper.IndexXML.indexFile(IndexXML.java:65)
at
org.greenstone.LuceneWrapper.GS2LuceneIndexer.main(GS2LuceneIndexer.java:110
)
GAPlug: processing 96-2576-00-Cuerpo04/doc.xml
Starting to index <xml doc on stdin>
[ Doc: 307parse error:
org.xml.sax.SAXParseException: The reference to entity "rging" must end with
the ';' delimiter.
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseEx
ception(Unknown Source)
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(Unkno
wn Source)
at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanE
ntityReference(Unknown Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$Fragm
entContentDriver.next(Unknown Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanD
ocument(Unknown Source)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Un
known Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.greenstone.LuceneWrapper.Indexer.index(Indexer.java:117)
at org.greenstone.LuceneWrapper.IndexXML.indexFile(IndexXML.java:65)
at
org.greenstone.LuceneWrapper.GS2LuceneIndexer.main(GS2LuceneIndexer.java:110
)

All text come from OCR so it□s probably a very dirty text, but it is
supposed that the import process take care of this when creating doc.xml. I
can□t edit all txt and replace all strange chars!!!.

Any help?

TIA

Diego Spano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/attachments/20081006/5b28c8db/attachment.html