[greenstone-devel] Re: Lucene building support in greenstone

From Katherine Don
DateWed, 01 Jun 2005 09:58:58 +1200
Subject [greenstone-devel] Re: Lucene building support in greenstone
In-Reply-To (Pine-LNX-4-44-0505311133230-21931-100000-fornax-it-iitb-ac-in)
Hi Chaitra

> Now, when I build test, I get the following exception:
> buildcol.pl> **** Indexing Doc 3
> buildcol.pl> org.xml.sax.SAXParseException: Next character must be ";"
> terminating reference to entity "A".
> buildcol.pl> at
> org.apache.crimson.parser.Parser2.parse(Parser2.java:305)
> buildcol.pl> at
> org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:442)
> buildcol.pl> at javax.xml.parsers.SAXParser.parse(SAXParser.java:345)
> buildcol.pl> at Indexer.index(Indexer.java:87)
> buildcol.pl> at IndexXML.indexFile(IndexXML.java:43)
> buildcol.pl> at GS2LuceneIndexer.main(GS2LuceneIndexer.java:51)
> buildcol.pl> parse error:
hmmm, this is a bit weird cos it should be using xerces as the parser,
not crimson. xerces is included in the LuceneWrap.jar.
the error means that you have an '&' in the document that's not part of
an entity (e.g. <) which is invalid XML.
You wouldn't get the error with mgpp cos its not parsed using an XML parser.
You could try looking at the doc.xml files and seeing where this & is
and try and get rid of it.

> Each time I rebuild the collection this exception appears for random doc
> numbers (Its Doc1 in the first build, Doc2 or Doc3 in the 2nd build and so
> on and sometimes without any doc numbers). This works fine when the
> indexes are built using mgpp and no exceptions are thrown! This error is
> thrown for only one doc but other docs are also not displayed on the
> grenstone page.
I guess the whole collection is not built properly when a parse
exception happens.
>>I'll have a look. But just to check, do both your collections have the
>>same structure? most importantly they need to have the same index names.
>>this means that unless you manually rename indexes, you can't cross
>>collection search across different types (mg, mgpp, lucene) of collections.
> I tried renaming the .cfs files in one of the collections to have the same
> name as the other, but I am unable to search the collections.
I meant the index directory names. Looking at your config files, they
seem ok and cross coll searching should work. (searching text only.
searching titles will only work for the one collection that is indexing
titles). your indexes are probably called didx in both collections, so
thats fine.
If one of your collections is broken due to the above errors, then of
course cross coll searching will not work.

I suggest you get both collections working properly as individuals, then
try the cross collection searching again.
>>>1. It seems to be working for word docs only. When I use text and pdf
>>>files, the text is not displayed on the greenstone page.
>>this is strange. Lucene works off the archive files, not off the import
>>documents, so it shouldn't make any difference what type of documents
>>they were.
>>Have a look at the documents in index/text/hashid/doc.xml - this is
>>where the text is taken from. Is there text in there?
>>are there other differences between the word and pdf docs? eg do your
>>word docs have sections while the others don't?
>>are you using section level in your index or just document level?
> The text is present in the doc.xml files. I am using document level in my
> indexes
weird. I have tried word, pdf, html and text. They all worked for me
except the text one - I could search it but not display it.
I'll add this to our list to be looked at at some stage.