[greenstone-users] Re: Lucene building support in greenstone

From Chaitra Rao
DateFri, 17 Jun 2005 11:00:45 +0530 (IST)
Subject [greenstone-users] Re: Lucene building support in greenstone
In-Reply-To (429CDE22-9030706-cs-waikato-ac-nz)
Hi Katherine,
Thanks for the response. I am now able to build the collection without
any exceptions. I can view the file contents (except for text files) but,
I'm unable to search these docs. Please help!

Regards,
Chaitra

On Wed, 1 Jun 2005, Katherine Don wrote:

> Hi Chaitra
>
>
> > Now, when I build test, I get the following exception:
> > buildcol.pl> **** Indexing Doc 3
> > buildcol.pl> org.xml.sax.SAXParseException: Next character must be ";"
> > terminating reference to entity "A".
> > buildcol.pl> at
> ...
> > org.apache.crimson.parser.Parser2.parse(Parser2.java:305)
> > buildcol.pl> at
> > org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:442)
> > buildcol.pl> at javax.xml.parsers.SAXParser.parse(SAXParser.java:345)
> > buildcol.pl> at Indexer.index(Indexer.java:87)
> > buildcol.pl> at IndexXML.indexFile(IndexXML.java:43)
> > buildcol.pl> at GS2LuceneIndexer.main(GS2LuceneIndexer.java:51)
> > buildcol.pl> parse error:
> >
> hmmm, this is a bit weird cos it should be using xerces as the parser,
> not crimson. xerces is included in the LuceneWrap.jar.
> the error means that you have an '&' in the document that's not part of
> an entity (e.g. <) which is invalid XML.
> You wouldn't get the error with mgpp cos its not parsed using an XML parser.
> You could try looking at the doc.xml files and seeing where this & is
> and try and get rid of it.
>
> > Each time I rebuild the collection this exception appears for random doc
> > numbers (Its Doc1 in the first build, Doc2 or Doc3 in the 2nd build and so
> > on and sometimes without any doc numbers). This works fine when the
> > indexes are built using mgpp and no exceptions are thrown! This error is
> > thrown for only one doc but other docs are also not displayed on the
> > grenstone page.
> >
> I guess the whole collection is not built properly when a parse
> exception happens.
> >
> >
> >>I'll have a look. But just to check, do both your collections have the
> >>same structure? most importantly they need to have the same index names.
> >>this means that unless you manually rename indexes, you can't cross
> >>collection search across different types (mg, mgpp, lucene) of collections.
> >
> >
> > I tried renaming the .cfs files in one of the collections to have the same
> > name as the other, but I am unable to search the collections.
> >
> I meant the index directory names. Looking at your config files, they
> seem ok and cross coll searching should work. (searching text only.
> searching titles will only work for the one collection that is indexing
> titles). your indexes are probably called didx in both collections, so
> thats fine.
> If one of your collections is broken due to the above errors, then of
> course cross coll searching will not work.
>
> I suggest you get both collections working properly as individuals, then
> try the cross collection searching again.
> >
> >>>1. It seems to be working for word docs only. When I use text and pdf
> >>>files, the text is not displayed on the greenstone page.
> >>
> >>this is strange. Lucene works off the archive files, not off the import
> >>documents, so it shouldn't make any difference what type of documents
> >>they were.
> >>Have a look at the documents in index/text/hashid/doc.xml - this is
> >>where the text is taken from. Is there text in there?
> >>are there other differences between the word and pdf docs? eg do your
> >>word docs have sections while the others don't?
> >>are you using section level in your index or just document level?
> >
> >
> > The text is present in the doc.xml files. I am using document level in my
> > indexes
> >
> weird. I have tried word, pdf, html and text. They all worked for me
> except the text one - I could search it but not display it.
> I'll add this to our list to be looked at at some stage.
>
> Katherine
>