[greenstone-users] Using Greenstone as a webography : some problems with php files

From ak19@cs.waikato.ac.nz
DateWed Mar 26 12:43:02 2008
Subject [greenstone-users] Using Greenstone as a webography : some problems with php files
In-Reply-To (996a19c0803210819s699ffce7hb0f7f0f5f0d96c7b-mail-gmail-com)
Hi Damien,

I've been thinking about your problem, but it is hard for me to replicate
as I don't have a web mirroring tool here (that I know of) which replaces
the "?" after the php with an "@" sign.

However, I think the solution may lie in the HTMLPlug's configuration for
process_exp which is currently set to:
"(?i)(.html?|.shtml|.shm|.asp|.phpd?|.cgi|.+?.+=.*)$"

To view your HTMLPlug's configuration, you first need to be in Expert mode.
Go to File > Preferences. Mode tab, click Expert. Then, in the Design tab
for your collection, click on Document Plugins on the left, then select
HTMLPlug on the right and press the Configure Plugin button.
You will see that process_exp is one of the options for this plugin.

I don't know how much you are familiar with the Perl programming language,
but the bit above inside quotes is a "regular expression" in Perl. It's
basically telling the HTMLPlug which extensions it should be processing.
And as you can see ".php" is listed as one of them.
To explain that regular expression in words and be more precise about its
meaning, it says the accepted extensions can be any of:
- .htm(l), .shtml, .shm, .asp, .php followed by one optional digit, or
.cgi, - OR a sequence of one or more characters as long as it's followed
by a question-mark, which is followed by one or more characters, then an
equals-sign followed by 0 or more characters at the end.

This means that the second description above is the most relevant bit for
you as to what the HTMLPlug recognises as something it can process:
- a_sequence_of_chars?word=optional_word

THE IMPORTANT BIT FOR YOU:
Could you try pasting the following in the process_exp field for your
collection (after first making a copy of what expression was already there
in the field, so that we don't lose the expression that was actually
working):
- "(?i)(.html?|.shtml|.shm|.asp|.phpd?|.cgi|.+(?|@).+=.*)$"
Here's the explanation of the change I've made: it says it can expect
EITHER a question-mark OR an at-sign (@).

If that didn't work, could you try:
- "(?i)(.html?|.shtml|.shm|.asp|.phpd?|.cgi|.+(?|@).+=.*)$"
The intention here is the same, but will hopefully deal with the event
that the at-sign is a reserved character and therefore has a special
meaning in Perl regular expressions (I don't think it does). But if this
is the case, the second line is escaping it with "" to treat it as the
at-sign character that we want, rather than the special meaning.

Of course all this explanation is needless confusion unless it actually
worked for you.
- Therefore, could you please try the first (and if that didn't work, then
the second) solution above on your mirrored php documents and tell us if
it didn't work?
- You would also need to check that it didn't break any of your regular
web links (the ones without php@). It's no good if the solution works just
for php@ but suddenly the php? or even other file extensions stopped
working.

Alternatively, if the above was all just too confusing, you could send one
or two dummy php documents with "php@article" in the file name and also
tell me their mirrored folder structure and I could replicate the
structure here and try it out.

Fingers crossed and hope this works out,
Anupama

> Hello everyone,
>
> I discovered Greenstone last couple of months, and I decided to include it
> in a Information project, about the french student's strike last november.
> Our plot was to use Greenstone as a tool which link documents to their
> real
> sources (for legal issues). So, the value "file_as_url" is indispensable.
>
> To make indexes and logical organization of our documents, I used the
> mirroring module. For the majority of documents, there isn't any problems,
> their're coming where I gathered some PHP files.
> Indeed, some mirrored websites uses some CMS like Spip :
>
> 1) The url has been modified. Normally, it would be spip.php?article... it
> became spip.php@article...
> 2) Greenstone can't identify the filetype because of the text after, so
> the
> plugin UnknownPlug indexes the document.
>
> Is there any solution to configure Greenstone to index correctly that kind
> of file, in order to get all my items under the icon "weblink"?
>
> Tthanks for your support.
> --
> Damien NICOL
> nicol.damien@gmail.com
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>