NOTE: Click here for a Belorussian
translation, kindly proved by Bohdan Zograf.
PreScript is a utility for extracting text from PostScript files.
- PostScript conversion to plain ASCII or HTML.
- PreScript is really a PostScript to plain text converter,
but rudimentary HTML can also be produced. Tags are inserted to mark
paragraphs (<p>), short lines (<br>), page breaks
(<hr>), and header and footers (italicized with
- Paragraph boundaries detection.
- PreScript determines the line spacing of a document and
uses this (and also indentations) to determine paragraph boundaries.
- Hyphenation removal.
- Hyphenated words are de-hyphenated.
- Ligature translation.
- Most ligatures used by TeX document are detected.
PreScript doesn't track font changes making it impossible to
reliably detect all ligatures.
PreScript is written in PostScript and Python. You will
need Ghostscript (at least
version 4.01) and the
Python interpreter (at least
The PreScript 0.1 distribution
This distribution is the most stable - it is what you should use to
do real work.
- Download the PreScript
- Define the environment variable PRESCRIPT_DIR to the
directory where PreScript is installed (or where ever you put
- Move prescript.py to a directory listed in your
PATH environment variable. You may want to remove the
.py suffix (prescript.py can be either a standalone
program, or an imported library of another Python program).
- Change #! /usr/local/bin/python in
prescript.py to the location of your Python interpreter.
The PreScript 2 distribution
This is a beta release of our latest version. This version is a lot cleaner and
faster; it is also extensible (users can write their own renderers),
better documented, and contains better prediction of line, paragraph,
and page breaks. If you notice any bugs, want to request new
features, or want to become a beta tester please email the New Zealand Digital Library
- Download a PreScript 2 distribution (the later versions are more stable).
PreScript 2.2 -- same as Prescript 2.1 but compatibility issues
with python 1.5 have been fixed
- On unix systems 'make install' will install prescript to /usr/local/bin. It will
also install the accompanying manual page (to install somewhere else simply edit
- If not installing with the make utility:
It is easiest if all of the program scripts are kept in the same directory,
which ideally should be listed in the PATH environment variable. If
this is inconvenient, be sure that PRESCRIPT_DIR points to where
prescript.ps is installed, and that PYTHONPATH points to
where *.py are installed.
- Change #!/usr/local/bin/python in prescript to the
location of your Python interpreter ('make install' does NOT do this for you).
prescript format input [output]
- format is either plain or html.
- input is the input filename, a PostScript file.
- output is the output filename. By default, the output file
name is the same as the input filename with the path removed and suffix
replace to either .txt or .html.
Please report bugs to the New Zealand Digital Library
PreScript is a port of a Perl program used by the New Zealand Digital
Library project to convert computer science technical reports to HTML. The
Perl version is deemed unfit for a public release because the code is quite
messy (a consequence of Perl's cumbersome syntax for defining objects). The
Python version is considerably easier to understand, maintain, and extend.
The technical paper prescript.ps.gz documents the
algorithms and heuristics used in PreScript 0.1 - there is an
update to this for PreScript 2 inside its distribution
Other Postscript Converters
Here is a summary of other PostScript to text converters we found.
- From the DEC Virtual Paper research project. PostScript program
and C program. Probably the best PostScript to text converter (after
PreScript, of course).
- Developed at Johns Hopkins University to convert JHU journal
articles to HTML. This converter attempts to preserve the formatting
of the original PostScript document, but is tied to PostScript
files generated with a specific package (QuarkXPress?). A table
describing a number of parameters is used to aid conversion and can be
modified for new formats. Uses a variation of Ghostscript's
- Part of the Ghostscript distribution. ps2ascii.ps is
considerably less robust than PreScript.
- A PostScript program similar to Ghostscript's ps2ascii.ps.
- A PostScript program and Perl script.
- A Perl script that extracts parenthesized text from a PostScript
- A standalone C program that extracts parenthesized text. Some
special code to deal with dvips generated files.