By Jon: First published in the AusSI Newsletter, 2003
Syntactica (http://www.syntactica.com) is the latest ‘automatic indexing’ system for the general public. Unlike Indexicon and other earlier programs, Syntactica is a web-based service (see Glenda Browne’s extensive review of automatic indexing in LASIE V27, No 3, p58-65, reproduced athttp://www.aussi.org/conferences/papers/browneg.htm).
Users pay a subscription fee which entitles them to upload their own documents to the Syntactica site. An opening balance of $US10 is credited to new users, allowing them to try Syntactica out on up to 100 pages of text before a payment is required. A registered user is given their own password-protected workspace in which the ‘indexes’ are kept and made available.
The website uses cookies to keep track of your details, a harmless but unnecessary liberty.
The type of documents that Syntactica will process is limited to text, .RTF and Word files.
Uploaded documents are analysed and the resulting ‘indexes’ become available for the user to download. Three indexes are available for each document; a short (Min) index with about one term per 100 words, a longer (Mid) one with about 4 terms per 100 words, and a longer one again (Max) with about 1 term for every 10 words. These ratios appear to decline for longer documents – see the table below for details.
Once the indexes are produced the user can view any of these in a pop-up window alongside a plain text version of the input file, with hyperlinks from the index terms to the text they refer to. The indexes can be edited in another pop-up window by adding or removing entries and rewriting index terms. The indexes can be downloaded as text files and marked-up versions of the original documents are made available in Word format for making embedded Word indexes. The actual operation is slick and user-friendly, though I had some minor problems with the pop-up windows; these were possibly due to my own PC set-up.
I uploaded two text files for analysis, one of 1655 words and one of 4646 words. The shorter document took about one minute to analyse and the longer one about three minutes. Syntactica don’t reveal their algorithms, but an examination of the results shows that the program is basically looking for noun phrases and inverting these.
The results are as you might expect.
Syntactica does almost nothing that an indexer would recognise as analysis. It also makes blunders which any indexer would avoid. The most obvious of these:
- Every index entry is capitalised.
- Sequencing is in ASCII order (all upper-case letters file before lower-case letters); this is partly masked by the capitalisation but is obvious for acronyms.
- It doesn’t use index terms that are not in the text.
- It doesn’t understand synonymy; AusSI and Australian Society of Indexers are not combined together.
- It can’t distinguish most multi-word descriptions from nouns + verbs: thus NSW, treasurer of is treated the same as NSW, University of.
- It has trouble with plurals: Rosella and Rosellas, Website and Websites, appear separately.
- It has problems with names: Sylvia Klienert and Margo Neal appears in the index but not Kleinert, Sylvia or Neale, Margo.
- Entries are often doubled as their own subheadings, thus: Fictionwise,Fictionwise
- It favours two-word phrases: thus we have Web, Wide but not Web, World Wide or World Wide Web.
- Odd capitalisations are imposed on non-standard words: eBookManbecomes EbooKman.
- It failed to pick up several one-line headings from a text document, eg ‘Current trends’ was a heading but was not indexed. It was no better at picking up headings or underlined words in a formatted Word document.
- Words after full stops are given special emphasis even where this is not appropriate: e.g. ‘www.baen.com’ is indexed under Com.
The sample index shown below will reveal other flaws in this approach. The inadequacy of the algorithm is obvious. But it also appears, from the first two points above, that the designers haven’t read even an elementary indexing textbook. The work involved in cleaning up a Syntactica index of any length would be far greater than the work involved in making an index from scratch. Still, when even the blurb on the ‘About Us’ page contains a misspelling, what can you expect?
Syntactica, Inc. is dedicated to the study and analysis of the linguistic structure of the English language. Syntactica, Inc. designs and produces software that understands the English language. Dedicated to the analysis of long, complex documents, the company has produced software which generates summaries, dictionaries, indices and abstracts to enable the user to quickly understand documents’ contents and determing (sic) their relevance. It is truly a unique company!!
Alas! would that it were…
Sample Syntactica MIN index to 1655-word article
Document size and index entries |
||||
Document size | Processing time |
Min index size (entries) |
Mid index size (entries) | Max index size (entries) |
1655 words | 1 min | 13 | 39 | 141 |
4646 words | 3 min | 22 | 67 | 212 |
Blank lines have been removed, but the index is shown here in its entirety.
B
Blackmask
Blackmask
Bookmark
specified page or
C
Cameras
digital
Colour screen
capabilities and
Com
Com and Baen Books baen
Cybereditions
Cybereditions
D
Domain
public
F
Fictionwise
Fictionwise
Fonts
L
Literature
thousands of works of
M
Memory cards
Music file
Mp3