A new thesaurus standard for the web – Jon Jermey and Glenda Browne

By Glenda: First published in Online Currents – Vol.18 Issue 4, May 2003

Information specialists who have grappled with the task of using thesauri or other controlled vocabularies for information retrieval on the Web, will be delighted to hear of the revision of the standard Guidelines for the Construction, Format, and Management of Monolingual Thesauri, ANSI/NISO Z39.19 currently underway by NISO (the United States’ National Information Standards Organization). These guidelines, last revised in 1993, form an important basis for thesaurus construction and use in library and database environments, and the revision aims to make them more appropriate for use on the Web and with a wide range of online documents.

In addition, the revision will take into account the fact that thesaurus management software has changed significantly, that research has shown the need for usability testing of controlled vocabularies, and that it is now assumed that thesauri will be used for both indexing and searching (http://www.niso.org/committees/TRAG/ThesaurusAG.html ).

As I read through the details of the discussion paper I thought ‘Yes, this is just what we need’. The revision is a timely and well-considered development to expand the usefulness of the standard from its traditional role to the wider arena of online retrieval, where many of the struggles are with the management of unstructured data and information on intranets and the Web. For those of us who believe that traditional library and information science approaches have value for the Web, any road map showing best practice will be much appreciated.

The revision is being coordinated by Dr Amy Warner, with an Advisory Group drawn from the NISO members and institutions funding the work. Funding has come from the Getty Foundation, the H.W.Wilson Foundation, and the National Library of Medicine. Peter Morville (an IA consultant, and co-author of Information Architecture for the World Wide Web) posted information about a survey on a number of mailing lists, and received 71 responses from members of SIG-IA, WEB4LIB, Index-L, AIFIA-members, NKOS, and SIGCR-L.

The first question asked how people had used the Z39.19 standard. Answers included: for thesaurus and taxonomy design; for metadata tagging; in education; haven’t; for periodical indexing; for copy cataloguing; to introduce client groups to the principles of thesaurus construction; and for automated categorisation.

The second question asked about competing or complementary standards. A number of people mentioned new XML-based standards including XFML (for faceted classification), VocML (Vocabulary Markup Language), topic maps, and RDF (Resource Description Framework), as well as work done on the semantic Web in general. Others noted traditional library tools including Library of Congress Subject Headings (LCSH), Medical Subject Headings (MeSH), MARC, Library of Congress Classification (LC), and the Dewey Classification (DC), as well as the existing BSI (British) and ISO (international) thesaurus standards, including those on multilingual thesauri. One respondent noted the importance of internal corporate standards.

Organisations that were mentioned included the American Library Association (ALA), American Society of Indexers (ASI), American Society for Information Science & Technology (ASIST), Dublin Core Metadata Initiative (DCMI), Internet Engineering Task Force (IETF), National Federation of Abstracting and Indexing Services (NFAIS), Networked Knowledge Organization Systems/Services (NKOS) forum, and Open Language Archives Community (OLAC). (A field day for acronyms!)

The third question asked about relevant software products. Many of the answers suggested that the guidelines should be generic, rather than listing specific products. This is appropriate; however, the standard must not be divorced from the software capabilities it will be used with. Software categories included thesaurus management software, automatic categorisation and classification tools, and content management systems. Specific thesaurus management packages that were named include MultiTES, Term Tree, Lexico and WebChoir. In addition, relational database management systems, such as MS-Access and Oracle, were said to be important back-end components for various packages.

Suggested revisions were offered by 44 out of 71 respondents. These included:

Definitions for the following terms: controlled vocabulary; thesaurus; taxonomy; ontology; glossary; lexicon; classification system
An XML structure for thesauri
Guidance on how to create Web services that deliver or interact with thesauri (including integration of the thesaurus with the document database, and online presentation of thesauri)
Guidelines for harmonising one thesaurus with another
Information on machine-assisted generation of thesauri
Information on graphical displays, including term mapping (visualisation)
Information on alternatives (thesauri-lite)
Guidance on the development of taxonomies or category structures with a limited number of top terms, which are opened successively to show lower levels of the hierarchy
Clarification of the difference between compound terms and pre-coordination (the suggestion is that pre-coordination is more suited to the development of Web navigation schemes, and that post-coordination is better suited to Boolean searching).
Fuller discussion of the effect of removing terms from context on their usefulness for describing and finding information
Scenarios of real-world use

This revision seems well targeted to address the practical issues facing librarians and others applying their traditional skills to taxonomy construction, metadata creation and intranet and Web navigation. I am impatiently awaiting the result, which I think will be a crucial resource for taxonomists and thesaurus creators in the coming years.