By Glenda: First published in Online Currents – Vol.18 Issues 1, 2 and 3, Jan-Apr 2003
1. Principles of Classification
Automatic categorisation is the new ‘killer app’ for information access on Web sites, intranets and portals. However, is it really the solution to information overload, or is it just another promised technological fix that doesn’t deliver? This three-part article examines the state of the art in automatic categorisation. This part examines research in classification theory and its relevance to automatic categorisation. The second looks at some of the principles and practices in automatic categorisation, while the third focuses on specific software products.
Automatic text processing research has been around for a long time, and people often proclaim success with automatic indexing, but somehow practical applications are rarely developed. It turns out that language is much more complex than people realise, and, although academic projects within limited domains might have been successful, these approaches just didn’t work in the real world. The potential benefits of automatic text processing, however, are too significant to be ignored, and work has continued, with automatic methods having some success in some areas.
The size of the Web has offered information access problems and potential larger than those considered before. There are those who say the Web is far too big to be manually indexed, while others proclaim that it is far too big for totally computer-based methods to be effective. Solutions to date have offered a bit of each approach, with much access through search programs such as Google, and some through manually-created directories such as Yahoo (which also offers Search) and more subject-specific directories such as AustLII and SOSIG. Metadata – i.e. keywords embedded in Web pages – has been proposed as an indexing method for the Web, but it hasn’t taken off, particularly as most search engines give it little weighting, or don’t take it into account at all. This is largely because metadata is susceptible to misuse, and people have used irrelevant metadata terms as a way of bringing more users to their sites (spamming). Metadata creation is also time-consuming and skilled work. In addition, unless a significant number of sites include metadata, it is of limited use.
Corporate intranets also suffer from information overload and access problems. Metadata has a much bigger part to play here than on the Web, as it can be targeted to specific user groups, and should be much less susceptible to spamming. Nevertheless, there is a feeling that new, better solutions are needed, and one of the major proposals is the use of automatic categorisation to group content in meaningful ways, to make it more accessible to users. These methods are not mutually exclusive, however, and quality metadata can make automatic categorisation methods much more effective.
Automatic categorisation software is available with a number of content management and portal software systems. The software can perform two separate functions: it can create a taxonomy from the content on a site, and it can populate that taxonomy with links to specific bits of content. Different software packages allow these parts to be integrated in different ways, and most now allow for human checking and editing of the automatic results. Some proposed uses of automatic categorisation are to generate the navigation structure of intranets (or multiple navigation approaches for different users), to categorise the results of searches on the Web, and to organise news feeds.
This area is dynamic, and companies that I started studying a few weeks ago have now been taken over, with software offerings being integrated or subsumed. So, more than in almost any other field, you will have to keep looking for the most up-to-date information on this subject.
The definition I use is that ‘taxonomies are controlled vocabularies with hierarchical structures that are used to organise electronic content for information access. They include thesauri, although they have not always been created with the analytical rigour and consistent description that is applied to thesauri.’ I do not use taxonomy to include the content as well as the structure within which the content has been organised, which Gilchrist (1) has noted as one usage in the field. The trouble with such a broad definition is that it leaves us without a word for the taxonomy structure alone. I do, however, include the language and business rules that are associated with terms in the taxonomy (in the same way that a thesaurus includes scope notes for terms within it).
Automatic categorisation is the use of computer software to allocate content into logically useful categories. The first step is usually to create the taxonomy, or structure of terms, that apply to the content. However, with statistical clustering techniques, sometimes a topological map is generated instead of a taxonomy. The second step is to make links from terms in the taxonomy to content relevant to those terms. The content can then be displayed in logical groups. For example, if the content on an intranet is automatically categorised, this grouping can then be used as the navigation structure of the site. If results from a search are automatically categorised, the hits can be presented to the searcher in logical groups. (For example, if you searched on ‘terror’, a logical grouping of resulting hits might include the categories ‘world affairs’, ‘literature’ and ‘psychology’.)
Figure 1. Topological Map Generated for a Search for ‘bears’ at http://www.kartoo.com
Automatic categorisation is treated as a synonym of automatic classification, automatic indexing, and automatic tagging by Peter Morville (2). To me, classification is a more formal structure, and my library training makes me think of notations, but either term can be used. Automatic indexing and tagging are appropriate terms when the automatic categorisation software allocates metadata keywords to content for use in grouping the topics. When searching for information on automatic categorisation, it is important to remember the alternative spellings and forms (categorization and automated), the shorter form (auto-categorisation), and related terms, such as unstructured data management, automatic text processing and text mining.
Research into Classification Theory
To fully consider automatic categorisation it helps to understand how humans categorise information, and what this teaches us about words, their meanings, and the relationships between them.
Classical and Family Resemblance Categories
Steven Pinker (3) tells us that people think in categories such as ‘furniture’ and ‘vegetables’. These categories underlie much of our vocabulary, and much of our reasoning. Traditionally, people considered categories as ‘classical’, or ‘Aristotelian’, and assumed they could be based on logic and definitions.
These rule-based categories were challenged by the philosopher Wittgenstein, who used the example of finding what characteristics things called ‘games’ had in common. He found that there was not one common thread, but instead a series of similarities and relationships, linking all games, but allowing many differences as well, e.g. most games involve competition, but solitaire (patience) doesn’t. It can be hard, or impossible, to find one definition that encompasses all members of a category. For example, lizards have legs and snakes don’t, but what about legless lizards?
In addition, not all members of a category are created equal – a kookaburra is a more representative example of a bird than the flightless emu. The best example in a category – the one that sums up the group in people’s minds – is the prototype. For example, if someone says they need a Valium you know how they are feeling better than if they long for Lorazepam, because Valium is the prototype benzodiazepine anti-anxiety drug – it is the one that represents the category in people’s minds.
Categories also have fuzzy borders – for example, most people agree that carrots are vegetables, but what about parsley or garlic powder? A school funding decision in the US hinged on whether ketchup (tomato sauce) was a vegetable or not. Categories also have stereotyped features – traits that are associated with the category, even if they are not necessary for membership of the category. So flying is stereotypically associated with the category ‘birds’, even though not all birds can fly.
Two psychologists used the most ‘classical’ categories they could find, and asked students to rate examples. Subjects rated ‘7’ as a better odd number than ‘447’, and ‘housewife’ as a better example of woman than ‘policewoman’. This is not to say that the subjects didn’t know that 447 is an odd number, and that a policewoman remains totally female. It appears that people are able to turn fuzziness of categories on and off as the need demands – they know that an emu is a bird, but they are also aware that it lacks some of the stereotypical features of birds. Family resemblance categories and classical categories seem to live side by side in people’s minds as alternative ways of understanding the world.
Implications of This Research
Two sorts of categories exist – ‘classical’ categories, for which membership rules can be written, and ‘family resemblance’ categories, for which there are general relationships but no absolute rules, and in which there are fuzzy areas where people disagree about whether something is a member of the category or not. These distinctions mean there are some areas in which automatic categorisation will work better than others, and there might have to be different approaches to the automatic categorisation of different sorts of content (see the example about cars in the rule-based categorisation section below).
With ‘family resemblance’ categories, we have to realise that there are no clear-cut rules for membership of the category, and that decisions people make at the ‘fuzzy’ edges will often not match the decisions made by either human or automatic categorisers. There are two possible approaches to this – one is to try and generate categories that suit all possible users. Automatic categorisation into many different views is possible, although if one-size fits all is difficult, three-sizes fits all will still be problematic. In addition, the management of multiple hierarchies for different users could be an administrative nightmare. Another possible approach is to let users know that categorisation is a fuzzy science, and that if they don’t find what they need in one category, they should try another category. We can also try and give as many paths as possible to the information, even if we only use one overall hierarchy.
The fact that we all have different approaches to categorisation has been brought home to me. I was preparing some training notes and asked a 20-year old to sort some cards into piles that he thought would make logical groupings for a Web site offering information about a local area. He put the Conservation Society, plant nursery, and bowling green in one group, having picked up the similarity that they were all green and environmental. He hadn’t thought to group the bowling green with the squash club in a sports category, although this was the obvious grouping to me. This experience showed me how different an information specialist’s approach can be from that of the users we are trying to serve, and how critical it is to test the schemes we create with real users.
These issues apply equally to manual and automatic categorisation, and in both cases user testing of categories is crucial. With automatic categorisation it is important to do a human evaluation of results, as the computer doesn’t think about the appropriateness of the work it is doing in the way that a human specialist would.
Folk Classification and Access, and Basic Level Terms
Marcia Bates (4) discusses linguistic and anthropological research on classification and approaches to information access. ‘Folk classifications’ – categories used for plants, animals, colours and so on – have been found to have consistent characteristics across different cultures. These taxonomies include from 250 to 800 terms, focus on the generic level (e.g. ‘monkey’, rather than ‘howler monkey’ or ‘primates’), and usually have a shallow hierarchy.
Other research has confirmed the importance of generic, or basic level, terms. While it is not so easy to identify basic level terms for real-world examples, Bates suggests that we would probably find people using basic level terms, rather than the broadest or narrowest terms, while searching.
Bates also discusses ‘folk access’ – the way people access information systems. Focus groups have found that users were positive about the idea of some sort of classification tree available at the start of their search, and that they often start with a broader question than the one they really need an answer to. In a real-life reference interview in a library this can be a way of making some contact with the librarian who is helping them; in an electronic search it can be a way of working out the context of the search and getting a feel for the system and its contents.
Implications of This Research
Folk classification research has implications for the number of categories and level of hierarchy that might be optimal for online information systems. Systems might aid users by providing a hierarchy from which they can select terms to start their search, or in which they can review search results.
One of the difficult issues to address is that searchers often use broader terms than those which suit their information needs. It would, therefore, be useful if search results allowed users to drill down to narrower topics in the hierarchy, if the results they retrieve are too broad. This could be done by showing them a hierarchy from which they can select new search terms, or automatically broadening a search by ‘exploding’ it to include narrower terms. (‘Explode’ is a very useful option in Medline searching as it automatically includes all narrower terms in a search).
The Inktomi Search CCE Module (Content Classification Engine) (5) has taken the research into basic level categories into account, and builds its hierarchy up and down from basic-level categories in the middle. (It is unclear what will happen to this technology, as parts of Inktomi have just been taken over by Verity).
It is not really possible to solve this problem from the indexing side, because if the indexer allocates only the broader terms, the person who needs a specific search will not be catered for. Part of the solution may also be training, as for some searches it is a waste of time to search broadly first and drill down through information in context. For example, at a Business Support centre someone phoned to find out how high a window had to be to be considered ‘accessible’ with respect to Home Contents insurance risk (the lower a window, the more risk of a break-in). She was told that she could have found the answer in the Online Help under ‘accessible windows’ and also under ‘windows:accessible’. She had obviously looked up a much broader term such as ‘security’, assuming that nothing as specific as ‘accessible windows’ would be indexed.
Importance of Categories in Information Access
Some of the general research discussed above has shown how people naturally create categories in their daily life, and may search on a broader term than the one that actually describes their topic.
When looking for information on a site, Jakob Nielsen (6) found that about half of all users were search-dominant (that is, they preferred to find information using search engines), about a fifth were link-dominant (they preferred to follow links and browse through the hierarchy of the site by selecting appropriate categories one at a time), and the rest showed mixed behaviour. This suggests that for 50% of users the browsing hierarchy is important at some time.
Other writers (often those promoting categorisation software) have stressed the importance of categories in electronic information access (see, e.g. Bryar (7) and the Delphi Taxonomy report (8)). On the other hand, Marcia Bates has pointed out that faceted classification may be more useful than traditional hierarchical categories, so this is another thing to take into account (9).
It is thus apparent that appropriate categories are important for electronic information retrieval, although they are not the only option, and need to be considered in conjunction with other access methods that are offered. One of the most important applications of automatic categorisation might be in the grouping of search results, thus combining the strength of both approaches.
What Else we Need
When I read that infoglut is such a problem, and that automatic categorisation is a solution to help us manage this information, my first thought is ‘cull’. There is so much rubbish on intranets, and even more so on the Web, that the ideal first step in its management would be to get rid of a good proportion of the content, allowing the remainder to be managed more efficiently. Well, we can’t cull the Web, but attention to quality and user need should be a top priority in intranet policies. Karen Bishop (10) has written an overview which acts as a guide for creating a quality corporate intranet.
Information access is also enhanced if the content being searched has been written according to consistent styles and standards. So, if all writers on an intranet write ‘millennium bug’ you will get more consistent search results than if some write ‘millennium bug’, others ‘millenium bug’, others ‘Y2K’, ‘Y2K bug’, ‘Y2K problem’, and so on. (Of course synonym tables and other devices behind the scenes help deal with some of this variety, but it can be hard to catch all the variants and errors). It is also important for authors to include keywords in the key parts of the document (such as titles and introductory sentences) as these are often given higher weighting than the remainder of the document when statistical (and human) indexing methods are used.
Human taxonomy creation and categorisation can be an important tool in an organisation to encourage many stakeholders to think about the language they use and the content they create and offer to staff. If automatic methods are used, it is important to ensure that consensus gathering and other forms of feedback and standard setting continue on a human level. On the other hand, trying to reach consensus from stakeholders barracking for a major spot on the intranet home page can be an impossible task. Saying ‘the computer did it’ could remove some of the personality problems from the issue. The computer is the only organiser that can be said to truly reflect the content with no bias, except that which comes from the people who control the computer settings.
In addition, while it might be hard to quantify, it is a fact that indexers, categorisers, metadata taggers etc, add value to documents by being ‘final readers’ – unofficial proofreaders who pick up inconsistencies, particularly of language. If computers do all the tagging, someone else might have to aid with editing and quality control.
This article has examined basic principles of classification, many of which are highly relevant to the modern problem of too much information, too little of it relevant and useful. The next two articles look at automatic categorisation and the software that is available for this task.
I would like to thank Derek Jardine, of Information Solutions, and Roger Browne, for their comments on this series of articles.
(1) Gilchrist, Alan. ‘The corporate taxonomy – the latest tool in the battle against information overload.’ Bulletin 100 of the Records Management Society of Great Britain. December 2000. http://www.rmsgb.org.gb (only available online to members)
‘A taxonomy aspires to be a correlation of the different functional languages used by the enterprise:
- to support a mechanism for navigating, and gaining access to, the intellectual enterprise
- by providing such tools as portal navigation aids, authority for tagging documents and other information objects, support for search engines, and knowledge maps
- to become a knowledge base in its own right.’
(2) Morville, Peter. ‘Strange connections: software for information architects’ 19 February, 2001. http://argus-acia.com/strange_connections/current_article.html . (Sighted 19 November 2002).
(3) Pinker, Steven. Words and rules: the ingredients of language. London: Weidenfeld & Nicolson, 1999, p270-275.
(4) Bates, Marcia J. ‘Indexing and access for digital libraries and the Internet: human, database and domain factors.’http://www.gseis.ucla.edu/faculty/bates/articles/indexdlib.html . A previous online draft without figures athttp://dlis.gseis.ucla.edu/research/mjbates.html was dated 1996. Also published in JASIS v49 (November 1998): 1185-1205. (Sighted 25 November 2002)
(5) Underwood, Walter. ‘The philosophy behind Inktomi Search CCE Module: why topics inherit upwards.’http://www.inktomi.com/products/search/products/ultraseek/cce/ccephilosophy.htm(Sighted 16 November 2001)
(6) Nielsen, Jakob. Designing Web usability. Indianapolis, Indiana: New Riders Publishing, 2000, p.224.
(7) Bryar, JV. ‘Taxonomies: the value of organized business knowledge: a white paper prepared for NewsEdge.’ 2001http://www.newsedge.com/materials/whitepapers/taxonomies.pdf (Sighted 25 November 2002)
(8) Taxonomy & content classification: market milestone report. Available as a guest download at http://www.delphigroup.com/coverage/taxonomy.htm(Sighted 25 November 2002). [Note that companies paid to be included in this report. It is also available from many sites with details for just one of the companies]
(9) Bates, Marcia. After the dot-bomb: getting Web information retrieval right this time. 12 July, 2002.http://firstmonday.org/issues/issue7_7/bates/index.html (Sighted 25 November 2002)
(10) Bishop, Karen. Information as an intellectual asset: a sample strategy for implementing a corporate intranet.http://www.oneumbrella.com.au/strategy.html (Sighted 21 July 2001)
Automatic categorisation is the new ‘killer app’ for information access on Web sites, intranets and portals. But is it really the solution to information overload, or is it just another promised technological fix that doesn’t deliver? This, the second in a series of three articles, examines the state of the art in automatic categorisation. The first examined research in classification theory and its relevance to automatic categorisation. This one looks at some of the principles of automatic categorisation, while the third will focus on specific software products.
Automatic categorisation is offered as part of the following software applications:
- content, document and knowledge management systems
- search and retrieval software
- categorisation, taxonomy generation and data visualisation software.
With the expansion of services and acquisition of companies with complementary offerings, the boundaries between these groups are not that clear any more.
Most automatic categorisation software packages promise effective categorisation of content from unstructured data – e-mails, text documents (Word and PDF), and Powerpoint presentations – and sometimes from structured data, such as SQL and Oracle databases as well. This means that the format the information is in doesn’t limit access to it. Much automatic categorisation software also offers the option to automatically generate the taxonomy structure, although some software packages provide a taxonomy as a starting point.
Some of the packages offer extra features, such as real-time delivery of organised information via telephone. (I hope the companies purchasing this software have a psychologist on the team to compare the benefits of access to news 24/7 with the psychological disadvantages of never being able to get away from the office). Other offerings provide the information visually, with topics named, and then joined to other relevant topics – the closer they are in space, the more related they are. I will not be discussing these data visualisation applications in detail (1).
Categorisation can be used to generate navigation structures for intranets and portals, including alternative views for different user groups. Other significant uses are for the organisation of news, where automation is important in coping with a continual influx that has to be findable immediately, and for organising results from Web and other electronic searches, making it easier for users to sift the wheat from the chaff. The information specialist with a messy hard drive need not feel ignored – there is even software that can automatically tidy your computer files (2).
Automatic Taxonomy Generation
Taxonomies can be automatically generated using statistical methods; however, the results are often unacceptable, and need a lot of human editing. On automated category generation, Peter Morville (3) had this to say (February 19, 2001): ‘Proceed with great caution! The demos we’ve seen produce truly confusing category schemes with tremendous redundancy and mixed granularity.’
Many companies have found it cost-effective to manually create a taxonomy to ensure a sound framework, and then to automatically or semi-automatically populate it. For example, Alan Gilchrist (4) notes that Arthur Andersen, Ernst & Young and PriceWaterhouseCoopers have all constructed taxonomies manually, saying ‘all are labour-intensive operations, both in the consultation and consensus forming process and in the hand-crafted nature of the actual taxonomy compilation.’
Most vendors now offer the opportunity for people to use a pre-existing taxonomy, or to edit the taxonomy after it has been automatically generated. TopicalNet, Applied Semantics, and H5Technologies provide an initial taxonomy with a predefined world knowledge, to be used as a starting point for automated methods.
The example in the Verity white paper (5) does nothing to inspire confidence in the possibilities for the fully automatic generation of taxonomies. It is a ‘category hierarchy’ (taxonomy) that was automatically created by Verity’s Thematic Mapping for the San Jose Mercury News articles. First the document collection was specified, then dominant themes were extracted, labelled and organised into a hierarchy. The sample includes the following hierarchy:
The other sub-categories under ‘soviet_president’ are ‘yeltsin’ and ‘nuclear’ (white paper written July 2002). The paper makes no comment on the quality of the taxonomy, yet it seems unsatisfactory. It is possible that a little bit of editing would improve it (e.g. maybe ‘bush’ should be rewritten as ‘relations with US’), but there still seem to be so many inappropriate categories (e.g. ‘yes’), and so many missing categories that nothing short of a line-by-line revision could salvage it, by which time you might as well have paid to do it properly in the first place. Before making final conclusions it would be necessary to evaluate the taxonomy in context – probably Bush is also listed under US Presidents (maybe along with Yeltsin and Howard?).
My view is that it is hard to imagine the fully automated taxonomy being feasible for projects working with mission-critical information, although it may be appropriate for less-important material, for sub-categories (once the major groupings have been created manually), or for rough-and-ready organisation of search results into more manageable groups.
On the other hand, it has been suggested to me that, since this hierarchy automatically picks up the current areas of relevance to Soviet Presidents, the fact that it is not logically constructed is not necessarily a disadvantage. It may be that ‘housing’, ‘cuomo-trade’ and ‘buchanan’ are the three areas which are currently important between Bush and the Soviet President, and that the other topics that could logically belong there but haven’t been identified as relevant by the program would actually form clutter. This reasoning could possibly apply to the management of recent news.
It is a benefit that automatically generated taxonomies are instantly-generated, readily kept up-to-date, and derived from the text they categorise. And although these taxonomies may have ‘tremendous redundancy and mixed granularity’ (as Morville noted), the research discussed in Part 1 of this series showed that people often search at mixed granularity and with a wide range of keywords.
It seems to me that user testing with real data is needed to make valid conclusions about the value of these taxonomies.
Automatic categorisation relies on the ability of computers to process enormous amounts of information. Some approaches follow explicit rules, others learn from ‘exemplar’ documents provided to them, and others use statistical clustering to group related information. The value of each approach depends to some extent on the nature of the documents being categorised, and most software packages now use a combination of approaches. Katherine Adams’ and Thomas Reamy’s articles give good overviews of classification technology (6 and 7), and Thomas Reamy’s includes a hypothetical costing, which challenges the much-preached statement that automatic categorisation is necessarily cheaper than human categorisation.
Rule-based categorisation allocates content to categories depending on expert-generated predefined rules noting the presence or absence of specific words in the content. For example, documents with the words ‘Alice Springs’ and ‘Wollongong’ could be listed in a category called ‘Places, Australia’. Extra rules would be needed for place names such as ‘Newcastle’ and ‘Wellington’ to determine which country those references applied to. Rules can also apply to business policies, e.g. ruling that only Microsoft Powerpoint documents generated after 1 January 2002 should be allocated to a certain category.
It seems that rule-based categorisation should work well for the ‘classical’ categories discussed by Pinker (8), where clear rules describe categories with well-defined boundaries. It would not work so well for the fuzzy ‘family resemblance’ categories, for which it is difficult to write rules that include all relevant examples, and exclude those that don’t belong. These ideas have been discussed in more detail in part 1 of this article.
Basically, a category such as ‘odd numbers’ is a classical category, as a simple rule defines membership of the class. A category such as ‘vegetables’, on the other hand, is a ‘family resemblance’ category, in which some items definitely belong, while for others we are unsure. For example, carrots are definitely vegetables, but what about parsley and garlic powder? Try writing rules for membership of the categories birds, or doors, or books (don’t forget e-books) and see how you go. If you have difficulty defining a category it is probably a family resemblance category, not a classical category.
Verity’s Intelligent Classifier is an example of a program that uses rule-based categorisation (among other techniques).
Learning by Example/Categorising by Example
While rule-based categorisation might suit ‘classical’ categories that can be well-defined, learning by example can be better for ‘family resemblance’ categories, as the computer identifies common threads and relationships, but doesn’t expect every member of a class to fulfil rigid requirements (although rules still have to be generated and applied by the computer, so that it can sort documents into appropriate categories).
In learning by example methods, a set of ‘exemplar’ (i.e. typical) documents, is associated with each category, and the software automatically learns rules that define each category. With some systems, including Verity, both positive and negative exemplars can be used, resulting in greater accuracy. For example, for a category ‘Java’, the programming language, you could use as negative exemplars articles that discussed Java, the Indonesian island.
Many programs use Bayesian analysis (based on a probability formula developed by Thomas Bayes) for identifying patterns in sample documents and making predictions about unseen text. In addition, pattern analysis components of software allow it to differentiate between words with multiple meanings, using keywords, sentence structure, word length and other textual features to determine patterns. The programs learn through iterative processes, gradually refining their understanding of a concept.
Mohomine’s MohoClassifier, Inxight’s Categorizer and Autonomy use learning by example. Verity and GammaSite use SVM (Support Vector Machines), a variation on learning by example.
With statistical clustering, similar concepts are grouped together using statistical and linguistic algorithms. Clusters of closely related documents are identified using methods such as co-occurrence of terms or neural networks, and are assigned to a category. Reamy (7) has pointed out that this is the only truly automatic classification approach, since humans must set up rules for rule-based categorisation, and training sets for learning by example. It can, however, be used with human editors and/or pre-existing taxonomies.
Statistical clustering is more dependent on the underlying data for its categories than are the other types of software. This perhaps makes it more responsive to changing needs, but less consistent and generally useful.
Semio (now Entrieva), Autonomy and Mohomine use statistical clustering.
Automatic Grouping of Search Results
As well as being used to create a navigation structure for the whole site, categorisation can be used to limit a search to a subset of relevant topics, or to group the results of a search, making it easier to select the relevant pages. Dumais and Chen (9) have worked in this field using support vector machine (SVM) technology, in which test items are sorted to different sides of a plane, depending on their relevance.
The Inktomi white paper ‘Best practices for search and categorization’ (10) shows a neat example in which five results from a search on ‘bears’ have been automatically grouped into categories – 2 hits for the football club, 2 hits for the hairy creatures, and 1 hit for stockmarket downward trends. This is a nifty solution to enable the searcher to go just to the subset of articles of interest. However, when I tried a Google search of ‘bears’ and examined which categories the hits might belong to, the results weren’t quite so tidy. The top five hits in a search for ‘bears’ on Google:
1. The official site of the Chicago Bears
2. The Official Berenstain Bears Website
Arts>Literature>…>Authors>B>Berenstain, Stan and Jan
3. Bears.Org, Bear Information and Resources [biology, mythology, physiology about all species of bears]
4. Brown Electronic Article Review Service (BEARS) (online forums on philosophy)
5. Resources for Bears – Welcome [directory of services, homepages and clubs]
Society>Gay, Lesbian and Bisexual>Gay Men>Bears
So, instead of three neat, fairly predictable categories, as in the example, we get one hit each about animals, children’s literature, philosophy, gay lifestyles and football. (The next five continued the diversity – another on animals, then two on teddy bears, one on the Hershey Bears (hockey), and one on The Country Bears, a Disney comedy.) You could hardly get a better example of the varied results retrieved for a seemingly simple search, along with the potential benefits if you could categorise them well.
You can further explore categorisation concepts on Google by selecting ‘Similar pages’ next to a result of interest. When I selected ‘Similar pages’ next to a hit about polar bears, all of the top ten in the next result set were about polar bears. This is a quick way of generating your own categories.
Part of the answer to mixed search results with a lot of irrelevant material is to teach users to search more effectively. A search on ‘gays bears’ is crucial if those are the bears being sought. This defaults to ‘gays AND bears’, meaning both terms must be present in the document. And a search for stockmarket bears could either include the word ‘stockmarket’ or some other specific term such as ‘Australia’ or ‘BHP–Billiton’. Unfortunately the ‘search better’ approach is limited, as the words you want to use to limit the search often have multiple meanings themselves. Limiting a search with ‘shares’ or ‘stocks’ did not work nearly as well as using the term ‘stockmarket’. And colourful language means that combining ‘bears’ with ‘children’ retrieves as the top hit (on 22 November 2002) ‘Apache – IBM marriage bears children’. The other problem with using search terms to limit a search to a specific domain is that, while this increases precision (the hits are more likely to be relevant), it decreases recall (you will miss some relevant hits).
Figure 1. Search on ‘Automatic Categorisation’ in Vivisimo.
Vivisimo.com is another Web site at which you can experiment with categorisation. It is a metasearch engine that groups search results. Here a search on ‘bears’ results in some useful categories, with other more dubious ones. A selection is given below (the number in parentheses shows the number of hits for that category):
- Teddy Bear (66)
- Business (18)
- Black Bear (13)
- Country (12)
- Box, Toys (5)
- Cats (4)
- North American (4)
- Bear in Big Blue House (4)
The final category ‘Other’ includeshttp://www.bearinthebigbluehouse.com/mainbody.html, so the system is obviously not perfect yet, although it offers great potential for enhancing search results. You lose nothing but screen space because you don’t have to follow the categories, but they are there if you want them.
While taxonomy generation and taxonomy population are offered as a package these days, the potential of each differs. Automatic taxonomy generation seems to offer less hope, and a manual or partly automated solution is needed for high-quality output. Automatic categorisation is based on extensive research into computer statistical methods, and is developing greater sophistication. These programs offer the potential for rapidly organising large quantities of content in an acceptable manner.
Peter Morville (3) has noted that automatic methods work best with full-text document collections, can’t index images, applications, or other multimedia (most of them), do not adjust for user needs or business goals, and do not understand meaning. But if you can live with those limitations, and you are suffering from infoglut, it is worth considering an automatic, or partly automatic solution.
The next, and final, part of this series looks at some of the software packages available, and the features they offer.
(2) ‘Enfish Personal’ [formerly Enfish Onespace].http://www.enfish.com/desktop/desktop_personal.asp (Sighted 5 February 2003)
(3) Morville, Peter. ‘Strange connections: software for information architects’ 19 February, 2001. http://argus-acia.com/strange_connections/current_article.html (Sighted 5 February 2003).
(4) Gilchrist, Alan. ‘The corporate taxonomy – the latest tool in the battle against information overload.’ Bulletin 100 of the Records Management Society of Great Britain. December 2000. http://www.rms-gb.org.gb(only available online to members)
(5) ‘The ABCs of content organization’. July 2002.http://www.verity.com/pdf/white_papers/MK0391a_ContentOrg_WP.pdf[a Verity white paper] (Sighted 5 February 2003)
(6) Adams, Katherine C. ‘Word wranglers.’http://www.intelligentkm.com/feature/010101/feat1.shtml(Sighted 5 February 2003)
(7) Reamy, Thomas. ‘Cyborg Categorization: the salvation of search? Part 1.’ Intranet Professional. v.5, n.1, January/February 2002.http://www.infotoday.com/IP/jan02/reamy.htm (Sighted 5 February 2003)
(8) Pinker, Steven. Words and rules: the ingredients of language. London: Weidenfeld & Nicolson, 1999, p270-275. (Discussed in more detail in Part 1 of this series)
(9) Dumais, Susan and Chen, Hao. ‘Hierarchical classification of Web content’. http://research.microsoft.com/~sdumais/sigir00.pdf(Sighted 5 February 2003)
(10) ‘Best practices for search and categorization’. [Inktomi white paper].http://programs.inktomi.com/mk/get/webwp (Sighted 5 February 2003)
Automatic categorisation is the new ‘killer app’ for information access on Web sites, intranets and portals. But is it really the solution to information overload, or is it just another promised technological fix that doesn’t deliver? This three-part article examines the state of the art in automatic categorisation. The first part examined research in classification theory and its relevance to automatic categorisation. The second looked at some of the principles of automatic categorisation, while this one focuses on specific software products. Nicolson, 1999, p270-275. (Discussed in more detail in Part 1 of this series)
(9) Dumais, Susan and Chen, Hao.http://research.microsoft.com/~sdumais/sigir00.pdf (Sighted 5 February 2003)
(10) ‘http://programs.inktomi.com/mk/get/webwp (Sighted 5 February 2003)
Automatic categorisation is the new ‘
Below I will discuss features of some of the major categorisation software packages, with a general comparison at the end. References are given with each company, with general references only at the end.
Auto-Categorizer from Applied Semantics allows you to create your own unique taxonomy, then map categories from the taxonomy to concepts in their ontology. Their ontology contains 1.2 million terms derived from sweeping the Web. All input and output is through XML. Typical costs range from US $140 000 to US $160 000.
http://www.topic.com.au/products/autonomy (select White Paper and Executive Brief – EB)
Autonomy provides software for enterprise portals and customer relationship management (CRM) systems. Features include automatic categorisation and taxonomies, automatic clustering, and multilingual access. They seem to be one of the few companies still focusing on automatic methods without significant human intervention, although they do allow reorganisation of categories ‘with just one simple click’.
Autonomy uses automatic categorisaton to alert users to new information pertinent to their profiles. Automatic clustering is used to identify ‘what’s hot’ (areas of current interest within the organisation), breaking news, and information gaps – clusters of user interests that do not match clusters of information content. Early this year Autonomy launched its Eduction Module which automatically forms complex metadata. (Educe means to draw out or elicit – a very erudite name that unfortunately looks like a typo every time I see it).
Figure 1: Automatic Taxonomy Generated by Autonomy
http://www.endeca.com (Click the link called ‘View demos’ for a multimedia demonstration – it needs Flash which can be downloaded from the site.) Endeca is a young company, founded in 1999, and boasting 7 successful customers in January 2002. It provides the Endeca Navigation Engine in retail packages (Endeca InFront) and enterprise analytic packages (Latitude). Retail packages cost from US $75,000 to US $250,000 per year, and enterprise packages start at US $150,000 per year for a three year term.
Guided Navigation puts search results in useful groups, and provides links for refinement of searches. For example, if you search the music section for ‘Mozart’ at http://www.towerrecords.com, you retrieve a list of matches categorised in groups, including ‘Composer Matches’ and ‘Ensemble Matches’ (e.g. ‘Vienna Mozart Ensemble’), At the left you can select to navigate further:
- by Form/Genre (eg Holiday Music)
- by Feature (eg Boxed Sets; In Stock)
- by Price (eg Under $7), and so on.
This means that users can see the relevant options and choose areas of interest for further exploration.
Entopia believes that taxonomy structure generation and population should be a by-product of an individual’s normal work process. By making it valuable to users to have their information categorised, they hope to make that information useful and accessible throughout the organisation.
Their ‘dynamic semantic profiling’ is a two-stage process: content is reduced to its bare concepts, and then metadata is added to give contextual information. When an individual searches the knowledge base, profiles of the user and the content are created using the invisible taxonomy, to ensure up-to-date and relevant content. The user is prompted to put the information that is collected into one of three zones: personal, work group, and enterprise. Users can personalise the taxonomy for use with documents relevant to them.
Entrieva (previously Semio and Webversa)
Webversa bought Semio Corporation (including SemioTagger Categorization Software), then changed its name to Entrieva. Entrieva offers categorisation, discovery and notification software that groups information sources and provides real-time notification via telephone, PDA or e-mail, when incoming content matches conceptual criteria set up by the client. Users can ask their computer to let them know as soon as anything arrives on the desktop in the category ‘pay rise’, or ‘share price changes’ or ‘spouse’. Users can also respond to the notification with voice input technology. Entrieva has more than 100 customers in the private sector and government markets.
Factiva, see Inxight
Inktomi Enterprise Search, see Verity
The Inmagic Gatherer extracts content from text files, then Inmagic Classifier (powered by TopicalNet; now LightSpeed) automatically classifies the content using a taxonomy of over 1 million pre-built categories using thematic and semantic analysis. The Inmagic Classifier is a ‘glass box’ classifier, meaning it makes classification decision-making clear, and allows the implementer to customise the results according to local needs. Output can be a browseable directory or XML output for incorporation into DB/Textworks or IntelliMagic databases. If required, content can be classified into a taxonomy provided by the customer.
http://www.content-wire.com/FreshPicks/Index.cfm?ccs=86&cs=2337 (Captiva InputAccel)
Interwoven is a content management system used by nearly half of the US Fortune 100 companies. In Australia it is used by ten government agencies, including the Australian Tourist Commission (http://www.australia.com ) and the NSW Department of Education and Training. It has purchased the content tagging and taxonomy technology of Metacode Technologies, Inc. and uses InputAccel, the Captiva information capture software.
Interwoven MetaTagger 3.0 is tightly integrated with Interwoven TeamSite. Much of the initial setup, including taxonomy configuration and designation of training sets, is done by editing XML-based configuration files. Once everything has been set up, MetaTagger can be accessed from the browser-based administration interface or from the command line. MetaTagger can also manage some nontextual content such as multimedia files (especially those such as MP3 files, which have built-in tagging).
MetaTagger costs from US $85,000 to US $110,000 on top of the cost of Interwoven, already a six-figure investment – and they say manual indexing is expensive!
Inxight (incorporating WhizBang), with Factiva
Inxight’s unstructured data management software is used by more than 200 companies worldwide, and was chosen as the best categorisation software package in a report by the 451 group. One special feature is its visualisation capabilities, which you can see at http://www.inxight.com/map/ . Inxight has a software licensing agreement with Factiva, which uses Inxight’s text categorisation and entity extraction in the Factiva Fusion content enhancement tool. The entity extraction technology identifies and extracts information on company names, and groups it by categories. The text categorisation software is used to automatically analyse, code and classify text data from all enterprise content sets, according to the Factiva Intelligent Indexing taxonomy.
The Klarity suite of software products has been developed by Intology, a subsidiary of the tSA Group in Canberra. Different tools are used to categorise text (Klarity), to generate keywords and build taxonomies and thesauri (Keyword and Taxonomy Builder), and to provide an alternative virtual file structure based on a subject hierarchy for information on shared drives (Network Neighborhood). Network Neighborhood does not manage or move the files, but rather provides a unified view based on subject, rather than just the individual file structures of each individual who has contributed material to the shared drive.
Intology also provides Feature Parser, a tool that identifies features such as countries, politicians or chemicals, and KleanUp, which identifies duplicate and very similar documents, enabling companies to ‘clean up’ their legacy data before implementing categorisation or other systems.
Figure 2: Intology’s Network Neighborhood Showing Subject-based Hierarchy of PC Resources
Metacode, see Interwoven
MetaTagger, see Interwoven
Mohomine sells software for processing resumes and for categorising text (so next time you send a resume to a large organisation, it might be their software assessing you). They provide the software to independent software vendors and application service providers, as well as to intelligence agencies and large enterprises. (And we all know how well intelligence agencies have managed infoglut lately!)
MohoClassifier was created as an OEM offering (i.e. other equipment manufacturers – a program created to be implemented as part of another vendor’s system) so it was made to easily integrate with existing enterprise applications.
Mohomine claims to offer ‘accuracy, speed, and scalability’. N.B. accuracy means 90-95% accuracy, i.e. 5-10% error. It classifies 100 Mb of data per minute; i.e. between 20 and 300 documents per second. Mohomine works with North American English, all Western European languages, Chinese, other Asian languages, and Arabic. I think you have to train it to speak Australian English!
Mohomine and Wordmap have now announced a partnership offering Wordmap’s Taxonomy Management System integrated with Mohomine’s MohoClassifier.
http://www.infotoday.com/newsbreaks/wnd021230.htm (Gale partnership)
Nstein offers computer-aided indexing for primary, secondary, and tertiary e-publishers and for news content. Nstein automatically categorises documents according the the news industry’s standard IPTC taxonomy. Systems are modular, including components such as the nconcept extractor and the IPTC ntelligent categorizer. Language modules are available for French, Russian, Turkish and Korean (among others). Their ndexing may be ntelligible but their nnaming is nnot.
Nstein and Gale (http://www.gale.com ), a part of The ThomsonCorp, have recently announced that Gale’s subject taxonomy will be packaged with Nstein’s categorizer. An alternative is to use LDNA (Linguistic DNA) text analysis software with nserver to automatically isolate concepts within a text without use of a predefintrong>Entrieva
Stratify (formerly PurpleYogi)
http://www.stratify.com (select links to white paper, Delphi Group Report, and Documentum Datasheet)
Stratify Discovery System 2.0 automatically builds taxonomies and classifies content using a number of different classification approaches. It believes that all the approaches have strengths and weaknesses, so runs them in parallel and uses the Combiner module to compare the results. It also provides pre-built reference taxonomies for specific industries.
Stratify learns user’s interests and alerts them to new documents matching their profile. It also classifies documents while they are being written, and presents related documents to the user. It offers Total Taxonomy Lifecycle Management, meaning a single interface that can be used by a number of editors for all taxonomy and classification tasks. Stratify is being used by Dialog to organize data on its NewsEdge service, and by Documentum in its document management system.
Texis Catgorizer 4.1
Texis Categorizer is a flexible system that is easily integrated with Web-based applications. It uses a browser-based interface once the initial scripts are set up, and needs about 20 training sets for each category in the taxonomy. Categories can be fine-tuned as needed, and changes can be made by uncategorising content then re-entering it.
Cost is US $10,000 for the Texis engine and US $10,000 for the Categorizer – cheaper than many competing products.
Verity (incorporating Inktomi Enterprise Search and Quiver Classifier)
http://www.verity.com/pdf/white_papers/MK0391a_ContentOrg_WP.pdf(ABCs of content organization – a Verity white paper)
http://www.eweek.com/article2/0,3959,828522,00.asp (‘Verity outlines Inktomi integration path)
Verity was founded in 1988 and has about 1,500 customers, mainly Global 2000 companies. It is also the OEM (other equipment manufacturer) search provider of over 200 software vendors.
Verity has a three-tier system, comprising Verity Intelligent Classifier, search, and social network technology (which connects users with each other). With the recent purchase of Inktomi Enterprise Search (which had in turn recently purchased Quiver Classifier), the categorisation offerings to Verity and ex-Inktomi customers may change. It appears that Verity may use the Inktomi product for mid-range clients, while maintaining the Verity K2 toolkit for its top-end clients. It may also return to the name Ultraseek for the search engine Inktomi acquired from Infoseek (via Go Disney). The remainder of Inktomi is to be bought by Yahoo! Verity now also offers the LexisNexis taxonomies and concept definitions.
Verity’s K2 Catalog is the company’s e-commerce solution, and Verity Intelligent Merchandising, a companion product, competes with Endeca InFront. The average deal size for Verity K2E is about US $170,000.
Webversa, see Entrieva
Wordmap, see Mohomine
Other Sites to Investigate
If that’s not enough, here are a few more sites to visit:
http://www.clearforest.com (meta tagging and text analysis)
http://www.convera.com (enterprise search, retrieval and categorisation)
http://www.gammasite.com (automatic categorisation and tagging software)
http://www.h5technologies.com (unstructured data management)
http://www.hummingbird.com (Fulcrum KnowledgeServer offers automatic text categorisation using neural net technology and ClusterMap offers taxonomy generation via clustering)
http://www.hyperwave.com (Integrated eKnowledge Suite)
http://www.recommind.com (automated information management software)
http://www.semagix.com/home/ (automated aggregation, classification and synchronisation with existing content)
http://www.similesoftware.com (unstructured data management including relevance, summarisation, and categorisation)
http://www.smartlogik.com (decision intelligence solutions from Applied Psychology Research)
See also Wherewithall and Textology (among others discussed above) in the Delphi report (4).
For lists of categorisation software vendors (not quite up-to-date) seehttp://www.avaquest.com/resources-cat.html andhttp://www.kmconnection.com/pguide . KM World has published a list of the 100 companies that they think matter most in knowledge management in 2003 (http://www.kmworld.com/100.cfm ). While not all of these companies offer automatic categorisation, many of them do and might be worth following up.
Choosing a Program
You now need to select the best-of-breed, benchmark-leading, next-generation, cutting, if not bleeding, edge, patent-pending taxonomy software solution with real intelligence, to seamlessly integrate with your legacy systems to leverage corporate knowledge and fast-track real-time access to business-critical content, and to lifecycle manage explosive infoglut to surface relevant knowledge and actionable information in a scalable way, with a zero-training interface to meet your most aggressive requirements.
Easier said than done. All of the software packages discussed above claim to identify key concepts from a range of textual document types, and to sort them automatically into categories. To find out which one, if any, can add value to your organisation you really need to try these products with real data, real users, and real needs. See also Online Comments in this issue for more on the brochure engineering and buzzword marketing used to sell these products.
Where to Start?
Some of the things you should consider when looking for categorisation software are listed below. The specific suggestions below are not the only alternatives, but are possible starting points:
- Make sure the programs you are looking at deal with all the file formats and languages you need to access. Most work with at least HTML, word processing documents, plain text and PDF. Some access Oracle and SQL databases. Multi-language categorisation is available with Mohomine, Autonomy, Interwoven, and Nstein (through the nlanguage modules).
- Check that the system will manage the amount of data that you have at the speeds that you need, and that it will remain adequate as your content grows. Most systems are fast and offer continual updating of categories.
- If you don’t have a pre-existing taxonomy, you will have to generate a new one manually or automatically, or purchase a system such as Inmagic or Entrieva (Semio) that comes with a taxonomy. For content such as news, a pre-existing taxonomy that follows industry standards might be best (check Nstein and Factiva). Note that many automatically generated taxonomies are not very good – you need to allocate time for manual review. You will also have to allocate someone to maintain the taxonomy.
- If many people will be working on the system you need a program such as Stratify that offers good workflow control. If you want instant notification of incoming content pertinent to user profiles, you need a package such as Entrieva, Autonomy or Stratify. Entrieva also includes notification by telephone.
- You should also look for a taxonomy and categorisation package that fits well with pre-existing software. For example, if you have Interwoven Teamsite, MetaTagger is the obvious choice, and if you use DB/Textworks, Inmagic Classifier could work well.
- For retail catalogues, try Endeca InFront or Verity K2 Catalog. For multimedia categorisation try Interwoven, and for data visualization try Inxight. For an Australian product with local support, and for a program to give better access to organisational information on shared drives, try Klarity.
- If accuracy is important, consider that for many of these programs ‘accurate’ means 90% accurate, 10% wrong. This might not matter for fast-flowing news, but certainly does for pharmaceutical efficacy data. Of course, manual categorisation won’t necessarily be better, but it can be. Tom Reamy (1) reports claims by Verity that they achieved 99% accuracy by incorporating editor-derived rules with their automatic categorisation. Similar improvements with human intervention could be expected with other systems as well.
- For a general content management system with categorisation software, try Verity. It was previously aimed at the high-level market, but with the purchase of Inktomi Enterprise Search should be offering mid-level software as well. Even if you don’t buy it, it is a good benchmark for comparison.
- If you still can’t decide, read the articles by Katherine Adams (2) and Jim Rapoza (3). Katherine Adams’ article provides a table comparing features such as classification method, source of taxonomy, XML reformatting, multiple views and languages of seven products: Mohomine MohoClassifier, Inxight Software Categorizer, Metacode Metasaurus, Semio Taxonomy, Cartia Themescape, Autonomy products and Verity Intelligent Classifier (unfortunately already somewhat dated by takeovers). Jim Rapoza compared the performance of Applied Semantic’s Auto-Categorizer 1.1, Interwoven Inc.’s MetaTagger 3.0, and Thunderstone Software LLC’s Texis Categorizer 4.1, and found they all performed well. His article also provides links to screen shots.
- Finally, the Delphi Group report on Taxonomy and Content Classification (4) covers the state of the industry, and gives specific information on a number of companies. Note, however, that the companies paid to be included in this report.
While automatic categorisation still shows its computer origins in an obvious lack of understanding of the material it is dealing with, leading to occasional ludicrous groupings, it also shows promise and the ability to deal with vast quantities of material, quickly, with passable results. It is, therefore, sure to expand its market, particularly in areas that may not traditionally have had good intellectual organisation (e.g. corporate intranets), but also in areas that have been traditionally indexed and catalogued by information professionals. One of its key uses may be in the organisation of search results, where it adds value to what is otherwise simply a list of results (possibly ranked by relevance, but not sorted by category).
It is interesting, though, that there is widespread agreement now that you don’t get optimal results using computers alone, and that the most cost-effective solution is ‘cyborg categorisation’ – the use of both human and computer input. A strategy might involve manual taxonomy generation or automated taxonomy generation with human review, followed by training or rule creation, then automated categorisation of documents. Human review is crucial while the system is being established, and to check that standards are maintained. Human input may also be valuable for areas identified in user studies (or from analysis of search logs and other records) as very important and in those areas in which computers do not perform well, such as the allocation of documents into ‘genre’ categories (‘overview’, ‘technical information’, ‘for children’). It might also be important to identify key tasks that are not well represented in the historical documents on which the system has been trained. For example, each new legislative or regulatory requirement involves new systems and training for staff. In this way, top quality can be maintained for the most important documents, and high throughput for the remainder. With these new scenarios, information professionals will still be needed, and their work may rise in status. There will be more jobs in certain areas, and probably job losses in others.
Finally, this is such a dynamic area that there have been a few takeovers and partnership agreements as I have been writing this article, and it is likely that there will have been major changes in companies and their products by the time you read it. Make sure you get up-to-the-minute information to base your decisions on.
(1) Reamy, Tom. ‘Auto-categorization: coming to a library or intranet near you!’ Econtent magazine, November 2002.http://www.econtentmag.com/r5/2002/reamy11_02.html (Sighted 28 November 2002)
(2) Adams, Katherine C. ‘Word wranglers.’http://www.intelligentkm.com/feature/010101/feat1.shtml(Sighted 25 November 2002)
(3) Rapoza, Jim. ‘Three paths to sorting content.’ Eweek, 15 July, 2002.http://www.eweek.com/print_article/0,3668,a=29100,00.asp(Sighted 26 November 2002)
(4) “Taxonomy & content classification: market milestone report.” Available as a guest download athttp://www.delphigroup.com/coverage/taxonomy.htm (Sighted 25 November 2002). [Note that companies paid to be included in this report. It is also available from many sites with details for just one of the companies]
All Web links are current as of 13 March 2003.