Optimising your site's internal search using metadata

Amber Dean
May 01 2014

Metadata is often an overlooked part of the web development process. Perhaps because it's one of the last steps before you hit publish... (or because for most it is the least 'interesting' part of web publishing?). This article will serve to act as a gentle reminder that while metadata can be 'tedious', there is unfortunately 'no pain, no gain'. It is important, and often critical, to the findability and general information organisation of your site once users have arrived (note, this is an article on site search - not on improving SEO to attract users to your site). In particular, it will seek to demystify an often 'hidden gem' in the metadata family - subject metadata - and it's role in improving site search precision.

The case for metadata

Let's pause for a moment and lament the consequences of poor website search. Whether you are search-dominant or not, at one stage or another you may have been driven to the site search by a poor information architecture (IA) and navigation system... and out of a thousand results, not a single result on the first or second page looked like a reasonable hit. Sound familiar? Because of this reduced trust in site searches, we have observed users again and again turning to Google to find information rather than using the site search. If they're looking for a strategic plan for a Government site, they type into Google something like 'strategic plan' + 'Government department name', effectively bypassing the site IA and site search.

On intranets, it's worse. There is no 'Google workarounds', users are virtually hostages to the site. They have to use the site's navigation, or rely on the site search, and if they consistently can't find what they're looking for, they become participants in a culture of 'dialing' colleagues for help - which ultimately leads to, and creates, poor work productivity.

These scenarios are all too common. There are a lot of elements that feed into effective search systems, but one of the primary forces strengthening search is metadata. Metadata is 'data about data' - this is the vehicle that provides us access information. In the words of a few metadata commentators:

  • 'Be good to your metadata, and it will be good to you.'
  • 'There are dangers of false economy by skimping on metadata capture and quality control... metadata ensures the content business has a future, because without metadata assets are lost.'
  • 'Metadata is a love note to the future.'

The case for subject metadata

One type of metadata that is frequently overlooked but particularly useful for improving search precision is 'subject metadata'. This is one of 15 metadata properties recommended for resource description, by the well-respected international web metadata standard - Dublin Core. The Australian Government AGLS Metadata Standard also recommends subject metadata. According to Dublin Core, this subject metadata property should typically ' represented using keywords, key phrases, or classification codes. Recommended best practice is to use a controlled vocabulary.'

This recommendation for use of controlled vocabularies is echoed by the pioneers of Information Architecture - Peter Morville & Louis Rosenfield in their 'polar bear' book. It even gets a mention as the third layer on their 'IA iceberg'.

Rosenfeld and Morville’s information architecture iceberg

Figure 1: Rosenfeld and Morville's information architecture iceberg (sourced from UX Matters)

Controlled vocabulary demystified

A controlled vocabulary is essentially where there is some form of order and control applied to the way in which 'keywords' or 'subjects' are applied to metadata. As opposed to an uncontrolled vocabulary, where anyone uploading content to the site determines the subject or keyword tags based on what they subjectively 'think'. This is not an uncommon approach (in case you're thinking this sounds familiar!)... but it relies on 'common sense' - that all content editors will view the content largely in the same way. It has been proven time and again that common sense is very much 'in the eye of the beholder', and human decision-making is very prone to bias (see acclaimed psychologists Daniel Kahneman's work in this area to see how 'irrational ' us humans can be).

What happens with this approach, is as the site grows and ages, it becomes 'unruly' (think 'wild west'!):

  • the metadata language becomes too broad, and ultimately out-of-line with user needs
  • combined with a lack of content governance, there is orphaned and out-of-date content that just keeps on growing
  • before you know it, you have a site where no-one can find anything!

Illustration of metadata not fitting content

Figure 2: Conceptual illustration of metadata not fitting content

With a controlled vocabulary, you have a narrower set of terms (a controlled language) that is centrally managed. If developed in a user-centered framework, this should also be in line with users information needs.

Illustration of metadata fitting content

Figure 3: Conceptual illustration of metadata fitting content

There are a number of different types of controlled vocabularies. A thesaurus is generally well-regarded, and in this article we'll look at this a bit closer.

Thesauri explained

What it is: A thesaurus identifies relevant terminology for a subject area or topic (preferred terms), and creates a structure for the vocabulary. The hierarchical structure with preferred terms includes:

  • 'Parent-child' relationships - broader and narrower terms;
  • 'Sibling' relationships - semantic relationships between terms;
  • 'False friends' - maps non-preferred synonyms to the preferred thesaurus term.


  • Preferred term: Houses
  • Broader term: Building
  • Narrower term: Cottages
  • Related term: Palaces
  • Non-Preferred term*: Dwellings

The Australia and New Zealand Society of Indexers (ANZSI) provides standards for the development of thesauri.

Thesauri in practice

Developing a thesaurus from scratch is a big investment and requires specialist skills from a trained indexer or taxonomist. It's generally best first of all to see if there is one already available that can be adapted or customised to your site. There is a large number of general (e.g. see Library of Congress subject headings) or specialist thesauri (e.g. see Australian Picture Thesaurus) available on the open web.

Many content management and publishing systems provide metadata tools (including in-built thesauri tools), which allow authors, editors and librarians to add appropriate entries more easily, using standard vocabulary and formatting. However, this is not yet a standard part of web publishing. To complicate matters, they are often not called simply 'controlled vocabularies' or 'thesaurus' - they are all named differently. For example, in Sharepoint, their in-built thesaurus tool is called 'Term Store Management Tool'.

If a full-blown thesaurus sounds too much for your site right now, just start with a small common set of preferred (and ideally non-preferred) terms - say 50 terms. This can be generated from your user content requirements for the site. This at least raises awareness and puts the principle in place that metadata for your site needs to be managed from a 'controlled' list.


Usability tip

Use visual hierarchy to direct and help users' complete tasks.

The definition of visual hierarchy is the "Arrangement of elements according to importance and emphasis. Typically, this involves emphasizing certain elements in order to influence the user to look at and interact with a certain item first, another item second, yet another item third, and so on." Lisa Graham, The Principles of Interactive Design, 1999

This is easy to achieve by using size, visual weight, colour and position to generate contrast and direct the viewer's attention. Make sure any actions that are important (to the users) are clearly differentiated.

Categories: Navigation & search