Heresy: An opinion profoundly at odds with what is generally accepted.
Machines work, people should think.
Heresy the Second
Heresy the First
Back in
Heresy the First, Axiom 1 asserted that documentation is written primarily for people, with computers being a secondary audience (that ultimately serves the purposes of people by searching, organizing, and cataloging documents). Computers are very useful tools, but are not terribly intelligent — a computer can “read” a document, in the sense of loading it from a hard drive, but cannot “read” it the way we do. The closest a computer comes to reading is by searching for a series of bits that either match a particular pattern or appear in a particular part of the file.
Computers are very good at finding patterns, or arranging data into desired patterns (e.g. sorting); but when it comes to document handling they might need a little help. That help often takes the form of
metadata, literally “data about the data.” Metadata can take several forms, some of which I’ll describe shortly.
Sometimes, metadata is simply a way to muddy the waters. “What about metadata?” comes up often when I stump for more simple, human-scale authoring systems. The proper answer may well be, “Well, what about it?” If metadata is what computers use, let the computers deal with it.
Metadata takes many forms, but these forms can be graphed:
In this graph, implicit vs. explicit metadata appears on the X-axis; internal vs. external metadata appears on the Y-axis. Let’s take a look at each pair of attributes.
Implicit vs. Explicit
Sometimes, metadata just comes along for the ride. For example, these items appear in most technical documents:
- directory path
- file names/extensions
- title
- header/footer info
- terms or index entries
- styles, elements, attributes, or document types
- links (or references) to other documents
This is
implicit metadata — information that’s part of your document, and meant for other purposes, but can be used as metadata too. Even plain text can be a kind of metadata: for example, search engines build an index based on words or phrases appearing in the document. A more sophisticated search engine could, based on keywords or phrases, provide a rough idea of what the document is about (which would provide a more refined search).
The document’s name and path can provide metadata? Why not? For example, a document’s path and file name might be:
~/Documents/Widget2000/UserGuide/Installation.odt
We can infer several important pieces of information just from this: the document is part of the Widget 2000 User’s Guide, describes installation procedures, and is an OpenOffice file. If we wanted to glean more information, knowing that it’s an OpenOffice file means we can extract the
meta.xml data (we’ll explore OpenOffice files in a later post). By listing the directory, we could find closely-related files.
The document title is important — a proper title describes the document, and some search engines often weight terms found in the document title. For example, Unix “manpages” take advantage of implicit metadata; a script called
makewhatis extracts the title and description from the “NAME” block to build a database of keywords that can be searched using
apropos or
man -k (keywords). Writing a proper manpage title can be an art — not only does it have to properly describe the content, it must contain keywords that readers would use to find that manpage in an
apropos search.
In a markup context, attributes, tags, and even the document type help to describe the document. For example, the difference between DITA
task and
concept documents should be obvious; each topic type describes what kind of content should be found in that topic. Going deeper, attributes like
class (HTML) or
role (DocBook, DITA) can be used to describe parts of a document.
Links and references, as well as the document’s directory path, describe the document’s relationship to other documents — its place in the world, as it were. Even link targets (such as bookmarks or
<a name="foo"/> tags) can be significant, as they call out sections important enough to be referenced from other places — one could write a script to walk through a body of documentation and create a relational map of jumps and destinations (such a map could, for example, ensure that all referenced topics or chapters are included in a book or other collection).
Just as we can make a case for all documents having an implied structure (aka
shared context), we can make the same case for metadata: certain items that already appear in the document do a pretty good job of describing it. All documents have metadata associated with them, whether we recognize it as such or not.
Explicit metadata, obviously, is metadata that is added specifically as metadata. Examples include:
- descriptions
- catalog info
- properties/file info
- cataloging (ToC)
- search (Index)
Ironically, explicit metadata depends much more on the document format than does implicit metadata. You might specify metadata using a Properties dialog (word processors),
<meta /> tags (HTML), or Dublin Core (various XML document types).
The Table of Contents and Index are constructs that provide a high-level look at the entire document and a rudimentary search function.
Internal vs. External
Metadata can be either
internal to the document or
external. Each type has specific advantages and disadvantages.
Internal metadata can include:
- tags (<meta/>, Dublin Core)
- document types, elements, attributes
- Properties
External metadata includes:
- XML Topic Maps (XTM)
- file and path names
- Catalogs
- Search engine indexes
External metadata has the advantage of being highly flexible, and is necessary for some media (such as video). However, moving or copying a file requires changes to the external metadata; this may or may not present a problem.
Now let’s populate that graph above with examples:
Nearly any type of metadata can be pigeonholed into one or two of these four quadrants (Dublin Core metadata might be internal
or external, for example).
Audience Analysis
Remember, metadata is primarily a processing aid rather than something human readers use directly. By focusing on explicit metadata, you have to write
two documents: one for humans, another for computers. Even if computers are the primarily consumers of this second(ary) document, the computers are still serving human users — so audience analysis can help you to focus the work for best effect:
- How will computers use the metadata? For example, you may not need to cater to popular search engines, but need to provide hooks for automated builds.
- What metadata do the computers need to do the job?
- How much maintenance is required to keep the metadata current?
If possible, focus on implicit metadata — it requires less maintenance, which means it has a better chance of staying current. Meaningful path and file names, descriptive titles, and proper element (or style) selection can all play a part, and often improve the readability and maintainability of your documents… and that’s the most important thing.