To most biologists, it seems
inconceivable that the simple act of naming a biological entity has any more
significance than identifying a personal achievement or staking a claim to a
territory of research interest, akin to carving one’s initials into the
tree-of-life. However, this simple act
has potentially far-reaching and long-lived consequences. Names, especially
those ascribed to organisms, serve as a primary entry point into the
scientific, medical, and technical literature and figure prominently in
countless laws and regulations governing various aspects of commerce, public
safety and public health. These names
also serve as a primary entry point into many of the central databases that the
scientific community and the general public now rely upon.
Even well formed and
properly applied names can serve as a source of confusion and considerable
frustration. This is hardly a new
problem. In the mid 18th century Carl von Linné proposed the use of latinized
binomial names and a hierarchical classification scheme as an alternative to
trivial and colloquial names, which were a constant source of confusion among
his contemporaries (Linné 1735). The Linnaean
system of nomenclature was widely accepted and has subsequently been codified
into four separate legalistic frameworks (the codes of botanical (Greuter 2000), zoological (Ride
1985), prokaryotic (Sneath
1992), and viral nomenclature (Buchen-Osmond 2002)), describing the rules for forming and ascribing
names to species and higher taxa, circumscription and emendation, priority and
citation, synonymy and more recently homonymy, correction of orthographic errors
and adjudication of disputes in nomenclature.
Chief among the stated objectives of these codes is stability and order
in the nomenclature of the taxa covered by each. Achieving this goal, however, remains elusive. There are a variety of reasons.
Relatively few contemporary
biologists are actively engaged in systematic studies and have reason to formally
propose names for new biological entities. Rather, they are end-users of the
classifications and nomenclature produced by a small group of specialists. Most
biologists fail to recognize that taxonomic proposals are expert opinions,
arising from comparative studies of small numbers of species that may or may
not be representative of larger natural groups, and that the opinions rendered
by these experts are subject to acceptance or rejection by the larger community
of biologists. They also fail to recognize that the Codes of Nomenclature do
not govern the process of biological classification or identification, only the
formation and assignment of names to proposed taxa. Legitimate and valid names
may be ascribed to poorly formed taxa and illegitimate and invalid names may be
assigned to well formed taxa.
The name ascribed to a given
group is fixed in both time and scope and may or may not be revised when new
information is available. When taxonomic revisions do occur, resulting in the
division or joining of previously described taxa, authors frequently fail to
address synonymies or formally emend the descriptions of higher taxa that are
affected.
Whereas the different Codes
of Nomenclature guarantee persistence of a formal name, the serial, cumulative
nature of effective and valid publication allows the name to obsolesce in
relation to the taxon it originally denoted. In contrast, it is the taxon itself
that persists, and the granularity with which it is defined increases over
time. The formal name provides an archival record of taxonomic definition only
for a single point in time — the date of publication. A robust and persistent
taxonomy requires taxonomic definition to be a maintained, networked resource,
rather than a retrospective sequence of names and emendations. A commonly referenced terminology based on
persistent, increasingly refined taxa is needed to replace or augment a static
nomenclature that diverges over time from the taxonomy it initially denotes.
This disjunction of
nomenclature and taxonomy results in an accumulation of names of dubious value
in the literature and databases. While systematic biologists may be adept at
recognizing such problems, most others (including the curators of some
databases) are not. This can have a
significant impact on activities such as assertions of taxonomic identity,
commonality of metabolic function, and recognition of homologous, paralogous or
xenologous genes. It can also have significant and unintended consequences such
as adding or removing species to lists of tightly regulated species (e.g. the
current list of biothreat agents).
There are a number of
significant distinctions between the more mature Botanical and Zoological Codes
of Nomenclature and the Code of Prokaryotic Nomenclature. The record of
botanical and zoological nomenclature dates back to the time of von Linné, and
names are considered valid regardless of where they were published, provided
that other criteria are fulfilled. Each
worker is obliged to establish the priority of names that are used in their
proposals. While lists of names in
common usage are available as a starting point (Melville and Smith 1987; Buchen-Osmond 2002), the number of synonyms for a given group can easily
exceed the number of names deemed legitimate.
Prokaryotic biologists have addressed this problem in a much more
elegant manner. In 1980, those names
that were considered of dubious value where purged from the record and the
Approved Lists of Bacterial Names were published (Skerman, McGowan et al. 1980). These
officially sanctioned names represented a new starting point and a formal
mechanism for adding new names to the record was established; in essence, a
registry was created. The effects of
this mechanism prove instructive in other ways, as well.
Since publication of the
Approved Lists, the number of validly published names of species and higher taxa has
increased dramatically. This can be
attributed to the widespread application of high-resolution molecular methods
to resolve taxonomic problems and to infer evolutionary relationships among
Bacteria and Archaea. While these
studies have led to the first natural taxonomy of prokaryotes (Garrity and Holt 2001; Garrity, Johnson
et al. 2002), a side effect has been a degradation of
nomenclature. By the end of 2002,
one-fifth of the more than 6500 validly published species names were reduced to
synonymy. In the absence of an a
priori knowledge of the biological nomenclature, it becomes increasingly
unlikely that even the most sophisticated users will be able to perform a
complete and accurate search of either the scientific, technical or regulatory
literature or databases. There is a clear and unfulfilled need for a mechanism
that integrates this critical information into a networked, distributed
environment to circumvent this problem.
As
biological data proliferates and interconnects, it depends increasingly on software
infrastructure, and it becomes increasingly obvious that biological names do
not meet the requirements of a good identifier, in strict computing terms. A
good identifier should be unique and persistent. Ideally, each name should
point to a single time-specific or publication-specific definition of a taxon
(although if necessary each taxon can have more than one identifier). In
reality, only the lower taxonomic entities are persistent and permanently coherent, phylogenetically. As new data become available, the inferred relationships among the
named entities may change: a taxon may
be promoted or demoted, new taxa may be interposed between formerly contiguous
taxa. While such events should trigger a full and automatic updating of the
definitions of the affected names at all levels, that rarely occurs. As a result, the formal association of names
with taxa tends to weaken as the rate at which gene sequencing accelerates.
(Gene sequencing is increasingly used to define and delineate taxa.) Failure to address this problem will result in increasingly
unpredictable responses when biological names are used to query either the
literature or databases. What is required is a resolution system that can
handle the complex relationships between biological names and the entities they
denote and provide links to both the historical and current definition of each
named taxon.
We
believe that an implementation of the Digital Object Identifier (DOI, (Paskin 2002; Paskin in press)) may provide the most robust and future-proof
solution to this problem. A DOI is a unique, persistent identifier of an
information resource that is registered together with a URL. Its purpose is the
management and retrieval of that resource in the networked environment. In practice, most current DOIs identify
journal articles, but DOIs are now being applied to trade publications, stock
photography, and physicochemical data sets.
We
are developing a model for assigning DOIs to prokaryotic taxa as a test case.
Though the definition of a taxon may be refined and its nomenclature redefined,
the DOI will persist, leaving a forward-pointing trail that can be used to
reliably locate digital and physical resources, even when a name may be deemed
obsolete. Forward linking from a synonym
to a record of the publication that asserts synonymy is especially important,
as there is currently no mandatory mechanism for asserting and resolving names
that become ambiguous. Our model seeks to strengthen the association of names
with taxa by using DOIs to track the taxonomic definition of a name over time.
It is extensible to the level of individual genes within a given species.
However, the real power of this method lies in the ability of DOIs to become
embedded in the information environment, providing a direct and persistent link
to the full record of taxonomic and nomenclatural revision and ensuring
consistency and accuracy throughout online scientific resources. In building a DOI-based infrastructure for formally
associating nomenclature with taxonomy, we envision a time when a name can be
used unambiguously and persistently, only one mouse-click away from a record of
its current definition and historical development.
References cited
Buchen-Osmond, C. (2002). ICTVdB: The Authorized
Universal Virus Database, Biosphere 2 Center, Columbia University.
http://ictvdb.bio2.edu/index.htm
Garrity, G. M. and J. G. Holt, Eds. (2001). The
Road Map to the Manual. Bergey's
Manual of Systematic Bacteriology. New York, Springer-Verlag.
Garrity, G. M., K. L. Johnson, et al. (2002).
Taxonomic Outline of the Prokaryotes, Bergey's Manual Of Systematic
Bacteriology, Bergey's Manual Trust. DOI 10.1007/bergeysoutline
Greuter, W. (2000). International code of
botanical nomenclature (St. Louis Code), Koeltz Scientific Books.
Linné, C. v. (1735). Regna tria naturae
systematice proposita per classes, ordines, genera, & species.
Stockholm, Lugduni Batavorum.
Melville, R. V. and J. D. D. Smith, Eds. (1987). Official
lists and indexes of names and works in zoology. London, International
Trust for Zoological Nomenclature.
Paskin, N. (2002). Digital Object Identifiers.
ICSTI Seminar: Digital Preservation of the Record of Science, IOS Press.
Paskin, N. (in press). DRM Technologies:
Identification and Metadata. Lecture Notes in Computer Science.
Digital Rights Management: Technical, Economical,
Juridical, and Political Aspects. E.
Becker, D. Gunnewig, W. Buhse and N. Rump. Heidleberg, Springer-Verlag.
Ride, W. D. L., Ed. (1985). Code international de
nomenclature zoologique. Berkeley, Univ. California Press.
Skerman, V. D. B., V. McGowan, et al. (1980).
“Approved Lists of Bacterial Names.” Int. J System. Bacteriol. 30: 225-420.
Sneath, P. H. A., Ed. (1992). International Code
of Nomenclature of Bacteria
(1990 Revision). Washington, D.C., American Society for Microbiology.