David J. Birnbaum
University of Pittsburgh
Department of Slavic Languages and Literatures
Email: djbpitt+@pitt.edu
URL: http://clover.slavic.pitt.edu/~djb/
Copyright © 2000 by David J. Birnbaum. All rights reserved.
Do not reproduce or cite without permission. Comments welcome.
Last revised: 2000-07-05 12:28:05.
Keywords: architectural forms, attributes, critical editions, document type definitions, DTD, elements, extensible markup language, XML, OmniMark, standard generalized markup language, SGML, text encoding initiative, TEI
Abstract: The present study discusses the advantages and disadvantages of general vs specific DTDs at different stages in the life of an SGML document based on the example of support for textual critical editions in the TEI. These issues are related to the question of when to use elements, attribute, or data content to represent information in SGML and XML documents, and the article identifies several ways in which these decisions control both the degree of structural control and validation during authoring and the generality of the DTDs. It then offers three strategies for reconciling the need for general DTDs for some purposes and specific DTDs for others. All three strategies require no non-SGML structural validation and ultimately produce fully TEI-conformant output. The issues under consideration are relevant not only for the preparation of textual critical editions, but also for other element-vs-attribute decisions and general design issues pertaining to broad and flexible DTDs, such as those employed by the TEI.
SGML (Standard Generalized Markup Language)
provides at least three ways for representing textual information: 1)
GIs (generic
identifiers) (the names of elements), 2) attributes, and 3)
pcdata
element content. For example, if a witness entitled
"witnessname" includes a reading "here is some text", these two pieces of
information (witness name and textual content) might reasonably be represented
without redundancy in SGML in any of the following five ways:
<witnessname>here is some text</witnessname>
<reading name="witnessname">here is some text</reading>
<reading> <content>here is some text</content> <witness>witnessname</witness> </reading>
<witnessname content="here is some text">
<reading name="witnessname" content="here is some text">
Thus, an editor must make the following decisions:
pcdata
content of the main element (1 and 2), in the content of
a subordinate element (3), or as the value of an attribute (4 and 5).Versions 4 and 5 can be dismissed immediately, and I have mentioned
the possibility of encoding the text provided by a witness as an attribute
value of an empty element only because this type of approach to representing
textual content has recently achieved a certain popularity in an
XML (Extensible Markup Language) e-commerce context as an alternative to traditional
DBMS (database management system) representations. Under this approach, a
DBMS record is encoded as an
XML empty element and
DBMS fields are each encoded as attribute values of
that element. This approach may be effective when the field content is plain
text, but it raises complications when the text in question may itself contain
markup that reflects an internal hierarchical element structure, especially if
this markup needs to be parsed for other purposes. This complication means that
whatever its merits in certain specific contexts, encoding significant (and
possibly hierarchically-structured) natural-language text as an attribute value
of an empty element is unappealing as a general solution to the problem of
preparing textual critical editions. For this reason, I will assume that the
text from witnesses should be represented as pcdata
content, which
means that I will omit further discussion of strategies 4 and 5, above.
Instead, I will concentrate on examining the different consequences of
representing the names of witnesses as
GIs, as attribute values, or as pcdata
content, as exemplified by strategies 1-3, above.
In some respects the informational differences among strategies 1-3, above may appear slight. All three representations record unambiguously the name of the witness and the content, and these are the most important pieces of information for philological research. Each representation can be converted automatically to either of the others. There are, however, important differences among these three strategies involving both what they are capable of representing and how they interact with SGML syntax.
Before these issues can be addressed, it is necessary to examine the three methods employed by the TEI (Text Encoding Initiative) for encoding textual critical editions, all of which are based on strategy 2, above. Section 2, below (Flexible DTDs) outlines the general TEI strategy for ensuring the flexibility of a set of DTDs designed for multiple purposes. Section 3 (The TEI Approach to Critical Editions) narrows the focus from general TEI flexibility issues to those that pertain specifically to the encoding of textual critical editions. Section 4 (Problems with the TEI Approach to Critical Editions) identifies limitations imposed by the TEI DTDs on both what may be represented in critical editions and how that representation may be processed with SGML tools. Section 5 (Three Solutions) outlines three strategies for preparing TEI-conformant textual critical editions that overcome the liabilities identified in Section 4. Section 6 (Conclusions) summarizes the principal conclusions that emerge from this report.
One virtue of DTDs (document type definitions) is that they can help ensure structural uniformity (or, at least, coherence) across a body of document instances. This uniformity enables a set of similar or related documents to be processed, whether automatically by an application or intuitively in the mind of a user, in a consistent way.
The TEI DTDs are intended to serve a very broad community of users, whose interests and prejudices may vary considerably. This ambitious purpose creates an inevitable tension between the uniformity inherent in the notion of shared, communal DTDs, on the one hand, and the variation inherent in a heterogeneous user community, on the other. The TEI attempts to resolve this tension in three ways: modules (discussed in section 2.1, below), alternatives (discussed in section 2.2, below), and extensions (discussed in section 2.3, below).
One of the most ingenious features of the TEI architecture is its modular design. One popular metaphor for this design is that of a Chicago pizza:
All pizzas have some ingredients in common (cheese and tomato sauce); in Chicago, at least, they may have entirely different forms of pastry base, with which (universally) the consumer is expected to make his or her own selection of toppings. Using SGML syntax this might be summarized as follows:
<!ENTITY % base "(deepDish | thinCrust | stuffed)" > <!ENTITY % topping "(sausage | mushroom | pepper | anchovy ...)"> <!ELEMENT pizza - - (%base, cheese [amp ] tomato, (%topping;)*)>In the same way, the user of the TEI scheme constructs a view of the TEI DTD by combining the core tag sets (which are always present), exactly one "base" tag set and his or her own selection of "additional" tag sets or toppings. ([Burnard])
For example, the TEI DTD for critical editions used in the present report selects the prose base and the textual criticism topping. This modular approach to DTD design represents an attempt to compromise between the competing and equally unrealistic goals of providing a single, uniform DTD that will be suitable for all users, on the one hand, and enabling each user to design a custom DTD that is tailored to the specific needs of his or her documents and purposes, on the other. One practical implication of this approach is that there should be nothing completely surprising in a document designed according to the TEI pizza philosophy; while not all TEI-conformant documents will encode the same information the same way (or even at all), and not all will have been authored with the same specific TEI-conformant DTD, all such documents will be have been constructed from common elements used in a flexible but not unrestricted way.[1]
In addition to permitting users to select the DTD modules they wish to include in their individual DTDs, in some cases the TEI DTDs also provide support for multiple ways of encoding particular structures. One striking example of this approach is the three different mechanisms proposed for linking an apparatus to a text in a critical edition: 1) location referencing (using line numbers or some other canonical reference scheme), 2) double-end-point attachment (indicating precise locations of variant readings), and 3) parallel segmentation (providing variant readings in parallel within the text) ([TEI P3], section 19.2).
These three mechanisms are described and discussed in greater detail below, but the differences among them are essentially of two types: the technological and the personal. Concerning technological differences, for example, the location reference and double-end-point attachment methods may be used either in line or in an external apparatus, while the parallel segmentation method may be used only for an in-line apparatus. Similarly, the double-end-point attachment method can identify the exact end points of variants, while the precision of the location reference method depends on the precision of the canonical reference system underlying it (for example, the use of biblical chapters and verses does not directly permit references to units smaller than a verse). As for personal differences, scholars may be used to visualizing a critical edition in a particular way (concerning both the reference method and the choice between an in-line and an external apparatus), and may find it either intellectually difficult or psychologically unacceptable to adopt a markup strategy that does not reflect their conceptualizations of a text directly, even when there is no informational difference at stake. The result of the compromise adopted by the TEI is that most editors will be able to prepare TEI-compatible critical editions that model their analytical perspectives on the texts fairly closely, but at the expense of reducing the extent to which a user can anticipate how an arbitrary TEI-conformant critical edition will be structured. The three methods mentioned above are discussed individually in section 3, below.
The TEI developers understood that it would not be possible to anticipate all the ways in which users would wish to encode documents, and that no degree of flexibility in choice of encoding methods could prove sufficient for all members of a broad community with varying goals and established practices. Accordingly, the TEI DTDs contain a standardized extension mechanism, which permits the introduction of markup not anticipated by the TEI editors, including the deletion of elements, the renaming of elements, the extension of classes, and the modification of content models or attribute lists. The mechanism is, in fact, so powerful that it:
... if used in an extreme way, permits deletion of the entire set of TEI definitions and their replacement by an entirely different DTD! Such revisions would result in documents that are not TEI conformant in even the broadest sense, and it is not intended that encoders use the mechanism in this way. ([TEI P3], section 29)[2]
Where elements are renamed or classes extended, the
TEI guidelines provide for the inclusion of
a TEIform
attribute for each element, which can be used to
associate a new
GI with the original
TEI element on which it is based. This strategy
enables an application to refer to the TEIform
attribute value
when deciding how to process the new element. For example, a processing
application might regard as
TEI paragraphs not just elements where the
GI is equal to <p>
(the default
TEI element for encoding paragraphs), but also all
other (new) elements where the value of the TEIform
attribute is
also p
. This type of system is illustrated in Section 5.1, below.
Texts undergo alteration during copying, either accidentally (when a scribe inadvertently miscopies a source text) or deliberately (when a scribe consciously changes a source text, often in an attempt to improve it or to correct what he perceives to be an error). A textual critical edition attempts to provide evidence from multiple copies of a work (called witnesses), which scholars may then use for a variety of purposes (e.g, to determine the filiation of the witnesses, to reconstruct lost early or intermediary copies, to trace the history of the transmission of the text, etc.). Editors sometimes treat all witnesses as equivalent; in other cases they will identify a principal witness (called a copy text), which is transcribed in its entirety, and cite selected variants from other witnesses (called control texts) only when those witnesses provide important information that the editor considers important that is not available from the copy text. A reading from a privileged text is sometimes called a lemma and the collection of variants from control texts is called a critical apparatus. Witnesses in a critical apparatus are traditionally identified by short unique identifying strings of letters, numbers, and symbols called sigla (a contraction of plural sigilla, singular sigillum).
As was noted above, the TEI provides three methods for linking a critical apparatus to a text: 1) location referencing (using chapter and verse numbers or some other canonical reference scheme), 2) double-end-point attachment (indicating precise locations of variant readings), and 3) parallel segmentation (providing variant readings in parallel within the text) ([TEI P3], section 19.2).
What all three methods have in common is that the apparatus,
whether in-line or external, is contained in
<app>
(apparatus) elements. An <app>
element normally consists of <rdg>
(reading) elements, plus
an optional <lem>
(lemma) element, which may be used to
represent the reading from a privileged witness. Individual reading elements
may be included immediately within an apparatus element or they may be combined
within intermediary <rdgGrp>
(reading group) elements to
represent any grouping considered desirable by the editor. In either case, the
names of the witnesses to each reading will normally be encoded as the value of
a wit
attribute of the <rdg>
element.
Alternatively or additionally, the names of the witnesses may be specified in a
<wit>
element inside the <rdg>
element.
A skeletal apparatus element that uses reading groups might look like the following:
<app> <rdgGrp> <rdg wit="A">Text from witness A</rdg> <rdg wit="B">Text from witness B</rdg> </rdgGrp> <rdgGrp> <rdg wit="C">Text from witness C</rdg> <rdg wit="D E">Text that is identical in witnesses D and E</rdg> </rdgGrp> </app>
As was noted above, the witnesses attesting a particular reading
may be recorded as the value of the wit
attribute of
the <rdg>
element (as above) or they may be enclosed in a
separate <wit>
element within the <rdg>
element. In either case, the witnesses that are included in an edition are
supposed to be documented in <witness>
elements inside
a <witList>
element found elsewhere in the document.[3] The <witness>
element has an attribute
sigil
, representing the sigillum associated with that witness, and it is intended that the values found in these
sigil
attributes will correspond to the witness identifiers
associated with <rdg>
elements (whether these are given as
values of the attribute wit
or as data content of the
element <wit>
).
Unfortunately, because the attribute wit
of
the <rdg>
element and the attribute sigil
of the
<witness>
element inside a <witList>
element are both of type cdata
, an
SGML parser is unable to validate any aspect of the
desired correspondences. But [TEI P3] tantalizingly suggests
that:
The advantage of holding witness information in the
wit
attribute of<lem>
or<rdg>
is that this may make it more convenient for an SGML application to check that everysigil
identifier has been declared elsewhere in the document. By giving thewit
attribute a declared value ofidrefs,
for example, one could more easily ensure that readings are assigned only to witness sigla given asid
values for witnesses in a<witList>
element ... . (Section 19.1.4.2)
Because the standard
TEI
DTDs do not, in fact, declare the attributes in
question as id
and idrefs
, as described in the
preceding paragraph, and declare them instead as cdata
, the
advantages described above are not available. A user could, however, modify the
TEI
DTDs to change the attribute types, and because all
attributes of type id
and idrefs
also meet the
requirements for cdata
, documents created in this way would be
fully
TEI conformant.[4]
The following subsections describe briefly each of the three methods for associating an apparatus with texts, identify the strengths and weaknesses of each, and compare their features.
The location reference method gives the reading from the base text in line and inserts the apparatus wherever the editor wishes. The apparatus may be associated with the text either by physical location (e.g., it may be included within the element to which it refers) or by explicit location reference. The precision of the reference depends on the precision of the reference method employed, which is related to the granularity of the markup. For example, as was noted above, a location reference system that relies on biblical chapters and verses is not able to associate an apparatus element with any portion of text smaller than a verse. ([TEI P3], section 19.2.1)
Location referencing is convenient where a canonical reference system is well-known and where precise alignment of the apparatus with the base text is not required. Location referencing also requires that there be a base text, since this method requires that the reading from exactly one witness appear in line.
Double-end-point attachment indicates the exact endpoints of a span of text either by
referring to attributes of type id
located within the main text or
through indirect pointing location methods. If the apparatus is in line, the
<app>
element itself can mark one end point of the span to
which it refers, with the other end point indicated by reference to an
attribute of type id
. ([TEI P3], section
19.2.2)
The primary advantage of double-end-point attachment is that it is the only one of the three methods that is designed to handle overlap (discussed in greater detail in section 3.3, below). On the other hand, one consequence of this power is that double-end-point attachment is the most complex and least legible method, making it the most difficult to implement and process without specialized tools.
Parallel segmentation is the only method that does not require a privileged base text. Under the parallel segmentation method, all readings are grouped together in parallel, and although one may be designated as a lemma, this is not required, and except for assigning it a special name, the markup does not necessarily treat the lemma any differently from any other witness. ([TEI P3] section 19.2.3)
The only disadvantage to the parallel segmentation method is that it does not support overlap. For example, suppose three witnesses attest the following:
Witnesses A and B vary only with respect to the first word,
witnesses B and C vary only with respect to the last word, and witnesses A and
C vary with respect to both the first and last words. Under parallel
segmentation, one must either treat the entire line as a single
<app>
element with three separate readings (which fails to
create any formal record of the agreement that does occur) or divide it into
three portions: an <app>
element for the first word (where A
!= B = C), data content for the central section (where there is no variation),
and another <app>
element for the last word (where A = B !=
C). This last strategy creates a formal record of all agreement and
disagreement, but at what may be an inappropriate granularity; for example,
with respect to A and B there are only two logical segments, the first word
(where they disagree) and the rest of the sentence (where they agree). Because
parallel segmentation, unlike double-end-point attachment, must be applied to
all witnesses at once, the only alternative to one long segment is three short
segments, as if the agreement between A and B in the middle of the sentence was
a separate phenomenon from their agreement at the end.
The present section begins by describing a different type of critical edition (section 4.1). It then identifies both limitations of this type of edition (section 4.2) and ways in which this type of edition is capable of resolving problems inherent in the standard TEI methods (section 4.3). Section 5, below provides three solutions to the problems identified in Section 4.
The present section outlines the rationale for employing an alternative type of edition (section 4.1.1) and describes its structure (section 4.1.2).
Traditional printed critical editions most commonly provide a base text (either transcribed from a privileged principal witness or constructed by the editor) and record variants in a separate apparatus, usually printed in the margins of the page. This type of presentation simplifies reading the base text and it economizes on paper, but it achieves these goals at the expense of complicating both reading any text other than the base (since that reading must be reconstructed on the fly by mentally replacing selected base readings with variants plucked from the apparatus) and studying variation in general (since readers must move their eyes constantly between the in-line reading of the base text and the apparatus that is found elsewhere). Furthermore, the compromised legibility of this type of apparatus encourages editors to be selective about citing variants, and this type of selectivity means that readers will be unable to distinguish text where there is no variation at all from text where the editor has determined that although variation exists, it is not significant. These two problems are identified in [Birnbaum] as "compromised legibility" and "incomplete presentation of the evidence," respectively.
The problems of compromised legibility and incomplete presentation of the evidence, noted above, can be avoided by transcribing all witness in full in parallel, along the lines of:
A: Line 1 from Witness A B: Line 1 from Witness B C: Line 1 from Witness C A: Line 2 from Witness A B: Line 2 from Witness B C: Line 2 from Witness C etc.
The presentation of full transcriptions of all text from all witnesses in parallel resembles a conductor's musical score, and is sometimes called a "score-like" edition. A score-like edition enables a user to read any witness easily, since there is no need to move one's eyes between a base text in one location and variants in another, and it also allows a user to see at a glance which witnesses agree with which others at a particular location.
The score-like structure may be considered a special case of
TEI parallel segmentation, and it can be
implemented using standard
TEI parallel segmentation methods. It is similar
to this method in that it incorporates an in-line apparatus in a way that does
not privilege a single base text. But it differs from the
TEI model in two respects: 1) all text is
included in <app>
elements (that is, even text that is
identical in all witnesses will nonetheless be recorded separately for each
witness) and 2) each <rdg>
element is associated with
exactly one witness.
The two differences noted above result in important limitations
in the power of a score-like edition that are not present in a true
TEI parallel segmentation edition. The first
difference, the inclusion of all text in
<app>
elements, even when there is no variation, means that
there is no formal distinction between locations where some witnesses differ
and locations where all witnesses agree. In the
TEI model, <app>
elements would
be introduced only where there is variation, which enables text that varies to
be identified automatically (although, as was noted above, overlap in variants
may require segmentation that is either narrower or broader than the agreement
patterns among specific witnesses would justify). The second difference, the
association of each <rdg>
element with exactly one witness,
means that there is no formal encoding of agreement among selected witnesses.
In the
TEI model, a <rdg>
element has a
wit
attribute of type cdata
that contains a list of
sigla for all witnesses attesting the reading in question, a strategy that
enables an application to locate patterns of agreement by parsing the document
and postprocessing the content of the wit
attributes. Under the
score-like structure, each reading will be associated with exactly one witness,
which means that the only way to identify patterns of agreement will be to
postprocess the data content of the <rdg>
elements, a much
more complicated operation than comparing simple and standardized sigla (and,
furthermore, one
that makes it impossible to use SGML tools to
locate patterns of agreement ).
These two types of differences make it impossible to determine from the markup of a score-like edition where there is variation and where there is not (the first difference, but one that is also present to a lesser extent in the TEI parallel segmentation method) and where particular manuscripts agree and where they do not (the second difference). These are important limitations, and editors will need to decide whether supporting the formal encoding of this type of information is more valuable than overcoming the legibility and completeness problems noted above.
It is, of course, the case that a true TEI parallel segmentation encoding, if it includes all variation, can be converted to a score-like edition automatically, and one might argue that the more informative TEI parallel segmentation structure should be used for encoding a source file that can then be transformed into the more legible score-like format for rendering. This is a sensible compromise, although it cannot address fully two problems: 1) as was noted above, even the pure TEI parallel segmentation method is not able to formalize all patterns of agreement at the appropriate level of granularity because it is inherently unable to represent overlap, and 2) it is easier for an editor to avoid the incompleteness issue by encoding all witnesses in full.
From a different perspective, the intellectual process of identifying variation involves comparing all readings from all witnesses, which is precisely the perspective afforded by a score-like edition. With this consideration in mind, one might envision the relationship between the score-like and strict TEI parallel segmentation representations the other way around: the score-like representation is the input to identifying the patterns of variation that may then be encoded explicitly in a strict TEI parallel segementation representation. From a production perspective, one might start by transcribing all witnesses in full, use these to collate the witness in a score-like encoding, run the collated composite text through an application that identifies variation, and then use the output of the variant analysis to generate a critical apparatus using one of the standard TEI methods. Not only does an approach that takes full transcriptions of individual manuscripts as a starting point mirror the intellectual process of textual criticism, but it is also easily rerun should the editor need to change the list of witnesses, whether because of new discoveries or because certain witnesses must later be eliminated.
A score-like edition provides an opportunity to overcome certain limitations in structural control that are inherent in the three standard TEI methods of encoding a critical apparatus. Section 4.3.1 describes the technical weaknesses of those methods and section 4.3.2 describes how a score-like edition provides an opportunity to overcome them.
One striking difference between the parallel segmentation and
score-like editions is that parallel segmentation supports the association of a
single <rdg>
element with multiple witnesses, and this is in
many respects a strong argument for the superiority of the parallel
segmentation method. But this feature of the parallel segmentation method also
imposes a certain cost, since there are times when the association of a
single <rdg>
element with multiple witnesses would be an
error, and an
SGML parser is unable to distinguish these
situations from those where this association is appropriate. As the
TEI guidelines note: "The hand
and resp
attributes [of <rdg>
elements] are
intelligible only on an element recording a reading from a single witness
... If more than one witness is given for a reading, they are undefined."
([TEI P3], 19.1.2) This must be given as a prose admonition,
rather than encoded in the
DTD, because
SGML tools are incapable of distinguishing when
attributes are defined based on the content of another attribute. This means
that the considerable advantages in the parallel segmentation method of being
able to associate a single reading with multiple witness is partially offset by
the creation of an opportunity to introduce undefined markup that cannot be
discovered through normal
SGML validation.
The representation of all readings from all witnesses
as <rdg>
(reading) or <lem>
(lemma)
elements in the
TEI
DTDs introduces an important control problem. An
editor who is creating a score-like edition using the
TEI parallel segmentation apparatus method might
mark up a section of text as follows:
<p id="p1"> <app> <rdg wit="WitnessA">text from witness A</rdg> <rdg wit="WitnessB">text from witness B</rdg> <rdg wit="WitnessC">text from witness C</rdg> </app> </p>
Because
SGML controls linear and hierarchical document
structure through elements, but not through attributes, there is no way that an
SGML parser can ensure that all witnesses are
represented, that no witness appears more than once, or that the witnesses
occur in a particular order. If reading groups are used, an
SGML parser is unable to validate
whether <rdg>
elements associated with specific witnesses
occur in the correct <rdgGrp>
element. This type of
validation can be performed externally, but because what
SGML does best is validate structure, it seems
perverse to create an
SGML document that depends on structural
features that are not representable in the
DTD. However, this type of validation could be performed
internally using standard
SGML tools if individual witnesses were
distinguished not by attributes, but by elements. That is, the what is at issue is not an inherent limitation in the expressive power of DTDs, but a difference between the syntactic properties of elements and attributes within a DTD framework.
It is clear that a score-like critical edition offers both advantages and disadvantages with respect to traditional critical editions. It is also clear that the TEI parallel segmentation method is able to represent many--but not all--of the features of a score-like edition. Finally, it is clear that some features of a score-like edition could be represented with greater structural control using the TEI parallel-segmentation method if the TEI DTDs could be changed as follows:
<rdg>
elements) into readings groups
(<rdgGrp>
elements). However, the
DTDs do not provide a mechanism for
ensuring that readings are grouped correctly. For example, an editor might wish
to group manuscript witness in one reading group and printed witnesses in
another, but although the
TEI
DTDs support the creation of these groups,
they do not support the use of an
SGML parser to verify which witnesses are
listed in which group. If individual witnesses were identified by
GIs, rather than attribute values, the
content models could control grouping.This section examines three strategies for implementing the desiderata listed above. The initial requirements for solutions to this problem were that 1) all validation had to be performed using SGML tools and 2) the final result had to be fully TEI-conformant.
The three solutions involve 1) modifying the
TEI
DTDs according to the recommendations in the
guidelines ([TEI P3], Section 29) and subsequently processing
the document by referring to the TEIform
attribute (Section 5.1);
2) encoding the document using a custom
DTD and then transforming it to a standard
TEI
DTD using an arbitrary transformation tool (Section
5.2), and 3) encoding the document using a custom
DTD that incorporates the
TEI
DTD as a base architecture, and then using
SGML architectural processing to transform the
document to a standard
TEI
DTD (Section 5.3).
The test document used in this report is the following small hypothetical critical edition:
Witness A: First line from witness A Witness B: First line from witness B Witness C: First line from witness C Witness A: Second line from witness A Witness B: Second line from witness B Witness C: Second line from witness C
The standard TEI markup for this document in a parallel segmentation edition would be:
<!-- tei-standard.sgml --> <!doctype tei.2 public "-//TEI P3//DTD Main Document Type 1996-05//EN" [ <!entity % TEI.prose 'INCLUDE'> <!entity % TEI.textcrit 'INCLUDE'> ]> <tei.2> <teiheader> <filedesc> <titlestmt> <title>TEI Critical Edition Test Document, Standard TEI Version</title> </titlestmt> <publicationstmt> <p>Unpublished.</p> </publicationstmt> <sourcedesc> <p>Original test document created 2000-03-10 by djb.</p> </sourcedesc> </filedesc> </teiheader> <text> <body> <p id="p1"> <app> <rdg wit="A">First line from witness A</rdg> <rdg wit="B">First line from witness B</rdg> <rdg wit="C">First line from witness C</rdg> </app> </p> <p id="p2"> <app> <rdg wit="A">Second line from witness A</rdg> <rdg wit="B">Second line from witness B</rdg> <rdg wit="C">Second line from witness C</rdg> </app> </p> </body> </text> </tei.2>
As was noted earlier, this representation is unable to use SGML tools to validate the occurrence, order, or grouping of the witnesses. All three solutions proposed below will achieve greater control over these features in two ways: by representing each witness as its own element with its own GI and by revising and constraining certain content models to a subset of the content permitted by the TEI DTDs.
Creating new
GIs for each witness makes it possible to develop a
DTD that ensures that
each <rdg>
-type element refers to exactly one witness, that
no witness is omitted inadvertently (although it is possible to declare certain
witnesses as omissable, should that be desired), that no witness occurs more
than once, and that all witnesses occur in a consistent order. This test case
does not use the
TEI <rdgGrp>
element, but the
technique described here can also be applied to ensure that specific witnesses
appear only in the appropriate reading groups, and [Birnbaum]
illustrates a custom
DTD approach that incorporates reading groups.
These issues of content control are not unique to critical
editions. For example, Simons notes that it may be convenient to design custom
GIs for frequent combinations of standard
TEI
GIs with particular attribute values, and he gives
the example of replacing the markup <foreign lang="SIK">
with
a custom <sik>
element in a dictionary of Sikaina. ([Simons], Section 2.2)
Convenience is certainly important, but an even
more compelling reason to design custom
GIs is that elements can provide types of structural
control that are unavailable with attributes. As Simons notes, one advantage to
creating a new <idiom>
element as a replacement for the
standard
TEI <eg type="idiom">
(where
<eg>
represents examples of any type) and a new
<lit>
element for literal translations of idioms as a
replacement for the standard
TEI <tr type="lit">
element
(where <tr>
represents translations of any type) is that the
<lit>
element then can be constrained to occur only in the
<idiom>
element (that is, translations may occur freely in
examples of all types, but literal translations may occur in examples only when those
examples are idioms).
Constraining the content models of elements addresses a problem that Simons describes as the SGML and XML counterpart to "fatware". Much as software may be encumbered with features that not only are not needed by many users, but also may get in the way, so a large and general DTD, such as the TEI DTDs, may support more elements, broader content models, and more and broader attributes than are required for a specific project. ([Simons], Section 2.3) This is to be expected, since general DTDs need to support a variety of projects, but the availability of unneeded markup is both inconvenient (for example, unwanted markup in a menu is clutter, and may overwhelm visually the list of elements a user might actually need) and pernicious (since it enables the author to use markup that is legal in the general DTD but not desirable in the particular project).
One crucial feature of these issues is that the greater control over frequency, ordering, and grouping that is provided by the new GIs and content models is required at certain stages in the life of the document, but not at others. Most commonly, one might require strict control during authoring with a validating editor; alternatively, one might use a non-validating authoring tool and then ensure the validity of the document through an iterative process of external validation and revision. But whether validation is part of the authoring process from the beginning or introduced only at the end, once the validity of a completed document has been confirmed, subsequent processing does not require additional validation. This means that although the new structural control features discussed above may need to be present in the DTD used for validation during or after authoring, once the document has been completed and validated, transformation engines, rendering engines, and other post-authoring processes may have no direct need for witness-specific GIs or constrained content models. In this respect, the strategy in question extends a feature that was first observed as a general principle when XML was developed: much useful processing can be performed independently of a DTD.
From a slightly different perspective, the standard TEI DTDs may be unable to constrain document structure during authoring and validation as well as the alternatives discussed below, but as long as those contraints can be ensured in some other way, this limitation of the standard TEI DTDs may be unimportant during subsequent processing. This distinction reflects the different roles of the DTD at different stages in the life of the document. Specifically, during authoring and subsequent validation, the DTD defines the set of possible valid documents that can be created. But once the document has reached the stage where it is valid and will not be edited further, the document instance itself defines its own unique structure, and any other possible document structures that may also be licensed by the DTD become irrelevent as far as that particular document is concerned.[5]
TEIform
Attribute ApproachThe TEIform
-attribute approach is the only one of
the three strategies that does not require the explicit use of a non-TEI
DTD at any stage. Instead, this approach involves
modifying the
TEI
DTDs as prescribed in the published guidelines
([TEI P3], Section 29), in this case by:
<rdg>
and that declare "rdg" as the value
of the TEIform
attribute, which will enable a processing system to determine that the new elements should be processed identically to standard TEI <rdg>
elements. The value of the wit
attribute of each new element will be declared as a specific fixed value, which will prevent
mismatches, omission, or duplication. Because no other standard attributes of
<rdg>
are used in the test file, these are not declared for
the new elements, thus avoiding an opportunity for error.<app>
to
admit these new elements and prohibit the use of the
original <rdg>
element or any other original content (to
avoid an opportunity for error).Section 5.1.1, below, illustrates these modifications of the TEI DTDs. Section 5.1.2, below, evaluates these modifications according to the clean/unclean dichotomy established by the TEI Guidelines ([TEI P3], Section 29.1). Section 5.1.3, below, demonstrates how a document created with a modified TEI DTD can be processed by a generic TEI-aware tool without requiring any special knowledge about the modifications.
The approach described above is implemented by creating the following TEI.extensions.ent file (called teiform-test.ent) and TEI.extensions.dtd file (called teiform-test.dtd). As was noted above, unused parts of the original content models and unused original attributes are removed, since their presence only creates an opportunity for error. The consequences of these modifications and of others that achieve a similar effect are discussed below.
<!-- teiform-test.ent --> <!-- The following element is revised --> <!entity % app 'IGNORE'>
<!-- teiform-test.dtd --> <!-- The following declaration defines a new content --> <!-- model for the revised app element --> <!-- --> <!element app - - (witnessa, witnessb, witnessc) > <!attlist app %a.global teiform cdata #fixed "app" > <!-- --> <!-- The following three declarations define new --> <!-- elements, which occur in the revised content --> <!-- model of the app element --> <!-- --> <!element witnessa - o (%paraContent) +(%m.fragmentary)> <!attlist witnessa %a.global wit cdata #fixed "A" teiform cdata #fixed "rdg" > <!-- --> <!element witnessb - o (%paraContent) +(%m.fragmentary)> <!attlist witnessb %a.global wit cdata #fixed "B" teiform cdata #fixed "rdg" > <!-- --> <!element witnessc - o (%paraContent) +(%m.fragmentary)> <!attlist witnessc %a.global wit cdata #fixed "C" teiform cdata #fixed "rdg" >
With modifications of this type in place, the following valid TEI-conformant document can be created:
<!-- tei-teiform.sgml --> <!doctype tei.2 public "-//TEI P3//DTD Main Document Type 1996-05//EN" [ <!entity % TEI.extensions.ent system "teiform-test.ent" > <!entity % TEI.extensions.dtd system "teiform-test.dtd"> <!entity % TEI.prose 'INCLUDE'> <!entity % TEI.textcrit 'INCLUDE'> ]> <tei.2> <teiheader> <filedesc> <titlestmt> <title>TEI Critical Edition Test Document, TEIform Version</title> </titlestmt> <publicationstmt> <p>Unpublished.</p> </publicationstmt> <sourcedesc> <p>Original test document created 2000-03-10 by djb.</p> </sourcedesc> </filedesc> </teiheader> <text> <body> <p id="p1"> <app> <witnessa>First line from witness A</witnessa> <witnessb>First line from witness B</witnessb> <witnessc>First line from witness C</witnessc> </app> </p> <p id="p2"> <app> <witnessa>Second line from witness A</witnessa> <witnessb>Second line from witness B</witnessb> <witnessc>Second line from witness C</witnessc> </app> </p> </body> </text> </tei.2>
As was noted above, the
TEI
DTDs allow the association of multiple witnesses
with a reading by declaring the type of the wit
attribute of
<rdg>
elements as cdata
. A score-like edition,
on the other hand, presents the full text of each witness on a separate line,
which can best be represented in
SGML by requiring that each reading element be
associated with exactly one witness. The declaration of the wit
and
TEIform
attributes as fixed
and the use of
shorttag
in the default
TEI
SGML declaration associates the correct
attribute value with each new element, while ensuring both that these
attributes do not need to be included explicitly in the markup (which is a
convenience) and that the inadvertent inclusion of any value other than the one
specified in the
DTD will raise a parser error (which is a
safeguard). ([DeRose], Section 5.15.)
This modification both permits (actually, requires) the use of
the newly-defined elements inside <app>
elements and
prohibits the use of other elements usually permitted in that context. It does
not, however, restrict the content of the <text>
,
<body>
, or <p>
elements that surround
the <app>
element, which creates an opportunity for a user to
input element or data content that is legal in standard
TEI documents but not wanted in this particular
modified document. To provide greater protection against such errors, the
content of these outer elements could be redefined similarly to the
redefinition of the <app>
element documented above.
Furthermore, if the content of the new witness elements will always be
pcdata
, the content models can be narrowed, providing additional
protection against the inadvertent inclusion of unwanted markup.
The TEI Guidelines divide modifications of the TEI DTDs into two classes, called "clean" and "unclean." Clean modifications are of two types: "The set of documents parsed by the original DTD may be properly contained in the set of documents parsed by a modified DTD, or vice versa." ([TEI P3], Section 29.1) In the present study, the first type of modified DTD will be called a "new superset DTD" and the second type will be called a "new subset DTD." The TEI Guidelines draw no further distinction between the two types of clean modification, and they also do not state explicitly that clean modifications are preferable to unclean ones, although this might be inferred from the vernacular meaning of the terms themselves.
In fact, there are striking practical differences in the utility of new subset DTDs and new superset DTDs. Document instances prepared with new subset DTDs may be parsed by any user who has a standard TEI configuration. This means that such document instances conform fully to the standard TEI distribution and may be exchanged without regard for the modifications (which in this case are relevant only during document preparation). Document instances prepared with new superset DTDs, on the other hand, cannot be parsed in arbitrary environments configured for the standard TEI distribution, which means that the document instances themselves do not conform to the standard (unmodified) TEI model. While the interchange and processing advantages of new subset DTDs are clear, the only processing advantage to new superset DTDs accrues to the developer, whose modified environment will be able to process standard TEI documents alongside his or her superset documents. On the other hand, such a use of modified DTDs to parse unmodified TEI document instances would make it impossible to verify whether the instances are valid against the unmodified TEI DTDs.
In light of these considerations, the different practical consequences of these two models suggest that the most important distinction may lie not between clean and unclean modifications, but between document instances prepared with clean subset DTDs, which can be exchanged freely without special preparation, and those prepared with clean superset DTDs, which cannot.
The modifications of the text file documented in the preceding
section are necessarily unclean because they are of two types that have
opposite consequences: the creation of entirely new elements means that the new
DTD cannot be a subset of the original
TEI
DTDs, while the imposition of new restrictions
on the content model of some standard elements means that the new
DTD also cannot be a superset of the original
TEI
DTDs. As is demonstrated below, however, it is
nonetheless possible to process documents created with the modified
DTD using tools that do not need to refer
explicitly to the newly-created elements. This is the strategy that motivated
the creation of the TEIform
attribute ([TEI P3],
Section 3.5), and the use of this feature during processing means that the
unclean documents in question can enjoy the same interchange and processing
benefits of documents that reflect clean subset modifications.
From a more general perspective, there is no way to achieve a
clean subset
DTD while creating new elements. Once one is
committed to creating new elements, it is possible to create a clean superset
DTD by permitting the optional use of the new
elements alongside all other elements that are already legal in the context in
question. For example, instead of redefining the <app>
element to admit only the new elements, one could extend the content model to
admit either the new elements or the original content licensed by the standard
TEI
DTDs. This approach was not adopted in the
present case for two reasons: 1) it creates the opportunity for error should
the user inadvertently populate the <app>
element with the
original
TEI content, rather than the new elements, and
2) the advantages of clean superset modifications are much less than the
advantages of clean subset modifications, and they were not considered
sufficient to justify the compromise in content control that would
result.
The advantage of the modified TEI DTD approach is that the resulting document is fully TEI-conformant from start to finish, since although it extends the TEI DTDs, it does so in a standardized and well-documented way. The disadvantage of this approach is that although this type of extension may be standardized and well-documented, tools and applications that have been configured to process TEI-conformant documents based on standard TEI GIs will not know how to process the new elements created above. One solution to this problem involves modifying the processors explicitly as needed to recognize new elements, but this approach has no general value and is suitable only for small and infrequent use.
A more robust and scalable approach is to reconfigure
TEI processors to use the
TEIform
attribute value in lieu of or in addition to the
GI. For example, while most Omnimark scripts
designed to process
TEI documents are designed to respond to the
GIs of the standard
TEI elements, an alternative approach that
ignores the
GIs and acts instead on
TEIform
attribute values can deal with the modified
DTD above without having to know anything more
about the new
GIs. Omnimark permits the definition of a
catch-all "implied" element, which fires whenever an element is encountered for
which specific actions are not declared. The use of this feature means that a
generic Omnimark
TEI transformation script could read the
TEIform
attribute of an unfamiliar element and act on it according
to that attribute value. More generally, an Omnimark script for processing both
unmodified and modified
TEI documents could be written with only one
element rule (for the default "implied" element), with different actions
depending on the value of the TEIform
attribute. A script designed
in this way for standard
TEI documents will be able to process any
extended document for which the new elements were created only to provide
control during authoring, and do not require custom processing.
The following brief Omnimark script converts the test document
in section 5.1.1, above, to
HTML 4.0 without calling any explicit rules for
the newly-defined elements. Similar strategies can be implemented with any
scripting language that is capable of processing an element with an unfamiliar
GI according to its TEIform
attribute value.
; tei-teiform.xom ; run with omnimark -s tei-teiform.xom ; teisgml.dec tei-teiform.sgml ; -d socat "e:\lib\sgml\dtd\teip3\catalog" ; -i "e:\program files\omnimark\xin\" ; -of tei-teiform.html down-translate include "socatete.xin" global counter rdgcount initial {0} element #implied do when attribute teiform = "TEI.2" output '<!doctype html public "-//W3C//DTD HTML 4.0//EN">' output "%n<html>" output "%c" output "%n</html>" else when attribute teiform = "teiHeader" output "%n<head>%c%n</head>" else when attribute teiform = "fileDesc" output "%c" else when attribute teiform = "titleStmt" output "%c" else when attribute teiform = "title" output "%n<title>%c</title>" else when attribute teiform = "publicationStmt" output '%n<meta name="publicationStmt" content="%c">' else when attribute teiform = "sourceDesc" output '%n<meta name="sourceDesc" content="%c">' else when attribute teiform = "text" output "%c" else when attribute teiform = "body" output "%n<body>%c%n</body>" else when attribute teiform = "p" output "%n<p>" unless ancestor is teiheader output "%c" output "</p>" unless ancestor is teiheader else when attribute teiform = "app" set rdgcount to 0 output "%c" else when attribute teiform = "rdg" increment rdgcount output "%n" output "<br>" when rdgcount > 1 output "<strong>Witness %uv(wit):</strong> %c" else output "%nUndefined element: %q" suppress done
Although the preceding script can be considered only a proof of
concept (in its current form it includes only the TEIform
attribute
values that occur in the test document), it is nonetheless clear that:
TEIform
attribute
values).TEIform
attribute, as documented in
[TEI P3], (Section 29).Should the modification strategy discussed here become
widespread (and this was clearly the intent underlying the creation of the
TEIform
attribute), designers of general
TEI processing tools might wish to build in the
appropriate support for processing the TEIform
attribute when new
GIs are encountered. This strategy would enable
a generic
TEI transformation or other processing tool to
process an arbitrary document that includes a
DTD that has been extended in this way without
compromising the ability to process standard (unextended)
TEI documents. A generic processor obviously
cannot anticipate newly-created
GIs, but if those
GIs are used only to increase structural control
during authoring and do not otherwise require special handling, subsequent
processing can be controlled through the TEIform
attributes, which
are not extended during this type of modification.
One might wish, on this basis, to distinguish two types of unclean modifications: those where the uncleanliness is important during authoring but can then be ignored during processing and those where the uncleanliness must be maintained at all stages of the life of the document. The score-like edition project described here is of the first type.
As was demonstrated above, modifying the TEI DTDs is not as difficult as it may appear to those who have never tried, thanks to the developers' creative use of parameter entities and marked sections and the excellent documentation available in [TEI P3]. But the full TEI DTDs, even when only the necessary modules have been selected, will be overkill for many projects, and using a DTD that licenses an element one doesn't need creates an opportunity for error. The preceding section dealt with this problem by redefining TEI content models and attribute declarations to exclude unneeded markup, but a comparable result might also be achieved by creating a TEI-independent minimal custom DTD for use in authoring and then converting the document to the standard TEI DTDs later. This approach does not require any modification to the TEI DTDs because the custom DTD provides the necessary control during authoring, and once authoring is finished, that control is no longer needed.
The custom DTD approach is described in greater detail in [Birnbaum], where it was used in a large troff-to-SGML conversion that was part of a critical edition project. Briefly, the prolog including a custom DTD for the test document used here might look like:
<!-- tei-custom.dtd --> <!doctype document [ <!element document - - (p)+> <!element p - - (witnessa, witnessb, witnessc)> <!attlist p id id #required> <!element (witnessa | witnessb | witnessc) - - (#pcdata)> <!attlist witnessa wit cdata #fixed "A"> <!attlist witnessb wit cdata #fixed "B"> <!attlist witnessc wit cdata #fixed "C"> ]>
The principal advantage of this approach is the extreme simplicity of the DTD. The TEI header is not included because there is no advantage to authoring it outside the real TEI DTDs. That is, one would encode the TEI header (using the real TEI DTDs) and the body of the textual critical edition (using a custom DTD) separately, transform the body so that it will be TEI-conformant, and then combine the two parts for publication. The custom DTD illustrated here contains no markup that is not used in the test document, but the content models could be expanded to correspond to those found in the standard TEI DTDs, if desired.
Note that the attribute wit
is declared
as cdata
with a fixed value. The cdata
declaration
corresponds to the value in the
TEI
DTDs, although an alternative declaration
requiring a name (such as id
) would also be
TEI-compatible, since names are a subset
of cdata
strings (although they are subject to case folding and
character restrictions). As was noted in section 5.1.1, above,
the fixed
declaration ensures that only the declared value will be
permitted, and the use of shorttag
in the standard
TEI
SGML declaration means that the default value does
not need to be specified explicitly in the document instance. This strategy
allows the user to declare any value that would be acceptable in the
TEI
DTDs, but it also avoids the opportunity for error
by ensuring that the user does not need to specify the value explicitly, and
that incorrect values specified explicitly will be caught by the parser.
The test document instance marked up according to this DTD would look like:[6]
<!-- tei-custom.sgml --> <document> <p id="p1"> <witnessa>First line from witness A</witnessa> <witnessb>First line from witness B</witnessb> <witnessc>First line from witness C</witnessc> </p> <p id="p2"> <witnessa>Second line from witness A</witnessa> <witnessb>Second line from witness B</witnessb> <witnessc>Second line from witness C</witnessc> </p> </document>
The following Omnimark script will convert the custom version of
the test document into the standard
TEI version, suitable for combination with a
<TEIheader>
(with a <TEI.2>
root element):
; tei-custom.xom ; run with omnimark -s tei-custom.xom ; teisgml.dec tei-custom.dtd tei-custom.sgml ; -of tei-custom-standard.sgml down-translate element document output "%n<text>" output "%n <body>" output "%c" output "%n <body>" output "%n</text>" element p output '%n <p id="%lv(id)">' output "%n <app>" output "%c" output "%n </app>" output "%n </p>" element (witnessa | witnessb | witnessc ) output '%n <rdg wit="%uv(wit)">%c</rdg>'
As with the modified TEI DTD approach, above, any SGML-aware tool can be used for the transformation.
This section discusses the use of SGML architectures to implement the mapping and conversion between a custom DTD and the TEI DTDs. Section 5.3.1 discusses the advantages of architectural processing, section 5.3.2 describes how architectural processing works, and section 5.3.3 illustrates how architectural processing can implemented to support the current project.
The custom DTD approach described above allows the user to construct a project-specific DTD that enforces much greater structural control than is available in the standard TEI DTDs. The transformation of the document from markup according to the custom DTD to markup according to the standard TEI DTDs enables the author to take advantage of this structural control when it is needed, viz. during authoring, and then to get it out of the way when custom markup becomes an impediment, viz. during publication and interchange.
The transformation process described above uses a custom Omnimark script to convert the custom document to a standard TEI document, and the advantages of this approach are the simplicity of the custom DTD and the fact that users may employ any scripting language with sufficient power to accomplish the desired transformation. One disadvantage of this approach, however, is that the transformation script must deal with all elements, including those that have the same GIs and attributes in both the custom DTD and the TEI DTDs, although this limitation could be overcome by building into the transformation script a default identity transformation.
A more significant limitation to the transformation process described above is that the relationship between the custom DTD and the TEI DTDs is completely external to the custom DTD itself. Since the mapping between the custom DTD and the TEI DTDs is logically part of the informational value of the custom DTD (that is, the custom DTD is designed with remapping to the TEI DTDs in mind), it is desirable to build that mapping into the DTD itself. This relationship could be expressed through comments in the custom DTD, but 1) this approach is not obligatorily formalized, 2) the completeness of the mapping cannot be validated automatically, and 3) the TEI version of the document that results from the implementation of the mapping cannot be validated without performing the conversion and then validating the output separately. That is, the validity of the remapped document against the TEI DTDs is not inherent in the document itself.
SGML architectural forms provide a mechanism for formalizing the mapping between a custom DTD (called a "document DTD") and a TEI DTD (called an "architectural DTD"). The principal informational advantage of architectural processing over the custom DTD strategy described above is that architectural processing integrates into the document DTD itself the identity of the architectural DTD, information about how an architectural processor can access the architectural DTD, and information about how markup in the document DTD is associated with markup in the architectural DTD. The principal practical advantage of architectural processing is that an architectural processor can validate a document against both the document DTD and the architectural DTD simultaneously, and some architectural engines can even generate an output document that implements the associations between the two DTDs as transformations. In other words, architectural engines of this type are capable of converting a document marked up with a document DTD into a new document with the same basic content, but with the original markup replaced by markup taken from the architectural DTD. Architectural forms are not an all-purpose transformation mechanism, and their transformational power is considerably less than that of Omnimark and other scripting languages that are common in SGML environments, but, as is shown below, they are fully capable of supporting the associations required by the critical edition project described here.
Simons observes that architectural processing provides an alternative to the TEI notion of clean and unclean modifications, described in section 5.1.2, above. According to this new model, a custom document that employs a TEI DTD as an architectural DTD may be considered architecturally cleanly conformant if the document in question is valid with respect to both the document DTD and the architectural DTD. As Simons notes, unlike with the TEI definition of clean and unclean modifications, architecturally clean conformance can be validated in one step with an architectural parser. ([Simons], Section 6) Furthermore, this interpretation avoids the imbalance between clean subset and clean superset modifications, since a document can be architecturally clean only in one way. The architectural implementation illustrated below is architecturally clean, which is to say that the test document is valid against both the document DTD and the TEI architectural DTD.
As is shown below, architectural processing, like modifying the TEI DTDs, is not as complicated a procedure as it may appear to those who have never tried it. Excellent introductions include [Kimber1], [Kimber2], [Clark1], and [Clark2]. The architectural form standard is defined in [ISO10744] and a convenient set of links to additional information is available at [Cover].
Briefly, architectural processing requires the following steps (illustrated in full in section 5.3.3, below):
<?IS10744 ArcBase tei > <!entity % teidtd system "tei-textcrit-pizza.dtd" > <!notation tei system > <!attlist #notation tei arcDocF name #fixed TEI.2 arcFormA name #fixed tei arcDTD cdata #fixed "%teidtd" >The system entity refers to the standard TEI DTD created in the preceding step. The first attribute (
arcDocF
) identifies the root
element of the standard TEI DTD as <TEI.2>
. The second attribute (arcFormA
) assigns the name tei
to the
architectural DTD attribute (that is, it says that a global attribute with the name tei
will be used to identify the element in the standard TEI DTD that corresponds to each new element in the document DTD). The third attribute (arcDTD
) identifies the architectural DTD as the standard TEI DTD declared earlier.tei
for each
element that must be mapped to a different
element in the TEI architectural
DTD as follows:
<!element witnessa - - (#pcdata) > <!attlist witnessa wit cdata #fixed "A" tei nmtoken #fixed "rdg" >This example says that the element
<witnessa>
in the document
DTD will be mapped to the
element <rdg>
in the architectural
DTD.tei
when parsing the document and "filename" represents the
name of the document file.This procedure is illustrated below.
One useful feature of architectural forms is that GIs that correspond in the document and architectural DTDs are mapped automatically. This means that it is not necessary to specify architectural attributes for such elements, which can greatly simplify the creation of the document DTD. To take advantage of this feature, one needs to assign the same GIs to corresponding elements wherever possible. For this reason, all GIs in the following document DTD are borrowed from the architectural TEI DTD except where they correspond to new elements. In the custom DTD strategy discussed in section 5.2, there was no necessary advantage to implementing this type of correspondence, although one could create a script that performs the same type of default mapping that occurs in architectural processing (which would be done in Omnimark, for example, by using the default [implied] element rule).
The custom
DTD strategy described in Section 5.2 created
only the eventual content of the <text>
element of the
TEI version of the document, under the
assumption that the <teiHeader>
element could be authored
separately and then combined with the <text>
element after
the latter was generated from the custom document. This strategy is
inconvenient under the architectural approach because this approach validates
the document against both
DTDs simultaneously, and a
TEI document without a
<teiHeader> element is invalid. For this
reason, the most convenient authoring strategy within the architectural
approach involves authoring the entire document (including the
<teiHeader>
) using the document
DTD.
For the present project, it was important to restrict the
content of the eventual
TEI <text>
element as much as
possible as a way of preventing the inadvertent use of unwanted markup. On the
other hand, there was no desire to restrict the content of
the <TEIHeader>
element. This means that the document
DTD, which would be used for authoring, needed
to include all features of the <TEIHeader>
, but very little
of the original content of the <text>
element.
As a shortcut to including the
<teiHeader>
markup in the document
DTD, a monolithic
DTD was generated for a
TEI independent header by creating the following
empty
TEI document:
<!-- tei-header.sgml --> <!doctype ihs public "-//TEI P3//DTD Auxiliary Document Type: Independent TEI Header//EN" [ ]>
and then using the spam tool from the SP suite to run "spam -pp
tei-header.sgml > tei-header.dtd". The output of this procedure was then
edited by hand to remove the doctype declaration line at the top of the file
and the "]>" at the bottom. As was noted above, because the
<teiHeader>
element and all its content elements in the
document
DTD will have the same
GIs as in the architectural
DTD, it is not necessary to edit the
<teiHeader>
DTD to declare tei
architectural
attributes for them explicitly.
The resulting
DTD fragment supports all elements in the
standard <teiHeader>
but none of the content that is
specific to the <text>
element. It also does not define a
root <TEI.2>
element. This means that the user must define
the root element and non-header markup, which can be designed to support only
what is needed for a particular project. Because some elements may be used in
both the <teiHeader>
and the <text>
, it is
safest to create entirely new elements where any special constraints are
required inside the <text>
, rather than to redefine existing
elements. For example, it might be useful for this project to define paragraphs
in the body of the document as consisting entirely of <app>
elements, but paragraphs also occur in the <teiHeader>
, where
it is important that they be able to support their usual content. Conflicts are
avoided by a type of name-spacing, where the string "djb-" is prefixed to the
original
GI of corresponding original
TEI elements. For example,
<djb-p>
is a replacement for the standard
<p>
element in the <text>
(actually,
<djb-text>
), while the standard
TEI <p>
element remains
available within the <teiHeader>
element. Because no standard
TEI element begins with the string "djb-", this
strategy ensures that new
GIs will not conflict with existing ones.
The following is a possible SGML prolog (doctype declaration and internal DTD subset) for the architectural version of the test file:
<!-- tei-architecture.dtd --> <!doctype tei.2 [ <!-- magic incantation to support architectural processing --> <!-- uses "tei" as architecture name --> <!-- incorporates TEI DTD with prose and text crit modules --> <?IS10744 ArcBase tei > <!entity % teidtd system "tei-textcrit-pizza.dtd" > <!notation tei system > <!attlist #notation tei arcDocF name #fixed TEI.2 arcFormA name #fixed tei arcDTD cdata #fixed "%teidtd" > <!-- header markup is all declared in a separate file --> <!entity % tei-header system "tei-header.dtd" > %tei-header; <!-- root and all non-header markup declared here --> <!-- all new elements are prefixed "djb-" and bear "tei" --> <!-- attributes to define architectural mapping --> <!element tei.2 - - (teiheader,djb-text) > <!element djb-text - - (djb-body) > <!attlist djb-text tei nmtoken #fixed "text" > <!element djb-body - - (djb-p)+ > <!attlist djb-body tei nmtoken #fixed "body" > <!element djb-p - - (djb-app) > <!attlist djb-p id id #required tei nmtoken #fixed "p" > <!element djb-app - - (djb-wita, djb-witb, djb-witc) > <!attlist djb-app tei nmtoken #fixed "app" > <!element (djb-wita | djb-witb | djb-witc) - - (#pcdata) > <!attlist djb-wita wit cdata #fixed "A" tei nmtoken #fixed "rdg" > <!attlist djb-witb wit cdata #fixed "B" tei nmtoken #fixed "rdg" > <!attlist djb-witc wit cdata #fixed "C" tei nmtoken #fixed "rdg" > ]>
The test document instance marked up according to the preceding DTD looks as follows:
<!-- tei-architecture.sgml --> <tei.2> <teiheader> <filedesc> <titlestmt> <title>TEI Critical Edition Test Document, Architectural Version</title> </titlestmt> <publicationstmt> <p>Unpublished.</p> </publicationstmt> <sourcedesc> <p>Original test document created 2000-03-10 by djb.</p> </sourcedesc> </filedesc> </teiheader> <djb-text> <djb-body> <djb-p id="p1"> <djb-app> <djb-wita>First line from witness A</djb-wita> <djb-witb>First line from witness B</djb-witb> <djb-witc>First line from witness C</djb-witc> </djb-app> </djb-p> <djb-p id="p2"> <djb-app> <djb-wita>Second line from witness A</djb-wita> <djb-witb>Second line from witness B</djb-witb> <djb-witc>Second line from witness C</djb-witc> </djb-app> </djb-p> </djb-body> </djb-text> </tei.2>
The document DTD ensures that only newly-defined
elements may occur outside the <teiHeader>
, while the
architectural attributes ensure that these new elements are associated with
standard
TEI elements.
Using the SP toolkit, the document may be parsed simultaneously against both the document and architectural DTDs using nsgmls by typing "nsgmls -A tei -s tei-architecture.dtd tei-architecture.sgml" (where "tei-architecture.dtd" represents the prolog file illustrated above and "tei-architecture.sgml" represents the instance file). The instance may be converted to standard TEI markup by typing "sgmlnorm -A tei tei-architecture.dtd tei-architecture.sgml".
This section summarizes the conclusions that emerge concerning several related problems posed at the beginning of and during the course of this report.
If one marks up witness variants as
<rdg>
elements distinguished by the value of the
wit
attribute, the principal strategy available in the standard
TEI
DTDs, it is impossible to use an
SGML parser to ensure that 1) each witness appears
exactly once in each <app>
element, 2) the witnesses occur
in a consistent and specific order, and 3) where reading groups are employed,
the witnesses fall inside the desired <rdgGrp>
element within
the <app>
element. This limitation can be overcome by
representing the names of witnesses not as attribute values (as in the
TEI wit
attribute) or data content
(as in the
TEI <wit>
element), but as
GIs.
As was noted above, there are advantages in structural control to
changing the declarations in the standard
TEI
DTDs for the sigil
attribute
of <witness>
elements (from cdata
to id
) and the wit
attribute
of <rdg>
elements (from cdata
to idrefs
). This control can be combined with the strategy of
creating new
GIs for each witness, should that be desired,
although it becomes less necessary once new
GIs have been declared, since the new content
models of the
DTD already ensure better control over witness
identification than would be available from the id/idrefs
mechanism.
If we now return to the first three strategies for representing
witness names listed in section 1, above (GIs, attributes, and data content, respectively),
we can conclude that the first (GIs) provides the most structural control, the
second (attributes) can enforce some coordination between readings and
witnesses (especially if the attribute type is changed from cdata
to idref
or idrefs
), and the third (data content)
provides no significant structural control (it can ensure that an element
exists to hold a witness identifier, but it cannot validate the specific data
content of that element at all).
Very tight structural control may be desirable during authoring, but much looser control is often completely satisfactory for subsequent processing. This suggests that instead of assuming that a single DTD will be used for all purposes, it might be profitable to employ a strict DTD for authoring and a flexible one for interchange and subsequent processing. This observation extends the XML philosophy that one can do many useful things with a structured document without accessing a formal DTD. Another way to look at this issue is that the DTD is important during authoring because it constrains the types of documents that may be created. Once a document has been created, it can have only one type, which is the type implemented in the specific document instance itself. A processor that has to deal with such a document will have no need to know about all the other structures it might have had.
As was noted above, there are inherent contradictions between the need for DTDs that provide appropriate structural control during authoring for specific projects, on the one hand, and the need for DTDs that are flexible enough to enable a community of users to exchange files without requiring special accommodations. This contradication can be resolved by modifying a communal DTD (such as the TEI DTDs) in a way that enforces authoring control but still permits the resulting document to be processed without needing to revise the tools to accommodate the modifications directly. Alternatively, the contradication can be resolved by using one DTD for authoring and then transforming the document instance so that it conforms to a different DTD before publication or other processing. This transformation can be performed with an arbitrary scripting language or with SGML architectural processing.
From a publishing perspective, score-like critical editions address two problems that are widespread in traditional critical editions: incomplete presentation of the evidence and compromised legibility. From an SGML engineering perspective, score-like critical editions make it possible to develop project-specific DTDs that provide much better structural control than is available through the standard TEI approach. On the other hand, encoding for score-like editions does not distinguish formally between situations where there is textual variation and situations where there is not; similarly, where there is variation, encoding for score-like editions does not provide a formal record of which witnesses agree.
A compromise approach might view a score-like edition as a presentational view that can be generated from one of the standard TEI models. This is true, but it turns out that two of the three TEI methods are also not capable of formalizing all details of variation.
A robust authoring environment for score-like critical editions
requires that all witnesses be represented in all sections (unless they are
designed to be omissable), that no witness occur more than once, that all
witnesses occur in a particular order, and that where reading groups
(<rdgGrp>
) are used, witnesses occur only in the appropriate
reading groups. These issues cannot be controlled purely through
SGML with standard
TEI parallel segmentation editions because only
GIs (and not attributes) can enforce these
requirements, and the ability to associate a reading with multiple witnesses
makes it impossible (or, at least, grossly impractical) to try to replace
attribute value witness identifiers with
GIs. A score-like edition, on the other hand,
because it does not permit a single reading entry to be associated formally
with multiple witnesses, provides an opportunity to implement additional
structural control features by designing an appropriate
DTD.
Clean subset modified TEI DTDs differ from clean superset TEI DTDs in that only the former produce document instances that show no trace of having been created with modified DTDs. Although the declaration of new elements combined with restrictions on the content models of standard elements creates what the TEI Guidelines ([TEI P3]) call an unclean modification, the resulting document instance can nonetheless be processed with unmodified TEI-aware scripts. This observation can be extended to all unclean modifications where the superset aspects of the new DTD are required only during authoring, and not during subsequent processing.
Some projects may be authored with very small specific
DTDs, after which the document instances may be
converted so that the custom markup is replaced with standard
TEI markup. Designing a custom
DTD is relatively easy, especially if it is used
only for the material that will be included in the
TEI <text>
element, with
the <teiHeader>
authored separately. The principal
disadvantages of this method are that the mapping from the custom
DTD to the
TEI
DTDs is external to the document and that the
eventual
TEI document can be validated only after the
transformation has been performed.
Although SGML architectures are not as powerful as Omnimark and other scripting languages commonly used for SGML processing, they are fully capable of implementing the types of associations required by current project. Architectural tools can validate a document simultaneously against both the document DTD and the architectural DTD, and can also output a new version of a document with the original markup replaced by the corresponding markup from the architectural DTD.
Any of the three strategies discussed in section 5, above
(processing a modified
TEI
DTD with respect to TEIform
attribute
values, [section 5.1], transformation of a custom
DTDs to a
TEI structure [section 5.2], and architectural
forms [section 5.3]) provides a solution to the issues posed by a score-like
edition. Specifically, these strategies all permit much greater structural
control than is available in the standard
TEI
DTDs, rely entirely on
SGML for all validation, and produce a final
document that is fully
TEI-conformant.
[1] The TEI DTDs actually also permit the simultaneous use of multiple bases (a "mixed" base), which means that it is technically possible to create a universal TEI DTD that includes all markup available in all TEI DTD modules. This, in turn, means that any TEI document created without modifying existing TEI components (see below) should be able to be parsed against this universal TEI DTD.
[2] This warning is not formalized, which is to say that it is not possible to determine unambiguously when deletion, renaming, extension, and modification have become so extensive that one can no longer claim TEI compatibility. While the text states explicitly that deleting all TEI definitions would not produce a TEI-conformant document, it is surely the case that deleting all but one such definition, or all but two, etc. would also not be considered conformant practice.
[3] For reasons explained immediately below, the
TEI
DTDs do not strictly require the inclusion of
<witness>
elements for each witness; in fact, they do not
require the inclusion of a <witList>
element at all.
The <witList>
element has been omitted from the examples in
this paper to save space, but this would not normally happen with real critical
editions.
[4] The principal impediment to revising the official
TEI
DTDs to change the declarations of the
sigil
and wit
attributes to id
and
idrefs
, respectively, is that this change could render some
existing documents invalid. In particular, some editors may wish to employ
sigla that begin with digits (such as years), and because attributes of type
id
are names, which must begin with name start characters, and
digits are not name start characters according to the standard
TEI
SGML declaration, a value for name attributes
that began with a digit would not be valid. Other issues that arose during
TEI development discussions of this question
included sigillum references of the type "c-e" (to indicate witnesses c, d, and e)
and the convenience of including annotations (such as question marks to
indicate uncertainty) within the wit
attribute (C. M.
Sperberg-McQueen, personal communication).
In retrospect, since the <wit>
element is
already available as an alternative to the wit
attribute and can
answer the needs described above, it seems particularly unfortunate that the
utility of the sigil
and wit
attributes was
compromised through the cdata
declarations. An alternative solution
might have involved providing an opportunity for editors who wish to employ
sigla that are not valid
SGML names to use id
-type
sigil
and idrefs
-type wit
attributes purely for internal control, which could involve, for example,
tagging a witness that one would like to call "1643" as <witness
sigil="witness1643" n="1643">
(using the global cdata
attribute n
). An application could then render the vernacular name
by accessing the cdata
value of the n
attribute, while
the system could validate the relationship between the witnesses in
the <witList>
element and those cited in readings through the
SGML id/idrefs
mechanism.
Because names (including id
and
idrefs
attribute values) are case-insensitive (using the standard
TEI
SGML declaration), while cdata
is
case-sensitive, authors who change the default type of the sigil
and wit
attributes will need to monitor the consistency of their
case usage. Authors do not normally need to be consistent in case usage when
authoring using attribute values that are names, but if the document is then
distributed with a standard
TEI
DTD, where the values that were authored as
names come to be published as cdata
, they will all be valid, but
the
ESIS (element structure information set) output of cdata
attributes that differ
in case will also differ in case. A user can choose to ignore this mismatch
during subsequent processing (which will have to be handled separately from
SGML validation in any case, since
SGML tools cannot validate correspondences
between cdata
attributes) or by normalizing the case usage of the
name-type attributes before changing
DTDs (for example, with a tool such as sgmlnorm
from the SP toolkit).
[5] This is not intended to suggest that a DTD is always needed only during editing. For example, during processing one might wish to identify not only the attribute value that has been associated with a particular element, but also the universe of possible values from which a particular one was chosen.
[6] The assignment of id
attributes to
<p>
elements is left to the editor, but the
DTD and script could easily be revised to
prohibit the editor from specifying an id
value explicitly and
require the script to assign consecutive numerical values automatically.
Acknowledgements: I am grateful to David Mundie, Casey Palowitch, and Elizabeth Shaw for comments on an earlier version of this paper.