Text Encoding Initiative
1. Introduction [Back to Contents]
At the end of the eighties a group of people gathered together for the development of a standard way to encode textual material. They used SGML (Structured Generalized Markup Language). The group was called Text Encoding Initiative (TEI).
When we now speak of TEI, we mean the XML-format that is especially suited for the encoding of many different types of textual material. Examples are monographs, articles, poetry, drama, etc.
The TEI and all TEI components are free to use. For some parts the GPL (General Public License) applies, for example for the DTDs that form the TEI P4, the HTML version of the TEI Guidelines and the documentation that can be found online.
2. From P1 to P5 [Back to Contents]
For those of you who are already familiar with the 'TEI' acronym: you may also have seen the 'P'. The P stands for proposal, the first of which was published in the year 1990. It was called TEI P1. Starting with the third proposal the standard is called Guidelines for Electronic Text Encoding and Interchange, and is published in a 2 volume book (see following picture).
After the introduction of XML, the newly formed TEI Consortium developed the TEI P4. This new proposal appeared in 2002 and can be used for both XML as well as SGML documents. The newest TEI specification is called P5 and was released on November 1, 2007.
3. Applicability of the TEI Guidelines [Back to Contents]
The TEI is especially designed for encoding textual material where as much as information is kept intact as possible. Each type of text has its own characteristic structure (like e.g. division in acts, different speakers in a play or chapters of a book). Using TEI those specific elements of the text are encoded in meaningful representations in order to preserve the information. The complete text is encoded in digital format including information about its content, the structure and metadata. With this standardized way of encoding, it should be possible to exchange texts in different environments. Also specific content can be easily indexed using the markup information, as used in the Digital Locke Project.
To give you an impression of the various markup elements used within the TEI, the follow fragment shows an example. It’s a fragment of one of Shakespeare’s plays, King Lear:
<sp> <speaker>Kent.</speaker> <lg> <l part="Y">Do;</l> <l part="N">Kill thy physician, and the fee bestow</l> <l part="N">Upon the foul disease. Revoke thy gift,</l> <l part="N">Or, whilst I can vent clamour from my throat,</l> <l part="N">I'll tell thee thou dost evil.</l> </lg> </sp>
This short fragment illustrates the way of encoding separate lines
<l> elements, that are grouped together in a
<lg> element. Extra information on the lines can
be further encoded in the part-attribute.
(No) shows that the corresponding line is complete,
part="Y" (Yes) shows that the line is incomplete, i.e.
something is missing. In the fragment only the first line shows an
incomplete line. The element
the name of the speaker of these lines.
4. Structure of the TEI Guidelines [Back to Contents]
The guidelines give you information about how specific elements can be used for encoding a text. There are many different types of texts and therefore the TEI Guidelines can be seen as a set of separate DTDs each of which contains specialized elements for encoding text. These DTDs are called DTD fragments or tag sets. The tag sets can be divided in three groups:
- core tag set - these are the standard components of the TEI DTD. These form the base of the TEI DTD and will be used for every type of text.
The core tag set can be enriched with any of the following:
- base tag sets – these are the base tag sets for specific types of texts. Typically one of these is chosen for usage. Examples are prose, verse and drama.
- additional tag sets – extra tag sets for specific goals. As an example the set textcrit was used for encoding the Locke texts, it has specific functionality for making text critical markup.
The XML fragment below shows the global structure of a TEI
document. Every text that is written in TEI, begins with the
<TEI.2>. After the first element a
header is written containing information about the text (metadata),
like for example the original author, information about markup,
etc. The header is encoded using the
element. After the header comes the frontmatter (in the element
<front>), containing the title, preface, table of contents,
etc. Then the actual text is printed, in the
<body> element. And finally some backmatter
(e.g. the appendix) may follow in the element
<TEI.2> <teiHeader> <!-- ... --> </teiHeader> <text> <front> <!-- front matter of copy text goes here. --> </front> <body> <!-- body of text goes here. --> </body> <back> <!-- back matter of text, if any, here. --> </back> </text> </TEI.2>
5. Differences between TEI P4 and TEI P5 [Back to Contents]
The TEI Consortium has released the TEI P5 specification November 1, 2007. The P5 specifications differs a quite lot from the older P4. Documents that are encoded in P4 will not validate using a P5 DTD and vice versa. The P5 supports a lot of new possibilities, making encoding of texts more effective. As an example the P5 specification allows for inclusion of other types of XML documents, for example MARC records can be embedded within the text.
6. Further reading [Back to Contents]
The amount of material that can be read about the TEI is huge. For those of you who cannot afford the time to delve into the separate tag sets, the TEI Lite may be helpful. This is a ready to use DTD and contains a fair part of possible applications that are part of the TEI.
Further information about the TEI starts here:
Directly jump to the TEI P5 Guidelines:
Directly jump to the TEI P4 Guidelines:
Text Encoding for Information Exchange:
TEI Lite: An Introduction to Text Encoding for Interchange
Online edition of the Guidelines for Electronic Text Encoding and Interchange