Marking Up The Hobbit in XML
As a starting point, I’m working on the electronic markup of the text of The Hobbit in the Extensible Markup Language (XML).
Because much of the analysis we want to do goes beyond a list of words and into the structure of the texts, there is work to be done in representing that structure.
At first glance, a novel like The Hobbit has a simple structure: chapters made up of paragraphs.
Chapters are numbered and so we can use TEI-like XML tags like
to delimit them and the familiar
to delimit paragraphs.
Each chapter has a name, for which we can use
Hence the start of chapter one of The Hobbit might be something like:
It doesn’t quite end there, however. Other issues one must deal with include:
- words and phrases in italics
- section breaks
- illustrations (or at least their placement and captioning)
- other specially-formatted sections such as Thorin’s letter
In a later post, we’ll explore the different purposes for which italics are used in The Hobbit but for now we won’t distinguish and will simple use
per the TEI guidelines. For example:
<hi> is short for “highlight” and, in TEI, neither suggests why the content is being highlighted nor (unless there is a
rend attribute) how. Given
<hi> in our markup is always rendered as italics in the printed books, I have omitted the
rend attribute for now.
On 52 occasions, there is a section break in the text. In the 2007 printed edition, this is indicated by extra vertical spacing mid-page and, on a page break where it would otherwise go unnoticed, by three asterisks.
A forthcoming blog post will look at the different functions of these section breaks but they usually represent a shift of some kind, e.g. from dialog to compressed narrative.
There is no single obvious way to mark up these section breaks and nothing in the TEI guidelines. I am currently vacillating between a separate empty element (but which?–nothing in TEI seems appropriate) or some attribute on the following paragraph (but again, it’s unclear which attribute would be appropriate).
I’ve tentatively used
for now or, where asterisks are used:
We currently represent illustrations with markup like:
The location of these elements is currently based on the Kindle edition although, in the print edition, they are on pages of their own and so their location in the running text is dependent on where page breaks fall. Page breaks are not currently marked up but if and when they are, that will provide an opportunity for better indicating the location of illustrations in the text.
The Hobbit has 25 instances of verse. The lines and stanzas (or “line-groups”) of these are marked up. There is no dedicated TEI element for an overall poem and so we use a generic element with a type for that:
<lg> indicates a line-group and
<l> a line).
Some lines are indented (with up to four distinct levels in some poems) and so we mark that up on the line as well:
There is a single footnote in The Hobbit. In Chapter 17, the mention of Bolg has a footnote explaining that he is the “Son of Azog” with a reference to the page where Azog is mentioned.
We have, for now, left the page reference as-is (just as in the Kindle edition) even though, in an electronic text, it could be linked better to the original reference.
Other Markup Issues
Beyond the structures discussed above, there are three others that manifest presentationally in the text. The first is Thorin’s letter to Bilbo. The overall markup is straightforward and we use a generic element with a type:
The closer of the letter is a little more interesting, however. We could mark it up:
This assumes the closer is a single paragraph with line-breaks, which raises questions we’ll explore a little more in a subsequent post. Alternatively, we can treat each line of the closer as a paragraph and use the specific TEI element for the entire closer:
As yet another alternative, we could use specific TEI elements within the closer:
It is interesting to note that the 2012 Kindle version has the opening quotation marks only at the start of the closer whereas the 2007 print version repeats the opening quotation marks on each new line (as is a common print convention for new paragraphs in a multi-paragraph quote).
For now, I have elected to go with the second alternative although am still on the fence.
The other two issues are in chapter 5 and have to do with a riddle or response to a riddle that is distinctly presented in the text.
The first is the No-legs lay on one-leg riddle. It is presented not as verse but merely as a paragraph all in italics. We mark this up as:
The second is Bilbo’s riddle answer “Time! Time!” which is not in italics but is centred and offset with space above and below. I initially chose to mark this up as a paragraph but the fact the preceding paragraph ends with a colon “:” suggests it’s really just part of that paragraph set off. I’ve now gone with
Not Marked up
There is much that could still be marked up. As already mentioned, there is presently no attempt to indicate page breaks or the semantics of each italic phrase.
Another potentially important next step could be to more explicitly markup up dialogue. This would enable much easier studies of vocabulary used by particular characters.
There are other possibilities such as marking up named entities (people, places, etc). And there are all the deeper linguistic annotations that can be made over time.
All of these will be discussed in future posts. This is very much just the beginning—even just for The Hobbit—and the same markup needs to eventually be applied to other Tolkien texts.
One of the very next things we’ll do, however, is compare the paragraph divisions in our markup to the work done by Sparrow Alden. We’ll also provide some basic statistics on the text and its structure.
Many thanks to Paul O’Rear for discussions on a number of the issues dealt with here.