In this post I will briefly describe the work that went into the creation and correction of the Indexes, and their gradual evolution into a Tolkien Authority List of names. This post serves as documentation to the early work for the index, and indicates the next steps.

Introduction

Names, or Named Entities, are one of the most fundamental features in works of literature, and indexes of names are important aids for readers as well. The annotation and modeling of names in a digital space are one of the most important tasks in digital philology.

One of the goals of the Digital Tolkien project is to create machine-actionable indexes of all named entities in Tolkien’s works, starting from Lord of the Rings. These indexes are directly extracted from the text, annotated and classified, and connected to the context through the canonical citation system used in Digital Tolkien. This will allow scholars to use them, to model patterns of occurrence in texts, to examine character agency (for example, who speaks and where, who is named what and when), or simply for statistical analysis.

Creation of the Indexes

We started with LOTR partly because of the complicated history of its Indexes (and because it was definitely one of the best places to start), but also because it was one of the first works that we had available in XML format with a solid citation system.

We extracted names from the XML text through explicit string matching, with a script that simply parsed the text for capitalized words. The script excluded some matches automatically, such as words capitalized after a period or start of direct speech, and added other instances manually, such as common Elvish words, or all-caps words from inscriptions and letters: for example, in the string “For ADELARD TOOK, for his VERY OWN, from Bilbo”, “ADELARD TOOK” and “Bilbo” were retained as instances of those names, while “VERY OWN” and “For” were excluded. Names that were more uncertain and could not be excluded from the start were placed in two separate lists including unexplained capitalized words (which we will parse and include, where necessary, in the index) and words found at the start of sentences that need to be examined manually.

The script created a preliminary, comprehensive list of names as a YAML file, which simply contained the extracted string, a category to be assigned, and a section for comments. We also decided to create a preliminary categorization of name types, and assigned them manually in the YAML file. I will provide more details about this in another post.

The script also generates annotated HTML files of the text, which display the names in different colors and formats depending on their assigned category, and HTML files of the indexes: each name extracted and categorized appears in the index but can also be seen in the context of the line where it appears, and referenced back to the full text through the citation system. Even the excluded names are highlighted, so that they can be checked for false negatives.

The text, with various categories of names in different colors.
Full Index of Places.
Index of Places: Letter A.

Currently, the text has been annotated with the Index and all the names extracted have been categorized and manually corrected and disambiguated with the help of the generated HTML text. I have corrected and disambiguated the whole list of Creatures, and are going through Places at present, while at the same time refining the Index categories and their definitions.

Creation of an Authority List

One of the issues with working with a text-generated index is that it is impossible to associate names to entities: by “entity”, I mean the thing or living being, unambiguously and uniquely identified, appearing in Tolkien’s text, which may have various names, descriptives, or epithets assigned to it. One of the most famous examples of this phenomenon is certainly Aragorn:

Then Faramir stood up and spoke in a clear voice: “Men of Gondor, hear now the Steward of this Realm! Behold! one has come to claim the kingship again at last. Here is Aragorn son of Arathorn, chieftain of the Dúnedain of Arnor, Captain of the Host of the West, bearer of the Star of the North, wielder of the Sword Reforged, victorious in battle, whose hands bring healing, the Elfstone, Elessar of the line of Valandil, Isildur’s son, Elendil’s son of Númenor. Shall he be king and enter into the City and dwell there?”

This association is essential for large-scale computational work, since when we want to count all the times when the character Aragorn appears in the text, we definitely want to include the times when he appears under another name, or the full string associated with it. So, we are modeling a more detailed version of the index, an Authority List with unique identifiers for each entity, and a mechanism of association of their various occurrences.

We created a list of Keys based on a KWIC (Key Word in Context) concordance format, which allows us to read a portion of the line where the name appears, and makes it possible to associate various types of occurrences to an entity identified by the same key. The Keys are manually assigned (by me) in the same YAML file that contains the general index, and generated as an additional HTML page.

A preview of the Keys Index
A few of the occurrences of the entity Aragorn in LOTR.
Occurrences of the entity Arwen in LOTR.

Keys are currently assigned to the entire index of Creatures. The next step will be to further model the authority list, based on a categorization of names connected to each entity. They will also be enriched with unique identifiers and references to available resources, such as the Tolkien Gateway, the Encyclopedia of Arda, ArdaCraft, and the Encyclopedia of Middle Earth, among others.

I haven’t touched on some of the finer modeling issues, such as the inclusion of non-capitalized words, the creation of “name-strings”, and the inclusion of determiners (you can see some of these things in the screenshot). I’ll provide more details on these aspects of the Index and the Authority List in another post.

PS: I haven’t provided many citations in this post, but you can easily find the provenance of every single example using the Search Tolkien tool.