Tokenizing the Hobbit
How many words are there in The Hobbit?
No, not two. But jokes aside (thanks, Iian Neill) we need to immediately be clear if we’re talking about unique words or not.
For example does:
In a hole in the ground there lived a hobbit.
consist of 10 words or only 8 (a, ground, hobbit, hole, in, lived, the, there)?
This distinction is often referred to as type vs token. There are 10 tokens but 8 types (assuming we treat “In” and “in” as the same word, of course).
So let’s be clear up front, we’re talking about tokens. So how many tokens are there in The Hobbit?
Well, there are some more things to deal with first.
[UPDATE: As José Anido pointed out on Twitter, before anything else: which edition of The Hobbit are we talking about? All my digital work is currently based on the 70th Anniversary edition although I’m working with Ugo Truffelli on expanding that.]
Are we including Christopher Tolkien’s Preface? Douglas Anderson’s Note on the Text? The short description of the spelling of dwarves and of runes at the start?
Let’s say no. Let’s just deal with the 19 chapters of the main narrative. How many tokens in those?
Are we including chapters headings or not? The footnote about Bolg? Figure captions?
Again let’s say no. Can we count tokens now? How might we go about doing that?
The easiest way would be to just “split on whitespace”, that is treat every space as the end of a token and the start of a new one.
If we do that, we get 95,137 tokens in The Hobbit. This is a reasonable thing to do but we might want to check the shape of these tokens. Do they all look like what we might expect our tokens to look like?
The 95,137 tokens come in 11,532 different types. Just under half (5,576) of those types consist of just a sequence of letters (including ä which is found in
Roäc and is the only character with a diacritic in the text). Another 291 types are hyphenated words like
Another 166 types are a sequence of letters (and possibly hyphens) followed by one of:
53 types have punctuation, either
( before a word. 5,165 types consist of a word followed by punctuation
—. 4 types have both punctuation before and after a word.
Note that single (or double) quotation marks don’t show up here because I replace them with markup to avoid ambiguity between quotes and apostrophes.
In one case (in
05.146) there’s a hyphenated expression with an apostrophe marking possession in the middle, plus punctuation at the end:
We also have
02.013 as part of “Thorin & Co.”.
We have a number of words with apostrophes indicating something other than the possession or contraction covered above:
runnin’. With the possibility of being followed by punctuation, this accounts for 18 types.
And so we have 5,576 + 291 + 166 + 53 + 5,165 + 4 + 1 + 4 + 18 = 11,278 types, leaving 254 unaccounted for.
These all have em-dashes
— in them. We’ve already encountered em-dashes in our pre-word punctuation and post-word punctuation so let’s briefly review those before we get to the mid-token em-dashes.
The three times an em-dash starts a token (
07.116) it is at the start of direct speech and marks the resumption of an interrupted utterance.
Similarly the tokens ending in an em-dash are usually the interruption of speech (
07.073 which is what
07.075 is resuming).
There are two cases where a final en-dash is used:
01.128. I do wonder if the latter at least should be an em-dash instead (and possibly the former) although this doesn’t make a difference to our token counting.
Let’s get back to our 254 remaining types that have an em-dash in the middle. Consider the following from
The tunnel wound on and on, going fairly but not quite straight into the side of the hill—The Hill, as all the people for many miles round called it—and many little round doors opened out of it, first on one side and then on another.
Here the em-dash is used to mark out a parenthetical comment about the hill being called “The Hill”. If we just tokenize based on spaces, we end up with tokens like
it—and which is almost certainly not what we want.
What we probably want to do instead is split these into
and as if the em-dashes were “open” (i.e. have spaces around them) rather than “closed”. This is fairly easy to do but there are two things we must be careful of doing this.
Firstly, we get things like
01.097 which we probably would NOT want to split into
, because we’re not splitting off punctuation like the comma as a separate token. Other examples of this are
Secondly, we have the following:
Do we want to split these into multiple tokens? Let’s assume we keep the first four as single tokens but split the last one into
chest;. How many tokens do we end up with?
In that case 95,643 in 11,324 different types.
One may quibble with those last few tokens with multiple em-dashes. There is also some question whether to treat
hill—The as three tokens or just two.
If we don’t include the medial em-dash in our token count (i.e. treat
hill—The as two tokens) we end up with 95,390 tokens in total.
And so we’ve suggested 95,137, 95,390 and 95,643 as possibilities. And this is still assuming we’re dropping chapter headings, the footnote, the preface, etc.
But the key point is we need be explicit about what we mean when counting tokens—what the rules are that we’re following to get our count.
Incidentally, the act of doing the coding for this blog post resulted in discovering a couple of errors in my digital text so apart from anything else, this was a useful exercise in validating the text against certain heuristics to find mistakes.