The previous blog post introduced a citation system for The Hobbit and linked to an index that showed the first five tokens in each paragraph. How often is five a sufficient number to uniquely identify the paragraph? How often can we get away with less?
These are relevant questions because they go to whether a lookup can be provided that does not give enough text to violate any rights of the copyright holder.
If we strip punctuation (other than hyphens and apostrophes) and convert the tokens to lower case, we can actually uniquely identify the paragraph reference with five or fewer initial tokens in 1,542 out of 1,566 cases.
For example: there is only one paragraph that begins with “sorry” (
01.021) or “next morning” (
07.121). We can’t just use “next” because of “next day” but even that is not sufficient because we have “next day the” (
17.001) and “next day they” (
07.134). But in those two cases, the initial three tokens suffice.
I initially hit an interesting case with
01.108 which is simple the token “why”. There are six other paragraphs that start with “why” but there are no further tokens in
01.108 that can be used to disambiguate. Rather it’s the fact that the paragraph ends there that is the disambiguation. And so I found it helpful to put a stop token
# in my prefixes and so
01.108 would really be tokenized as “why #”.
This stop token trick turned out to be helpful in two other cases.
02.015 is just “but said bilbo” but
02.017 begins “but said bilbo again”.
02.127 is “thank you said thorin” but
03.023 begins “thank you said thorin a”
So doing all this, there are:
- 180 paragraphs that need just one initial token to be uniquely identified
- 649 paragraphs that need just two initial token to be uniquely identified
- 515 paragraphs that need just three initial token to be uniquely identified
- 156 paragraphs that need just four initial token to be uniquely identified
- 42 paragraphs that need just five initial token to be uniquely identified
There are 11 paragraphs that actually need six initial tokens:
- “far over the misty mountains grim”
- “i am mr bilbo baggins he”
- “i am mr bilbo baggins i”
- “just at that moment the lord”
- “just at that moment the wolves”
- “o where are you going so”
- “o where are you going with”
- “roads go ever ever on over”
- “roads go ever ever on under”
- “wrong said bilbo who had lost”
- “wrong said bilbo who had luckily”
4 that need seven:
- “on silver necklaces they strung the flowering”
- “on silver necklaces they strung the light”
- “why o why did i ever bring”
- “why o why did i ever leave”
2 that need eight:
- “what has it got in its pocketses he”
- “what has it got in its pocketses the”
2 that need nine:
- “who are you and what do you want he”
- “who are you and what do you want they”
There remain 5 cases that still cannot be uniquely identified even with nine tokens. All of them are poetry stanzas with repetition.
01.144 all begin “far over the misty mountains cold” and continue identically until the twenty-first token!
- “far over the misty mountains cold to dungeons deep and caverns old we must away ere break of day to find”
- “far over the misty mountains cold to dungeons deep and caverns old we must away ere break of day to claim”
- “far over the misty mountains cold to dungeons deep and caverns old we must away ere break of day to seek”
15.038 are identical stanzas:
“the dwarves of yore made mighty spells while hammers fell like ringing bells in places deep where dark things sleep in hollow halls beneath the fells #”
All-in-all, it does seem like a citation lookup based on initial tokens might be achievable without copyright violation. In 98.5% of paragraphs, it takes five tokens or fewer to identify the reference. In 95.8% of paragraphs, it takes four tokens or fewer. Even with just the initial three tokens you can get 85.8% of references unambiguously.
I will follow this analysis up with the equivalent for Lord of the Rings and the Silmarillion soon.