After much blasphemous and daemonic experimentation with abominable and vile Dark Practices, I have been able to make contact with the undead mind of H.P. Lovecraft, summoning him from the insane and unspeakable dimensions beyond this realm, and extracting from him these fragments of text that he has written whilst in his wretched timeless, bodiless state. (Clicking will open in a new window.)
Isn’t that cool?
OK, but seriously, though: I’ve been re-reading a number of H.P. Lovecraft stories because of the art project that I’m working on. Lovecraft has a beautiful, elaborate and poetic writing style; but sometimes, when I come across sentences like this one,
There were, in such voyages, incalculable local dangers; as well as that shocking final peril which gibbers unmentionably outside the ordered universe, where no dreams reach; that last amorphous blight of nethermost confusion which blasphemes and bubbles at the centre of all infinity – the boundless daemon sultan Azathoth, whose name no lips dare speak aloud, and who gnaws hungrily in inconceivable, unlighted chambers beyond time amidst the muffled, maddening beating of vile drums and the thin, monotonous whine of accursed flutes; to which detestable pounding and piping dance slowly, awkwardly, and absurdly the gigantic Ultimate gods, the blind, voiceless, tenebrous, mindless Other gods whose soul and messenger is the crawling chaos Nyarlathotep.
…I just can’t help but think he’s just fucking with people.
I mean, come on.
So I thought to myself: since he has such a distinct vocabulary and style, it probably wouldn’t be that difficult to create an automatic text-generation program that would simulate his writing style!
Thus began my little project with H. P. Lovecraft random text generation!
Random Text Generation
There are two main approaches out there when it comes to random “language-like” text generation. What I mean by “language-like” is streams of text that are intended to sound like something sensible that someone would write, but that are obviously nonsense (over the long term, at least) that was generated by a computer.
1. Word-Transition Probabilities
The first method uses word-transition probabilities. You take whatever source text you want your random text to “sound like” (in this case, the source would be text by H.P. Lovecraft), and you create a table of n-gram frequencies. In other words, you count the number of times that a sequence of n words happens together.
Just to give you a sense of what this looks like, I took the complete text of two short stories by Lovecraft, “Whisperer in the Darkness” and “The Dream-Quest of Unknown Kadath”, and calculated the frequencies of all of the trigrams (i.e. groups of three words) in the text, and sorted them in decreasing order of frequency.
As you can see, the most commonly occurring group of three words is “the Great Ones” (which should surprise nobody who is a fan of Lovecraft’s work).
Once you have a table like this, you can use it to randomly generate text that follows the same probability distribution for these trigrams over the long term, even though words are selected randomly over the short term.
This is how you do it. First, you pick a trigram to start with. Let’s start with “The Great Ones” just as an example. How do you decide what word comes next? You can find all of the trigrams in the table that begin with the two words “Great Ones”, and see their respective counts.
These are all of your options for the next word. You pick one randomly… but, you weight your decision based on the counts in the original text. So, for example, the chances that the next word after “The Great Ones” is “themselves” should be three times the probability that the next word is “gently.”
Let’s suppose you select the word “wished.” So far, your sentence is “The Great Ones wished.” Your most recent two words are now “Ones wished.” So, you find the list of all trigrams that begin with “Ones wished,” and that will give you your new list of what word might come next. Once again, you use a random number generator to select the next word from that list, but you weight the choice by the total count, so that those trigrams that occurred more often in the original text will have a higher probability of showing up in your random text.
In the long run, this will generate text that has the same overall “feel” of H.P. Lovecraft writing, because it will have the same overall word-transition probabilities.
What does it look like? Try it yourself:
Clicking the above link will open up a page in a new window that will generate pseudo-Lovecraftian text based on this method. You can refresh that window as many times as you want, and it will generate new random text each time. I’ve run it several times, and here is one of my favorites:
It would be much worse; for the first time the great Ones fear, and to the north they traded onyx for the sound of lutes and pipes that stole timid from inner courts where marble fountains bubbled. They had been left behind. It was not sure but that he was a great black doorway which marked olden wrath of the ghouls and night gaunts had left the living room. I had never been sought by any vessel because the Great Ones will prance and jump with antique mirth and forthwith stride after the yak merchants. The great Ones themselves dread them.
Indeed they do, Mr. Lovecraft. Indeed they do.
Of course, there is nothing in this method that guarantees that your sentences will be grammatical. There is nothing that represents grammar structure. Because it is random and based simply on word transition probabilities, it could potentially create a 100 word run-on sentence. (Not that that type of thing is out of the question for real Lovecraft sentences.)
If you want to make sure that you actually generate grammatically correct (or at least, reasonable) sentences, then you can use the second method of random text generation.
2. Generative Grammar Rewrite Rules
The second method uses generative grammar rewrite rules. You analyze different sentences in the text that you want to simulate and examine their grammatical structure. From that, you can figure out what grammar rules are used most often, and you can generate sentences that follow those same rules.
You probably remember sentence diagrams from school. Words can be tagged with their parts of speech, which can then be grouped together into things like noun phrases, verb phrases, and so on. Finally, as you group together each component of the sentence into larger and larger chunks, eventually you get up to the very top level of the full sentence.
As you can see with this example sentence, Lovecraft uses some pretty strange sentence structures. He makes very heave use of adjectives and adverbs, and in this example, has inverted the usual “NP VP” structure, and actually puts the verb phrase first.
This example sentence can be created using a set of rewrite rules. So, beginning with the full sentence (at the top of the graph), we see that we will re-write the token “sentence” with a pair of tokens, “VP” (verb phrase) and “NP” (noun phrase). Then, we can see that the noun phrase is in fact made up of a noun phrase followed by a prepositional phrase. This is another rewrite rule, where NP gets rewritten as NP PP.
You can continue to examine the creation of this sentence in this way, and come up with a set of re-write rules that will gradually break down grammatical structures into their parts, and eventually replace grammatical tokens like “noun” and “verb” and “adjective” with the actual individual words. The set of rewrite rules needed to produce the above sentence is shown at the right.
These re-write rules are the basic units for randomly generating sentences using this method. We can amass a large set of these rules by analyzing text by H.P. Lovecraft. We can count up how often each of the rewrite rules is used, in order to figure out which ones are used more often and which ones are used less often. Because we are expressing the conversion of the token “noun” (for example) into individual words, this will also include word frequency counts.
Then, we can generate a random sentence using a similar process as before. The big difference is that now, instead of randomly selecting trigrams, we are randomly selecting re-write rules.
This is how it works. Start with the “sentence” token. Then, randomly select a re-write rule that will convert that into some other tokens. You should weight your selection based on the frequencies found in the original text. So, for example, if the rule sentence → NP VP appears twice as often in the original text as the rule sentence → VP NP, then it should have twice the chance of being selected randomly.
Then, simple repeat this process until every single “token” in your string has been replaced with a word. That creates one sentence. Keep repeating that for as many sentences as you want to generate.
What does it look like? Try it yourself:
Clicking the above link will open up a page in a new window that will generate pseudo-Lovecraftian text based on this method. You can refresh that window as many times as you want, and it will generate new random text each time. Here is one of my favorites:
Then all lonely horrors, and all humans but a hideous living sunset and a onyx body, had Ackley. Where a twilight sinister you bitterly told many jagged, enchanted dreams and many gates cannot doubt. You followed deeply. The captain saw he and a toad. Where most outside birds came they, a marvellous, crawling screaming cannot arise. Many dreams came from most zoogs, and followed evidently. Then some slippery, stone horrors nearly softly likely had the living taverns, and climbed it, and felt chiefly. Then the great rock sickeningly felt it, and saw a lonely space. The monstrous phonograph cannot were.
Now, a few things should be noticed here. Although this method “in theory” should only produce grammatical sentences, it really doesn’t. In real English, not all re-write rules can be applied completely independently. For example, some re-write rules can only be used for certain types of words, or only can be used in conjunction with other rewrite rules or for certain constructions. This kind of complexity could, in theory, be built into a model like this one… but for this particular example, we kept it simple. a result, you will see that some rules have been applied in places where they probably shouldn’t have been.
Also, actually collecting the data for this kind of analysis is actually very hard. It’s much more difficult to automatically pull sentence structure rules out of a text than to just examine trigram frequencies. As a result, I used a much smaller “source text” (just the first part of “The Dream-Quest of Unknown Kadath”) in this example. This will make the result a bit more repetitious. If I spent more time on collecting data for a more diverse set of rewrite rules, it would probably look even better.
Anyway, have fun with the random Lovecraftean text. Amaze your friends by quoting it to them at parties. And if you want, try making some random text-generators to simulate the writing style of your favorite author!