Writing about code

Marky Markov: Generating "Random" Text With Ruby

When I first signed up for Twitter as @skitelman in 2012 I thought I had declared myself the world's greatest Skittle's fan. I would tweet almost exclusively about Skittles and I would use twitter to connect Skittles to the zeitgeist. For my efforts I hoped that the Skittles corporation would reward me with some free candy or at least a lousy retweet. I got neither. So I became obsessed with revenge. I knew that people could create Twitter bots that could do all sorts of things. Couldn't I create one that would establish me as America's foremost Skittles fan? Certainly it couldn't be too hard to create a bot that would just tell the Skittles Twitter account how much I loved their product?

It turns out, that without knowing how to program it is nearly impossible to make interesting Twitter bots. In my research on Twitter bots I came across a blog post by Darius Kazemi, the creator of many great Twitter bots. His advice was summed up in this handy linklink but he says it more eloquently in his post:

The reason I am able to make Twitter bots is because I have been programming computers in a shitty, haphazard way for 15 years, followed by maybe 5 years of less-shitty programming. ... Every little atom of knowledge represents hours of banging my head up against a series of technical walls, googling for magic words to get libraries to compile, scouring obscure documentation to figure out what the hell I’m supposed to do, and re-learning stuff I’d forgotten because I hadn’t used it in a while.

In the intervening years I have learn a lot about computer programming. I can finally gaze into the source code for some Twitter bots and not be dazed by their magic. As it turns out, many Twitter bots rely on a piece of mathematics that I had known about for even longer: the Markov process.

Markov Processes

Let's say that one morning Mark Wahlberg decides to go on a walk through Manhattan. Rather than having a fixed destination in mind, he decides to let a 4-sided die determine how he will turn at every intersection. While the majority of his friends and loved ones would be quite concerned, his mathematician friends would gleefully recognize this as a Markov process before they expressed their concern.

Mark Wahlberg's disturbing random walk is known as a Markov process. In fact a Markov process is just a sequence of events that obeys the following four conditions:

  1. The outcome at any stage depends on chance.
  2. The set of possible outcomes is finite.
  3. The probability of the next outcome's occurrence depends only on the previous outcomes.
  4. The probabilities do not change over time.

A great many types of probabilistic processes are Markov processes. In fact, when Larry Page and Sergey Brin developed the PageRank algorithm for Google they considered the Web to be a giant Markov process with each user randomly clicking a link on a site or choosing to randomly visit another website. My college linear algebra textbook held up Markov Processes and PageRank as a tangible reason to care about the nightmarish tasks of orthogonalizing matrices and calculating eigenvalues. But much simpler properties of Markov processes are used to generate "random" speech for Twitter bots and many other internet distractions.

Marky Markov and The Funky Sentences

Marky Mark

A relatively simple Markov process can be used to create "real" seeming text that is in fact randomly generated from a piece of source material. To generate a new random text, a word (or a series of words) is chosen from the source material at random. Then all instances of this word (or series of words) is found in the source material and the words that follow are recorded. The next word in the Markov text is just a random word chosen from this collection. The process continues as long as desired with the last word (or series of words) being used to generate the next.

The ruby gem Marky Markov implements such a process and I will attempt to explain their implementation. To play around with the gem, I used Jerry Seinfeld's book "SeinLanguage" as a basis to see if I could randomly generate new Seinfeld jokes. First, to generate the random text, SeinLanguage is parsed to create a hash. By default the text is parsed in two word chunks. For example, the "joke" setup:

I was on a plane the other day, and I was wondering, "Are there keys to the plane? Do they need keys to start the plane?"

is parsed as the hash:

{["I", "was"]=>["on", "wondering,"], ["was", "on"]=>["a"], ["on", "a"]=>["plane"], ["a", "plane"]=>["the"], ["plane", "the"]=>["other"], ["the", "other"]=>["day,"], ["other", "day,"]=>["and"], ["day,", "and"]=>["I"], ["and", "I"]=>["was"], ["was", "wondering,"]=>["Are"], ["wondering,", "Are"]=>["there"], ["Are", "there"]=>["keys"], ["there", "keys"]=>["to"], ["keys", "to"]=>["the", "start"], ["to", "the"]=>["plane"], ["the", "plane"]=>["?", "?"], ["Do", "they"]=>["need"], ["they", "need"]=>["keys"], ["need", "keys"]=>["to"], ["to", "start"]=>["the"], ["start", "the"]=>["plane"], ["plane", "?"]=>[""], ["?", ""]=>["."]}

Notice that the text is read two words at a time. I found that reading more than two words at a time results in a hash where most values are arrays that only contain one word and reading one word at a time results in nonsense. The Marky Markov gem applies additional logic to keep track of capitalized words in order to begin sentences appropriately. Also the gem uses some pretty crafty regular expressions to deal with punctuation and other weirdness.

To generate a new sentence, the gem finds a random capitalized word and pulls a key from the dictionary corresponding to it. Then the two last words from the sentence are used to search the hash for the appropriate key. Then a random word is taken from the values corresponding to the key.

For example, let's say that the sentence so far is "I'd ride in the van with my". Marky Markov would then look in the dictionary for the key ["with", "my"] and return

 ["with", "my"]=>["sneakers", "life", "cereal,", "parents"]

Then the array method sample is called on the values which returns a random value which is then appended onto the sentence. So the sentence becomes "I'd ride in the van with my life" and the process continues until the last word is a piece of punctuation.

So the use the marky markov to generate pieces of Seinfeldian wisdom I first installed the gem and then ran the following lines of code:

require 'marky_markov'  
# create a new instance of the text generator with a temporary dictionary.
markov =  
# give the generator a file to parse in the dictionar

markov.generate_n_sentences 2  
=> "It's too good to be Superman? Have a good time?"

markov.generate_n_sentences 2  
=> "Friends are the people at the airport can make the toast darker. Men, the transplant is the fire."

It is not exactly English and it is not exactly a reasonable approximation of Jerry Seinfeld. But I would argue it is about the most we can expect from Jerry Seinfeld today.

Share this post: