Lorem Ipsum in Polish

Minimum viable Markov chains in Haskell
2016-02-27

As a weekend hack, I set out to build a minimalist Markov chain generator for Polish text. This was a personal itch I needed to scratch, unsatisfied with the two available implementations. The end product: “Słowotok, or Lorem Ipsum in Polish” is now online.

The program itself is not tied to a particular locale, so if you need a generator, drop your chosen text files into data and you’re good to go. It should work for European languages, and fail catastrophically for languages that don’t delimit words by spaces.

Below are some notes from the project.

Basic project structure

At its most general level, the program does the following:

Read some files with Polish text
Clean up the data
Generate an N-gram database
When requested, serve a random sampling of text, based on the N-grams

That translates to the following main function:

main =
  listDirectory "data" >>= mapM TIO.readFile >>=
  withNgrams . digrams . clean . T.concat
 where
   withNgrams (!t) = scotty 3111 $ do
     get "/" $ file "./data/index.html"
     get "/text/:length" $ do
       len <- (min 1000) `fmap` param "length"
       result <- liftIO $ evalRandIO (fromDigrams len t)
       text (connect result)

The relevant top-level functions exposed by the Lib module being:

module Lib
  (Token, Corpus, Digrams, 
  clean, digrams, fromDigrams) where

clean :: T.Text -> Corpus
digrams :: Corpus -> Digrams
fromDigrams :: (RandomGen g) => Integer -> Digrams -> Rand g [Token]

Interesting and Problematic bits

1. Lazy IO vs. “readiness to serve”

I’d really like steps 1-3 from the main plan to be completed before the HTTP server starts accepting requests. Otherwise — due to Haskell’s lazy evaluation — the first request to hit the server would force the runtime to evaluate the entire path of dependencies leading up to a response. In practice, this would mean that the first request to a freshly started server would take ~3-5 seconds, depending on the available CPU/IO resources on the hosting machine.

Consider this first iteration of the main function:

main = do
  files <- listDirectory "data"
  t <- (trigrams . clean . T.concat) `fmap` mapM TIO.readFile files
  scotty 3111 $ do
    get "/text/:length" $ do
      len <- (min 1000) `fmap` param "length"
      result <- liftIO $ evalRandIO (fromTrigrams len t)
      text (connect result)

The problem occurs when our get handler receives the first HTTP request, and the value of t in the expression (fromTrigrams len t) is needed to produce output.

This, in turn, forces the runtime to:

list the “data” directory (listDirectory "data')
read all the text files’ contents (mapM TIO.readFile files)
prepare the text and create a trigram database (trigrams . clean . T.concat)

But these are exactly the steps we want completed by the time we’ve started serving requests.

The quickest solution here is to use GHC’s Bang Patterns, which tell the compiler we want a value to be fully evaluated inside a given function body. Hence the helper function:

withNgrams (!t) = scotty 3111 $ do

This ensures that when we evaluate scotty 3111, t has been constructed and is ready to use.

2. Randomness

In contrast with every practical language I’ve used, Haskell does not provide an “easy” method of acquiring a random value. One can get “easy” random values in the IO monad, (randomIO :: IO a) but that’s no fun.

System.Random, in tandem with Control.Monad.Random, give us a nice way out. We can have the best of both worlds with testable, predictable “randomness” in tests and real IO-dependent randomness in production.

The Rand monad lets us generate pure random values, based on a random generator g. Here’s a test that always passes:

it "is generated from digrams" $ do
  let t = digrams ["a", "b", "c", "a", "b", "c"]
      g = mkStdGen 1
  evalRand (fromDigrams 1 t) g `shouldBe` ["c", "a", "b"]

The type of the generating function is:

 fromDigrams :: (RandomGen g) => Integer -> Digrams -> Rand g [Token]

where Rand g [Token] could be read as: a monadic function that, when applied to a source of randomness g, will yield a list of Tokens of the desired length.

The digrams themselves are modeled as a Map (Token,Token) [Token]. For each sequence of Tokens t1,t2,t* we hold a list of t*s found in the Corpus.

Given an existing Markov chain holding:

[..,"hello", "there"]

we have to proceed by adding one of the elements held in the Map entry:

("hello","there") => ["gentlemen", "ladies", "world", "ladies"]

Notice how we encode the probablities of each transition in the form of duplicated list entries. The probablities of particular transitions are:

hello there ladies : 2/4
hello there gentlemen : 1/4
hello there world : 1/4

How do we make this happen? Simple!

randomEl :: (RandomGen g) => [a] -> Rand g a
randomEl l = getRandomR (0, length l - 1) >>= return . (l  !!)

3. A proliferation of types

I am not particularly proud of the way the various N-gram contracts are reflected in the types. We have:

type Unigrams = M.Map Token [Token]
type Digrams = M.Map (Token,Token) [Token]
type Trigrams = M.Map (Token,Token,Token) [Token]

This guarantees that for any sequence (as long as it’s a sequnce of length 1,2 or 3) of Tokens found in the source Corpus, there will always be a key containing those Tokens, tupled up, in the resulting database.

If we know that the words “veni”, “vidi”, “vici” appear together in the source, we can be sure that if we lookup (“veni”, “vidi”) in a map of Digrams, we’ll at least get [“vici”]. This means that we can generate sequences of arbitrary length, without ever hitting a “terminating” key, for which no subsequent tokens will be found.

This is a nice safety property, but it forces us to provide separate functions for each Nness of the N-grams we want to support.

For example: to build a dictionary of digrams, we use:

digrams :: Corpus -> Digrams
digrams = let merge (t,u,v) m = M.insertWith (++) (t,u) [v] m
           in ngrams merge triplets

…and for trigrams:

trigrams :: Corpus -> Trigrams
trigrams = let merge (t,u,v,w) m = M.insertWith (++) (t,u,v) [w] m
           in ngrams merge quadruplets

And the reverse — generating functions:

fromDigrams :: (RandomGen g) => Integer -> Digrams -> Rand g [Token]
fromDigrams k m = fromNGrams (\(t,v) e -> [e,v,t]) (\(v:t:_) -> (t,v)) k m

fromTrigrams :: (RandomGen g) => Integer -> Trigrams -> Rand g [Token]
fromTrigrams k m = fromNGrams (\(t,v,u) e -> [e,u,v,t]) (\(u:v:t:_) -> (t,v,u)) k m

There is a lot of rather pointless marshalling and unmarshalling of tuples to lists and vice versa. All of the functions could just use lists as the Map keys, but then the type checker wouldn’t be able to tell us that inserting the key [“veni”] into a map of Trigrams :: Map [Token] [Token] is a bad idea.

Since the program realistically only needs to support digrams and trigrams, this is not that much of a burden, but rather an aesthetic objection.

I’ll be looking into the Type Family Tutorial to see how this issue could be resolved in a more generic manner, with less code.

Hopefully more coming up on the topic! Thanks for reading.