Lorem Ipsum in Polish
Minimum viable Markov chains in Haskell
2016-02-27
As a weekend hack, I set out to build a minimalist Markov chain generator for Polish text. This was a personal itch I needed to scratch, unsatisfied with the two available implementations. The end product: “Słowotok, or Lorem Ipsum in Polish” is now online.
The program itself is not tied to a particular locale, so if you need a generator, drop your chosen text files into data
and you’re good to go. It should work for European languages, and fail catastrophically for languages that don’t delimit words by spaces.
Below are some notes from the project.
Basic project structure
At its most general level, the program does the following:
- Read some files with Polish text
- Clean up the data
- Generate an N-gram database
- When requested, serve a random sampling of text, based on the N-grams
That translates to the following main
function:
main =
listDirectory "data" >>= mapM TIO.readFile >>=
withNgrams . digrams . clean . T.concat
where
withNgrams (!t) = scotty 3111 $ do
get "/" $ file "./data/index.html"
get "/text/:length" $ do
len <- (min 1000) `fmap` param "length"
result <- liftIO $ evalRandIO (fromDigrams len t)
text (connect result)
The relevant top-level functions exposed by the Lib
module being:
module Lib
(Token, Corpus, Digrams,
clean, digrams, fromDigrams) where
clean :: T.Text -> Corpus
digrams :: Corpus -> Digrams
fromDigrams :: (RandomGen g) => Integer -> Digrams -> Rand g [Token]
Interesting and Problematic bits
1. Lazy IO vs. “readiness to serve”
I’d really like steps 1-3 from the main plan to be completed before the HTTP server starts accepting requests. Otherwise — due to Haskell’s lazy evaluation — the first request to hit the server would force the runtime to evaluate the entire path of dependencies leading up to a response. In practice, this would mean that the first request to a freshly started server would take ~3-5 seconds, depending on the available CPU/IO resources on the hosting machine.
Consider this first iteration of the main
function:
main = do
files <- listDirectory "data"
t <- (trigrams . clean . T.concat) `fmap` mapM TIO.readFile files
scotty 3111 $ do
get "/text/:length" $ do
len <- (min 1000) `fmap` param "length"
result <- liftIO $ evalRandIO (fromTrigrams len t)
text (connect result)
The problem occurs when our get
handler receives the first HTTP request,
and the value of t in the expression (fromTrigrams len t)
is needed to
produce output.
This, in turn, forces the runtime to:
- list the “data” directory (
listDirectory "data'
) - read all the text files’ contents (
mapM TIO.readFile files
) - prepare the text and create a trigram database (
trigrams . clean . T.concat
)
But these are exactly the steps we want completed by the time we’ve started serving requests.
The quickest solution here is to use GHC’s Bang Patterns, which tell the compiler we want a value to be fully evaluated inside a given function body. Hence the helper function:
withNgrams (!t) = scotty 3111 $ do
This ensures that when we evaluate scotty 3111
, t has been constructed and is
ready to use.
2. Randomness
In contrast with every practical language I’ve used, Haskell does not provide an “easy”
method of acquiring a random value. One can get “easy” random values in the IO
monad, (randomIO :: IO a
) but that’s no fun.
System.Random
, in tandem with Control.Monad.Random
, give us a nice way out.
We can have the best of both worlds with testable, predictable “randomness” in tests
and real IO-dependent randomness in production.
The Rand
monad lets us generate pure random values, based on a random generator g
.
Here’s a test that always passes:
it "is generated from digrams" $ do
let t = digrams ["a", "b", "c", "a", "b", "c"]
g = mkStdGen 1
evalRand (fromDigrams 1 t) g `shouldBe` ["c", "a", "b"]
The type of the generating function is:
fromDigrams :: (RandomGen g) => Integer -> Digrams -> Rand g [Token]
where Rand g [Token]
could be read as: a monadic function that, when applied
to a source of randomness g
, will yield a list of Token
s of the desired length.
The digrams themselves are modeled as a Map (Token,Token) [Token]
.
For each sequence of Tokens
t1
,t2
,t*
we hold a list of t*
s
found in the Corpus
.
Given an existing Markov chain holding:
[..,"hello", "there"]
we have to proceed by adding one of the elements held in the Map entry:
("hello","there") => ["gentlemen", "ladies", "world", "ladies"]
Notice how we encode the probablities of each transition in the form of duplicated list entries. The probablities of particular transitions are:
hello there ladies : 2/4
hello there gentlemen : 1/4
hello there world : 1/4
How do we make this happen? Simple!
randomEl :: (RandomGen g) => [a] -> Rand g a
randomEl l = getRandomR (0, length l - 1) >>= return . (l !!)
3. A proliferation of types
I am not particularly proud of the way the various N-gram contracts are reflected in the types. We have:
type Unigrams = M.Map Token [Token]
type Digrams = M.Map (Token,Token) [Token]
type Trigrams = M.Map (Token,Token,Token) [Token]
This guarantees that for any sequence (as long as it’s a sequnce of length 1,2
or 3) of Token
s found in the source Corpus
, there will always be a key
containing those Token
s, tupled up, in the resulting database.
If we know that the words “veni”, “vidi”, “vici” appear together in the source, we
can be sure that if we lookup
(“veni”, “vidi”) in a map of Digrams
, we’ll
at least get [“vici”]. This means that we can generate sequences of
arbitrary length, without ever hitting a “terminating” key, for which no
subsequent tokens will be found.
This is a nice safety property, but it forces us to provide separate
functions for each N
ness of the N-grams we want to support.
For example: to build a dictionary of digrams, we use:
digrams :: Corpus -> Digrams
digrams = let merge (t,u,v) m = M.insertWith (++) (t,u) [v] m
in ngrams merge triplets
…and for trigrams:
trigrams :: Corpus -> Trigrams
trigrams = let merge (t,u,v,w) m = M.insertWith (++) (t,u,v) [w] m
in ngrams merge quadruplets
And the reverse — generating functions:
fromDigrams :: (RandomGen g) => Integer -> Digrams -> Rand g [Token]
fromDigrams k m = fromNGrams (\(t,v) e -> [e,v,t]) (\(v:t:_) -> (t,v)) k m
fromTrigrams :: (RandomGen g) => Integer -> Trigrams -> Rand g [Token]
fromTrigrams k m = fromNGrams (\(t,v,u) e -> [e,u,v,t]) (\(u:v:t:_) -> (t,v,u)) k m
There is a lot of rather pointless marshalling and unmarshalling of tuples
to lists and vice versa. All of the functions could just use lists as the Map
keys, but then the type checker wouldn’t be able to tell us that inserting the key
[“veni”] into a map of Trigrams :: Map [Token] [Token]
is a bad idea.
Since the program realistically only needs to support digrams and trigrams, this is not that much of a burden, but rather an aesthetic objection.
I’ll be looking into the Type Family Tutorial to see how this issue could be resolved in a more generic manner, with less code.
Hopefully more coming up on the topic! Thanks for reading.