Sunday, June 17, 2012

Shake Storage Layer

Summary: Shake maintains metadata as the build progresses. This metadata must remain up to date and consistent even if the process is killed. This post describes my new strategy for doing that.

Shake is an advanced build system, and in common with nearly all advanced build systems, it maintains extra metadata about rules - when the rule was last run, what the dependencies were, how long it took etc. If the metadata associated with a rule is not available, the rule must be rerun, which is often expensive. Any build system is likely to be interrupted on a regular basis - both due to failing rules (compile errors) and the user aborting a build. As a result, it is important that the metadata is robustly stored to disk as soon as it is produced.

In this post, I outline the old solution to maintaining metadata, along with the new solution available in shake-0.3, which I just released. The new solution has a number of benefits:

  • Reduces time loading/saving metadata by up to 75%. In practice this is unlikely to make a significant difference unless no rules need running.
  • Exceptions at any point will not cause file handles to be left open.
  • Previously there were very small windows where if the process died suddenly all metadata would be corrupted. These have been eliminated.
  • I removed all knowledge of the build system from the storage layer, making it properly decoupled.

Most of these improvements have been driven by people using Shake in new ways. When used as a replacement for Make, with one invocation per run, many of these issues are theoretical. Now people are running Shake in background threads and forcibly killing and restarting it on a regular basis, these issues can be observed in practice. However, the improvements will benefit everyone.

The Old Solution

The old solution has remained basically the same since the very first version of Shake, over three years ago. Shake maintains two files - the database contains the metadata, while the journal contains a list of metadata updates that can be appended to. The sequence of steps is:

  • Load the database
  • If the journal exists then:
    • Replay the journal into the database
    • Save the database
    • Delete the journal
  • Run the build, storing any updates to the journal
  • Save the database
  • Delete the journal

This solution works well, but has a couple of flaws. Whenever we save the database, if it gets corrupted half-way through, we lose the entire database, causing the build to start from scratch. Another problem is that if we are building nothing, we read in all the metadata, then write it all out again with only one single modification (incrementing the build time step). Since serialisation takes 3x longer than deserialisation (in benchmarks on the Shake metadata) about 75% of the time associated with the metadata is wasted. Even when we have made many updates, the data is already stored in the journal, so rewriting the database is not strictly necessary.

The New Solution

The new solution keeps a single database, containing a list of key/value pairs, which can be appended to. At certain points a backup file is made, simply a copy of an existing database. The sequence of steps is:

  • If the backup file exists, delete the database and use the backup file
  • Read all records from the database
  • Put the records into a Map
  • If the Map is significantly smaller than the number of records then
    • Rename the database to the backup
    • Resave the database
    • Delete the backup
  • Run the build, storing any updates to the database

In this method we never save the data after a successful run, but just close the file handles. The database accumulates key/value pairs, but only the last value associated with any key in the database is useful - earlier values are ignored. At some point the database will contain a significant number of keys that are no longer useful, and at that point we rewrite the database, taking care to make a backup before starting.

This post outlines the general steps, omitting details such as version stamps and consistency checks, which are highly important for a robust build system. These details are taken care of in the full implementation, available in the source as Development.Shake.Storage, taking about 100 lines.

Monday, June 04, 2012

The Flavours of MVar

Update: These functions have been cleaned up, improved with respect to error conditions and async exceptions, and put in the extra package.

The MVar is a flexible and powerful locking primitive, used extensively in Haskell. An MVar is like a box which is empty (has zero elements inside) or full (has one element inside). You block when trying to take from an empty MVar or put to a full MVar. On top of MVars, lots of interesting concurrent programs can be written. However, with such a flexible mechanism, there is scope for confusion. Every MVar can block on either a take or a put, but for any individual MVar it is likely you expect it to block on only one of those operations. In my programs I usually restrict my MVars to one of three flavours, each of which is described below.

Lock


The Lock guarantees single-threaded access, typically to some system resource.

type Lock = MVar ()

newLock :: IO Lock
newLock = newMVar ()

withLock :: Lock -> IO a -> IO a
withLock x = withMVar x . const

And as an example:

lock <- newLock
let output = withLock . putStrLn
forkIO $ do ...; output "hello"
forkIO $ do ...; output "world"

Here we are creating a lock to ensure that when writing output our messages do not get interleaved. This use of MVar never blocks on a put. It is permissible, but rare, that a withLock contains a withLock inside it - but if so, watch out for deadlocks.

Var


The Var operates on a mutable variable in a thread-safe way.

type Var a = MVar a

newVar :: a -> IO (Var a)
newVar = newMVar

modifyVar :: Var a -> (a -> IO (a, b)) -> IO b
modifyVar = modifyMVar

modifyVar_ :: Var a -> (a -> IO a) -> IO ()
modifyVar_ = modifyMVar_

readVar :: Var a -> IO a
readVar = readMVar

And as an example:

hits <- newVar 0
forkIO $ do ...; modifyVar_ hits (+1); ...
i <- readVar hits
print ("HITS",i)

Here we have a variable which we modify atomically, so modifications are not interleaved. This use of MVar never blocks on a put. No modifyVar operation should ever block, and they should always complete in a reasonable timeframe. A Var should not be used to protect some external resource, only the variable contained within. Information from a readVar should not be subsequently inserted back into the Var.

Barrier


A barrier starts with no value, is written to once, and read one or more times.

type Barrier a = MVar a

newBarrier :: IO (Barrier a)
newBarrier = newEmptyMVar

signalBarrier :: Barrier a -> a -> IO ()
signalBarrier = putMVar

waitBarrier :: Barrier a -> IO a
waitBarrier = readMVar

And as an example:

bar <- newBarrier
forkIO $ do ...; val <- ...; signalBarrier bar val
print =<< waitBarrier bar

Here we create a barrier which will contain some computed value. A thread is forked to fill the barrier, while the main thread waits for it to complete. A barrier has similarities to a future or promise from other languages, has been known as an IVar in other Haskell work, and in some ways is like a manually managed thunk. It is an error to signal a barrier more than once and a deadlock to never signal it. Since the barrier goes from empty to full, it never blocks on a put, unless you incorrectly call signal more than once.

Combining MVar Flavours - Once


The previous three MVar wrappers are the flavours of MVar which I use regularly. These can be combined into higher-level abstractions specific to certain situations. I now give two examples, intended to show how to combine these primitives.

The once function takes an action, and returns a new action. If the action is never called the argument action will never be executed, but if it is called more than once, it will only be executed once. We can write this function as:

once :: IO a -> IO (IO a)
once act = do
    var :: Var (Maybe (Barrier a)) <- newVar Nothing
    return $ join $ modifyVar var $ \v -> case v of
        Nothing -> do b <- newBarrier; return (Just b, do x <- act; signalBarrier b x; return x)
        Just b -> return (Just b, waitBarrier b)

Here we create a variable to store the result, whose state is either Nothing (we have not yet started computing) or Just a barrier (we have started computing, use this barrier to get the result out). I have found 'join $ modifyVar' is a common idiom, used to defer a blocking action (often waitBarrier) until after a modifyVar has completed, ensuring we preserve our invariant of not blocking inside a modifyVar. When running the resulting action, if the variable is a Nothing we create a new barrier, store it, and then start an action (after leaving the modifyVar) to compute the result, signal the barrier and return. If we already have a barrier, we just wait for this barrier.

[Note that you can implement once in terms of MVar directly, using only one MVar, but that violates the simple rules of the restricted MVars - rightly so, you have to use the MVar empty state to mean both atomic access to shared state, and to mean computation in progress.]

Combing MVar Flavours - Queue


As another practical example of using these restricted MVars, let us consider a special kind of queue. Message arrive individually, but are collected in bulk. When someone tries to retrieve message, if there are any messages waiting they are sent immediately. If there are no messages, the read blocks until either a message arrives or until a new reader arrives, in which case the old reader is sent away with nothing. This can be implemented as:

type Queue a = Var (Either [a] (Barrier [a]))

arrive :: Queue a -> a -> IO ()
arrive q x = modifyVar_ q $ \q -> case q of
    Left xs -> return $ Left $ xs ++ [x]
    Right b -> do signalBarrier b [x]; return $ Left []

collect :: Queue a -> IO [a]
collect q = join $ modifyVar q $ \q -> case q of
    Left xs@(_:_) -> return (Left [], return xs)
    _ -> do
        case q of Right b -> signalBarrier b []; _ -> return ()
        b <- newBarrier
        return (Right b, waitBarrier b)

The type of Queue tells us most of what we need to know about the invariants - Queue has a mutable state, which is either Left (zero or more messages waiting) or a Right (someone waiting to collect messages). If we had used MVar instead of both Var and Barrier, the invariant and design would be far less clear. With these invariants clearly stated, the code just follows directly.

Creating New Flavours


I find the three MVar wrappers (Lock, Var, Barrier) much easier to understand since the rules are simpler, making maintenance easier. I have also found that most projects benefit from higher-level abstractions in some places. As an example, I defined Queue in one recent project, and Shake defines a Resource type, on top of which the resources feature is implemented. Concurrency is hard, but robust abstractions split the complexity, and thus simplify the programs.

Sunday, June 03, 2012

Hoogle Update

Summary: I just updated the Hoogle website. It looks nicer on the iPhone and the source is on GitHub.

The Website

The Hoogle website is (as always) at http://haskell.org/hoogle. I've just uploaded a fresh Hackage index (currently a manual operation, but one I'm intending to automate imminently). I've also made a number of improvements if you are using Hoogle over the iPhone - to simulate the iPhone experience click here.

The Source

The source code has moved to Github: https://github.com/ndmitchell/hoogle. While darcs is a much nicer version control system than Git, GitHub offers a lot of nice features, so I'm using Hoogle as an experiment. I've been promised that projects on GitHub get lots of contributions, so now I wait!

I'm leaving the bug tracker in Google code for the moment, and am considering where the Hoogle manual should live, but a GitHub wiki site is currently looking likely.