Thursday, May 22, 2008

Interactive TagSoup parsing

I've written quite a few programs using the tagsoup library, but have never really used the library interactively. Today I was wondering how many packages on hackage use all lower case names, compared to those starting with an initial capital. This sounds like a great opportunity to experiment! The rest of this post is a GHCi transcript, with my comments on what I'm doing prefixed with -- characters.


$ ghci
GHCi, version 6.8.2: http://www.haskell.org/ghc/ :? for help
Loading package base ... linking ... done.
-- load some useful packages
Prelude> :m Text.HTML.TagSoup Text.HTML.Download Data.List Data.Char Data.Maybe
Prelude Data.Maybe Data.Char Data.List Text.HTML.Download Text.HTML.TagSoup>
-- ouch, that prompt is a bit long - we can use :set prompt to shorten it
-- side note: I actually supplied the patch for set prompt :)

:set prompt "Meep> "
-- lets download the list of packages
Meep> src <- openURL "http://hackage.haskell.org/packages/archive/pkg-list.html"
... src scrolls pass the screen ...
-- parse the file, dropping everything before the packages
Meep> let parsed = dropWhile (~/= "<h3>") $ parseTags src
-- grab the list of packages
Meep> let packages = sort [x | a:TagText x:_ <- tails parsed, a ~== "<a href>"]
-- now we can query the list of packages
Meep> length packages
648
Meep> length $ filter (all isLower) packages
320
Meep> length $ filter ('_' `elem`) packages
0
Meep> length $ filter ('-' `elem`) packages
165
Meep> length $ filter (any isUpper . dropWhile isUpper) packages
100
Meep> length $ filter (isPrefixOf "hs" . map toLower) packages
47
Meep> length $ filter (any isDigit) packages
37
Meep> reverse $ sort $ map (\(x:xs) -> (1 + length xs,x)) $ group $ sort $ conca
t packages

[(484,'e'),(374,'a'),(346,'r'),(336,'s'),(335,'t'),(306,'i'),(272,'l'),(248,'c')
,(247,'n'),(240,'o'),(227,'p'),(209,'h'),(185,'-'),(171,'m'),(159,'d'),(126,'g')
,(112,'b'),(96,'u'),(87,'y'),(78,'k'),(76,'f'),(74,'x'),(58,'S'),(53,'H'),(35,'w
'),(33,'v'),(29,'q'),(29,'L'),(27,'A'),(26,'F'),(24,'D'),(23,'C'),(22,'T'),(16,'
P'),(16,'M'),(16,'I'),(16,'G'),(13,'B'),(12,'W'),(12,'3'),(12,'2'),(10,'O'),(9,'
R'),(9,'1'),(8,'z'),(8,'j'),(8,'E'),(7,'X'),(7,'U'),(7,'N'),(6,'Y'),(6,'V'),(5,'
J'),(4,'Q'),(4,'5'),(4,'4'),(3,'Z'),(3,'8'),(3,'6'),(1,'9')]


We can see that loads of packages use lowercase, lots of packages use upper case, quite a few use CamelCase, quite a few start with "hs", none use "_", but lots use "-". The final query figures out which is the most common letter in hackage packages, and rather unsurprisingly, it roughly follows the frequency of English letters.

TagSoup and GHCi make a potent combination for obtaining and playing with webpages.

Sunday, May 18, 2008

Haskell and Performance

There has been lots of discussion on the Haskell mailing lists about the speed of Haskell. There are many conflicting opinions, and lots of different advice. Some of the information on Haskell's performance is written as a sales pitch, some is based on outdated knowledge. Since I've been working on optimisation for a while, I thought I'd try and give a snapshot of Haskell performance. Most of the following is personal opinion, and others could quite validly disagree. Since GHC is the best performing Haskell compiler, I have used Haskell to mean GHC with the -O2 flag throughout.

High-level Haskell is not as fast as C

If you write Haskell in a standard manner, it is unlikely to perform as fast as C. In most cases, linked-lists are slower than arrays. Laziness is more expensive than strictness. The Haskell code will almost always be shorter, and more concise, since it will abstract over low-level detail. But by writing that low-level detail in the C code, you are likely to produce faster code.

Low-level Haskell is competitive with C

If you use GHC, with unboxed operations, written in a low-level style, you can obtain similar performance to C. The Haskell won't be as nice as it was before, but will still probably express fewer details than the C code. Writing in such a low-level manner requires more knowledge of Haskell, and is probably something that a beginner should not be attempting. However, for a critical inner loop, low-level Haskell is a very attractive option.

Haskell's Code Generator is weak

The back end assembly generator in GHC is a weak link, but improvements are being carried out. After this work has been finished, it is likely that low-level Haskell will be able to produce nearly identical assembly code to C.

Some Haskell libraries are poorly optimised

Some of the central Haskell libraries have functions which are badly optimised. For example, the MTL library is known to be poorly performing. The words and isSpace functions in the base library aren't very good. These issues are being addressed over time, the Binary and ByteString libraries have fixed two holes. A new implementation of inits has been contributed. Over time, more issues will be identified and fixed, improving the speed of all code.

Haskell's multi-threaded performance is amazing

A lot of clever people have done a lot of clever work on making multi-threaded programming in Haskell both simple and fast. While low-level speed matters for general programming, for multi-threaded programming there are lots of much higher-level performance considerations. Haskell supports better abstraction, and can better optimise at this level, outperforming C.

Reading the Core is not easy

A standard advice to people trying to optimise Haskell is to read the Core - the low-level functional language used as an intermediate form in the compiler. While Core provides much useful information about what optimisations were performed, it isn't easy to read, and takes a lot of practice. Some effort has been done to make reading Core easier, but I still wouldn't recommend it for beginners.

Optimisation without profiling is pointless

People often want to make programs run faster. In general, this activity is a waste of time. I recently wrote a program for the HCI group at my university, which takes 10 minutes to run, and requires 4Gb of RAM, on a very expensive machine. I haven't even bothered to profile the program, because I have better things to do. Unless the speed of something actually makes a difference, you should not be spending excessive effort on optimisation.

If you have determined that the program in question is running too slowly, then profile. After profiling, you can usually identify some small part of the program that needs optimisation. Too often there is a focus on speeding up something that is not slow enough to make a difference.

The trend is for higher-level optimisation

As time goes buy, higher-level programs keep getting faster and faster. The ByteString work allows programmers to write high-level programs that are competitive with C. Performance enhancements are being made to the compiler regularly, pointer tagging, constructor specialisation etc. are all helping to improve things. More long term projects such as Supero and NDP are showing some nice results. Optimisation is a difficult problem, but progress is being made, allowing programs to be written in a higher-level.

My goal is that one day Haskell programs will be written in a very declarative, high-level style - and outperform C at the same time. I think this goal is obtainable, albeit some way in the future.

Saturday, May 10, 2008

Bad strictness

Haskell has one primitive construct for enforcing strictness, seq :: a -> b -> b. The idea is that the first argument is evaluated to weak-head normal form (WHNF), then the second argument is evaluated to WHNF and returned. WHNF is reduction until the outermost bit is available - either a function value, or an outer constructor.

You can model the behaviour by introducing an evaluate function, in a lower-level language, and showing how to perform reduction:


evaluate (seq a b) = do
a' <- evaluate a
b' <- evaluate b
return b'


The evaluate function must return an evaluated argument, and it wants to return b which is not already evaluated, so it must make a recursive call. The evaluate function for id, which simply returns its argument, is:


evaluate (id a) = do
a' <- evaluate a
return a'


Notice that even though id does "nothing", it still has to evaluate its argument. Of course, evaluate (id x) is the same as evaluate x, so id performs no additional work.

Haskell is lazy, so if an expression has already been evaluated, then the evaluate call will be incredibly cheap, and just return the previous result. So let's consider the result of calling seq with the same argument twice.


evaluate (seq a a) = do
a' <- evaluate a
return a'


This time the second evaluation of a is skipped, as a is already evaluated. We can easily see that evaluation of seq a a is exactly equivalent to a. This means that any code which writes a `seq` a is wrong. There is plenty of this code around, and one example (which prompted me to write this) is on slide 15 of this talk.

The other classic seq related mistake is id $! x. The ($!) operator is for strict application, and is defined:


f $! x = x `seq` f x


For the particular instance of id $! x, we obtain x `seq` id x. Of course, all that id x does is evaluate x, so again there is no change from just writing x.

There are valid uses of seq, but any time you see either of the following constructs, you know the programmer got it wrong:

x `seq` x
id $! x