Dan Piponi’s latest post “Beyond Monads” has prompted some wonderment (and forehead-slapping). For example:
How did we all miss that before?
The answer (of course!) is that while we might have, they didn’t.
For example, Edward Kmett’s category-extras package has had Control.Monad.Indexed available for use in Haskell since early last year, while the concept for its implementation in Haskell has been around since at least 2005 (see the paper “Monadic Augment and Generalised Short Cut Fusion”). Needless to say, Oleg has been there too.
For me, this prompted a vaguely-related revelation. Say that we have a Category class:
class Category (⤳) where id ∷ a ⤳ a (∘) ∷ (b ⤳ c) → (a ⤳ b) → (a ⤳ c)
I had recently been wondering how categories other than (→), such as graphs (and even the trivial category) fit into this model. Now that I have figured it out, the answer is simple; it is a matter of arity. Graphs and other datastructure-related categories don’t have the extra type arguments (unless you’re working in a dependently-typed language), and so can’t fit into this. What we want to do is fake the extra arguments, by having some extra ones that we can just ignore (the name, FF, is short for FakeFake, i.e. two fake arguments).
newtype FF f a b = FF { unFF :: f }
Now we can define the trivial category!
class Category (FF ()) where id = FF () f ∘ g = FF ()
Viewed in this light, it is obvious why (as Dan notes in the original post) Monoids give rise to Monads and Categories give rise to ParametrizedMonads, as we have:
class Monoid m ⇒ Category (FF m) where id = FF mempty (FF f) ∘ (FF g) = FF (mappend f g)
Next, we want to write implementations for all our old Monads in the new PMonad class. We might do it like this (with a new newtype for the different arguments of this datatype, RealFakeFakeReal):
newtype RFFR f x y a = RFFR { unRFFR ∷ (f a) } class PMonad m where (>>=) ∷ m s1 s2 a → (a → m s2 s3 b) → m s1 s3 b return ∷ a → m s s a instance (Old.Monad m) ⇒ PMonad (RFFR m) where (RFFR x) >>= f = RFFR (x Old.>>= unRFFR ∘ f) return x = RFFR (Old.return x)
There is a big issue that this highlights. It is hard to work with types of different arities in Haskell. (This is noted in the ‘quantified contexts’ proposal.)
As an example, since we now know that every Monoid is a Category, we might think to drop Monoids and just use Categories everywhere (I should note that this is actually a silly idea
!). Unfortunately, we can’t just do this as we’ll have to use FF wrappers/unwrappers everywhere:
x = unFF $ FF [1,2,3] ∘ FF [4,5,6]
I have not yet found a workaround for this. Help would be nice ![]()

Cleaning up a set of tags with Awk
Introduction
David R. MacIver has recently written this blog post about cleaning up a set of tags. This blog post, on the other hand, is about a nice old Unix tool called ‘awk’.
Awk is one of those programs that is often overlooked. It is really a small domain-specific language for processing text. In some ways it resembles sed, but it is more powerful, and it especially excels at processing line- and field-structured input.
Processing cite-u-like’s data
First of all, before we begin I want to say this this isn’t a criticism of David’s work. Using a general-purpose language like Ruby to process data comes with several benefits. This post is to explain the benefits of awk and what it excels at doing.
Now, cite-u-like’s tag data comes in a pipe-separated format, so we have input like this:
From now on, whenever you see x-separated data, I want you to scream ‘USE AWK!’
Awk scripts look something like this:
pattern { expression }pattern is used to match against a record, and if successful, the action in the braces (expression) will be carried out. But what is a record? Awk allows you to define what a line is using the variable
RS(short for ‘record separator’). By default it is set to\n(so that each line is a record), which is what we want here.Within the expression you can refer to fields of the current record, using the syntax
$n.$0refers to the whole record, while$1,$2… refer to individual fields.Awk also allows you to define what the field separator will be via the variable
FS.So how do we set these variables? Awk has two special patterns
BEGINandEND, which run before and after everything else. In this case, we want the fields to be separated by|, so we use the pattern:BEGIN { FS = "|" }This is rather long-winded, so mawk (an implementation of awk) also allows you to set
FSvia a command-line option-F.For this application, we want the last field of the record (we could hard-code it as
$4, but we’re exploring awk!). Awk provides the number of fields in the record as the variableNF(number of fields), so we want to access this field. We do so using the record syntax$and the variableNF.So to put all this together, what we want is the command:
We set the field separator to “|”, and write an awk expression to print the last field of each record. Since we left the pattern empty, this expression is evaluated on every record.
Awk vs. Ruby
So what good is learning all this, anyway? There are a couple of reasons:
Filtering data with awk
After the above we use
sortanduniqthe same as David does to get the results in the following form:David uses the following to filter out lines with no alphabetical content:
We can use awk’s patterns to do the same thing:
Here we use the
~(match) operator to write a pattern that matches only the records with an alphabetical character in them. (Remember that$0refers to the entire record.) Notice also that we can leave off the expression after the pattern, because it defaults to{ print }, which is exactly what we want.In this case, awk really shines. On my machine it outperforms the Ruby version by a factor of 8–9.
Programming with awk
The third task that David does is to consolidate all tags which are differentiated only by hyphens or underscores. That is, ‘a-tag’, ‘atag’, and ‘a_tag’ should all be considered the same. We choose which one to put into the output by whichever one is used the most times (and then we normalize the tag by replacing ‘-’ with ‘_’ so there are only underscores in the output).
Here is David’s code to do the job:
I’m not going to explain it here, because that’s not the point of the post
Here’s my awk script:
{ tag_counts[$2] = $1 } END { for (tag in tag_counts) { normtag=tag gsub(/-|_/,"",normtag) count=tag_counts[tag] sum[normtag]+=count if (count > max[normtag]) { names[normtag]=tag max[normtag]=count } } for (tag in names) { finaltag=names[tag] gsub(/(-|_)+/,"_",finaltag) print " " sum[tag] " " finaltag } }That’s right, you can use awk to do some ordinary programming tasks! Arrays are used using the usual syntax, and awk even has a foreach-style loop for looping over them. I’ll walk through the rest of the script slowly.
First we apply an expression to each record, creating an array of tags and their counts:
{ tag_counts[$2] = $1 }Then, once everything has finished (the
ENDpattern), we process this array. For each tag, we do the following:Normalize the tag (the gsub function overwrites the variable, so we have to make a copy):
(Notice the similarity between this and Ruby’s equivalent
normtag.gsub!(/-|_/, "")!)Get the count for that tag and add it to the count for the normalized version:
Like in PHP and Perl, if a value is not present in an array it is automatically added with a default value.
Next we check to see if the current tag is the commonest version of the normalized tag, and if so we save its name and count in two other arrays:
if (count > max[normtag]) { names[normtag]=tag max[normtag]=count }Notice again the usefulness of a default value for nonexistent keys:
count > max[normtag]will be true ifmax[normtag]doesn’t exist.Now we have all we need to print out the answer. For each tag we normalize it to the final version (remembering to make a copy):
Then we print out the line (concatenation is done by simply juxtaposing variables or strings):
If you’ve been watching closely you’ll notice there is a small difference between the awk and the Ruby scripts; Ruby sorts before outputting, while the awk version will come out in a non-defined order. This is fine! We can use the standard *nix tool ‘sort’ to sort the lines;
(Note that this fits in with the *nix philosophy of ‘do one thing well’ and reusing small components.)
Again, the awk version outperforms the Ruby by a factor of 3–4.
Reading external commands and files
Unfortunately there doesn’t seem to be a command-line stemming program (a quick Perl script would suffice but it isn’t what we’re here for!), so I’ll skip that stage (here’s one of the aforementioned weaknesses of a non-general-purpose language). Instead we’ll go straight to implementing stopwords.
Here’s David’s Ruby code again:
And here’s my equivalent in awk:
BEGIN { while (getline < "smart.txt") { stopwords[$0] = 1 } } !($2 in stopwords)Here I use the ‘getline’ function which does as its name suggests. We make an array of all the stopwords, with 1 as a placeholder value. The pattern is then short and simple: Print every record where the tag isn’t in the stopwords (again, we can leave off the expression to print the whole record).
Note: There is a discrepancy here: David claims this eliminated 46 tags, while I get a value of 368 for both my awk code and his Ruby code.
Again, the Ruby takes about 8 times as long to execute.
Conclusion
Here’s a couple of points:
Record- or field-oriented data? Think awk.
Don’t discount it just because it’s venerable. It is very well-suited to its task.
pattern { expression }syntax is extremely flexible.Awk’s regular expressions are fast.