Contents
Things start to really heat up when it dawns on us, how a good set of standard parts can be used to express totally different ideas just by composing them in different ways.
(Edit 2022-03-10: speaking of lessons, how about the one in the appendix?!)
"Designing is fundamentally about taking things apart. It's about taking things apart in such a way that they can be put back together. i.e. Separating into things that can be composed."
— Rich Hickey, "Design, Composition, and Performance", 2013
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 10q
— Douglas McIlroy, Communications of the ACM, 1986
Previously: Shell ain't a bad place to FP: part 0/N
The Pipeline that Douglas Built
Douglas McIlroy famously (infamously?) wrote the following in reply to a problem posed by Jon Bentley for his column "Programming pearls: a literate program" (Communications of the ACM magazine, June 1986, Vol. 29, No. 6).
I first heard of it some years ago in More Shell Less Egg, and saw it again in the book Classic Shell Scripting (which taught me much of my shell-fu). The original was not online then. Now I see the ACM has kindly published it along with the rest of their archives!
Here it is, lightly paraphrased:
# Problem statement (word frequency):
#
# - Read a file of text
# - Determine the n most frequently-used words
# - Print out a sorted list of all the words, along with their frequencies
# Douglas McIlroy's answer
# 1. Transliterate complement (-c) of words into newlines,
# squeezing out (-s) duplicates
tr -cs A-Za-z '\n' |
# 2. Transliterate uppercase to lowercase
tr A-Z a-z |
# 3. Sort to bring identical words together
sort |
# 4. Replace each run of duplicate words with
# a single representative, and include a count
uniq -c |
# 5. Sort reverse (-r), numeric (-n)
sort -rn |
# 6. Pass through stream editor; quit after printing the
# the first 10 lines received
sed 10q
Here I am, punching the Bash manual page through it…
man bash |
tr -cs A-Za-z '\n' | tr A-Z a-z |
sort | uniq -c | sort -rn |
sed 10q
… and here are the top 10 words by frequency.
4200 the
1822 is
1251 to
1221 a
1147 of
869 if
805 and
570 shell
570 in
563 command
"Coolcoolcoolcool nodoubt nodoubt… So, uh… that's it?"
Take Apart! Semantics/Idioms -> Functions
It's worth observing that the same tools composed in different ways express totally different concepts. sort
just sorts. uniq
just returns uniques. But sort | uniq
is an idiom for set of things. Whereas sort | uniq -c |
sort -rn
is an idiom for frequency distribution.
Now…
What if we use Bash functions to name the idioms we see in McIlroy's pipeline?
flatten_paragraphs() {
# English-only for easy explanation, but can be more general
tr -cs A-Za-z '\n'
}
tokenise_lowercase() {
# Transliterate uppercase to lowercase
tr A-Z a-z
}
frequencies() {
# Produce frequency distribution of input
sort | uniq -c | sort -rn
}
take_n() {
# Given a number n, return those many lines of input
# or 10 lines by default, if n is not specified.
sed ${1:-10}q
}
And what if we update the pipeline with function calls like this?
man bash |
flatten_paragraphs |
tokenise_lowercase |
frequencies |
take_n 10
Yes, we get the same result!
4200 the
1822 is
1251 to
1221 a
1147 of
869 if
805 and
570 shell
570 in
563 command
Yes, yes, YES! Functions + pipes = mind blown!
Play! Semantics -> Functions -> "Ooh, what if I…"
Now that we lifted out a couple of text processing functions, we can try to make more text processing functions. Here are some examples.
(Edit 2022-03-10: the "clever" mkfifo-ery contains dangers I did not know of. More at the bottom, in the appendix.)
sort_dictionary() {
sort -b -d -k2
}
sort_rhyme() {
rev | sort -b -d | rev
}
# eliminate stop-words
drop_stopwords() {
local stopwords=${1:-"the,is,to,a,of,if,and,in,or,be,by,not,with,for,when,it"}
local grep_pattern=$(tr , '\|' <<<"${stopwords}")
grep -v -E ${grep_pattern}
}
# n-grams
butlast_n() {
# utility for picking appropriate collection of n-grams
head -n -${1:-0}
}
bigram() {
# we need intermediate state, but we can make it stream,
# instead of accumulating in temp files
mkfifo bigram_buffer
tee >(tail +2 > bigram_buffer) |
paste - bigram_buffer |
# take all but the last entry as it is not a bigram
butlast_n 1
rm bigram_buffer
}
trigram() {
# we need intermediate state, but we can make it stream,
# instead of accumulating in temp files
mkfifo trigram_buffer_one trigram_buffer_two
tee >(tail +2 > trigram_buffer_one) |
tee >(tail +3 > trigram_buffer_two) |
paste - trigram_buffer_one trigram_buffer_two |
# take all but the last 2 entries as they are not trigrams
butlast_n 2
rm trigram_buffer_one trigram_buffer_two
}
Clearly there is a lot to explore about functions and pipelines and other techniques in this code. We will do deep dives in upcoming posts. For now just know that Bash functions…
- name a group of shell statements,
- can be composed with pipes
- thus intermix with regular shell tools, and
- can help create domain-specific abstractions out of domain-agnostic ones.
But before we go there, indulge me and my Oh, and One More Thing (TM) …
Compose Again! Semantics -> Functions -> Play -> Grand New Pipeline
What's the point of making a text processing library of functions if we don't process any text?
Well…
- Start a new shell session.
- Copy-paste all the Bash functions above into it.
- Then copy-paste this pipeline and
- Hit Enter!
# I assume you have Bash version 4+.
man bash |
# pre-process
flatten_paragraphs |
tokenise_lowercase |
drop_stopwords |
# cache raw pre-processed data, if we need to re-analyse later
tee /tmp/bash_manpage_raw_tokens.txt |
# cache various views or compressions of the raw data
tee >(sort_dictionary | uniq > /tmp/bash_manpage_sorted_as_dictionary.txt) |
tee >(sort_rhyme | uniq > /tmp/bash_manpage_sorted_as_rhyme.txt) |
# accumulate various analyses of the OG raw data
tee >(frequencies > /tmp/bash_manpage_token_freqs.txt) |
tee >(bigram | frequencies > /tmp/bash_manpage_bigram_freqs.txt) |
tee >(trigram | frequencies > /tmp/bash_manpage_trigram_freqs.txt) |
take_n
And why not experiment?!
Reorder it! Remove parts of it! Change parts of it! Give it 10 GiB of input!
Play and learn!!!
(#protip: The shell can auto-complete functions. Type flat and hit TAB, and you should get a completion for flattenparagraphs.)
Addendum: Remarkable aspects of Doug's O.G. pipeline
The UNIX tools philosophy is clearly at work. sort
just sorts, uniq
just returns uniques, pipes connect parts. Ho hum.
The things I do find remarkable are:
Now the year is 2022, i.e. McIlroy wrote the program about 4 decades ago. It continues to edify, meaning the ideas it contains have a timeless quality.
It also works as-is, on my cheap Thinkpad running a GNU Linux (Ubuntu), even though the original code was written for a UNIX that might live only in a museum today (or maybe in your bank). Odds look good that come 2036, it will continue to still work as-is on mainstream boxen.
It is plain text, and so eminently portable. (I slapped it into the org-mode file of this blog post, evaluated it via org-babel, and captured the results inline. How? Because Emacs org-babel can simply "shell out"; i.e. make a standard request to a standard shell to evaluate the program and have the shell process return any result in a standard way.)
I bet it runs way faster now because my box is a supercomputer v/s the UNIX boxen of that era.
Pipes remove the burden of explicit state handling. Oh, also, Douglas McIlroy invented UNIX pipes.
The entire composition is itself a function.
map
(tokenise),filter
(uniquify),reduce
(frequency distribution), and early termination (take
first 10) are automatic, needing no special machinery.It is an abstract computation that is independent of data source/sink. We can hook into any I/O combination of sockets, or fifo pipes, or files on disk without modifying the pipeline code—much like Clojure transducers or monadic I/O in Haskell land.
Most importantly, a rank amateur like me could figure out each part and the whole in one sitting. It is eminently doable because:
- each sub part is understandable in isolation and
- the whole is amenable to incremental as well as large-scale adaptation,
- in playful, interactive, low-risk ways.
I was clueless then and had to dig through manpages and flail about at the command line. It took me a while to grok the function of each tool and how it is applied to the text processing problem.
If you haven't already, I'd say bear that small cost, because it teaches a priceless lesson in modular, composable, functional architecture.
Plus, why not step up one's shell-fu?
Appendix: an unexpected masterclass!
My head is exploding. Prof. McIlroy emailed me some remarks. (There is a backstory, but first the important stuff.)
The danger lurking in the pipes
(Emails redacted to stymie spambots.)
----- Original message -----
From: Douglas McIlroy <Email at his web page. Link posted below.>
To: Aditya Athalye <Email at this web page. See footer.>
Subject: Musings on your blog
Date: Wednesday, March 09, 2022 8:16 PM
Aditya,
A reader might complain that the bigram example in your blog
can be done more efficiently, with a similar amount of typing,
by a sed script instead of mkfifo, tee, and paste:
sed -n '1bx; H; g; s/\n/ /p; s/.* //; :x; h'
A slightly different example is immune to this charge:
trap "rm -f fifo" 0 HUP TERM PIPE INT
mkfifo fifo
sort |
uniq |
tee >(rev | sort >fifo) |
join -o 1.1 - fifo >palindromes
But ... join can't move until rev|sort produces output, so
essentially the whole word list piles up in its input pipe.
If there's not enough buffer space, deadlock will occur.
The moral of this tale is that loops in the (undirected)
graph of a pipe network pose a hazard of deadlock if some
pipe queue necessarily suffers unbounded growth. This
hazard manifests in the palindrome example but not in
the bigram example.
Sidelight. Buffering by C's stdio package can cause
deadlock in a feedback loop. A process that buffers its
output will starve if it needs feedback from stuff that's
waiting in its output buffer. stdio's buffering is evil!
Prof. McIlroy also pointed me to his notes on coroutine-based programs (examples of stream processing in Unix).
In case you haven't already seen it,
https://www.cs.dartmouth.edu/~doug/sieve/sieve.pdf
exhibits some unusual plumbing.
The PDF is available at his Dartmouth College home page, which has other fun stuff too.
Backstory
I habitually cold-email people if something they did or said moved me in some constructive way. So, I wrote a little thank you note to Prof. McIlroy after posting this blog entry (nobody thinks straight at 3 AM).
He replied! We exchanged a couple of emails. "That was so cool!" thought I, and went back to life as usual.
Yesterday he emailed these follow-up remarks! A nice little masterclass in Unix programming that I'm so pleased to share here, with Prof. McIlroy's gracious permission.
Postscript
Wow, this is one of the best emails I've ever received! The reader's complaints are warranted and deserved.
I was fooling around with mkfifo and accidentally discovered it "worked" after a fashion. "What's the buffering story?" crossed my mind, but I didn't find out. I'm also slapping my forehead for not using trap
to auto-clean the pipes. And needless to say, my sed-fu is weak. Brown belt at best :)
I rue the fact that I haven't paid due attention to The Machine. I can write Clojure to make a living, but can't write C to save my life :))
So now this excellent complaint leaves me no choice, but to crack open my long-unused copies of the K&R book and The Unix Programming Environment.
Thank you so much for taking the time to teach me, Professor!
—
Next up: Part 2/N: Deep-dive into bash functions and function design techniques
- Using functions to craft one’s own Bytes-sized UNIX tools
- Using them interactively like regular UNIX tools
- maybe more…
The ol' noodle is noodlin' over it. Stay tooned!