Mergesort. Dissected.

4 07 2008

(Note: All code in this under the WTFPL [0]. Except possibly for Stepane Delcroix’s adapted code, which is under the MIT/X11 license)

Very recently, I decided to spend some time coding those standard algorithms that every programmer should know, but I am woefully ignorant of. The plan was to do an algorithm a day. Unfortunately, yesterday night, I started on the Merge sort algorithm, and I found some ancient bash script I wrote to profile and graph the running time of some other sorts (i.e. the bubble sort). Which led to me spending precious time playing around with gnuplot and various program parameters. Expect graphs. Lots of them.

Some background: I’ve lately developed a foreboding sense of being a ‘code monkey’. While I know how to create a UI using GTK+ (and wx), and how to link up stuff using DBus, I still don’t know a fig about QuickSort (which is next on my list). Being a CS student, this is in my view a crime. So I pulled out the big ‘ol Cormen, Rivest, et al.[1] and started to peruse it (btw, I still haven’t done my algos class).

It took me very little time to get a basic implementation of the merge sort ready. I was about to get back to my GSoC work [2], when I remembered I had written a bash script I wrote some time last Novemeber in my introductory CS lab session (of my own volition). At the time, having already programmed for a while, I managed to complete all my lab sessions very quickly, and wasted my time in the lab doing non-sensical things to my C programs, this being one of them (I still fondly remember a CS lab teaching assistant telling me about the heap sort. Unfortunately, I didn’t get it at the time). Nonetheless, I was surprised when I found this script, since I honestly don’t know that much scripting now.


# Compile over different optimizations
for i in `seq 0 3`; do
make CFLAGS="-g -O$i" LDFLAGS="-pg";
echo "#mainO$i SORT DATA" > mainO$i.dat;
echo "#No. of elements     Time taken (ms)" >> mainO$i.dat;
for j in `seq 1000 1000 50000`; do
./main $j;
gprof -b -a main > mainO$;
t=$(awk '/sort/ {print $2}' mainO$ | head -n 1);
echo "$j        $t" >> mainO$i.dat;
echo "$j        $t";
echo "mainO$j";
make clean;
gnuplot tmpl.p;

The script has been subtley modified to make it more robust and cleaner. The original source is here[3]. Modifying the script required me to go over all of those bash basics I keep forgetting, and then doing the same for gnuplot.

I initially made the merge sort work only for arrays of size 2^n, and had a some minor complications extending it to an arbitrary sized array. I don’t know about the optimality of my method, but I essentially split every array into two parts, one that’s 2^n sized, it’s remainder, recursively. This ends when the size of the array is 1, which happens always. So I needed a method to find the nearest power of 2. Luckily, Jeffery Steadfast had already written something about this. [4]

When I first read it, I made some absolutely baseless and stupid comment about how his method works (Method #2 that is). It’s far simpler than I thought. To redeem myself however, I figured out how the very efficient method suggested by Stephane Delcroix works (Method #4). It is by far the most beautiful code snippet I’ve seen, doing a sort of a binary search for the top most bit set. Here is the code:


nearest_pow (long num)
long i, j;
(i = num & 0xFFFF0000) || (i = num);
(j = i & 0xFF00FF00) || (j = i);
(i = j & 0xF0F0F0F0) || (i = j);
(j = i & 0xCCCCCCCC) || (j = i);
(i = j & 0xAAAAAAAA) || (i = j);
return i;

And here is a graphical break down of it:

There are 4 bytes to an int, so 8 hex numbers. Each step divides the into into 2. For example in the first step, it checks if there exists a bit set in the first 2 bytes (4 hex numbers), using & FFFF0000. If so, the last 2 bytes are set to 0. In the other case, the first 2 bytes are anyway 0. Now, you only need to focus on 2 bytes. The mask & FF00FF00 works on *both* the upper 2 bytes and the lower two bytes at the same time. In effect, you’re applying the mask FF00 on the 2 bytes filtered from the first step. This continues to next step as well, with F0F0F0F0 being F0 in for the single byte you’ve filtered (remember, F is 1/2 a byte long). A in binary is 11001100, and C is 10101010. So at the last step, the uppermost bit survives, and all the lower ones are set to 0. And that’s the nearest power of two (on the lower end).

Back to merge sort. Here are some graphs. (Programs compiled with (-g -O{0,1,2,3} -pg), I’ve iterated from 1e6 to 5e7 in divisions of 1e6. That would be 50 data points. Attempts to profile stuff below about 1e5 fail, since it’s too short. I chose 1e6 because I needed to strike a balance between torturing my computer and my patience.):

(Note, list_ is a scratch buffer. It’s been allocated the same memory as list).

void sort (long len, int *list, int *list_)
long i = 0, idx1 = 0, idx2, idx;
idx2 = idx = prev_pow(len);
/* divide */
if (len != 1)
sort (idx, list, list_);
sort (len - idx, list + idx, list_);
/* merge */
while ((idx1 < idx) && (idx2 < len))
list_[i++] = (list[idx1] < list[idx2])?list[idx1++]:list[idx2++];
if (idx1 < idx)
memcpy (list_+i, list+idx1, sizeof(int)*(idx - idx1));
memcpy (list_+i, list+idx2, sizeof(int)*(len - idx2));
memcpy (list, list_, sizeof(int)*(len));


Mergesort Vanillia

Mergesort Vanillia: Using Lines

Using CSplines

Mergesort Vanillia: Using CSplines

Mergesort Vanillia. Using Beizer

Mergesort Vanillia. Using Beizer

At first I thought the spikeiness of the graphs was caused due to load variations on my system, but after looking at how smooth the other graphs were, I don’t know. Perhaps it’s caused by variation due to sensitivity of the initial array (but technically speaking, merge sort is tightly bound on O(n * log n), and there shouldn’t be that much variation…). Does anyone  have an insight on what might be the matter?

A very popular modification of the merge-sort algorithm is to use an insertion sort for branches smaller than a particular cutoff [5]. I of course, wanted to see whether there was a relation between input size and optimal cutoff. However, on analysis, I found that the cutoff did not help in any way (note: the cutoff=2 case (i.e. it doesn’t insert sort much) actually takes *longer* than the vanilla mergesort). Perhaps it is something wrong in my insertion sort implementation, though I don’t see anything that could go wrong (after all, insertion sort is very trivial). (All graphs made using the program compiled with (-g -O2 -pg), given increasing cutoffs at fixed array sizes).


idx2 = idx = prev_pow(len);
/* insertion-sort */
if (len < cutoff)
for (idx1 = 0; idx1< len; idx1++)
i = idx1;
for (idx2 = idx1; idx2< len; idx2++)
i = (list[idx2] < list[i])?idx2:i;
if (i == idx1)
idx2 = list[idx1];
list[idx1] = list[i];
list[i] = idx2;
/* divide */
else     {
sort (idx, list, list_);


Using Lines

Mergesort (with Insertion Sort): Using Lines

Using CSplines

Mergesort (with Insertion Sort): Using CSplines

Using Bezier

Mergesort (with Insertion Sort): Using Bezier

Something interesting worth noting are the big jumps somewhere between 60 and 80, and 120 and 140, which I’m sure are 64 and 128. This is because that marks a branching point for merge sort. So by having an insert sort at 65, you ‘save’ the merge sort a branch. I don’t see any savings though (btw, 64 is the most often quoted cutoff I’ve seen). I also tried to plot a 3d graph, which looks rather terrible:
Finally, I decided since I’ve already done the effort, why not plot graphs for Bubble Sort and Insertion Sort (these were the only two sorts I did in the past).

Using Lines

Bubble Sort: Using lines

Using CSplines

Bubble Sort: Using CSplines

Using Bezier

Bubble Sort: Using Bezier

Using lines

Insertion Sort: Using Lines

Using CSplines

Insertion Sort: Using CSplines

Using Bezier

Insertion Sort: Using Bezier

Nothing particularly interesting to see, other than proof that bubble sort is the worst sorting algorithm of all time. Also, I’d like to point out how smooth the graphs are.

That’s it for today. Hopefully, more graphs are on their way.

[0] This more for my amusement than anything else. I think it is absolutely unneccessary to license code as trivial as shown in this blog.
[1] Introduction to Algorithms : Cormen, Rivest, et. al
I recently found out that a friend of mine studying economics at Dartmouth (Darth Mouth ;-)),
had a course taken by one of these authors. Which is just awesome beyond words. Pity he wasn’t
doing a solid algorithms course with them though.

[2] For people interested in my GSoC progress, I’m happy to say that the plugin is fundamentally complete, and requires some bug fixing and feature bloating. There are a few changes to the Anjuta source code, and until that’s been approved (which requires me cleaning it up and sending it first), I can’t release it. When I do though, it’d be awesome if you’d try it out. Vim is something that has so many bells and whistles, its hard to make sure it doesn’t fail on somebody else’s configuration.

[3] The astute reader may notice that the file is hosted at Yes, I have indeed bought my first plot on the internet. Hopefully someday I’ll find enough time to build a home there.
[4] Jeffery Steadfast Nearest Power of Two
[5] The Algoritmist’s take

[1] A template. sed -i ‘s/tmpl/<algo_name>/g’ and write some code. ./rand.bash to use. Easy to extend to other kinds of algorithms as well. tmpl.tar.gz
[2] Code, graphs, and curresponding data (with proper indetnation). /code/sort




7 responses

4 07 2008
Marius Gedminas

Cormen et al. is a very good book.

I had memorized Quicksort but never understood how it worked. Then I saw the algorithm expressed as a Haskell two-liner and it suddenly became clear. And I don’t even know Haskell, just read the tutorial once!

4 07 2008

A few years ago, Jeffrey Stedfast blogged about a bunch of sorting algorithms for a while (he was hitting on a bunch of them).

Here’s the link to the merge sort post:

5 07 2008

The usual problem with graphing is that you want to smooth out the noise but keep the significant blips intact (not half-smoothed away). there is a really good way to do that: use a Butterworth filter. It’s a low-pass signal filter, meant to smooth out digital (or analog; the original filter is an analog ciruit), but it works great for graphs like this as well.

The only messy part is to figure out the parameters for a given filter, but fortunately the Signal toolbox in Octave (or Matlab of course) has a “butter” function that will give you the parameters for any frequency cutoff.

5 07 2008
C nazi

> There are 4 bytes to an int
That’s mostly implementation defined.

> so 8 hex numbers
Even if int is 4 bytes, it is not said that those bytes are 8 bits.

5 07 2008
website design

Upmodded for use of the WTFPL.

5 07 2008

@Marius: Thanks for the link! I didn’t get the syntax at first, but looking at the explanation, it’s rather beautfiul. Haskell’s been added to the “Must Learn” list.

@Rob: I saw that actually. I wouldn’t have bothered with this post if it weren’t for the graphs…

@Janne: Interesting. Thanks for the tip. (Added Octave to the aforementioned list).

@C Nazi: Too true. I can’t deny it, but I decided to convinently ignore the non-x86 users (call me racist if you’d like). That code wouldn’t work on such a platform, but it’s pretty obvious how to make the desired changes.

5 07 2008
Gopala Krishna

Wow! I guess i am in the same boat 🙂
I had my ADA (Analysis and Design of Algorithms) in the last week, and only then did i understand how heapsort is so efficient.

Good work, and no i am not going to divert from gsoc for time being as i have already lost most of time in exams 😉

fellow gsoc student

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: