Micro benchmarks can make you short-sighted

In the last 24 hours, two articles related to code benchmarking showed up on my Twitter timeline, where the authors describe the optimization of a tiny bit of code via successive benchmarks. Both articles immediately triggered the same reaction in my head: "WTF, this is wrong! Why did they wrote this in the first place, and why do they focus on optimizing this bad solution?"

The second article even ends up with a solution that not only does not solve the original problem, but is really suboptimal. So in true xkcd #386 fashion I had to do something about it!

Both articles show that benchmark-driven optimization can lead people to focus on incrementally improving what is there rather than taking a step back and thinking about the original problem in a different way.

Sum of fizz and buzz

The first article, "What I learned from writing six functions that all did the same thing", describes a variant of Fizz buzz: given a number n, return the sum of all multiples of 3 & 5 that are between 0 and n. The initial solution is to build an array of multiples of 3 or 5, and then sum the members of this array. WTF? Two loops and an array? How can one even come up with this solution in the first place?

Reading the code, it seems trivial that the array not only useless but harmful by requiring memory allocation and two loops! After many incremental improvements to the sum of the array elements, the author finally takes a step back and replaces it by a single loop where the sum computed on the fly.

And he feels like a champion since it's 28 times faster... until someone comes up with the fact that it can be solved by adding/subtracting the sum of arithmetic progressions with common difference of 3, 5 & 15 (some math we learn in high school in France).

This article ends on a positive note though: even if there were 3 versions focusing on an incremental approach of the bad part of the solution (the array), the author finally takes a step back to consider things from a different angle, and ends up with a decent solution even if we could further reduce the number of useless iterations in the loop.

A loop that should have been a lookup table

The second article, "All The Small Things - JVM Profiling Lessons From The Trenches" starts by analyzing memory allocation in the CouchBase client library, wondering why there are so many allocations of an array of enums. The finding is that in Java, SomeEnumType.values() allocates an array at each call.

Allocating an array for a list of constants?

Let's study this first: the list of values for an enum is constant, so why do we need to allocate a brand new array at each call to values()? Simply because arrays in Java are mutable, and that the caller can thus modify the contents of the resulting array. Returning a singleton array would potentially cause weird bugs in subsequent callers of this method if someone modifies it.

It certainly almost never happens in real code, but the JDK had to make sure it is protected against it. A solution would have been for the method to return an unmodifiable List rather than a low-level mutable array.

Linear scan? Srsly?

Back to the article: they have an enum representing status codes and want to, given a numerical status code, find the corresponding enum value. The initial solution for this was to loop over all enum values until we find the one that has the correct code.

This is when the "WTF?!" kicks in: this code searches for a single element in a collection, given the value of a property of that element. Hmm... wait... doesn't it sound like a table lookup? It sure does! And the implementation is a very inefficient linear scan! Fortunately, looking at the enum definition, it looks like the most frequent statuses come first, alleviating a bit the cost of the scan. But still.

Then the article goes to great length about progressive improvements by adding fast path checks for common codes. This is indeed a good idea, but the codes in the fast path are mostly at the beginning of the list, so the performance gain shouldn't be high compared to a linear scan.

But more importantly, it does not solve the original problem, which was reducing memory allocation: for values out of the fast path, values() is still called, with the associated array allocation/copy.

Solving the original problem, even if badly

So before going to constant time lookup, let's remove array allocation, and keep a loop with no fast path. This is still a bad linear scan, but allows us some interesting comparison with the fast path solution (see the full code on GitHub):

// Original implementation, with no fast path
public static KeyValueStatus valueOfLoop(final short code) {
    for (KeyValueStatus value: values()) {
        if (value.code() == code) return value;
    }
    return UNKNOWN;
}

// Current code: fast path, but array allocation for other values
public static KeyValueStatus valueOf(final short code) {
    if (code == SUCCESS.code) {
        return SUCCESS;
    } else if (code == ERR_NOT_FOUND.code) {
        return ERR_NOT_FOUND;
    } else if (code == ERR_EXISTS.code) {
        return ERR_EXISTS;
    } else if (code == ERR_NOT_MY_VBUCKET.code) {
        return ERR_NOT_MY_VBUCKET;
    }

    for (KeyValueStatus keyValueStatus : values()) {
        if (keyValueStatus.code() == code) {
            return keyValueStatus;
        }
    }
    return UNKNOWN;
}

// No fast path, use a static array to avoid allocation when calling values()
private static final KeyValueStatus[] VALUES = values();

public static KeyValueStatus valueOfLoopOnConstantArray(final short code) {
    for (KeyValueStatus value: VALUES) {
        if (value.code() == code) return value;
    }
    return UNKNOWN;
}

Benchmark code:

@Param({
        "0", // 0x00, Success
        "1", // 0x01, Not Found
        "134", // 0x86 Temporary Failure
        "255", // undefined
        "1024" // undefined
})
public short code;

@Benchmark
public KeyValueStatus loopNoFastPath() {
    return KeyValueStatus.valueOfLoop(code);
}

@Benchmark
public KeyValueStatus loopFastPath() {
    return KeyValueStatus.valueOf(code);
}

@Benchmark
public KeyValueStatus loopOnConstantArray() {
    return KeyValueStatus.valueOfLoopOnConstantArray(code);
}

Benchmark results (benchmark name truncated to fit the page width):

Benchmark               (code)  Mode  Samples   Score  Score error  Units

loopNoFastPath               0  avgt       10  19.383        0.331  ns/op
loopNoFastPath               1  avgt       10  19.243        0.376  ns/op
loopNoFastPath             134  avgt       10  24.855        0.651  ns/op
loopNoFastPath             255  avgt       10  30.587        0.833  ns/op
loopNoFastPath            1024  avgt       10  30.619        1.209  ns/op

loopFastPath                 0  avgt       10   3.044        0.092  ns/op
loopFastPath                 1  avgt       10   3.040        0.051  ns/op
loopFastPath               134  avgt       10  25.070        0.637  ns/op
loopFastPath               255  avgt       10  31.215        1.089  ns/op
loopFastPath              1024  avgt       10  32.464        0.930  ns/op

loopOnConstantArray          0  avgt       10   2.975        0.086  ns/op
loopOnConstantArray          1  avgt       10   3.035        0.080  ns/op
loopOnConstantArray        134  avgt       10  10.215        0.269  ns/op
loopOnConstantArray        255  avgt       10  16.856        0.679  ns/op
loopOnConstantArray       1024  avgt       10  17.015        0.577  ns/op

We get some pretty interesting results here:

when looping on a constant array, avoiding memory allocation, the worst case scenario (255 & 1024, out of the value range) are still better than the best case scenario (0) with memory allocation.
for the values at the beginning of the list, iterating on a couple of values on the constant array isn't different that testing the fast path. It's easy to understand: the fast path is a partial loop unrolling.

So to solve the original memory allocation issue, rather than adding a fast path we could have just added a static copy of values() and be done with it!

Constant time lookup

But this approach is still bad, with a linear scan. Let's do a constant time lookup. For this, the classical Java approach is to use a HashMap, which we'll initialize once in a static block:

private static final HashMap<Short, KeyValueStatus> code2statusMap = new HashMap<>();

static {
    for (KeyValueStatus value: values()) {
        code2statusMap.put(value.code(), value);
    }
}

public static KeyValueStatus valueOfLookupMap(final short code) {
    return code2statusMap.getOrDefault(code, UNKNOWN);
}

And here's the benchmark:

Benchmark               (code)  Mode  Samples   Score  Score error  Units
lookupMap                    0  avgt       10   4.954        0.134  ns/op
lookupMap                    1  avgt       10   4.036        0.125  ns/op
lookupMap                  134  avgt       10   5.597        0.157  ns/op
lookupMap                  255  avgt       10   4.006        0.144  ns/op
lookupMap                 1024  avgt       10   6.752        0.228  ns/op

This is goes from 1/3 slower for the first values to more than 4 times faster for the last values. Time is no more increasing linearly, but we are slower on the most frequent values.

There is also a large variance, from 4.0 to 6.7 ns. Where does this come from? It's caused by hash collisions in the internal HashMap lookup table, causing lookups to degenerate into linear scan on the colliding key values.

So this looks like an acceptable solution, giving a more or less constant time lookup without large object allocation.

Don't forget boxing

There is still some allocation happening though: a Java HashMap expects objects for its keys, not primitive types. So we have an implicit boxing of code when looking up in the map. The JDK has an internal cache for small values to alleviate part of this problem. But for codes over 127 it still allocates a Short object.

The code of this allocation doesn't show up in the benchmarks results though, with 255 (not cached) being similar to 0 & 1 (cached). There's probably some HotSpot optimisation happening behind the scenes, furthermore considering that Short.valueOf() is an intrinsic function, i.e. calls to this method are hijacked by the JVM and implemented in native code (without going through JNI).

To completely remove this boxing, we could consider the excellent CarrotSearch's high performance primitive collections. But there's an even simpler solution.

A good old reverse lookup table

This is what immediately came to my mind when looking at KeyValueStatus source code. The code values range from 0 to 0xCC (204), which is small enough to allow keeping a simple lookup table in an array:

// Lookup table: code -> KeyValueStatus
private static final KeyValueStatus[] code2status = new KeyValueStatus[0x100];

static {
    Arrays.fill(code2status, UNKNOWN);
    for (KeyValueStatus keyValueStatus : values()) {
        if (keyValueStatus != UNKNOWN) {
            code2status[keyValueStatus.code()] = keyValueStatus;
        }
    }
}

public static KeyValueStatus valueOfLookupArray(final short code) {
    if (code >= 0 && code < code2status.length) {
        return code2status[code];
    } else {
        return UNKNOWN;
    }
}

Note that we pre-fill the array with UNKNOWN to avoid the cost of a if (result == null) at each call. Do a bit more upfront to do a lot less later.

The benchmark of array lookup, with the loop with fast path checks for comparison:

Benchmark               (code)  Mode  Samples   Score  Score error  Units
loopFastPath                 0  avgt       10   3.044        0.092  ns/op
loopFastPath                 1  avgt       10   3.040        0.051  ns/op
loopFastPath               134  avgt       10  25.070        0.637  ns/op
loopFastPath               255  avgt       10  31.215        1.089  ns/op
loopFastPath              1024  avgt       10  32.464        0.930  ns/op

lookupArray                  0  avgt       10   3.061        0.126  ns/op
lookupArray                  1  avgt       10   3.048        0.127  ns/op
lookupArray                134  avgt       10   3.070        0.084  ns/op
lookupArray                255  avgt       10   3.035        0.113  ns/op
lookupArray               1024  avgt       10   3.034        0.113  ns/op

There we are! Constant time lookup, no allocation, and as fast as the article's solution for values in the fast path.

Note: the code on GitHub has two more experiments (array lookup with a try/catch, and a large switch statement) but they don't give better results.

Conclusion

What can be learned from both articles is that benchmarks are awesome tools to actually measure improvements, but they can also be dangerous, by making people short-sighted and optimize only part of the problem, even on a 10-line code block.

Before going into nano-optimization driven by micro-benchmarks, take a step back and ask yourself "What is the problem I have to solve? How can I do it differently?"

And be curious, look at how things are implemented under the covers, and go down the rabbit hole once in a while to really understand how things work. You will learn a lot, and it will most certainly change the way you solve problems.

Ah, and don't forget at the end to check that you actually solved the original problem :-)