In the last 24 hours, two articles related to code benchmarking showed up on my Twitter timeline, where the authors describe the optimization of a tiny bit of code via successive benchmarks. Both articles immediately triggered the same reaction in my head: "WTF, this is wrong! Why did they wrote this in the first place, and why do they focus on optimizing this bad solution?"
The second article even ends up with a solution that not only does not solve the original problem, but is really suboptimal. So in true xkcd #386 fashion I had to do something about it!
Both articles show that benchmark-driven optimization can lead people to focus on incrementally improving what is there rather than taking a step back and thinking about the original problem in a different way.
Sum of fizz and buzz
The first article, "What I learned from writing six functions that all did the same thing", describes a variant of Fizz buzz: given a number n, return the sum of all multiples of 3 & 5 that are between 0 and n. The initial solution is to build an array of multiples of 3 or 5, and then sum the members of this array. WTF? Two loops and an array? How can one even come up with this solution in the first place?
Reading the code, it seems trivial that the array not only useless but harmful by requiring memory allocation and two loops! After many incremental improvements to the sum of the array elements, the author finally takes a step back and replaces it by a single loop where the sum computed on the fly.
And he feels like a champion since it's 28 times faster... until someone comes up with the fact that it can be solved by adding/subtracting the sum of arithmetic progressions with common difference of 3, 5 & 15 (some math we learn in high school in France).
This article ends on a positive note though: even if there were 3 versions focusing on an incremental approach of the bad part of the solution (the array), the author finally takes a step back to consider things from a different angle, and ends up with a decent solution even if we could further reduce the number of useless iterations in the loop.
A loop that should have been a lookup table
The second article, "All The Small Things - JVM Profiling Lessons From The Trenches" starts by analyzing memory allocation in the CouchBase client library, wondering why there are so many allocations of an array of enums. The finding is that in Java, SomeEnumType.values() allocates an array at each call.
Allocating an array for a list of constants?
Let's study this first: the list of values for an enum is constant, so why do we need to allocate a brand new array at each call to values()? Simply because arrays in Java are mutable, and that the caller can thus modify the contents of the resulting array. Returning a singleton array would potentially cause weird bugs in subsequent callers of this method if someone modifies it.
It certainly almost never happens in real code, but the JDK had to make sure it is protected against it. A solution would have been for the method to return an unmodifiable List rather than a low-level mutable array.
Linear scan? Srsly?
Back to the article: they have an enum representing status codes and want to, given a numerical status code, find the corresponding enum value. The initial solution for this was to loop over all enum values until we find the one that has the correct code.
This is when the "WTF?!" kicks in: this code searches for a single element in a collection, given the value of a property of that element. Hmm... wait... doesn't it sound like a table lookup? It sure does! And the implementation is a very inefficient linear scan! Fortunately, looking at the enum definition, it looks like the most frequent statuses come first, alleviating a bit the cost of the scan. But still.
Then the article goes to great length about progressive improvements by adding fast path checks for common codes. This is indeed a good idea, but the codes in the fast path are mostly at the beginning of the list, so the performance gain shouldn't be high compared to a linear scan.
But more importantly, it does not solve the original problem, which was reducing memory allocation: for values out of the fast path, values() is still called, with the associated array allocation/copy.
Solving the original problem, even if badly
So before going to constant time lookup, let's remove array allocation, and keep a loop with no fast path. This is still a bad linear scan, but allows us some interesting comparison with the fast path solution (see the full code on GitHub):
// Original implementation, with no fast path public static KeyValueStatus valueOfLoop(final short code) { for (KeyValueStatus value: values()) { if (value.code() == code) return value; } return UNKNOWN; } // Current code: fast path, but array allocation for other values public static KeyValueStatus valueOf(final short code) { if (code == SUCCESS.code) { return SUCCESS; } else if (code == ERR_NOT_FOUND.code) { return ERR_NOT_FOUND; } else if (code == ERR_EXISTS.code) { return ERR_EXISTS; } else if (code == ERR_NOT_MY_VBUCKET.code) { return ERR_NOT_MY_VBUCKET; } for (KeyValueStatus keyValueStatus : values()) { if (keyValueStatus.code() == code) { return keyValueStatus; } } return UNKNOWN; } // No fast path, use a static array to avoid allocation when calling values() private static final KeyValueStatus[] VALUES = values(); public static KeyValueStatus valueOfLoopOnConstantArray(final short code) { for (KeyValueStatus value: VALUES) { if (value.code() == code) return value; } return UNKNOWN; }
Benchmark code:
@Param({ "0", // 0x00, Success "1", // 0x01, Not Found "134", // 0x86 Temporary Failure "255", // undefined "1024" // undefined }) public short code; @Benchmark public KeyValueStatus loopNoFastPath() { return KeyValueStatus.valueOfLoop(code); } @Benchmark public KeyValueStatus loopFastPath() { return KeyValueStatus.valueOf(code); } @Benchmark public KeyValueStatus loopOnConstantArray() { return KeyValueStatus.valueOfLoopOnConstantArray(code); }
Benchmark results (benchmark name truncated to fit the page width):
Benchmark (code) Mode Samples Score Score error Units loopNoFastPath 0 avgt 10 19.383 0.331 ns/op loopNoFastPath 1 avgt 10 19.243 0.376 ns/op loopNoFastPath 134 avgt 10 24.855 0.651 ns/op loopNoFastPath 255 avgt 10 30.587 0.833 ns/op loopNoFastPath 1024 avgt 10 30.619 1.209 ns/op loopFastPath 0 avgt 10 3.044 0.092 ns/op loopFastPath 1 avgt 10 3.040 0.051 ns/op loopFastPath 134 avgt 10 25.070 0.637 ns/op loopFastPath 255 avgt 10 31.215 1.089 ns/op loopFastPath 1024 avgt 10 32.464 0.930 ns/op loopOnConstantArray 0 avgt 10 2.975 0.086 ns/op loopOnConstantArray 1 avgt 10 3.035 0.080 ns/op loopOnConstantArray 134 avgt 10 10.215 0.269 ns/op loopOnConstantArray 255 avgt 10 16.856 0.679 ns/op loopOnConstantArray 1024 avgt 10 17.015 0.577 ns/op
We get some pretty interesting results here:
- when looping on a constant array, avoiding memory allocation, the worst case scenario (255 & 1024, out of the value range) are still better than the best case scenario (0) with memory allocation.
- for the values at the beginning of the list, iterating on a couple of values on the constant array isn't different that testing the fast path. It's easy to understand: the fast path is a partial loop unrolling.
So to solve the original memory allocation issue, rather than adding a fast path we could have just added a static copy of values() and be done with it!
Constant time lookup
But this approach is still bad, with a linear scan. Let's do a constant time lookup. For this, the classical Java approach is to use a HashMap, which we'll initialize once in a static block:
private static final HashMap<Short, KeyValueStatus> code2statusMap = new HashMap<>(); static { for (KeyValueStatus value: values()) { code2statusMap.put(value.code(), value); } } public static KeyValueStatus valueOfLookupMap(final short code) { return code2statusMap.getOrDefault(code, UNKNOWN); }
And here's the benchmark:
Benchmark (code) Mode Samples Score Score error Units lookupMap 0 avgt 10 4.954 0.134 ns/op lookupMap 1 avgt 10 4.036 0.125 ns/op lookupMap 134 avgt 10 5.597 0.157 ns/op lookupMap 255 avgt 10 4.006 0.144 ns/op lookupMap 1024 avgt 10 6.752 0.228 ns/op
This is goes from 1/3 slower for the first values to more than 4 times faster for the last values. Time is no more increasing linearly, but we are slower on the most frequent values.
There is also a large variance, from 4.0 to 6.7 ns. Where does this come from? It's caused by hash collisions in the internal HashMap lookup table, causing lookups to degenerate into linear scan on the colliding key values.
So this looks like an acceptable solution, giving a more or less constant time lookup without large object allocation.
Don't forget boxing
There is still some allocation happening though: a Java HashMap expects objects for its keys, not primitive types. So we have an implicit boxing of code when looking up in the map. The JDK has an internal cache for small values to alleviate part of this problem. But for codes over 127 it still allocates a Short object.
The code of this allocation doesn't show up in the benchmarks results though, with 255 (not cached) being similar to 0 & 1 (cached). There's probably some HotSpot optimisation happening behind the scenes, furthermore considering that Short.valueOf() is an intrinsic function, i.e. calls to this method are hijacked by the JVM and implemented in native code (without going through JNI).
To completely remove this boxing, we could consider the excellent CarrotSearch's high performance primitive collections. But there's an even simpler solution.
A good old reverse lookup table
This is what immediately came to my mind when looking at KeyValueStatus source code. The code values range from 0 to 0xCC (204), which is small enough to allow keeping a simple lookup table in an array:
// Lookup table: code -> KeyValueStatus private static final KeyValueStatus[] code2status = new KeyValueStatus[0x100]; static { Arrays.fill(code2status, UNKNOWN); for (KeyValueStatus keyValueStatus : values()) { if (keyValueStatus != UNKNOWN) { code2status[keyValueStatus.code()] = keyValueStatus; } } } public static KeyValueStatus valueOfLookupArray(final short code) { if (code >= 0 && code < code2status.length) { return code2status[code]; } else { return UNKNOWN; } }
Note that we pre-fill the array with UNKNOWN to avoid the cost of a if (result == null) at each call. Do a bit more upfront to do a lot less later.
The benchmark of array lookup, with the loop with fast path checks for comparison:
Benchmark (code) Mode Samples Score Score error Units loopFastPath 0 avgt 10 3.044 0.092 ns/op loopFastPath 1 avgt 10 3.040 0.051 ns/op loopFastPath 134 avgt 10 25.070 0.637 ns/op loopFastPath 255 avgt 10 31.215 1.089 ns/op loopFastPath 1024 avgt 10 32.464 0.930 ns/op lookupArray 0 avgt 10 3.061 0.126 ns/op lookupArray 1 avgt 10 3.048 0.127 ns/op lookupArray 134 avgt 10 3.070 0.084 ns/op lookupArray 255 avgt 10 3.035 0.113 ns/op lookupArray 1024 avgt 10 3.034 0.113 ns/op
There we are! Constant time lookup, no allocation, and as fast as the article's solution for values in the fast path.
Note: the code on GitHub has two more experiments (array lookup with a try/catch, and a large switch statement) but they don't give better results.
Conclusion
What can be learned from both articles is that benchmarks are awesome tools to actually measure improvements, but they can also be dangerous, by making people short-sighted and optimize only part of the problem, even on a 10-line code block.
Before going into nano-optimization driven by micro-benchmarks, take a step back and ask yourself "What is the problem I have to solve? How can I do it differently?"
And be curious, look at how things are implemented under the covers, and go down the rabbit hole once in a while to really understand how things work. You will learn a lot, and it will most certainly change the way you solve problems.
Ah, and don't forget at the end to check that you actually solved the original problem :-)