Computation Caching | The Fundamentals Behind Nx’s Lightning Fast Execution

Zack DeRose
Nx Devtools
Published in
8 min readDec 16, 2020

--

This article will explain the fundamentals behind and benefits of Computation Caching. While we’ll be explaining this concept from the perspective of Nx, the Web platform, and using JavaScript code to demonstrate concepts, these concepts are still relevant outside of Nx, as well as with other platforms and languages!

What is Computation Caching?

Computation caching is a term used by Nx to describe a method of improving the performance of processes run against an Nx workspace. Most of the time these processes are “build processes” - or processes meant to create an executable or deliverable artifact from the codebase. In front-end development, this would be the distributable files we push up for our server to host. Computation caching may also be used to improve the performance of other processes, e.g. running unit/e2e tests, linting, etc.

The simplest description of this concept is as follows:

  • All of the source code that you wrote that is used in a given process, plus
    all of the dependency code that is used in that process are the inputs for a process.
  • If you’ve run a process while your inputs from the previous bullet have not changed — then rather than actually running the process, you *could* instead just pull out the result of that process from your history!

In this way, we are drastically improving the time complexity of our best-case performance, without affecting the worst-case performance (if no history exists for the inputs, we’ll need to run the process as we otherwise would have without computation caching).

In the rest of this article, we’ll take a in-depth look into the Computer Science concepts and patterns that build the foundation of computation caching.

Pure Functions > Memoization > Serialization > Hashing > Persistence > Distributed Computation

Pure Functions

Pure Functions are functions that:

  • for a given set of parameters always return the same value
  • do not trigger any side effects

Consider the following function:

Looking at the signature of our function shows that it will take a single number as an argument, and return a number. Based on the implementation, we can correctly determine that any argument passed to the function will always return the same result, and there are no side effects to the function having been called.

In the context of ‘state’, we can say that this function is not reliant on any external state, it persists no state of its own, and it in no way alters any external state. In other words, state is removed entirely from the picture. Any pieces of data required to determine a result are included in the parameters.

Because of this, we can say that this function meets the criteria for a pure function. Pure functions are very helpful concepts in the realm of Computer Science, very similar to the Mathematical definition of a function. They have the benefit of reducing the cognitive load of a developer, they are easy to test, and they are easy to optimize via memoization.

Memoization

To explore memoization, let’s consider our addOne() function again. There’s not much computation happening in this implementation, but operating under the assumption that a function could require considerable computation (e.g. building an application) it would be beneficial to take advantage of the nature of pure functions, specifically: for a given set of parameters always return the same value.

Since this function is pure, we can leverage that so the actual calculation for a given n is only ever calculated once. We can accomplish this by using memory to keep track of what our function has been called with and what the result was. This process is called memoization.

In the code above, we use a closure to wrap our addOne() function, and enhance it with a cache object. This cache object serves as an indexed map that we’ll use to remember what we’ve called our function with, and what the results were. When we call this function, we can detect if we already have a result for the given argument.

  • If we don’t have a result, we call this a ‘cache miss.’ Since we don’t yet know the result, we’ll compute it and then store it in the cache before returning the computed result.
  • If we do have a result, we call this a ‘cache hit.’ In this case we will skip computation and return the result as stored in the cache object.

Serialization & Hashing

In our addOne() example, we had a single argument that happened to be a number, which made it rather simple to index our cache for the various arguments. What happens if our arguments become a bit more complicated?

In the example above, we see another pure function sum() that takes any amount of number arguments. The serializeArgs() function here takes those arguments and turns them into a string. Inside of our createMemoizedSum() closure then, we’re able to index our cache by calling this serializeArgs() on our arguments. Serialization is a name for the process of translating data into something that can be stored. In the case of our arguments here, we’re attempting to serialize our arguments into something JS can use to index an object.

Probably, the easiest way to serialize our arguments would have been to call JSON.stringify(args). However, our sum() function, in addition to being pure is commutative (the order of the arguments does not matter), which means by applying a deterministic sort, we can optimize for more cache hits, as we see in the examples at the bottom.

Note that our serialization function here is potentially problematic!! Sorting the arguments will invariably be worse complexity (Array.prototype.sort() will at best be O(n log(n))- depending on the JS engine being used) than it would be to iterate through our arguments and add them up! In light of this, serializing in this way defeats the point for the specific case of this sum() example, but if the function being memoized took minutes to compute, then sorting our arguments here could lead to a significant increase in performance!

When we consider some set of arguments that would create an excessively long string if we were to use JSON.stringify() to serialize them, we realize it becomes impractical to index our objects this way. This is where hashing can come in handy. Hashing is a term for taking data and ‘scrambling’ it. Hashing has many uses in the field of security, but there are also many practical uses for hashing in the realm of data-consistency.

A popular hashing function is SHA–1, which is the hashing algorithm Git uses to hash document content for consistency. Quoting Linus Tovalds (the principle developer of Linux and the original creator of Git) on using SHA-1 in Git:

Nobody has been able to break SHA-1, but the point is the SHA-1, as far as Git is concerned, isn’t even a security feature. It’s purely a consistency check. The security parts are elsewhere, so a lot of people assume that since Git uses SHA-1 and SHA-1 is used for cryptographically secure stuff, they think that, Okay, it’s a huge security feature. It has nothing at all to do with security, it’s just the best hash you can get.
(link: https://www.youtube.com/watch?v=4XpnKHJAok8&t=57m43s)

The popular npm package object-hash uses SHA-1 by default, and is a common library that we could use to hash any set of arguments into a reasonable 20 bytes or a 40-digit hexadecimal number for the purposes of serializing it in our cache.

Persistence

In addition to serializing our arguments in the context of a function, we can further optimize our performance by serializing the entirety of our cache, and saving the cache to disk. The following snippet demonstrates this with our original addOne() example:

When running this in Node (running in-browser, you could potentially persist to local storage, but this would require different code), the first run will result in a cache miss, but after having completed, a .addOne.cache.json file will be present in your file system. The second time that you run this program, you’ll read this file from disk and pre-populate your cache. In this way, we can persist our cache across builds, further improving our chances of a cache-hit!

Distributed Computing and Distributed Caching

Rather than persisting our cache to our local disk, we could potentially create a distributed system, where some central cache is maintained. In this case, some network of applications running would write to this central central cache on cache misses, and read from the central cache on cache hits.

This snippet above takes the same code we wrote earlier, but creates an express server to serve networked requests for results of our addOne() function.

Note that in this snippet, the cache is only persisted in the server’s memory, and we could further optimize our system to have our server read from a shared data source — which would allow us to create many instances of this service, but all instances reading from the same central location.

Also note in this example that the actual functionality lives on the server. We could also create a solution where the functionality is performed on the clients, and instead host ONLY the cache on the server. This solution would require two end points, one to check for a cache hit and return the result if there is a hit, and the other for the client to send the calculated results to the server, so the server may store the result in the centralized cache.

Computation Caching Solutions

At Nrwl, we’ve used these principles to build computation caching into Nx (our open-source workspace toolset), so that all “build targets” (think processes against our codebase) will benefit from a local computation cache that we store inside of your node_module directory.

In addition, we’ve built a cloud-based distributed computation cache solution for Nx called Nx Cloud (available both as a public cloud solution [with end-to-end data encryption] and as an on-premises solution). By adding a single line of configuration, you will receive all the benefits of distributed cloud-caching that will significantly improve the performance of your Continuous Integration/Continuous Deployment pipelines, and you only ever pay for time saved!

Here’s a great example of the practical power of Distributed Computation Caching in an excellent video by Nrwl’s own Rares Matei

--

--