Do not design systems with caches

Okay, I am just going to pretend it has not been a long while since my last post. Let's get right to it.

In my previous post I noted that Operating Systems do not cache DNS results. The simple explanation for this is that caching is really difficult. Not impossible, but really difficult. The annoying thing about caching, good caching, is that when it is done right and it works then no one notices but when it is done poorly it makes systems horrible.

I am not a big fan of the recent surge in popularity of caching software. Maybe it is not really recent, just recent to me as a noob. I am talking about the spread of  Memcached, redis etc. Don't get me wrong. I think the software is brilliant. I have a problem with the way its usage is abused. It makes me cry.

What is caching?

I am probably the least qualified person to write about caching but here I am. That's the internet for you.

I like to define caching as the mechanism of using memory to avoid recalculating values that otherwise take longer to calculate. Calculate here could mean several things that are not really calculating, like fetching over the network. The reason I like this definition is that it doesn't specify the type of memory. This means even a RDBMS like MySQL can be used as a cache. This definition also clarifies that caching plays a role in speeding things up, not making them work.

For example, on a website I built I had to calculate a popularity score based on the number of downloads, number of votes and time since upload. Calculating this value during each request-response cycle was taking too long and using too much resources on the application and database servers. I wrote an offline script that calculated this popularity score and saved it as a column on MySQL. This shaved response times by over 40% and decreased CPU usage by about 30%.

In the above example the website worked and returned the correct responses without the caching. It was just slow and always used the latest numbers to calculate popularity. I noticed that there was no need to have popularity be based on the latest numbers for each request. It was okay for the popularity value to not be live. The example seems simple and might even sound like it is not caching. But it is actually caching and has all the problems that make caching difficult. Consistency, TTL, invalidation. However, as I said, I will not even try to discuss caching and what makes it difficult here. It would be the case of the blind leading the blind.

Where is this going?

We'll get there. I haven't written in a while, I am shaving off rust.

I have been involved in a few designs where design diagrams include some form of cache from the beginning. This brings me great sadness. I usually raise my hand to say "can we do that part last". In some cases I even try to get that part not done at all before the software is launched. I do all of that because I believe a design with a cache from the beginning is a bad design.

Bare with me, let me finish.

A system design is supposed to be an idealistic view of the system. Real systems should strive to be as close to that design as possible. If your system ideally contains explicit caching then you are asking for trouble.

What!?

Okay, calm down.

A good designer keeps reality in mind. But a good designer does not dirty the ideal system with reality flaws. A cache is a fix for a reality flaw. It should not be part of the ideal system. The flaw could be anything really. It could be the network. Even when you have a Gigabit connection, the network is relatively slow. It could be some complex computation. CPUs are really fast lately but they can still take a large amount of time to calculate things we really care about. Caching is how we try to overcome these flaws that physics imposes upon our systems.

So when I see caches as first class design components I shed a tear.

Okay, this is silly

Please, relax. I am getting there.

Look, I am not saying caches are bad and should be avoided. If I said anything like that I think I would check myself into a mental institution. What I am trying to say here is that the basic design of any new system should not contain explicit caches at all. However, possible caching points should be kept in mind and pointed out in the design. This will allow the system to be used first to learn usage patterns. Only then can a valid caching strategy be chosen.

Caches should be hidden details in a design.

What should be done then?

Systems should work first before they are fast.

Recently I see a lot of systems that are designed to depend on caches to work correctly. Systems that crash and burn without these brilliant pieces of software like Memcached.

When systems are designed to have caches before they are even used then the real components that should do the work hide behind the caches. For example, database servers that are surrounded by caches from launch. They work brilliantly with lightning fast response times. Until one day the cache is emptied for some reason and the database explodes. When what should happen instead is that the database should take over and continue to work, just slower. This happens because the cache always takes the load off the database server and we end up not knowing how much load the database server can actually handle.

Also, usually software is designed before it is used. Caching is not a one-strategy-fits-all-use-cases. In most cases one can estimate how a system will be used based on existing data and intuition. However, you can never really know how a system will be used until you give it to users. Caching can be done randomly. But good caching is dependent on usage patterns. The type of cache to use, the time to cache values, the caching strategy to use. All these depend on how a system is used. They are bandages for specific flaws. I don't see how they can be part of an ideal design.

I mentioned earlier that a good cache is not noticeable. A system should work without a cache and carry as much load as it can. A cache should be introduced to speed up parts of the system that are otherwise too slow without it, to help the system deal with that load faster. The cache should not be the system itself.

A good caching strategy could buy you time. It could relieve your database of heavy loads while you figure out which is the most popular sharding technique. Popularity is how we design systems after all. It could remove strain from your tiny monolith application server while you figure out how to use Kubernetes. To autoscale, right? The list goes on and on. However, the key is to always remember to actually fix the system to be able to handle things on its own.

Echo

Software design is part engineering and part art. Experience shows up a great deal in art. In software design, engineering is knowing how a system should ideally be and art is knowing where potential reality flaws could creep up.

Be a good designer. Do not just cache!