A short post this week.
While I was helping some friends recently, we experienced a curious thing where as soon as an application was started up, it was immediately followed by a denial-of-service attack that played out in the most mundane way you can imagine.
The application itself is an API that is replacing the old Windows-based .NET Framework API, and it has many dependencies. When we removed Entity Framework (and dramatically scaled back the reliance on the caching layer because of innumerable race conditions) we discovered that Redis was timing out, but it seemed to be consistently random; at some unpredictable time after application start-up, Redis would time out and the whole house of cards came tumbling down.
But what was causing Redis to time out?
In the old version of the code, Redis was the dumping ground for data. For whatever reason, the original developers thought that their performance issues were because of the SQL Server database, not the awful no-good unindexed queries that Entity Framework was executing. So, as a precautionary measure, every time the application cycled (i.e., whenever it was started after a deployment or when restarted), a third-party process would refresh the data in Redis, but it needed data from SQL Server first — row by agonizing row of data.
Because the original developers never cleared the Redis cache, the Redis instance was effectively stateless. In other words, nothing was ever evicted, so anything in the Redis cache was long-lived. Then I came along, and I told my team that the SQL Server database was the source of the truth, and whenever the application was restarted, we should make sure it had the latest information from the database before being put in Redis.
When we dug into the problem, we discovered that the third-party process only ran on one of the four load-balanced instances of the application, and it requested all the information for all the inventory at every site. So, while three of the four instances were idle, one was struggling to return a huge amount of data from the database that it didn’t need, and eventually timed out.
It didn’t need that data because we had adopted a lazy loading pattern, and there’s no need to warm up the cache (i.e., load everything into Redis) on start-up because SQL Server is plenty fast enough if you ask for only the data you need, when you need it, and only cache it for a short period of time in Redis.
Share your coding fails in the comments below.
Once I saw this happening at hardware level.
At a conference in the auditorium there were cinema chairs an each chair had a LAN port and 80cm ethernet cables to connect your laptop to the internal internet. (there was still no wifi).
A guy linked two ethernet port together from one chair to another with that 80cm cable.
Kaboom, internet went down.