A Bug Story

10 July 2023

Once a year or two, I regularly encounter some incredible detective stories at work, after which you wonder how all this didn't crash earlier. I want to share an interesting case.

Users report: the app takes a long time to launch, looks like eternal waiting on the splash screen, a minute for example. And there's no message that everything froze, like it happens with ANR. Just as if something is running in the background for a long time.

By miracle I find a user session, look at events, understand that the problem is really at startup. My eye catches an ANR and a crash in a neighboring session, whose stack traces make absolutely no sense. There are, according to crashlytics, quite a lot of them, we know about them, but all this time it was absolutely unclear how to fix them.

We execute several mandatory requests on the splash screen. The backend shows its metrics that requests are completing within normal range, which crosses them off the list of suspects. One of the requests fetches strings, in order to later replace them at runtime. After receiving the response we write them to the local database, so we don't constantly go to the internet. We go to the internet no more than once a day. We update in a daily WorkManager job.

Everything sounds reasonable. Until my eye falls on the Dao with database queries: Insert, Select All. I look at the table description, see an auto-generated id, but the string key is not unique. I check the theory, yeah, there are several tens of thousands of rows in the table instead of two. I dig further, realize that once a day we steadily add a couple thousand records to the table, without deleting the previous ones. Select All at app startup tries to get all of them, makes a map from them, which we then work with, not noticing any problems.

That is, you understand, a user who hasn't deleted the app for a year has almost a million rows in this table, which are fetched at each launch in a background thread. And here the puzzle, of course, comes together completely. OutOfMemory crashes - because of this, almost all ANRs - also because of this, since the code has to process collections of hundreds of thousands of elements, stack traces suddenly became more understandable, the bloating of the app size over time also became clear. This code was written about two years ago by previous generations of developers and with each day it was getting worse and worse, making the experience worse and worse.

We fixed this bug simply by making a primary key from the key and rewriting the table overwrite logic. We reduced the number of ANRs by half, the number of crashes by almost one and a half times, ratings are going up. Everyone is delighted.

Developers and QA very often reinstall the app or clear cache in the process of their work, so we couldn't reproduce this at all. To reproduce this - you need to just be a loyal user who has had the app installed for a sufficiently long time.

What conclusions can be drawn here is not entirely clear, except that time cripples. Code where there are any date checks is a potential time bomb, you need to treat it very seriously. Here, integration tests would probably help, which would change the date, re-request the list of strings and check that there are exactly as many records in the database as came in this request. But even in hindsight, coming up with this test is a not entirely trivial task, not to mention anticipating it in advance.