Rewriting Android Priority JobQueue – Lessons Learned
Earlier this year (2016) I’ve decided to rewrite the internals of the Android Priority JobQueue, which is a task queue that I’ve written when I was working at Path to provide a decent offline experience in the app.
Over the 4 years, it has grown organically from a simple task manager to a complex one that allows fine grained control over how background tasks are managed. There were also changes in the Android scene especially the addition of the JobScheduler.
It was getting really hard to improve the JobManager to adapt these changes due to some bad decisions that I made in its earlier development. So I thought it is a good idea to share these as I’ve recently completed the rewrite and feeling much better for the future of the project.
Do not communicate by sharing memory; instead, share memory by communicating
This is the single most important thing in the rewrite and also was the most wrong thing I’ve done in v1. The JobManager is multithreaded by nature since it has to run multiple jobs in parallel. It is intuitive to put shared resources behind locks so that only 1 thread accesses them at a given time. Even though this works well at the beginning, it is very hard to manage especially if you have APIs that would trigger access to these resources. JobManager V1 had some thread lock bugs that were almost impossible to solve. You also need to be careful with memory barriers or mark all necessary fields
volatile which is very hard to track.
In V2, I’ve changed this communication entirely to make JobManager single threaded. JobManager gets its own thread which is the only thread that can access shared resources. Jobs that runs on other threads can communicate with the JobManager only via message passing. They cannot access resources. Indeed, they don’t have a way to grab a reference to the shared resource anyways so mistakes cannot be made in the future. Any public API also goes through this message passing, just like Job consumer threads.
This solution is not without its downsides either. There was a bug during development which created a message passing loop between the JobManager and Job consumers which meant JobManager is constantly running. This is a much easier problem to solve by making one item (JobManager) as the master to initiate any conversation and much easier to keep under control compared to deadlocks.
Also, read the Awesome Go Article on Share Memory By Communicating as a background.
If you code needs to do clock related stuff, abstract it out
JobManager provides the ability to delay execution of jobs. Java has built in functionality to defer things or get current time. They work great until you run your tests on a CI server. Your “solid” tests become flaky. Or in other words, they were always flaky, you just didn’t know that. It is only half the story. You also have no way of testing race conditions because you don’t control time, you cannot.
In V2, I’ve abstracted all time related tasks into a helper class. I even wrote lint checks to ensure that the real time is never accessed. This abstraction does not just abstract
System.nanotime calls. It also abstracts message queue delay timing or any timer that is used across the codebase.
Of course this introduce the risk of not using real time in tests but this relatively simple abstraction can itself be tested with real time to mitigate this risk.
Think twice and then think again before adding a new API
This is something I learned while working on the Android Framework, seeing all these functionality which we need to support going forward but does not make sense or not well designed for flexibility. JobManager had APIs like this, the worst one being the
addJob method which returns a
long id that is only unique when composed with the
persistent property of the Job. So a persistent and non-persistent job could have the same id. It worked this way because ids were provided by the Job queues which had default implementations but could be swapped by the developer.
API breaking V2 gave the opportunity to clean these and I had fun :). I’ve moved
long ids to
UUIDs that are assigned when Job is created so that it is easier for the developer to tie things to the job lifecycle w/o relying on a response. The worst part of this story is that I knew it was a bad API when added but added anyways. So don’t do it. No matter what the use case it. Just wait until you have a better solution before implemented whatever functionality requested by the user.
Overall, writing V2 was a fun experience for me to try new things. There are more things that I would like to change but didn’t as there is a balance between providing the most desired API vs backward compatibility. I’m happy with I have right now as I release V2 and I hope you enjoy using it and develop the most responsive apps that work offline.