Work tracking Sprints for DevOps teams (a review)

 

Why am I writing this article?

Recently there’s been a number of big changes to Atlassian Cloud products. A few, even became free or bundled with other price plans. Arguably, all are good products, some you may say are great in their space. As all of this has rolled out, they’ve sent out quite a few emails and one of the recent emails got me thinking about a tutorial I wrote a while back about how to manage DevOps work streams in the context of Agile sprints and Agile software, like Jira and Confluence. This made me want to write an update to this article and see how the idea has stood the test of time and different work environments as I've introduced it for other clients.

 

What's it based on?

The TL;DR for the previous article is the idea of how to manage long-, mid- and short-term goals and priorities for Operations (DevOps/SRE) Teams in a way which fits with other engineering work in the company and aims to gain the benefits of Scrum for teams which have wider-reaching responsibilities. It presented a way to capture business driven goals, along with technology driven ones and creating a hierarchy of these down to small tasks which engineer would be working on week-to-week; and how to present this in a way which is always available and up-to-date for everyone. It provided a tutorial on how to do this with Jira Epics, Issues and Sub-tasks, metrics dashboards; and Confluence macros and reports.

For the last few years I, along with another Mesoform engineer for a period, have played a significant part in the set-up and growth of the Google Cloud platform team at a large financial client. We've often gone backwards and forwards between different ways of working our sprints. Usually looking to do pure Agile as would normally be done by software development teams. However, these approaches have been too focused for a DevOps platform team who are usually looking after many different types of technologies. So, whenever we made such changes, we would usually quickly agree that the article described way was better, and switch back. However, it hasn’t always been perfect and we’ve learned from a few issues we regularly faced. 

 

What did we discover?

With some exceptions, eventually the process got adapted fully. It became incumbent within the engineering team and eventually spread, to some degree, to the wider platform teams including product management, security and compliance. It worked well but there were still a few things that could be improved.

First off, we found that all-to-often tasks weren't being identified as part of the higher epic as in the model described in the tutorial. Instead, lots of 'individual, small tasks' were being assigned to individual engineers to work-on on their own. This meant issues often had no relationship to each other from a reason or technology perspective and engineers would have to context switch a lot between tasks. This is not only something which impacts productivity and team knowledge but it's also a bit of a security issue. For example, if someone senior asks something of an engineer, the inclination is that they just do it. Often questions aren't asked, sometimes the engineer raises the ticket for the request instead of the requester, sometimes no one does.

Next up was poor in-team collaboration and knowledge sharing. Even when planning was thorough; when all issues were associated to an epic, had good acceptance criteria and clear descriptions (I.e. the full shebang), there was still very little discussion around ideas and final solutions. Initially this seemed like it was because too many issues were being dealt with for too many epics. So each sprint consisted of a broad range of topics and everyone was working pretty much in silos within their own team. Then, the next sprint, the same engineer would probably pick up tasks relating to work they've already done. As a result, the knowledge became siloed as well. This also made it difficult to handle unplanned time (this is unexpected time-off, fire-drills or even engineers just cycling onto ops support work).

Lastly, epics, and often sprint issues were never properly being completed. It also felt like this was down to too many things being worked on. Due to this, work dragged out over longer periods of time and as a result would almost always be reprioritised due to some other major issue which was the hot topic of the week.

We also noticed that our metrics on what was working and what wasn't, were unreliable. Somewhat because the burndown charts we were using relied on engineers to log time to issues; and as humans we just don't think about this much. I think I'm good at this overall but in reality, I'm still bad. 

The process and tutorial described in the original article were supposed to promote a way of working that prevented these sort of things from happening but clearly something was missing... It never really described how to manage the specific engineering tasks. Just how to organise them.


Updating the solution

Once we'd given the process enough time and seen the benefits, we were then in a good position to be able to make small improvements to it. Below are the things we changed:

  1. We split the platform engineering team into 4. 3 (initially 2) engineering teams and one support team
  2. Management and team leads would do backlog grooming and roadmap planning based on epics alone. In our enterprise version we organised and tracked these with a plugin called Jira Portfolio
  3. Each engineering team would have a different set of epics on their backlog based on priorities from the backlog/roadmap, they would do their own planning and would create engineering tasks under those epics.
  4. Instead of engineers working on individual issues, we would prioritise our top team issue, then every engineer in the team gets involved in planning it and voting on how long it will take to complete (using a Slack app called Storyplan). We'd then do the same for the next highest priority, and so on. When the sprint starts all engineers work on the highest priority issue first until it's complete (generally), then move onto the next task.
  5. As well as asking engineers to log time, we changed our estimation and velocity metric to be based on story points. Logged time became a secondary metric. Story points were always based on Fibonacci numbers (as is common in Sprint planning)

 

Conclusion

The results are still coming in but we've been operating this model in one of the teams fully for about 5 months now; and the others have partially adopted the additional changes more recently. The team that has been doing it completely has shown good results so far.

In the team who had made the changes fully, we found less issues were worked on during a single sprint but more epics got closed. Meaning that things like feature requests or platform improvements were more completely done, rather than being dragged out and often dropped before they were properly completed. We also saw an uptick in the number of story points being completed (an average of 9 story points vs 19 per month).

On top of this, I did a small survey within the team to get a qualitative view on the changes. The survey had the following questions:

  1. In your opinion, are we are getting more done as a team?
  2. In your opinion, are the tasks we're working on more completely finished?  This was around do we have better testing, monitoring, documentation, automation, etc. Things that would often be left
  3. Do you believe you have a better understanding of the work/technologies the team is responsible for?
  4. Do you feel more connected and updated on issues when you're not currently working on them? I.e. When covering operational support issues or doing unplanned work.

Each question was a simple score out of 10, 0 being considerably worse, 5 being no noticeable difference and 10 being a significant improvement. Some engineers also took a little extra time to give further feedback which I'll cover later.

Taking the engineers' results, I totalled the answers to each question, and calculated a positive or negative perception percentage of the result. I.e. if everyone gave 10/10 it would 100% approval, 0/10 would be -100% and 5/10 would b 0%. It turned out as follows

  1. 30%
  2. 85%
  3. 70%
  4. 60%

This was great news and was consistent with the fact that we were working on less tickets (people only being 30% confident that we were getting more done in a sprint) but closing more epics (people being 85% confident that we were completing things better and more comprehensively)

The feedback also highlighted that people felt they had a much better knowledge of things we are responsible for. This is in part because engineers will often pair on specific tasks, partly because they have a vested interest in the issue being resolved but also because they were all already working on the main issue, so collaborating was much easier. Furthermore, because they were subscribers to the issue through Jira and Slack conversations, engineers could take timeout to cover Ops support, ad-hoc issues or just take time off and never be that far out-the-loop.

 

Final points

As I mentioned before, some engineers offered some extra text on what they felt was good and what still could do with improving. These are some of those points.

  1. With the pace picking up quickly, when people lacked experience in certain areas, they sometimes found it difficult to keep up
  2. The planning was often team-lead, so a sense of teamwork was high
  3. We needed to highlight better when issue sub-tasks were a blocker on others. Main sprint issues were prioritised but more focus was needed to prioritise and flag blockers within an issue
  4. Ops support work was still quite disruptive to the engineer's ability to track sprint work. This showed in a 60% approval. Good but less than others, so maybe there are some further improvements we can make here.

 

Thanks for reading. I hope you found some of this useful, and as always, if you would like to discuss any of these topics in more detail, please feel free to get in touch

 

 

About Mesoform

For more than two decades we have been implementing solutions to wasteful processes and inefficient systems in large organisations like TiscaliHSBC and HMRC, and impressing our cloud based IT Operations on well known brands, such as RIMSonySamsung and SiriusXM... Read more

Mesoform is proud to be a