This article is a condensation of practices observed over a period of 2 years, that have increased the effectiveness of a software development team. The results have been higher customer satisfaction, increase in business and improvement in quality. While a new atmosphere that increased employee participation was the result of organizational changes that brought-in Agile Software Development, the observations and interpretations are our own and do not reflect company policy.
Among all the factors that we discovered, perhaps the most influential was doing frequent releases. Building software is complex work. Missing out on software behavior even under usual operating conditions is going to happen. An example is, behavior in the absence of an input column. Should the software discard the entire row or assume a default value? Prescribing solutions is easy in hindsight but experience tells that the next problem is going to be different.
One common reaction to the discovery of such omissions is to decide to seek for detailed formal descriptions in future. But that does not actually happen. People who are domain experts might have left the organisation, the remaining ones might be ultra busy in a new project and the customer might not give a response to every query within a useful span of time. What is an effective strategy under such conditions? A development process that acknowledges the inevitability of small mistakes is going to be more resilient.
A corollary to this understanding is that fanciness should be deferred to the future. This is not a new idea and has been explored in the Worse Is Better approach to software development.
By this point, we have understood that working software has more value than a grand design. However, the point of our kind of development is to satisfy a business need. It is wiser to follow a strategy that avoids excessive planning and gives high importance to feedback. Here comes the importance of defining development activities to short phases, typically of two or three weeks. Such short phases defined as sprint enable the flexibility to demonstrate value, take feedback and make course corrections. Even if every release cannot go into production, it can go to a customer lab, which helps get inputs on the next course of action.
One common question related to having a short release cycle is, when does development end and testing begin? We tackled this problem by doing component level testing all through but keeping aside two days in the end for overall stabilization.
There are areas in software delivery that are essential but don’t get enough attention in longer (2-3 month) release cycles. For instance, our frequent releases also included release notes and installation guides. Performance and soak testing were a topic of discussion and done as per the maturity of the software component. Repetition of the whole process results in habit formation and increase in efficiency.
A paradigm shift in the development process was essentially brought about by cyclically focusing on all aspects of development, every couple of weeks. This ensured that deficiencies did not get away from spotlight, thus improving quality. We used two weeks as the standard duration for one cycle of development activities.
2. Use To-Do Lists
It is challenging even for one person to bring a bunch of tasks to completion. It is much more difficult to get 8-10 people do a few dozen activities. Almost everybody knows how to-do lists help organize work — personal or professional — and improves completion. It is very beneficial to fully represent work that has to be done by team members in the next two-week period.
This results in:
- Work breakdown into achievable tasks.
- Increase of visibility that helps do better estimations and thus improve predictability. Predictability is crucial to building trust. Trust is crucial to being successful in a business.
- Setting-up of right expectations by increasing transparency. For example, it brings to the front activities that might be taking away time but are generally difficult to talk about (e.g. assisting a co-worker).
- Stimulating everyone to think through on work items and avoid ad-hoc requests that lower predictability.
The to-do list in this context is known as a sprint backlog. It is a prioritized list of items (e.g. features, bugs, technical work) that need to be carried out in the current sprint. On a side-note, to-do lists are not to stifle flexibility. This activity is about representation of a plan and sticking to it. The plan itself is mutually decided.
Historically, the emphasis in software development has been on the execution of segregated tasks like implementation, verification, review or documentation. One of the striking new changes was to downplay individual tasks and ruthlessly focus on the end goal. The end goal was defined to be a small software change that could potentially be released to the customer.
User story is a term that is commonly used to talk about a small change in software. We imposed two constraints on ourselves:
- A user story is not complete without it being tested. This follows from its having to be usable in a production like environment.
- A user story has to be something that can be completed within the usual development period of two weeks.
After several months, certain patterns emerged:
- A properly completed user story is finished work. It does not leave behind any serious technical debt. It is work that is truly complete (not just code complete) and enables everyone to move forward and pick something new. The software as a whole is in a consistent state after the addition of one user story. These benefits cannot be overstated.
- Smaller user stories get completed in time and have a higher chance of being properly tested. In a way, this is intuitive. A clunky task is impossible to define properly. Defining a problem correctly is essential to tackle it correctly.
- Bulky user stories signal uncertainty. They represent an area of shallow thought, are likely to contain unspecified assumptions and not get complete. The lesson is to avoid large user stories.
In short, progressing in units of user stories brought about predictability and confidence. For newbies, it is a challenge to breakdown a big piece of work into parts, such that the overall software remains shippable after the completion of every part. It will be instructive to give an example. The high-level requirement was an enhancement to an existing system where files were being fetched from a single host over an IPv4 network.
The new requirement was: Add the ability to fetch input files from multiple hosts over an IPv6 network.
The fetcher program ran inside a Kubernetes pod. At that time, Kubernetes did not have stable IPv6 support. Therefore, we introduced socat to help in the IPv4 to IPv6 inter-operation. The fetcher program would connect over IPv4 to socat, which would connect to the customer’s host over IPv6. Also, the multiple hosts could generate identical filenames. Further, since we always do some level of fault tolerance, all programs should get relaunched on unexpected exit and start automatically at boot.
It will be easy for most engineers to break-down this work into several tasks. The challenge is in defining the software layers such that each layer is of a complexity that can be finished in under two weeks and is a functional increment. We broke down the work into the following user stories:
- Add feature to the fetcher program to fetch from multiple hosts without caring if files were getting over-written.
- Add feature to the fetcher program to add an id to fetched filenames corresponding to the source host. We merely added a unique integral id towards the end of the filename. Making the id unique was not difficult. It was equal to the position number of the source host, as listed in the configuration file.
- Write a systemd managed program that can launch multiple socat instances. Kubernetes already handled relaunching the fetcher pod on failure. Therefore, doing anything extra for auto-relaunch was not required here.
In this work-breakdown, the main thought barrier that has to be crossed is this: When confronted with modifying a single program there is a tendency to work together on all features that have been planned. Notice that the various features are independent of one another. There is no need to complete them in any particular order. However, when completed, they deliver the business requirement.
One piece of data we collected was the subjective amount of effort that would go into delivering a user story, which is also called the story point. A story point is a measure of effort that expresses the relative difficulty of implementing a user story. We started off with a few rules:
- Associate a difficulty of small, medium or large with each of the user stories. Since numbers make summation and aggregation possible, the three categories were represented by 5, 8 and 13, respectively. The skew between the values conveys that although the difficulty is increasing it does not have to be linear.
- A small user story is one where there is a clear understanding of the problem and also the solution. It is something that can be completed in a week. A medium one is more complex but can be achieved in two weeks. A large one can only be achieved in two weeks with at least 2-3 people working on it only.
- The development team’s collective decision on the number of points is final.
- These may not be the best bunch of rules around story pointing. More important than endless debate is agreeing on a workable approach and trying it out.
Notice that given the layered development style that we introduced earlier, completed story points would be a representation of the value delivered by the team. After about one year, several advantages could be seen:
- Better Prediction. The average number of points delivered per week or velocity made it possible to estimate team output. People associated with a project, particularly managers, are often interested in knowing if a certain bunch of features can be delivered within a given amount of time. This difficult and universal problem in software development found a working solution.
- Better Planning. The presence of a historical average helped establish the amount of work that the team could pick with a high chance of completion. In other words, if the average velocity of the team was 75 points per sprint then a target of 120 quickly and surely showed a gap. Further, improved visibility into estimation had far-reaching benefits. For instance, we were able to allocate time even for training and learning new skills.
- Increased Objectivity. Story pointing made trade-off conversations more objective. Given an established team capacity, it made it easier to choose user stories for delivery.
- Red Flags. One, large user stories became a signal for work that is unlikely to get complete in a sprint or that would require several people to work together. Two, if the team had been delivering higher than usual in consecutive sprints then this hinted to exhaustion in the near future.
- Faster Estimation. Assigning of story points became very quick. These days, if a fair idea of a user story is available, we spend only a few seconds in setting its story points.
The other metric we collected was the happiness index of team members on a scale of 1 to 5, the former representing very unhappy and the latter, very happy. This provided everyone, even the shy ones, the opportunity to quickly communicate their overall level of interest during the most recent sprint. Often, a one-to-one discussion got prompted after someone reporting a low (< 3.5) value. It is much better to receive and give feedback in smaller time periods when matters are rectifiable. Also, it is worth noting that our highest happiness index values coincided with giving an important release that was scheduled to go to production.
In our domain of processing large data streams, there usually are at least three different technologies in each core functional area:
Languages: Java, Python, Scala
Cluster Computing: Spark, CDAP, Flink
Databases: HBase, Redis, Postgres
Resource Managers: Kubernetes, Yarn, Mesos
Messaging: Kafka, RabbitMQ, Pulsar
Scheduling: Oozie, Nifi, Azkaban
Packaging: RPM, Docker, Ansible
This list does not include other aspects like logging, statistics, alerting, OS environment and shared filesystems that also routinely demand attention. And we are not even going to mention about architecture that is expected to solve the business problem within resource constraints and achieve a high level of performance. Any project requires having at least ten different skills. It is very unlikely to find one engineer who has a high level of competence in so many areas. Further, the traditional supervisor-subordinate model is not effective in such a complex environment. A way of organizing people that allows free sharing of skills is an improvement.
Thus, one of the startling observations was the jump in work completion when multiple people worked together. This working together is not symbolic. It is an active form where two people stare at the same computer screen.
A related source of inefficiency is due to the antagonism and misalignment between developers and testers that evolves in segregated environments. This is something that was hinted upon in the section on building in layers. In the usual workplace, developers blame testers for harassing them and testers blame developers for not giving them enough time. An example of a big source of waste is unnecessary reporting of bugs in the tracking software. We figured out that the only kinds of bugs that actually mattered fell into the following categories:
- Those that represented an inadequacy that ought to be mentioned. Typically, these were bugs that could not be solved in the same sprint that they were discovered.
- The ones that came from production or a similar environment.
Testers and developers often have complementary skills. When they sit together and work for a common goal (e.g. make this release happen by Tuesday) the results are astounding.
The work environment that focused on final results giving freedom to engineers resulted in a charged up atmosphere, where motivation was high and people enjoyed the process. We successfully used the same group problem solving approach when debugging production issues too.
Coding is attention intensive. Meetings have the capacity of taking more time away from productive work than the stipulated duration of the meeting because programmers are on a different schedule: At least half a day is required for any meaningful output in coding.
Over the months, we enriched our management of meetings by:
- Encouraging managers to check progress during the daily stand-up meeting (DSM). For this to work, the credibility of DSM needed to be high, which was made possible by having everything represented in the sprint backlog — already mentioned earlier. If a boss feels that a certain work is important and it does not feature in the backlog then there are going to be unexpected status checks and interruptions.
- Ensuring that DSM time is within 15-20 minutes. This was done by tracking items in the backlog and not people. Team members were encouraged to pursue technical discussions outside of the daily stand-up.
- Clustering meetings to 1-2 days in a sprint. Some meetings (e.g. planning) require everyone to think together and are unavoidable. Keeping them on the same two days (e.g. Wednesday and Thursday) in a two-week period lets developers organize their schedules accordingly.
Another new thing we started doing was to demo the work at the end of every sprint. In the strategy to construct software in short increments, getting feedback from people who really matter is a powerful tool. Demos need preparation and might not appear sensible if the work is not immediately required to go to production. However, presenting work even to customer facing employees has significant benefits:
- Provides a means to validate the work that has been done. Presentation before the actual user is a big deal. Knowing if he was also expecting the same. Demo is a checkpoint between what has been constructed and what is expected.
- The development team gets a chance to revise all the recent functionality and installation changes. They are impelled to verify the sanity of the complete cluster and not only of the components that have been touched.
- Formal invitations bring about an atmosphere of seriousness. It is a ritual that tries to bring about closure to the sprint work. Sometimes something falls through the cracks and luckily gets caught during the demo. Omissions or avenues to reduce installation complexity get spotted too.
Before concluding, here are some additional practices that have also been crucial:
- Formal talk is time consuming. Something that can be discussed in a direct message should not have to go on an email or a Slack channel with 20 other people.
- Focus on finishing tasks more than starting new ones. Starting new activities gives a false impression of progress by increasing concurrency. Even from a theoretical standpoint, this time slicing prevents any task from completing early. Practically, delays get exacerbated due to human context switching losses.
- Watch the number of areas (e.g. new features, bug fixes, troubleshooting) that are in-progress. Reduce the count to increase predictability.
- Maintain transparency. Honesty is the best policy.
The energetic application of the above practices brought about a major transformation, People became more engaged and work became fun.