The Risks of Rewrites

(4 comments)

July 14th 2020

In the last chapter we drew some clearer lines around the terms rewrite and refactor, defining both as approaches to improve the quality attributes of an application, but differing in scope. Whereas refactoring is about making incremental changes to an existing system, rewriting is about starting over and building anew. Rewrites, in other words, are BIG. And risky. And yet despite this, there's still a deceptive allure to them. The logic can go something like this:

The system is already in production, we obviously know how it works, so we just need to port what we have to a better platform, and once we're there things will be easier.

In this chapter, we're going to rip this basic intuition to shreds. Rewrites, as we'll see, are anything but easy. While they may be immune to some of the challenges of a nascent app, they introduce completely new ones we often don't anticipate. In order to successfully deliver a rewrite, we must navigate these challenges, and so it's important to understand what they are up front.

Parallel Development

The first hazard we often confront on our journey is the realization that a rewrite is fundamentally a dual prong effort. Sure, on the technology side we'd love to think we can pause development to focus only on writing code in the new platform, but reality (and more specifically, the business) rarely abides. Maybe in order to win (or keep) an important customer, the business demands we add some new feature to the existing system NOW. Or maybe a third-party system has changed its API and we need to refactor, or a sev-1 defect pops up, or a new government regulation is written. The point is, almost never can a production system sit static in space and time for long, and that's true even during a rewrite period. Some care and maintenance will inevitably be necessary on the old system, which means parallel development.

The first problem here is that developing in parallel violates that principle we developers hold most sacred: Don't Repeat Yourself. Any change made to the legacy system will need to be ported into the new system as well. For example, if we add feature X into the old stack, we must add X again in the new one (but this time write it in the new language or framework). And that duplicative effort applies for the testing, project management, build, and deployment of X as well. Essentially, whatever it would cost (in time or resources) to make a change to the legacy system, it could now, during this rewrite phase, easily cost double.

Even more, depending on how far along the rewrite effort is, the change (e.g. feature, defect fix, etc.) might need to be made to a part of the source code that has not yet been rewritten, and so the team will have to remember and account for it when they get there (which will then push the timeline back for the rewrite). Or in the opposite case, the change is in a section of the app that has already been rewritten, and so the team has to circle back and make updates to code that was just migrated (and then re-test, etc.). Either way, the parallel development element of a rewrite can make it feel like the team is keeping a plane flying while also building its replacement on the ground. For the safety of the passengers, that plane's got to stay in the air, but if we spend too much focusing on its maintenance, we'll never get the new one off the runway. And of course the team responsible for both "planes" is often one and the same, which leads to the next issue...

Team Schism

Given that there's some amount of effort necessary to support the existing system, the question is: who is responsible for this? There are a couple approaches that teams typically take to manage the parallel work. One option is to have the more junior developers stay back in maintenance mode, and thereby free up the senior devs to forge the path forward for the new system. And this makes sense. To pull off a rewrite generally requires a higher level of technical bad-assery - getting new technologies up and running and working together, configuring environments, establishing patterns and conventions, and so on are all tricky tasks that could leave a more green developer spinning their wheels in the mud.

And so teams may settle on this type of split, junior developers to maintain, senior developers to go forward. And it may work for a short while, but it's not long until the senior devs are pulled back in to help solve a critical all-hands-on-deck issue or discuss the impact of some change. Their years of acquired knowledge of the old system and their relationships with business stakeholders, ops people, and others make them effectively irreplaceable. They need to be included. And even if these "pull-outs" only take up a couple hours a week, that level of context-switching can still stymie productivity. As developers float back and forth between the old and new systems, the timeline for the rewrite just gets pushed back.

To avoid this, teams may take an alternative approach: spin up a new team that can be more easily dedicated to the rewrite. The developers on this "rewrite team" would have negligible ties to the existing system (maybe they're consultants, new hires, etc.) and would therefore be better able to isolate themselves and stay productive on the new work. Problem solved? No. The first issue, of course, is that these developers are by definition less familiar with the codebase, domain, and generally how and why things work the way they do. Sure, they have the original code as documentation, but this can be like learning a programming language by looking at the source. It's all facts, but no story. And so this new team will either flail while trying to understand the intricacies and complexities of the existing system, or make bad assumptions just to move forward, or end up having to pull in the original team anyway. None of these are productive outcomes.

Additionally, this "fresh team" approach can also alienate the veteran developers who are charged with staying back to support the legacy system, and make them feel as though they are being penalized for their loyalty and experience. While the new team gets to work with modern technologies and a clean slate, they are still left in the doldrums of maintenance. No fun. And so unsurprisingly, this can quickly lead to disgruntled developers and shortly after that, brain-drain. Somehow a team structure needs to be devised where the old system can be supported, the new system can be built correctly and efficiently, and everyone is happy. But even when you put the right people on the right busses, there are still surprises...

Unknown Unknowns

Now even if the rewrite team is staffed with seasoned developers - i.e. those who either helped write the legacy system or at least have some experience with it - there are always surprises when migrating the existing source code to the new platform. Donald Rumsfeld (yes, that Donald Rumsfeld) made this not-so-crazy insight about these types of surprises:

Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don't know we don't know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.

When we begin the rewrite process, we typically try to estimate how much effort it will be. And it's here where we begin to catalogue the first two categories of work. There are of course segments of the codebase that we have direct experience with, and so we give quicker (and more accurate) estimates. These are the known knowns. But then there are also pockets of the system that we recognize we have less direct experience with, and so we add a little buffer. "Joe wrote the registration flow, and he quit last year, so that might take a little longer to migrate". These are the known unknowns. And so we proceed with planning, estimating both types of work and giving extra time for the known unknowns, until an overall timeline is agreed to.

It's not until we dig in and start migrating code, however, that we stumble upon that third and most pernicious category of work, the unknown unknowns. It may be a 1,000 lines of convoluted business logic we never knew existed, or a set of reports we didn't know were used by anyone, or some "duct tape" that unbeknownst to us holds an entire integration together. Whatever it is, our original rewrite plan never took it into account, but we're in too far at this point to bail out. The result is that this (formerly) unknown unknown will need to be be analyzed ("what the heck does it do?") and then dealt with ("do we need to bring this over, or can we leave it behind?"). And this extra analysis, discussion, and effort just pushes back the initial timeline and strains the patience of business sponsors and customers.

And while a few of these unknown unknowns can be absorbed, too many can put the entire rewrite effort in jeopardy. With better planning and decomposition (more on this later), they can be minimized, but avoiding them altogether is difficult. There is another hazard, however, that we inflict on ourselves...

The Second System Effect

Often, we've been living with the warts and imperfections of the existing system for so long that when we finally get the chance for a do-over, we can't help but want to make everything better. Or perfect. Decades ago, Fred Brooks labeled this tendency the Second System Effect. In the Mythical Man Month, speaking of the architect of the system, Brooks observes...

As he designs the first work, frill after frill and embellishment after embellishment occur to him. These get stored away to be used "next time." Sooner or later the first system is finished, and the architect, with firm confidence and a demonstrated mastery of that class of systems, is ready to build a second system.

This second is the most dangerous system a man ever designs. When he does his third and later ones, his prior experiences will confirm each other as to the general characteristics of such systems, and their differences will identify those parts of his experience that are particular and not generalizable.

This effect can manifest itself in a few ways. First, we tend to see the rewrite as an opportunity to eliminate all the technical debt we accrued in the past system. We want to decompose God classes, fix inconsistent variable naming, re-shuffle and restructure data structures, and so on. Basically, we see this as our chance to resolve all of our little peccadillos with the legacy code, because it wouldn't feel right to port that debt into the new codebase.

Additionally, it's also easy to think of the rewrite as not just a chance to modernize certain pieces of the architecture that are flawed, deficient, or out of support, but also as a vehicle to leap-frog everything to the cutting edge. Maybe the UI is still the old AngularJS and doesn't render well for mobile web users, so it does seem justified to port to a more modern web framework. But while we're at it, let's also break the backend up into microservices and write them in Go! Essentially, the rewrite can become like a piece of legislation - it may start out as a simple bill to reduce gun violence, but by the time it gets passed into law West Virginia is getting five new bridges and Oregon's farmers are being subsidized to grow soy.

And the Second System Effect is not just specific to developers. Oftentimes business stakeholders will take the same tact, but with their priorities. "We might as well add this feature to the rewrite that the customers have been dying for, since we'll be touching that code anyway." And just as with the technical improvements and embellishments, the scope just gets bigger and bigger and the release date of the rewrite gets pushed back further and further. In the end, while the mantra of development for the initial system was "get something up and running fast!", for the rewrite it can be "we'll never get another chance to do X, Y, and Z!"

Fighting the Last War

It's often said that generals in peace time prepare themselves to fight the last war. They look back and reflect on the flaws and faults of their old battle plans, and vow to never let those mistakes happen again. And then the next war comes, and it's nothing like the last. Their preparation was for naught. The same can happen with a rewrite. We are so keenly aware of our mis-steps of the first system, that when it's time to design the second, we make sure above all else that we don't get burned again in the same way. But it's easy to miss that things have changed. Those past mistakes may be non-issues at the point we rewrite, and so it's not necessary to go to any ends to prevent them. At the same time the modern technologies being used in the rewrite have introduced a whole new set of issues that we haven't yet looked ahead to spot.

In my early years of consulting, I helped build for a client a large web-app that served a few thousand internal users. The application was a success - meaning that we delivered the functionality within time and budget (plus or minus), but we unknowingly hitched our "horse" to the wrong wagon, so to speak. These were the early days of single-page apps, so we used a budding framework called Google Web Toolkit, which was quite cool at the time (you'll have to take my word on this). Unfortunately, a year or so after this app was launched, Google effectively jumped ship from GWT in favor of something better (Angular), leaving everyone (including this client) stranded on an essentially unsupported technology. Not a good situation, though in our defense it would have been difficult to predict.

Years later, I had the chance to reconnect with some of the developers still there, and they told me they had no choice but to rewrite the app. And rightly so. But when I asked what tech stack they had picked, I was surprised. Not wanting to be burned by single-page app technology or a flash-in-the-pan framework, they chose an incredibly mature but decade-old server-side page rendering technology, despite the fact that single-page frameworks were, at that point, extremely solid and used by, well, everyone, and their app had some very rich user interaction that could have benefitted from that model. In other words, the central driver for their rewrite was to avoid the mistakes of the first system, but in the process they missed out on many of the advantages of modern web frameworks. They were fighting the last war.

All or Nothing

The last big hazard of a rewrite is something that can be avoided (as we'll discuss in later chapters), but often is not. While we now accept the wisdom of iterative development and the concept of the Minimum Viable Product, with a rewrite it can be difficult to find a way around delivering everything all at once in one big bang. For example, if the customers have been using the legacy system which has, say, one hundred features, how can we deliver a new system with anything less than those one hundred features if we expect them to (happily) switch over? Basically, the rewrite needs to do everything the old system did, and so it would seem that we must first deploy the entire replacement before we can get back to iterative development. Depending on how big the existing system is though, this can represent a considerable effort.

Further, this "all or nothing" requirement can combine with the ever constant demands of the business to put incredible pressure on the team. If the team can't deliver anything until it's all finished, then they aren't able to show any tangible value until the replacement is launched. The business is in a sense being held captive by the rewrite effort, and it's up to them to stay patient and trust in their development team to finish. Not an easy feat. If the customers are especially demanding or if the team consistently hits unexpected delays and issues (e.g. "unknown unknowns", etc.) that push the release date out for the rewrite, the business might feel like it has to apply pressure (which can force the team to cut corners) or can pull the plug altogether.

Summing Up

In the end, hopefully some of the dangers that lurk in the Rewrite phase are now more visible. While the greenfield stage of development surely has its perils, the rewrite is no walk in the park either. The challenges of parallel development, team organization, feature and technology gold plating, and big-bang deployment can easily drag down the most competent of teams. And so it would seem that we would be apprehensive about taking on a massive rewrite. As we'll see in the next chapter, surprisingly, this isn't always the case. Despite the moral dangers to teams in the rewrite phase, we often put caution to the wind and take the journey of a rewrite anyway.

Rewrite or Refactor - The Risks of Rewrites

The Risks of Rewrites

Parallel Development

Team Schism

Unknown Unknowns

The Second System Effect

Fighting the Last War

All or Nothing

Summing Up

Recommend

分布式哈希表与Kademlia算法

binlog 的并行重放问题

第1-2课：算法设计常用思想之贪婪法

DO Principle #5: Data has a literal representation

Nonsmooth models and spline effects

How to manage time? Time management tips for anyone.

Git Tips - ls-tree 和 ls-files

Go 語言 Select Multiple Channel 注意事項

Update: Services are back online – Multiple Google services are down, it’s not j...

MeasuringU: What a Randomization Test Is and How to Run One in R

About Joyk