4

Project Hygiene, Part 2: Combatting Goodhart’s Law and Other “Project Smells”

 8 months ago
source link: https://dzone.com/articles/project-hygiene-part-2-combatting-goodharts-law-an
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

This is a continuation of the Project Hygiene series about best software project practices that started with this article.

Background

“It works until it doesn’t” is a phrase that sounds like a truism at first glance but can hold a lot of insight in software development. Take, for instance, the very software that gets produced. There is no shortage of jokes and memes about how the “prettiness” of what the end-user sees when running a software application is a mere façade that hides a nightmare of kludges, “temporary” fixes that have become permanent, and other less-than-ideal practices. These get bundled up into a program that works just as far as the developers have planned out; a use case that falls outside of what the application has been designed for could cause the entire rickety code base to fall apart. When a catastrophe of this kind does occur, a post-mortem is usually conducted to find out just how things went so wrong. Maybe it was some black-swan moment that simply never could’ve been predicted (and would be unlikely to occur again in the future), but it’s just as possible that there was some issue within the project that never got treated until it was too late.

Code Smells...

Sections of code that may indicate that there are deeper issues within the code base are called “code smells” because, like the milk carton in the fridge that’s starting to give off a bad odor, they should provoke a “there’s something off about this” reaction in a veteran developer. Sometimes, these are relatively benign items, like this Java code snippet:

var aValue = foo.isFizz() ? foo.isFazz() ? true : false : false;

This single line contains two different code smells:

  1. Multiple ternary operators are being used within the same statement: This makes the code hard to reason and needlessly increases the cognitive load of the code base.
  2. Returning hard-coded boolean values in the ternary statement, itself already being a boolean construction: This is unnecessary redundancy and suggests that there’s a fundamental misunderstanding of what the ternary statement is to be used for. In other words, the developers do not understand the tools that they are using.

Both of these points can be addressed by eliminating the use of ternary statements altogether:

var aValue = foo.isFizz() && foo.isFazz();

Some code smells, however, might indicate an issue that would require a more thorough evaluation and rewrite. For example, take this constructor for a Kotlin class:

Kotlin
class FooService(
   private val fieldOne: SomeServiceOne,
   private val fieldTwo: SomeServiceTwo,
   private val fieldThree: SomeServiceThree,
   private val fieldFour: SomeServiceFour,
   private val fieldFive: SomeServiceFive,
   private val fieldSix: SomeServiceSix,
   private val fieldSeven: SomeServiceSeven,
   private val fieldEight: SomeServiceEight,
   private val fieldNine: SomeServiceNine,
   private val fieldTen: SomeServiceTen,
) {

Possessing a constructor that takes in ten arguments for ten different class members is an indicator that the FooService class might be conducting too much work within the application; i.e., the so-called “God Object” anti-pattern. Unfortunately, there’s no quick fix this time around. The code architecture would need to be re-evaluated to determine whether some of the functionality within FooService could be transferred to another class within the system, or whether FooService needs to be split up into multiple classes that conduct different parts of the workflow on their own.

…And Their “Project” Equivalent

Such a concept can be elevated to the level of the entire project around the code being developed as well: that there exist items within the project that signal that the project team is conducting practices that could lead to issues down the road. At a quick glance, all may appear to be fine in the project - no fires are popping up, the application is working as desired, and so on - but push just a bit, and the problems quickly reveal themselves, as the examples outlined below demonstrate.

Victims Of Goodhart’s Law

To reduce the likelihood of software quality issues causing problems for a software code base and its corresponding applications, the industry has developed tools to monitor the quality of the code that gets written like code linters, test coverage analysis, and more. Without a doubt, these are excellent tools; their effectiveness, however, depends on exactly how they’re being used. Software development departments have leveraged the reporting mechanisms of these code quality tools to produce metrics that function as gates for whether to let a development task proceed to the next stage of the software development lifecycle. For example, if a code coverage analysis service like SonarQube reports that the code within a given pull request’s branch only has testing for 75% of the code base, then the development team may be prohibited from integrating the code until the test coverage ratio improves. Note the specific wording there of whether the ratio improves, *not* whether more test cases have been added - the difference frequently comes to haunt the project’s quality.

For those unfamiliar with “Goodhart’s Law,” it can be briefly summed up as, “When a measure becomes a target, it ceases to be a good measure. What this means in software development is that teams run the risk of developing their code in exact accordance with the metrics that have been imposed upon the team and/or project. Take the aforementioned case of a hypothetical project needing to improve its test coverage ratio. Working in the spirit of the metric would compel a developer to add more test cases to the existing code so that more parts of the code base are being covered, but with Goodhart’s Law in effect, one would only need to improve the ratio however possible, and that could entail:

  • Modifying the code coverage tool’s configuration so that it excludes whole swathes of the code base that do not have test coverage. This may have legitimate use cases - for example, excluding testing a boilerplate web request mechanism because the only necessary tests are for the generated client API that accompanies it - but it can easily be abused to essentially silence the code quality monitor.
  • Generate classes that are ultimately untouched in actual project usage and whose only purpose is to be tested so that the code base has indeed “improved” its code coverage ratio. This has no defensible purpose, but tools like SonarQube will not know that they’re being fooled.

Furthermore, there can be issues with the quality of the code quality verification itself. Code coverage signifies that code is being reached in the tests for the code base - nothing more, nothing less. Here’s a hypothetical test case for a web application (in Kotlin):

Kotlin
@Test
fun testGetFoo() {
   val fooObject: FooDTO = generateRandomFoo()
   `when`(fooService.getFoo(1L)).thenReturn(fooObject)
   mockMvc.perform(
       MockMvcRequestBuilders.get("/foo/{fooId}", 1L)
   ).andExpect(status().isOk())
}

This test code is minimally useful for actually verifying the behavior of the controller - the only piece of behavior being verified here is the HTTP code that the controller endpoint produces - but code coverage tools will nonetheless mark the code within the controller for this endpoint as “covered." A team that produces this type of testing is not actually checking for the quality of its code - it is merely satisfying the requirements imposed on it in the most convenient and quickest way possible - and is ultimately leaving itself open to issues with the code in the future due to not effectively validating its code.

Too Much Of A Good Thing

Manual execution in software testing is fraught with issues: input errors, eating up the limited time that a developer could be using for other tasks, the developer simply forgetting to conduct the testing, and so on. Just like with code quality, the software development world has produced tools (i.e., CI/CD services like Jenkins or CircleCI) that allow for tests to be executed automatically in a controlled environment, either at the behest of the developer or (ideally) entirely autonomously upon the occurrence of a configured event like the creation of a pull request within the code base. This brings enormous convenience to the developer as well as improves the ability to identify potential code quality issues within the project, but its availability can turn into a double-edged sword. It could be easy for the project team to develop an over-dependence on the service and run any and all tests only via the service and never on the developers’ local environments.

In one Kotlin-based Spring Boot project that I had just joined, launching the tests for the code base on my machine would always fail due to certain tests not passing, yet:

  • I had pulled the main code branch, hadn’t modified the code, and had followed all build instructions.
  • The code base was passing all tests on the CI/CD server.
  • No other coworkers were complaining about the same issue.

Almost all the other coworkers on the project were using Macs, whereas I had been assigned a Windows as my work laptop. Another coworker had a Windows machine as well, yet even they weren’t complaining about the tests’ constant failures. Upon asking this coworker how they were able to get these tests to pass, I received an unsettling answer: they never ran the tests locally, instead letting the CI/CD server do all the testing for them whenever they had to change code.

As bad as the answer was in terms of the quality of the project’s development - refusing to run tests locally meant a longer turnaround time between writing code and checking it, ultimately reducing programmer productivity - it at least gave me the lead to check on the difference between Windows and non-Windows environments with regards to the failing tests. Ultimately, the issue boiled down to the system clock: Instant.now() calls on *nix platforms like Linux and Mac were conducted with microsecond-level precision, whereas calls to Instant.now() on Windows were being conducted with nanosecond-level precision. Time comparisons in the failing tests were being conducted based on hard-coded values; since the developers were almost all using *nix-based Mac environments, these values were based on the lesser time precision, so when the tests ran in an environment with more precise time values, they failed. After forcing microsecond-based precision for all time values within the tests - and updating the project’s documentation to indicate as much for future developers - the tests passed, and now both my Windows-based colleague and I could run all tests locally.

Ignoring Instability

In software terminology, a test is deemed “unstable” if it has an inconsistent result: it might pass just as much as it might fail if one executes it, despite no changes having been done to the test or the affected code in between executions. A multitude of causes could be the fault of this condition: insufficient thread safety within the code, for example, could lead to a race condition wherein result A or result B could be produced depending on which thread is executed by the machine. What’s important is how the project development team opts to treat the issue. The “ideal” solution, of course, would be to investigate the unstable test(s) and determine what is causing the instability. However, there exist other “solutions” to test instability as well:

  • Disable the unstable tests. Like code coverage exclusions, this does have utility under the right circumstances; e.g., when the instability is caused by an external factor that the development team has no ability to affect.
  • Have the CI/CD service re-execute the code base’s tests until all tests eventually pass. Even this has valid applications - the CI/CD service might be subject to freak occurrences that cause testing one-off failures, for example - although it’s far likelier that the team just wants to get past the “all tests pass” gate of the software development lifecycle as quickly as possible.

This second point was an issue that I came across in the same Kotlin-based Spring Boot project as above. While it was possible to execute all tests for the code base and have them pass, it was almost as likely to have different tests fail as well. Just as before, the convenience of the CI/CD testing automation meant that all that one needed to do upon receiving the failing test run notification was to hit the “relaunch process” button and then go back to one’s work on other tasks while the CI/CD service re-ran the build and testing process. These seemingly random test failures were occurring with enough frequency that it was evident that some issue was plaguing the testing code, yet the team’s lack of confronting the instability issue head-on and instead relying on what was, essentially, a roll of the dice with the CI/CD service was resulting in the software development lifecycle being prolonged unnecessarily and the team’s productivity being reduced.

After conducting an investigation of the different tests that were randomly failing, I ultimately discovered that the automated database transaction rollback mechanism (via the @Transactional annotation for the tests) was not working as the developers had expected. This was due to two issues:

  1. Some code required that another database transaction be opened (via the @Transactional(propagation = Propagation.REQUIRES_NEW) annotation), and this new transaction fell outside of the reach of the “automated rollback” test transaction.
  2. Database transactions run within Kotlin coroutines were not being included within the test’s transaction mechanism, as those coroutines were being executed outside of the “automated rollback” test transaction’s thread.

As a result, some data artifacts were being left in the database tables after certain tests; this “dirty” database was subsequently causing failures in other tests. Re-running the tests in CI/CD meant that these problems were being ignored in favor of simply brute-forcing an “all tests pass” outcome; in addition, the over-reliance on CI/CD for running the code base’s tests in general meant that there was no database that the team’s developers could investigate, as the CI/CD service would erase the test database layer after executing the testing job.

A Proposal

So, how do we go about discovering and rectifying such issues? Options might appear to be limited at first glance.

  • Automated tools like SonarQube or CI/CD don’t have a “No, not like that!” setting where they detect where their own usefulness has been deliberately blunted by the development team’s practices. Even if breakthroughs in artificial intelligence were to produce some sort of meta-analysis capability, the ability for a team to configure exceptions would still need to exist. If a team has to fight against its tools too much, it’ll look for others that are more accommodating. Plus, Goodheart’s Law will still reign supreme, and one should never underestimate a team’s ability to work around statically-imposed guidelines to implement what it deems is necessary.
  • Spontaneous discovery and fixes of project smells within the project team - that is, not having been brought on by the investigation that followed some catastrophe - are unlikely to occur. The development team’s main goal is going to be providing value to the company via the code that they develop; fixing project smells like unstable tests does not have the same demonstrable value as deploying new functionality to production. Besides, it’s possible that the team - being so immersed in their environment and the day-to-day practices - is simply unaware that there’s any issue at all. The fish, as they say, is the last to discover water!

A better approach would be to have an outside look at the project and how it’s being developed from time to time. Providing a disinterested point of view with the dedicated task of finding project smells will have a better chance of success in rooting these issues out than people within the project who have the constraint of needing to actively develop the project as their principal work assignment. Take, say, a group of experienced developers from across the entire product development department and form a sort of “task force” that conducts an inspection of the different projects within the department at the interval of every three or six months. This team would examine, for example:

  • What is the quality of the code coverage and whether the tests can be improved
  • Whether tests - and the application itself! - can be run on all platforms that the project is supposed to support; i.e., both on the developer’s machine as well as in dedicated testing and production environments
  • What is the frequency of unstable test results occurring and how these unstable tests are resolved

Upon conducting this audit of the project, the review team would present its findings to the project development team and make suggestions for how to improve any issues that the review team has found. In addition, the review team would ideally conduct a follow-up with the team to understand the context of how such project smells came about. Such causes might be:

  • A team’s lack of awareness of better practices, both project-wise and within the code that they are developing.
  • Morale within the team is low enough that it only conducts the bare minimum to get the assigned functionality out the door.
  • Company requirements are overburdening the team such that it can *only* conduct the bare minimum to get the assigned functionality out the door.

This follow-up would be vital to help determine what changes can be made to both the team’s and the department’s practices in order to reduce the likelihood of such project smells recurring, as the three underlying causes listed above - along with other potential causes - would produce significantly different recommendations for improvement compared to one another.

A Caveat

As with any review process - such as code reviews or security audits - there must never be blame placed on one or more people in the team for any problematic practices that have been uncovered. The objective of the project audit is to identify what can be improved within the project, not to “name and shame” developers and project teams for supposedly “lazy” practices or other project smells. Some sense of trepidation about one’s project being audited would already be present - nobody wants to receive news that they’ve been working in a less-than-ideal way, after all - but making the process an event to be dreaded would make developer morale plummet. In addition, it’s entirely possible that the underlying reasons for the project smells existing were ultimately outside of the development team’s hands. A team that’s severely under-staffed, for example, might be fully occupied as it is frantically achieving its base objectives; anything like well-written tests or ensuring testability on a developer’s machine would be a luxury.

Conclusion

Preventative care can be a hard sell, as its effectiveness is measured by the lack of negative events occurring at some point in the future and cannot be effectively predicted. If a company were to be presented with this proposal to create an auditing task force to detect and handle project smells, it might object that its limited resources are better spent on continuing the development of new functionality. The Return On Investment for a brand-new widget, after all, is much easier to quantify than working towards preventing abstract (to the non-development departments in the company, at least) issues that may or may not cause problems for the company sometime down the line. Furthermore, if everything’s “working," why waste time changing it? To repeat the phrase from the beginning of this article, “It works until it doesn’t”, and that “doesn’t” could range from one team having to fix an additional set of bugs in one development sprint to a multi-day downtime that costs the company considerable revenue (or much worse!). The news is replete with stories about companies that neglected issues within their software development department until those issues blew up in their faces. While some companies that have avoided such an event so far have simply been lucky that events have played out in their favor, it would be a far safer bet to be proactive in hunting down any project smells within a company and avoid one’s fifteen minutes of infamy.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK