ML Model Testing: 4 Teams Share How They Test Their Models

Despite the progress of the machine learning industry in developing solutions that help data teams and practitioners operationalize their machine learning models, testing these models to make sure they’ll work as intended remains one of the most challenging aspects of putting them into production.

Most processes used to test ML models for production usage are native to traditional software applications, not machine learning applications. When starting a machine learning project, it’s standard for you to take critical note of the business, tech, and datasets requirements. Still, teams often neglect the testing requirements for later until they are either ready to deploy or altogether skip testing before deployment.

How do teams test machine learning models?

With ML testing, you are asking the question: “How do I know if my model works?” Essentially, you want to ensure that your learned model will behave consistently and produce the results you expect from it.

Unlike traditional software applications, it is not straightforward to establish a standard for testing ML applications because the tests do not just depend on the software, they also rely on the business context, problem domain, dataset used, and the model selected.

While most teams are comfortable with using the model evaluation metrics to quantify a model’s performance before deploying it, these metrics are mostly not enough to ensure your models are ready for production. You also need to perform thorough testing of your models to ensure they are robust enough for real-world encounters.

This article will teach you how various teams perform testing for different scenarios. At the same time, it’s worth noting that this article should not be used as a template (because ML testing is problem-dependent) but rather a guide to what types of test suite you might want to try out for your application based on your use case.

Developing, testing, and deploying machine learning models | Source

Small sidenote

The information shared in this article is based on the interaction I had with team representatives who either worked on a team that performed testing for their ML projects or are still working with such groups.

If you feel anything needs to be updated in the article or have any concerns, do not hesitate to reach out to me on LinkedIn.

1. Combining automated tests and manual validation for effective model testing

Organization

GreenSteam – an i4 insight company

Industry

Computer software

Machine learning problem

Various ML tasks

Thanks to Tymoteusz Wolodzko, a former ML Engineer at GreenSteam, for granting me an interview. This section leverages both the responses gotten from Tymoteusz during the interview and his case study blog post on the Neptune.ai blog.

Business use case

GreenSteam – An i4 Insight Company provides software solutions for the marine industry that help reduce fuel usage. Excess fuel usage is both costly and bad for the environment, and vessel operators are obliged to get more green by the International Maritime Organization and reduce the CO2 emissions by 50% by 2050.

GreenSteam – an i4 insight company dashboard mock | Source

Testing workflow overview

To perform ML testing in their projects, this team had a few levels of tests suites, as well as validation:

Automated tests for model verification,
Manual model evaluation and validation.

To implement automated tests in their workflow, the team leveraged GitOps using Jenkins running code quality checks and smoke tests using production-like runs in the test environment. As a result, the team had a single pipeline for model code where every pull request was going through code reviews and automated unit tests.

The pull requests also went through automated smoke tests. The automated test suites’ goal was to make sure tests flagged erroneous code early in the development process.

After the automation tests were run and passed by the model pipeline, a domain expert manually reviewed the evaluation metrics to make sure that they made sense, validated them, and marked them ready for deployment.

Automated tests for model verification

The workflow for the automated tests was that whenever someone on the team made a commit, the smoke test would run to ensure the code worked, then the unit tests would run, making sure that the assertions in the code and data were met. Finally, the integration tests would run to ensure the model works well with other components in the pipeline.

Automated smoke test

Every pull request went through automated smoke tests where the team trained models and made predictions, running the entire end-to-end pipeline on some small chunk of actual data to ensure the pipeline worked as expected and nothing broke.

The right kind of testing for the smoke suite can give any team a chance to understand the quality of their pipeline before deploying it. Still, running the smoke test suite does not mean the entire pipeline is guaranteed to be fully working because the code passed. So the team had to consider the unit test suite to test data and model assumptions.

Automated unit and integration tests

The unit and integration tests the team ran were to check some assertions about the dataset to prevent low-quality data from entering the training pipeline and prevent problems with the data preprocessing code. You could think of these assertions as assumptions the team made about the data. For example, they would expect to see some kind of correlation in the data or see that the model’s prediction bounds are non-negative.

Unit testing machine learning code is more challenging than typical software code. Unit testing several aspects of the model code was difficult for the team. For example, to test them accurately, they would have to train the model, and even with a modest data set, a unit test could take a long time.

Furthermore, some of the tests were erratic and flaky (failed at random). One of the challenges of running the unit tests to assert the data quality was that running these tests on sample datasets was more complex and took way less time than running them on the entire dataset. It was difficult to fix for the team but to address the issues. They opted to eliminate part of the unit tests in favour of smoke tests.

The team defined acceptance criteria and their test suite was continuously evolving as they experimented by adding new tests, and removing others, gaining more knowledge on what was working and what wasn’t.

They would train the model in a production-like environment on a complete dataset for each new pull request, except that they would adjust the hyperparameters at values that resulted in quick results. Finally, they would monitor the pipeline’s health for any issues and catch them early.

GreenSteam – an i4 insight company MLOps toolstack including testing tools | Source

Manual model evaluation and validation

“We had a human-in-the-loop framework where after training the model, we were creating reports with different plots showing results based on the dataset, so the domain experts could review them before the model could be shipped.”
Tymoteusz Wołodźko, a former ML Engineer at GreenSteam

After training the model, a domain expert generated and reviewed a model quality report. The expert would approve (or deny) the model through a manual auditing process before it could eventually be shipped to production by the team after getting validation and passing all previous tests.

2. Approaching machine learning testing for a retail client application

Organization

Undisclosed

Industry

Retail and consumer goods

Machine learning problem

Classification

Business use case

This team helped a retail client resolve tickets in an automated way using machine learning. When users raise tickets or when generated by maintenance problems, the application uses machine learning to classify the tickets into different categories, helping faster resolution.

Testing workflow overview

This team’s workflow for testing models involved generating builds in the continuous integration (CI) pipeline upon every commit. In addition, the build pipeline will run a code quality test (linting test) to ensure there are no code problems.

Once the pipeline generated the build (a container image), the models were stress-tested in a production-like environment through the release pipelines. Before deployment, the team would also occasionally carry out A/B testing on the model to evaluate performance in varying situations.

After the team deployed the pipeline, they would run deployment and inference tests to ensure it did not break the production system and the model continuously worked correctly.

Let’s take an in-depth look at some of the team’s tests for this use case.

Code quality tests

Running tests to check code quality is crucial for any software application. You always want to test your code to make sure that it is:

Correct,
Reliable (doesn’t break in different conditions),
Secure,
Maintainable,
and highly performant.

This team performed linting tests on their code before any container image builds in the CI pipeline. The linting tests ensured that they could enforce coding standards and high-quality code to avoid code breakages. Performing these tests also allowed the team to catch errors before the build process (when they are easy to debug).

A screenshot showing a mock example of linting tests | Source

A/B testing machine learning models

“Before deploying the model, we sometimes do the A/B testing, not every time, depending on the need.”
Emmanuel Raj, Senior Machine Learning Engineer

Depending on the use case, the team also carried out A/B tests to understand how their models performed in varying conditions before they deployed them, rather than relying purely on offline evaluation metrics. With what they learned from the A/B tests, they knew whether a new model improved a current model and tuned their model to optimize the business metrics better.

Stress testing machine learning models

“We use the release pipelines to stress test the model, where we bombard the deployment of the model with X number of inferences per minute. The X can be 1000 or 100, depending on our test. The goal is to see if the model performs as needed.”
Emmanuel Raj, Senior Machine Learning Engineer

Testing the model’s performance under extreme workloads is crucial for business applications that typically expect high traffic from users. Therefore, the team performed stress tests to see how responsive and stable the model would be under an increased number of prediction requests at a given time scale.

This way, they benchmarked the model’s scalability under load and identified the breaking point of the model. In addition, the test helped them determine if the model’s prediction service meets the required service-level objective (SLO) with uptime or response time metrics.

It is worth noting that the point of stress testing the model isn’t so much to see how many inference requests the model could handle as to see what would happen when users exceed such traffic. This way, you can understand the model’s performance problems, including the load time, response time, and other bottlenecks.

Testing model quality after deployment

“In production after deploying the model, we test the data and model drifts. We also do the post-production auditing; we have quarterly auditing to study the operations.”
Emmanuel Raj, Senior Machine Learning Engineer

The goal of the testing production models is to ensure that the deployment of the model is successful and the model works correctly in production together with other services. For this team, testing the inference performance of the model in production was a crucial process for continuously providing business value.

In addition, the team tested for data and model drift to make sure models could be monitored and perhaps retrained when such drift was detected. On another note, testing production models can enable teams to perform error analysis on their mission-critical models through manual inspection from domain experts.

An example of a dashboard showing information on data drift for a machine learning project in Azure ML Studio | Source

3. Behavioural tests for machine learning applications at a Fin-tech startup

Organization

MonoHQ

Industry

Fin-tech

Machine learning problem

Natural language processing (NLP) and classification tasks

Thanks to Emeka Boris for granting me an interview and reviewing this excerpt before publication.

Business use case

The transaction metadata product at MonoHQ uses machine learning to classify transaction statements that are helpful for a variety of corporate customer applications such as credit application, asset planning/management, BNPL (buy now pay later), and payment. Based on the narration, the product classifies transactions for thousands of customers into different categories.

MonoHQ | Source

Testing workflow overview

Before deploying the model, the team conducts a behavioral test. This test consists of 3 elements:

Prediction distribution,
Failure rate,
Latency.

If the model passes the three tests, the team lists it for deployment. If the model does not pass the tests, they would have to re-work it until it passes the test. They always ensure that they set a performance threshold as a metric for these tests.

They also perform A/B tests on their models to learn what version is better to put into the production environment.

Behavioural tests to check for prediction quality

This test shows how the model responds to inference data, especially NLP models.

First, the team runs an invarianc e test, introducing perturbability to the input data.
Next, they check if the slight change in the input affects the model response—its ability to correctly classify the narration for a customer transaction.

Essentially, they are trying to answer here: does a minor tweak in the dataset with a similar context produce consistent output?

Performance testing for machine learning models

To test the response time of the model under load, the team configures a testing environment where they would send a lot of traffic to the model service. Here’s their process:

They take a large amount of transaction dataset,
Create a table,
Stream the data to the model service,
Record the inference latency,
And finally, calculate the average response time for the entire transaction data.

If the response time passes a specified latency threshold, it is up for deployment. If it doesn’t, the team would have to rework it to improve it or devise another strategy to deploy the model to reduce the latency.

A/B testing machine learning models

“We A/B tests to see which version of the model is most optimal to be deployed.”
Emeka Boris, Senior Data Scientist at MonoHQ.

For this test, the team containerizes two models to deploy to the production system for upstream services to consume to the production system. They deploy one of the models to serve traffic from a random sample of users and another to a different sample of users so they can measure the real impact of the model’s results on their users. In addition, they can tune their models using their real customers and measure how they react to the model predictions.

This test also helps the team avoid introducing complexity from newly trained models that are difficult to maintain and add no value to their users.

4. Performing engineering and statistical tests for machine learning applications

Organization

Arkera

Industry

FinTech – Market intelligence

Machine learning problem

Various ML tasks

Thanks to Laszlo Sragner for granting me an interview and reviewing this excerpt before publication.

Business use case

A system that processes news from emerging markets to provide intelligence to traders, asset managers, and hedge fund managers.

Arkera.ai LinkedIn cover image | Source

Testing workflow overview

This team performed two types of tests on their machine learning projects:

Engineering-based tests (unit and integration tests),
Statistical-based tests (model validation and evaluation metrics).

The engineering team ran the unit tests and checked whether the model threw errors. Then, the data team would hand off (to the engineering team) a mock model with the same input-output relationship as the model they were building. Also, the engineering team would test this model to ensure it does not break the production system and then serve it until the correct model from the data team is ready.

Once the data team and stakeholders evaluate and validate that the model is ready for deployment, the engineering team will run an integration test with the original model. Finally, they will swap the mock model with the original model in production if it works.

Engineering-based test for machine learning models

Unit and integration tests

To run an initial test to check if the model will integrate well with other services in production, the data team will send a mock (or dummy) model to the engineering team. The mock model has the same structure as the real model, but it only returns the random output. The engineering team will write the service for the mock model and prepare it for testing.

The data team will provide data and input structures to the engineering team to test whether the input-output relationships match with what they expect, if they are coming in the correct format, and do not throw any errors.

The engineering team does not check whether that model is the correct model; they only check if it works from an engineering perspective. They do this to ensure that when the model goes into production, it will not break the product pipeline.

When the data team trains and evaluates the correct model and stakeholders validate it, the data team will package it and hand it off to the engineering team. The engineering team will swap the mock model with the correct model and then run integration tests to ensure that it works as expected and does not throw any errors.

Statistical-based test for machine learning models

Model evaluation and validation

The data team would train, test, and validate their model on real-world data and statistical evaluation metrics. The head of data science audits the results and approves (or denies) the model. If there is evidence that the model is the correct model, the head of data science will report the results to the necessary stakeholders.

He will explain the results and inner workings of the model, the risks of the model, and the errors it makes, and confirm if they are comfortable with the results or the model still needs to be re-worked. If the model is approved, the engineering team swaps the mock model with the original model, reruns an integration test to confirm that it does not throw any error, and then deploy it.

Conclusion

Hopefully, as you learned from the use cases and workflows, model evaluation metrics are not enough to ensure your models are ready for production. You also need to perform thorough testing of your models to ensure they are robust enough for real-world encounters.

Developing tests for ML models can help teams systematically analyze model errors and detect failure modes, so resolution plans are made and implemented before deploying the models to production.

References and resources

Stephen Oladele

Data Scientist by day, collaboratively leading nonprofit organizations by night, (technical) content creation on the weekends | TA @TGB | Podcast Host & Technical Author | AWS Community Builder

Follow me on

READ NEXT

Continuum Industries Case Study: How to Track, Monitor & Visualize CI/CD Pipelines

7 mins read | Updated August 9th, 2021

Continuum Industries is a company in the infrastructure industry that wants to automate and optimize the design of linear infrastructure assets like water pipelines, overhead transmission lines, subsea power lines, or telecommunication cables.

Its core product Optioneer lets customers input the engineering design assumptions and the geospatial data and uses evolutionary optimization algorithms to find possible solutions to connect point A to B given the constraints.

As Chief Scientist Andreas Malekos, who works on the Optioneer AI-powered engine, explains:

“Building something like a power line is a huge project, so you have to get the design right before you start. The more reasonable designs you see, the better decision you can make. Optioneer can get you design assets in minutes at a fraction of the cost of traditional design methods.”

Andreas Malekos

Chief Scientist @Continuum Industries

But creating and operating the Optioneer engine is more challenging than it seems:

The objective function does not represent reality
There are a lot of assumptions that civil engineers don’t know in advance
Different customers feed it completely different problems, and the algorithm needs to be robust enough to handle those

Instead of building the perfect solution, it’s better to present them with a list of interesting design options so that they can make informed decisions.

The engine team leverages a diverse skillset from mechanical engineering, electrical engineering, computational physics, applied mathematics, and software engineering to pull this off.

Problem

A side effect of building a successful software product, whether it uses AI or not, is that people rely on it working. And when people rely on your optimization engine with million-dollar infrastructure design decisions, you need to have a robust quality assurance (QA) in place.

As Andreas pointed out, they have to be able to say that the solutions they return to the users are:

Good, meaning that it is a result that a civil engineer can look at and agree with
Correct, meaning that all the different engineering quantities that are calculated and returned to the end-user are as accurate as possible

On top of that, the team is constantly working on improving the optimization engine. But to do that, you have to make sure that the changes:

Don’t break the algorithm in some way or another
They actually improve the results not just on one infrastructure problem but across the board

Basically, you need to set up a proper validation and testing, but the nature of the problem the team is trying to solve presents additional challenges:

You cannot automatically tell whether an algorithm output is correct or not. It is not like in ML where you have labeled data to compute accuracy or recall on your evaluation set.
You need a set of example problems that is representative of the kind of problem that the algorithm will be asked to solve in production. Furthermore, these problems need to be versioned so that repeatability is as easily achievable as possible.

Continue reading ->

ML Model Testing: 4 Teams Share How They Test Their Models

ML Model Testing: 4 Teams Share How They Test Their Models

Read also

How do teams test machine learning models?

Small sidenote

1. Combining automated tests and manual validation for effective model testing

Organization

Industry

Machine learning problem

Business use case

Testing workflow overview

Automated tests for model verification

Automated smoke test

Automated unit and integration tests

Manual model evaluation and validation

2. Approaching machine learning testing for a retail client application

Organization

Industry

Machine learning problem

Business use case

Testing workflow overview

Code quality tests

A/B testing machine learning models

Stress testing machine learning models

Testing model quality after deployment

3. Behavioural tests for machine learning applications at a Fin-tech startup

Organization

Industry

Machine learning problem

Business use case

Testing workflow overview

Behavioural tests to check for prediction quality

Performance testing for machine learning models

A/B testing machine learning models

4. Performing engineering and statistical tests for machine learning applications

Organization

Industry

Machine learning problem

Business use case

Testing workflow overview

Engineering-based test for machine learning models

Unit and integration tests

Statistical-based test for machine learning models

Model evaluation and validation

Conclusion

References and resources

Stephen Oladele

Continuum Industries Case Study: How to Track, Monitor & Visualize CI/CD Pipelines

Problem

Recommend

About Joyk