top of page
  • Writer's pictureSadik Bakiu

A practical guide to A/B Testing in MLOps with Kubernetes and seldon-core

Photo by Jens Lelie on Unsplash (

Many companies are using data to drive their decisions. The aim is to remove uncertainties, guesswork, and gut feeling. A/B testing is a methodology that can be applied to validate a hypothesis and steer the decisions in the right direction.

In this blog post, I want to show how to create a containerized microservice architecture that is easy to deploy, monitor and scale every time we run A/B tests. The focus will be on the infrastructure and automation of the deployment of the models rather than the models themselves. Hence, I will avoid explaining the details of the model design. For illustration purposes, the model used is one created in this post by Alejandro Saucedo. The model performs text classification based on Reddit moderation dataset. How the model is built and how it works is described in the original post.

In case you already know how A/B testing works, feel free to jump to the A/B Testing in Kubernetes with seldon-core section.

What is A/B testing?

A/B testing, at its core, compares two variants of one variable and determines which variant performs better. The variable can be anything. For instance, for a website, it can be the background color. We can use A/B testing to validate if changing the background color invites the users to stay longer on the website. For recommendation engines, it can be whether changing the model parameters generates more revenue.

A/B testing has been used since the early twentieth century, to improve the success of advertisement campaigns. Afterward, A/B testing has been used in medicine, marketing, SaaS, and many other domains. Notably, A/B testing was used in Obama’s 2012 campaign to increase voters’ donations.

How Does A/B Testing Work?

To run A/B testing, you need to create two different versions of the same component, that differ on only one variable. In the case of selecting the most inviting background color for our website, version A is the current color (e.g. red), also known as Control, and version B is the new color (e.g. blue), called Variation. Users are randomly shown either version A or version B.

Split traffic between version A and B in A/B testing.

Before we can pick the better performing version, we need a way to objectively measure the results. For this, we have to choose (at least) one metric to analyze. A meaningful metric in this example would be user session duration. Longer sessions mean users are using the website more. In our analysis, we search for the variation with longer user sessions. If we are happy with the results, we can choose to deploy this version to all the users.

How does A/B testing apply in Machine Learning

In Machine Learning, as almost in every engineering field, there are no silver bullets. We need to test to find a better solution. Searching for the data science technique that fits our data best, means we have to test many of them. To optimize the model, we need to play with many parameters or parameter sets.

A/B testing is an excellent strategy to validate whether a new model or a new version outperforms the current one. All this while testing with the actual users. The improvements can be in terms of F1 score, user engagement, or better response time.

A/B Testing Process

Let’s consider the case when we want to test the model’s performance. To run A/B testing, we need to:

  1. Define the hypothesis. In this step, we decide what is it we want to test. In our case, we want to test whether the new model version provides faster predictions.

  2. Define metric(s). To pick the better version in the test, we need to have a metric to measure. In this example, we will measure the time spent from the user’s request arrival until it gets a response.

  3. Create variant. Here we need to create a deployment that uses the new model version.

  4. Run experiment. After preparing the variant, we can run the experiment, ideally for a defined time frame.

  5. Collect data. At the same time, we must be collecting the data needed to measure the metric(s) from 2.

  6. Analyze results. All that is left now is to analyze the data and pick the variant that yields better results.

A/B Testing in Kubernetes with seldon-core

Kubernetes is the default choice for the infrastructure layer because of its flexibility and industry-wide adoption. We can provision resources dynamically based on the tests we want to run, and release them once the testing phase is finished.

In the MLOps space, there are not many open-source tools that support A/B testing out-of-the-box. Seldon-core is the exception here and it:

  • supports Canary Deployments (configurable A/B deployments)

  • provides customizable metrics endpoint

  • supports REST and gRPC interfaces

  • has very good integration with Kubernetes

  • provides out-of-the-box tooling for metric collection and visualization (Prometheus and Grafana)

Therefore, choosing seldon-core for model deployment is not a difficult decision.

Resources used in this post are in this GitHub repository.

What Tooling is Needed

The architecture we are aiming for looks as follows:

The aim architecture for A/B testing.
  1. Ingress Controller — makes sure requests coming from the users reach the correct endpoints. The ingress controller concept is described in great detail in the Kubernetes documentation. We will be using ambassador as an ingress controller since it is supported by seldon-core. Other alternatives are Istio, Linkerd, or NGINX.

  2. Seldon-core — as mentioned above, will be used for serving ML models as containerized microservices and for exposing metrics.

  3. Seldon-core-analytics — is a seldon-core component that bundles together tools to collect metrics (Prometheus) and visualize them (Grafana).

  4. Helm — is a package manager for Kubernetes. It will help us install all these components.

Building the models

To compare the two model versions, we first need to have the model versions.

In the test, we want to find which model version is better. The better model version for this case is defined as the one that provides predictions faster. For demonstration purposes, the only difference between the two versions will be an artificial delay added to version A before returning the response.

Let’s start by cloning the repository:

git clone
cd ab-testing-in-ml

Build the version A:

docker build -t ab-test:a -f Dockerfile.a .

Now, let’s build the version B:

docker build -t ab-test:b -f Dockerfile.b .

Setting up

After preparing the container images, let’s get started with the deployment. We will assume that we have a Kubernetes cluster up and running. The easiest way to get a Kubernetes cluster is to set up a local one with minikube, kind, or k3s.

Initially, we will install the helper components. We start with the ingress controller:

helm repo add datawire
helm upgrade --install ambassador datawire/ambassador \
    --set \
    --set service.type=ClusterIP \
    --set replicaCount=1 \
    --set crds.keep=false \
    --set enableAES=false \
    --create-namespace \
    --namespace ambassador

Install ambassador.

Next, we will install seldon-core-analytics:

helm upgrade --install seldon-core-analytics seldon-core-analytics\
    --repo \
    --set grafana.adminPassword="admin" \
    --create-namespace \
    --namespace seldon-system

Install seldon-core-analytics.

The next step is to install seldon-core:

helm upgrade --install seldon-core seldon-core-operator \
    --repo \
    --set ambassador.enabled=true \
    --create-namespace \
    --namespace seldon-system

This step, among other things, installs the necessary custom resource definitions (CRDs) in our cluster, needed to have SeldonDeployment working.

Install seldon-core CRDs.

Up to this point, we have installed the necessary components for our infrastructure. Next, we need to bring this infrastructure to life with the actual microservices that we want to test.

Deploy ML models as microservices

The next step is deploying both versions of our ML models. Here, we will use a canary configuration of a SeldonDeployment resource. A canary configuration is a setting where two (or more) versions are deployed and they will receive parts of the overall traffic. The Kubernetes manifest file looks like this:

kind: SeldonDeployment
    app: seldon
  name: ab_test
  namespace: seldon
  name: ab_test
    project_name: project_name
    deployment_version: v1
    - name: classifier-a
      replicas: 1
      traffic: 50
        - spec:
              - image: ab-test:a
                imagePullPolicy: IfNotPresent
                name: classifier-a
                  - name: VERSION
                    value: "A"
            terminationGracePeriodSeconds: 1
        children: []
          type: REST
        name: classifier-a
        type: MODEL
    - name: classifier-b
      replicas: 1
      traffic: 50
        - spec:
              - image: ab-test:b
                imagePullPolicy: IfNotPresent
                name: classifier-b
                  - name: VERSION
                    value: "B"
            terminationGracePeriodSeconds: 1
        children: []
          type: REST
        name: classifier-b
        type: MODEL

After this deployment, we have the complete architecture up and running.

Fully working architecture.

In our code, we have created a custom metric that reports the time until the prediction for every request. In addition, we have added a tag to be able to track each prediction from which version it was generated.

Code snippet to add custom metrics and tags.

Let’s now test whether the deployment works. First, we will port-forward the ambassador port, to send requests directly from our machine:

kubectl port-forward svc/ambassador -n ambassador 8080:80

and then we will send a request:

curl -X POST -H 'Content-Type: application/json' \
  -d '{"data": { "ndarray": ["This is a nice comment."]}}' \

Inspecting the response we can see:

  • In the data.ndarray key, we get the prediction.

  • In the meta.metrics and meta.tagskeys, we get the custom metrics defined above.

  • In the meta.requestPath key, we can see the model(s) that handled this request.

JSON response from the seldon-core microservice.

Finally, we need to build a dashboard to visualize the collected metric. We start by port-forwarding Grafana, to make it reachable from our machine:

kubectl port-forward svc/seldon-core-analytics-grafana -n seldon-system 3000:80

We will create the dashboard, or import one already prepared from here.

The imported dashboard.

After the dashboard is created, we have the deployment and the monitoring ready.

Now, let’s generate some load. The seldon_client/ script in the repo uses the seldon-core python client to send a few thousand requests.

python seldon_client/

We can now see values visualized in our dashboard and can compare the model versions on the prediction response time basis.

Model A vs. Model B prediction times.

Unsurprisingly, the model version B, without the artificial delay, responds faster.

What’s next?

In real life, choices are not always binary. More often than not, we have to consider more options. To address this, we can use Multi-armed Bandits (MAB). MAB are A/B/n tests that update in real-time based on the performance of each variation. MABs are supported on seldon-core as well.


With this demo, we were able to set up all the necessary infrastructure for A/B testing, collect and visualize metrics, and in the end, make an informed decision about the initial hypothesis. We also have the option to track further metrics using the custom tags we added in the response.


At Data Max, we work hard to bring state-of-the-art machine learning solutions to our customers. Implementing such solutions enables our customers to make informed decisions and become truly data-driven.


bottom of page