DevOps

Airflow

Links:

Server installation with database: https://www.statworx.com/de/blog/a-framework-to-automate-your-work-how-to-set-up-airflow/

First dag: https://medium.com/better-programming/how-to-write-your-first-pipeline-in-airflow-a51141c3f4dd

How to unit-test Airflow: https://godatadriven.com/blog/testing-and-debugging-apache-airflow/

CRON scheduler translator: https://crontab.guru/#0_08_27_*_*

A common source of confusion in Airflow regarding dates is the fact that the run timestamped with a given date only starts when the period that it covers ends! See for example the pictures in https://medium.com/nerd-for-tech/airflow-catchup-backfill-demystified-355def1b6f92 .

If you run a DAG on a schedule_interval of one day, the run stamped 2020-01-01 will be triggered soon after 2020-01-01T23:59. In other words, the job instance is started once the period it covers has ended. The execution_date available in the context will also be 2020-01-01.

The first DAG Run is created based on the minimum start_date for the tasks in your DAG. Subsequent DAG Runs are created by the scheduler process, based on your DAG’s schedule_interval, sequentially. If your start_date is 2020-01-01 and schedule_interval is @daily, the first run will be created on 2020-01-02 i.e., after your start date has passed.

Re-run DAG

start_date: https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date

execution_date:

macros for execution_date: https://airflow.apache.org/docs/apache-airflow/stable/macros-ref.html . For example:

{{ ds }} : the execution date as YYYY-MM-DD

{{ next_ds }} : the next execution date as YYYY-MM-DD if {{ ds }} is 2018-01-01 and schedule_interval is @weekly, {{ next_ds }} will be 2018-01-08

Airflow CLI

Detection of a specific dag DAG_NAME contained in a file ‘/folder/dags/any_subfolder/dag_filename’:

airflow list_dags -sd /folder/dags/ -r | grep dag_filename

# output such as
# /any_subfolder/dag_filename.py                                     | 0.018184 |       1 |        1 | ['DAG_NAME']

Docker

Good links:

The 3 following resources form a series very simple to follow and reproduce, and very useful to grab basic concepts:

Additional interesting resources:

High-level principle of Docker images:

Docker images principle

Activate a conda environment in Docker: https://pythonspeed.com/articles/activate-conda-dockerfile/

Spark on Docker: https://www.datamechanics.co/blog-post/spark-and-docker-your-spark-development-cycle-just-got-ten-times-faster

Note: bug with Dockerhub, to authenticate (when too many failed login attempts): https://github.com/docker/hub-feedback/issues/1250. In CLI: rm ~/.docker/config.json, and then: docker login

Some basic Docker CLI commands include:

  • docker build builds an image from a Dockerfile

  • docker images displays all Docker images on the machine

  • docker run starts a container and runs commands in it

  • docker run options:

-p specify ports in host and Docker container

-it opens an interactive console

-d starts the container in daemon mode (runs in the background)

-e sets environment variables

  • docker ps displays all running containers

  • docker rmi removes one or more images

  • docker rm removes one or more containers

  • docker kill kills one or more running containers

  • docker tag tags an image with an alias which can be referenced later

  • docker login logs in to your Docker registry

Docker-compose

Links:

Kubernetes

Good links:

Again, a very interesting series to address the basics concepts:

Intro to docker & kubernetes video: https://www.youtube.com/watch?v=bhBSlnQcq2k&ab_channel=Amigoscode

How to switch between different clusters: https://stackoverflow.com/questions/43643463/how-to-switch-kubectl-clusters-between-gcloud-and-minikube

You first need to get the context, which names the different clusters available:

$ kubectl config get-contexts

Example of output, for a minikube and AKS clusters:

CURRENT NAME CLUSTER AUTHINFO NAMESPACE

aks-cluster-dev aks-cluster-dev clusterUser_aks-rg_aks-cluster-dev

  • minikube minikube minikube default

Then to switch from the minikube to the AKS one:

kubectl config use-context aks-cluster-dev

And to switch back to the minikube cluster:

kubectl config use-context minikube

Minikube

How to install Minikube: https://minikube.sigs.k8s.io/docs/start/ and https://minikube.sigs.k8s.io/docs/drivers/docker/ (when using docker as backend runtime)

Here specifically on Ubuntu, with some addons: https://www.linuxtechi.com/how-to-install-minikube-on-ubuntu/

$ curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube_latest_amd64.deb

$ sudo dpkg -i minikube_latest_amd64.deb

# Start a cluster using the virtualbox driver: https://minikube.sigs.k8s.io/docs/drivers/virtualbox/
minikube start --driver=virtualbox

# You can also set these parameters
minikube start --cpus=4 --memory=8g

# To make virtualbox the default driver:
minikube config set driver virtualbox

# Access the Kubernetes Dashboard running within the minikube cluster:
minikube dashboard

# Once started, you can interact with your cluster using kubectl, just like any other Kubernetes cluster. For instance, starting a server:
kubectl create deployment hello-minikube --image=k8s.gcr.io/echoserver:1.4

# Exposing a service as a NodePort
kubectl expose deployment hello-minikube --type=NodePort --port=8080

# minikube makes it easy to open this exposed endpoint in your browser:
minikube service hello-minikube

# Upgrade your cluster:
minikube start --kubernetes-version=latest

# Stop your local cluster:
minikube stop

# Delete your local cluster:
minikube delete

# Exposing a service as a LoadBalancer
kubectl expose deployment hello-minikube1 --type=LoadBalancer --port=8080

# Then the service can be accessed using:
minikube service --url hello-minikube1
# example of output: http://192.168.99.100:12345/ This is the address where the service will be available

The last piece of code expose the service related to the deployment to port 8080.

To open that in the browser, one needs to open a tunnel in Minikube (see https://minikube.sigs.k8s.io/docs/handbook/accessing/)

$ minikube tunnel

At the end of the experimentation, we need to close the tunnel:

$ minikube tunnel –cleanup

Minikube Addons

# minikube has a built-in list of applications and services that may be easily deployed, such as Istio or Ingress. To list the available addons for your version of minikube:
minikube addons list

# To enable an add-on, see:
minikube addons enable <name>

# To enable an addon at start-up, where –addons option can be specified multiple times:
minikube start --addons <name1> --addons <name2>

# For addons that expose a browser endpoint, you can quickly open them with:
minikube addons open <name>

# To disable an addon:
minikube addons disable <name>

# To clear out minikube's home folder containing configs and other files:
minikube delete --all --purge

More Minikube links:

How to create a deployment on minikube/kubernetes?

# To see which deployments are already there:
kubectl get deployments

# To deploy a deployment yaml file:
kubectl create -f deployment.yaml

# Note that it is also possible to use (https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#creating-a-deployment):
kubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yaml

# See the different deployments
kubectl get deployment

# Describe the deployments (show the yaml files behind them)
kubectl describe deployments

# Delete a deployment
kubectl delete deploy mlflow-deployment

# How to scale a deployment to 4 pods (https://cloud.google.com/kubernetes-engine/docs/how-to/scaling-apps#scaling_an_application)
kubectl scale deployment mlflow-deployment --replicas 4

# How to scale a statefulset (or other controller) to 4 pods
kubectl scale statefulset mlflow-postgres --replicas 4

# How to autoscale the application (https://cloud.google.com/kubernetes-engine/docs/how-to/scaling-apps#autoscaling-deployments)
kubectl autoscale deployment my-app --max 6 --min 4 --cpu-percent 50

Seems that apply is more declarative, while create is imperative (see https://stackoverflow.com/questions/47241626/what-is-the-difference-between-kubectl-apply-and-kubectl-replace), and so apply will figure out by itself the best way to deploy (kubectl patch, replace, delete, create, even edit are all imperative)

Create a Stateless (classical app) app deployment: https://cloud.google.com/kubernetes-engine/docs/how-to/stateless-apps#anatomy

Create a Statefulset (like a database) app: https://cloud.google.com/kubernetes-engine/docs/how-to/stateful-apps#creating_a_statefulset

Configure Ingress

How to merge several yaml manifest files into one: https://stackoverflow.com/questions/59287850/how-to-have-multiple-object-types-in-a-single-openshift-yaml-template

Kubernetes command cheatsheets:

Azure Kubernetes Service

Start/stop already created cluster: https://docs.microsoft.com/en-us/azure/aks/start-stop-cluster?tabs=azure-cli

For a given cluster aks-cluster-dev in a resource group aks-rg, stopping/starting it is easy:

$ az aks stop –name aks-cluster-dev –resource-group aks-rg

$ az aks start –name aks-cluster-dev –resource-group aks-rg

Spark on Kubernetes

Good links:

Openshift

How to create a dummy app?

  1. Create a python repo in bitbucket(or github). It should have a requirements.txt and an app.py file (dummy example)

  2. Then in the catalog, search for python, select version, give an app name and select the repo

  3. Before going to next, go to advance settings, and fill the configurations

  4. Image building. Once the build config is ready, application can be built. To do this navigate to: Your OpenShift project → Builds: Builds → Your application Name → Start Build

  5. Deployment. There are basically 3 ways of deploying your app:

  • (continuous) deployment: Once an image is built, it is usually automatically pushed into deployment with one running pod.

If one doesn’t want to have a continuously running and/or indefinitely restarting application, one can decrease the pods in the deployment to zero: Openshift → Overview → Application

  • job (single run): To run an application once, one should create a job. One should do that using CLI:

oc run single-job --image=docker-registry.default.svc:5000/projectname-spi-syst/application:1.0 --restart=OnFailure --limits="memory=2Gi"

A pod will then be immediately created and run to completion. The pod will restart indefinitely if it fails, meaning that some supervision should take place if the user is not experienced.

–image part can be retrieved with commands oc get imagestream or oc get imagestreamtag or Builds:Images → Docker Repo. It is best to specify the imagestream with the tag (1.0 in this case, ) for specific image because there have been observations of older image versions used when using the tag “latest”.

–limits is used in order to extend the amount of memory (and other resources) in the pod

The pod restarts on failure regardless of which –restart parameter is set (–restart=Never is not proven to work as expected, however).

Latest jobs that ran to completion or currently running jobs can be viewed running oc get pods.

  • scheduled job

Scheduling a run is essentiallty the same as creating a job, except that additional schedule parameter in a CRON format should be added, for example: –schedule=”0 0 */1 * *”, this cron format should result in a run at 00:00 every day

When the single run command is modified, command for scheduling a job looks something like this:

oc run job-scheduler --image=docker-registry.default.svc:5000/projectname-spi-syst/application:1.0 --schedule="0 3 1,11,21 * *" --restart=OnFailure --limits="memory=2Gi"

This will create a pod with the app that should run once at 03:00 every 1st, 11th and 21st day of the month.

Scheduled jobs can be viewed with oc get cronjobs.

Settings for this cronjob can be edited running oc edit cronjob/job-scheduler

Persistent storage

Users can create storage volumes where data persists after closing the pods. Volume can be claimed in Storage → Create Storage.

A volume can be mounted to a deployment by modifying deployment config. This can be done from CLI interface running similar command oc set volume dc/application –add –claim-name app-storage –mount-path /opt/app-root/src/mount:

dc/.. indicates chosen deployment which can be found in Overview → Application

–claim-name indicates which volume is mounted (your created volume name)

–mount-path indicates path to your mounted volume within the deployment pods

Other relevant resources can have volumes mounted as well, for example to mount volume to a cronjob change dc/.. to cronjob/.. and etc.

Deletion

To find and delete resources that are no longer used is done by similar commands:

oc delete pod/application-pod

Resource type is indicated before slash

Resource name is indicated after the slash

To delete all resources that belong to an application, this program can be used:

oc delete all --selector app=application

Additional Openshift doc

General documentation: https://docs.openshift.com/container-platform/4.7/welcome/index.html?ref=learn-cluster-access

Learning:

Note: there is some playground for Openshift (60 min trials):

Also, there is a (non-free, 50$/month) cloud version of Openshift: https://manage.openshift.com/sign_in

Openshift can be integrated with Azure DevOps (by plugins within Azure DevOps). LOOK INTO THIS!!!

Localhost version of openshift: https://developers.redhat.com/openshift/local-openshift

Open source example of flask using OS3: https://github.com/idealo/flask-openshift-example

Same, using Minishift: https://medium.com/@nikhilkumar435/deploy-python-flask-application-using-openshift-minishift-af098eb74e26

For Openshift 4.X, to run locally, better to use codeready-containers: https://developers.redhat.com/products/codeready-containers/overview?ref=learn-cluster-access

Getting started (Medium): https://sindhumurugavel.medium.com/getting-started-with-kubernetes-9fb8995d0726

Get started with CLI: https://docs.openshift.com/enterprise/3.0/cli_reference/get_started_cli.html#basic-setup-and-login

Import images from external registry into openshift image registry: https://dzone.com/articles/pulling-images-from-external-container-registry-to

Example, to import minio image from DockerHub into OS model registry: oc import-image docker.io/minio/minio –confirm

CICD developement

what is it exactly?

Interesting series of Videos on CI/CD of a ML model (NLP detection of positive/negative comments) using GitLab, Jenkins and Flask (and Docker):

GitLab

Jenkins

Azure DevOps

Azure DevOps with azure container registry (for Docker container images): https://docs.microsoft.com/en-us/azure/devops/pipelines/ecosystems/containers/acr-template?view=azure-devops

Examples of deployment of a flask app using Azure DevOps:

Example of full cicd (ok, using dotnet, not python, but useful to see full steps): https://dotnetplaybook.com/build-test-and-deploy-a-rest-api-with-azure-devops/

Doc on docker image building using Azure DevOps (and a Dockerfile): https://docs.microsoft.com/en-us/azure/devops/pipelines/tasks/test/publish-test-results?view=azure-devops&tabs=junit%2Cyaml#docker

Example of Azure pipeline for a TEST-QA-PROD deployment of a python-flask application: https://docs.microsoft.com/en-us/samples/azure-samples/azure-pipelines-variable-templates/azure-pipelines-variable-templates/

jobs, conditions, variables in Azure Pipelines: A recurrent tricky problem: when you define a variable from a job, how to make it available in another job? https://gaunacode.com/passing-variables-between-jobs-for-azure-devops-pipelines (pay attention that jobs run on different agents by default, so the variables are not passed by default).

Also, if I need to use a variable of a previous job in a condition in a downstream job: https://docs.microsoft.com/en-us/azure/devops/pipelines/process/conditions?view=azure-devops&tabs=yaml#use-the-output-variable-from-a-job-in-a-condition-in-a-subsequent-job

Passing variables between different STAGES: https://docs.microsoft.com/en-us/azure/devops/pipelines/process/expressions?view=azure-devops#stage-to-stage-dependencies

This one is more concrete: https://arunksingh16.medium.com/azure-devops-share-variable-across-stages-9bca85abfe8a (simple working example):

stages:

- stage: STAGE_X
  jobs:
  - job: STAGE_X_JOB_A
    steps:
    - checkout: none
    - script: |
        echo "This is job STAGE_X_JOB_A. Lets set the value"
        export PARAM_X="TEST"
        echo "##vso[task.setvariable variable=VarKey;isOutput=true]$PARAM_X"
      name: ValidateVar
    - script: |
        echo "Key Value :"
        echo $(ValidateVar.VarKey)
      name: Print_Key_value
  - job: STAGE_X_JOB_B
    dependsOn: STAGE_X_JOB_A
    condition: eq(dependencies.STAGE_X_JOB_A.outputs['ValidateVar.VarKey'], 'TEST')
    steps:
    - checkout: none
    - script: |
        echo "This is job STAGE_X_JOB_B and will run as per the valid condition"
      displayName: Print Details

- stage: STAGE_Y
  dependsOn: STAGE_X
  jobs:
  - job: STAGE_Y_JOB_B
    condition: eq(stageDependencies.STAGE_X.STAGE_X_JOB_A.outputs['ValidateVar.VarKey'], 'TEST')
    variables:
      varFromStageA: $[ stageDependencies.STAGE_X.STAGE_X_JOB_A.outputs['ValidateVar.VarKey'] ]
    steps:
    - checkout: none
    - script: |
        echo "This Job will print value from Stage STAGE_X"
        echo $(varFromStageA)

Good post on the different variable types in Azure DevOps: https://adamtheautomator.com/azure-devops-variables/#What_are_Azure_DevOps_Pipeline_Variables

Tests

  • regression testing

  • performance testing

  • coverage testing: tps://www.guru99.com/test-coverage-in-software-testing.html

Git Flow

Intro here: https://medium.com/@patrickporto/4-branching-workflows-for-git-30d0aaee7bf

Git flow extended demo: https://datasift.github.io/gitflow/IntroducingGitFlow.html

The one of Azure DevOps: https://docs.microsoft.com/en-us/azure/devops/repos/tfvc/effective-tfvc-branching-strategies-for-devops?view=azure-devops

Infrastructure as a code (IaC)

terraform

# TODO: add simple refs for Terraform examples here

How to store the state file in a backend (like S3, ADLS): https://www.terraform.io/language/settings/backends/configuration

How to store the state file in Terraform Cloud: https://www.terraform.io/language/settings/terraform-cloud

MLOps - Machine learning life cycle

Great intro to the core principles of MLOps by Google: https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning Therein is discussed the 3 levels of development & deployments, very useful.

Associated video: https://cloud.withgoogle.com/next/sf/sessions?session=AI212#cloud-ai, https://www.youtube.com/watch?v=6gdrwFMaEZ0&feature=emb_logo&ab_channel=GoogleCloudTech

Here are the 3 levels of MLops defined by Google:

  • Level 0: the level that most companies try to achieve:

MLOps level 0
  • Level 1:

MLOps level 1
  • Level 2:

MLOps level 2

ML platforms: https://medium.com/better-programming/5-great-mlops-tools-to-launch-your-next-machine-learning-model-3e403d0c97d3

MLFlow

Advantages:

MLFlow is a Data Science platform built with machine learning model development, versioning and deployment in mind.

Developed by Databricks, open-sourced, and donated to Linux foundation. As such, heavily documented. Became de-facto standard in last 2 years

For development, ability to log parameters (see tracking API)

For deployment, ability to version-control model, and tag model: none-staging-production-archived (see model registry API)

the open source version exists as a server-client application, accessible through:

See the components of MLFlow here: https://www.mlflow.org/docs/latest/concepts.html#mlflow-components

Good links:

1. MLFLow Tracking: https://www.mlflow.org/docs/latest/tracking.html

Log scikit-learn models in MLFlow: https://mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.log_model

Log pyspark models in MLFlow: https://mlflow.org/docs/latest/python_api/mlflow.spark.html#mlflow.spark.log_model

Log tensorflow models in MLFlow: https://www.mlflow.org/docs/latest/python_api/mlflow.tensorflow.html#mlflow.tensorflow.log_model

2. MLFLow Projects: https://www.mlflow.org/docs/latest/projects.html

In MLflow projects, we can containarize the MLflow model (training and/or inference) into Docker containers!

Here an example for a containarized training: https://github.com/mlflow/mlflow/tree/master/examples/docker

3. MLFlow Models: https://www.mlflow.org/docs/latest/models.html

There are 2 ways to score/serve the MLflow models: offline and online.

For offline scoring:

mlflow offline scoring

For online scoring, also multiple possibilities:

MLflow also has a CLI that supports the following commands:

For more info, see:

mlflow models --help
mlflow models serve --help
mlflow models predict --help
mlflow models build-docker --help

build-docker:

Builds a Docker image whose default entrypoint serves the specified MLflow model at port 8080 within the container, using the ‘python_function’ flavor.

For example, the following command builds a docker image named ‘my-image-name’ that serves the model from run ‘some-run-uuid’ at run-relative artifact path ‘my-model’:

mlflow models build-docker -m "runs:/some-run-uuid/my-model" -n "my-image-name"

# We can then serve the model, exposing it at port 5001 on the host via:
docker run -p 5001:8080 "my-image-name"

# NB: by default, the container will start nginx and gunicorn processes. If you don’t need the nginx process to be started (for instance if you deploy your container to Google Cloud Run), you can disable it via the DISABLE_NGINX environment variable:
docker run -p 5001:8080 -e DISABLE_NGINX=true "my-image-name"

# See https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html for more information on the ‘python_function’ flavor.
mlflow serving on databricks platform

MLflow serve deploys the model as a local REST API server.

How to launch a MLflow webserver using the MLflow CLI:

mlflow models serve –model-uri models:/sklearn-iris/production –port 5001 (example)

Two formats, CSV or JSON (pandas-split or pandas-records format): https://www.mlflow.org/docs/latest/models.html#deploy-mlflow-models

mlflow serving csv format
mlflow serving json format

Here is the entire procedure:

mlflow end2end

4. MLFlow Model registry: https://www.mlflow.org/docs/latest/model-registry.html

Model registry example: https://docs.microsoft.com/en-us/azure/databricks/applications/mlflow/model-registry-example

Model registry accessible from multiple environments: https://docs.microsoft.com/en-us/azure/databricks/applications/machine-learning/manage-model-lifecycle/multiple-workspaces?fbclid=IwAR20hHDxB5JWl5otgceJ8yOv260POXwu-oBBgJaBuVNfiNAmNUOxSisU6Ew

Here is the full chain with Databricks’ MLflow and azure ML: https://databricks.com/blog/2020/10/13/using-mlops-with-mlflow-and-azure.html Very good tutorial.

How to create an open-source MLFlow server on Docker or Kubernetes

Deploy MLflow on Openshift:

How to setup MLflow on an Azure VM: https://medium.com/swlh/how-to-setup-mlflow-on-azure-5ba67c178e7d

SonarQube (or SonarCloud, SonarLint): static code analysis

How to use SonarQube with Azure DevOps: https://docs.sonarqube.org/latest/analysis/azuredevops-integration/

  • SonarQube: on prem version

  • SonarCloud: cloud available version

  • SonarLint: version available in IDEs, like VSCode.

How to use the cloud version of SonarQube, SonarCloud, with Azure DevOps:

What does SonarQube/SonarCloud “better” than classical linting tools like flake8 or pylint?

Rule explorer: here is the list of 174 rules used in SonarQube/SonarCloud/SonarLint: https://rules.sonarsource.com/python (for Python, but exists also for different languages)

Good links on SQ: https://tomgregory.com/sonarqube-branch-analysis/ and related pages

Kubeflow

Intro: https://www.kubeflow.org/docs/about/kubeflow/

Use cases: https://www.kubeflow.org/docs/about/use-cases/

On Azure: https://www.kubeflow.org/docs/azure/ and https://www.kubeflow.org/docs/azure/azureendtoend/

On Openshift: https://www.kubeflow.org/docs/openshift/

On premises: https://www.kubeflow.org/docs/other-guides/kubeflow-on-multinode-cluster/

On Minikube:

DataOps

Feature Store

Databricks (Azure) FS doc: https://docs.microsoft.com/en-us/azure/databricks/applications/machine-learning/feature-store/#create-a-training-dataset Databricks (AWS) FS doc: http://docs.databricks.com.s3-website-us-west-1.amazonaws.com/applications/machine-learning/feature-store/index.html#create-a-training-dataset

Why do we need a feature store? (https://www.datarevenue.com/en-blog/feature-store-feast-vs-hopsworks)

If you train models without a feature store, your setup might look something like this:

Taken from https://www.datarevenue.com/en-blog/feature-store-feast-vs-hopsworks

Every model has to access the data and do some transformation to turn it into features, which the model then uses for training.

There’s a lot of duplication in this process - many of the models use many of the same features.

This duplication is one problem a feature store can solve. Every feature can be stored, versioned, and organized in your feature store.

Taken from https://www.datarevenue.com/en-blog/feature-store-feast-vs-hopsworks

[Reasons of a FS](https://madewithml.com/courses/mlops/feature-store/):

  1. Isolation: feature development in isolation (for each unique ML application) can lead to duplication of efforts (setting up ingestion pipelines, feature engineering, etc.).

Solution: create a central feature repository where the entire team contributes maintained features that anyone can use for any application.

  1. Skew: we may have different pipelines for generating features for training and serving which can introduce skew through the subtle differences.

Solution: create features using a unified pipeline and store them in a central location that the training and serving pipelines pull from.

  1. Values: once we set up our data pipelines, we need to ensure that our input feature values are up-to-date so we aren’t working with stale data, while maintaining point-in-time correctness so we don’t introduce data leaks.

Solution: retrieve input features for the respective outcomes by pulling what’s available when a prediction would be made.

When actually constructing our feature store, there are several core components we need to have to address these challenges:

  • data ingestion: ability to ingest data from various sources (databases, data warehouse, etc.) and keep them updated.

  • feature definitions: ability to define entities and corresponding features

  • historical features: ability to retrieve historical features to use for training.

  • online features: ability to retrieve features from a low latency origin for inference.

FS Batch processing: in batch processing, inference requests on specific entity instances can use features that have been materialized from the offline store

Taken from https://madewithml.com/courses/mlops/feature-store/
  1. Application data is stored in a database and/or a data warehouse, etc. And it goes through the DataOps pipeline to validate the data and engineer the features.

  2. These features are written to the offline store which can then be used to retrieve historical training data to train a model with.

  3. Once we have our training data, we can start the MLOps pipeline to optimize, train and validate a model.

  4. We can incrementally materialize features to the online store so that we can retrieve an entity’s feature values with low latency. In a production setting, something like Redis or DynamoDB would be used.

  5. These online features are passed on to the deployed model to generate predictions that would be used downstream.

FS Stream processing: we require near real-time feature values to deliver up-to-date predictions at low latency. While we’ll still utilize an offline store for retrieving historical data, our application’s real-time event data will go directly through our data streams to an online store for serving.

Taken from https://madewithml.com/courses/mlops/feature-store/
  1. Real-time event data enters our running data streams (Kafka / Kinesis, etc.) where they can be processed to generate features.

  2. These features are written to the online store which can then be used to retrieve online features for serving at low latency. In a production setting, something like Redis or DynamoDB would be used.

  3. Streaming features are also written from the data stream to the batch data source (data warehouse, db, etc.) to be processed for generating training data later on.

  4. Historical data will be validated and used to generate features for training a model. This cadence for how often this happens depends on whether there are data annotation lags, compute constraints, etc.

Hopsworks

Intro:

Documentation:

FEAST

FEAST is an open source feature store.

FEAST is heavily linked with Kubeflow:

Comparison FEAST vs Hopsworks: https://www.datarevenue.com/en-blog/feature-store-feast-vs-hopsworks

Getting started with FEAST: https://www.kubeflow.org/docs/components/feature-store/getting-started/

Monitoring of ML models

Great hands-on blog: https://www-jeremyjordan-me.cdn.ampproject.org/c/s/www.jeremyjordan.me/ml-monitoring/amp/

madewithml: https://madewithml.com/courses/mlops/monitoring/

What exactly can we monitor:

  • The system health (underlying platform health)

  • the model performance.

  • the data (but that could be monitored even before the model prediction!)

Problem of delayed outcome: it is sometimes difficult to have the real model performance (absence of feedback on model output) on fresh data. We can use a subset of the dataset being served, with an annotation pipeline, to be able to produce some performance metric.

“If we wait to catch the model decay based on the performance, it may have already caused significant damage to downstream business pipelines that are dependent on it. We need to employ more fine-grained monitoring to identify the sources of model drift prior to actual performance degradation.”

Different types of drifts

Copied from https://madewithml.com/courses/mlops/monitoring/

Data drift (aka feature drift): also known as feature drift or covariate shift, occurs when the distribution of the production data is different from the training data.

data drift example

Copied from https://madewithml.com/courses/mlops/monitoring/

Target drift (aka label drift): shift in the outcomes/labels. This can be a shift in the distributions but also the removal or addition of new classes with categorical tasks. Though retraining can mitigate the performance decay caused target drift, it can often be avoided with proper inter-pipeline communication about new classes, schema changes, etc.

Concept drift: External factors cause the label to evolve. Besides the input and output/label data drifting, we can have the actual relationship between them drift as well. This concept drift renders our model ineffective because the patterns it learned to map between the original inputs and outputs are no longer relevant. Such drift can be gradual, abrupt, or even periodic.

Prediction drift: model prediction distribution deviates. Highly related to target drift, but where target drift is linked to an outside influence, prediction drift is linked to something within the model (unclear here!!)

Scikit-multiflow provides a toolkit for concept drift detection techniques directly on streaming data. The package offers windowed, moving average functionality (including dynamic preprocessing) and even methods around concepts like gradual concept drift.

See also: https://www.youtube.com/watch?v=tGckE83S-4s&ab_channel=Databricks

What to do against drifts?

drift actions

Copied from https://www.youtube.com/watch?v=tGckE83S-4s&t=660s&ab_channel=Databricks

What to monitor?

what to monitor

Copied from https://www.youtube.com/watch?v=tGckE83S-4s&t=660s&ab_channel=Databricks

Monitoring numerical features:

monitoring_numerical_features.png

Copied from https://www.youtube.com/watch?v=tGckE83S-4s&t=660s&ab_channel=Databricks

About the KS+Bonferroni correction (https://www.youtube.com/watch?v=tGckE83S-4s&t=660s&ab_channel=Databricks):

KS_bonferroni_correction.png

Copied from https://www.youtube.com/watch?v=tGckE83S-4s&t=660s&ab_channel=Databricks

More about Drifts: https://medium.com/@mahajan_sameer/mlops-series-introduction-to-mlops-data-drift-concept-drifts-and-how-to-handle-them-in-ml-e3821e05f948

To monitor the variance, look into the Levene test (https://www.youtube.com/watch?v=tGckE83S-4s&t=660s&ab_channel=Databricks):

levene_test_for_variance_monitoring.png

Copied from https://www.youtube.com/watch?v=tGckE83S-4s&t=660s&ab_channel=Databricks

See scipy and this demo (https://www.statology.org/levenes-test-python/) to see how to use Levene statistical test to monitor variance.

Monitoring categorical features:

monitoring_categorical_features.png

Copied from https://www.youtube.com/watch?v=tGckE83S-4s&t=660s&ab_channel=Databricks

where

one_way_xi2_test.png

Copied from https://www.youtube.com/watch?v=tGckE83S-4s&t=660s&ab_channel=Databricks

monitoring_models.png

Copied from https://www.youtube.com/watch?v=tGckE83S-4s&t=660s&ab_channel=Databricks

How to measure the drifts?

  1. Comparing the expectations (in the sense of Great Expectations) on reference data and production data

  2. For univariate case:

  • Kolmogorov-Smirnov (KS) test. For example, a Kolmogorov-Smirnov (KS) test, which determines the maximum distance between two distribution’s cumulative density functions.

  • Chi-squared test: similarly for categorical data (input features, targets, etc.) we can apply the Pearson’s chi-squared test to determine if a frequency of events (1D) is consistent with a reference distribution.

  1. Multivariate case: measuring drift is fairly straightforward for univariate data but difficult for multivariate data. In https://madewithml.com/courses/mlops/monitoring/, it is recommended to reduce data dimensionality (PCA and alike) and then measure. The different dimensionality reduction techniques applied on multivariate data yield either 1D or multidimensional data and so different statistical tests are used to calculate drift:

  • Maximum Mean Discrepancy (MMD): a kernel-based approach that determines the distance between two distributions by computing the distance between the mean embeddings of the features from both distributions.

  • Kolmogorov-Smirnov (KS) Test + Bonferroni Correction: determines the maximum distance between two distribution’s cumulative density functions. We can apply this on each dimension of the multidimensional data and then use the Bonferroni correction (conservative) or the False discovery rate (FDR) correction to mitigate issues stemming from multiple comparisons.