The theory behind End-to-End MLOps CI/CD Pipeline
4 min readNov 20, 2023
MLOps Benefits
- Reproducibility
- Deployment
- Monitoring: Monitoring in MLOps refers to observing, measuring, and analyzing the performance, health, and behavior of machine learning models and systems in a production environment.
DevOps vs. MLOps
Different Tools for MLOps
Key Features of MLOps Tools
- Model training, tuning, and drift management
- Pipeline management
- Collaboration and communication capabilities
AWS SageMaker: Integrated APIs
- Data scientists and developers can quickly and easily build and train machine learning models and directly deploy them into a production-ready hosted environment.
- Integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis
- Zero setups for data exploration
- Algorithms designed for massive datasets
The following diagram shows how you train and deploy a model with Amazon SageMaker.
The client data will be stored in the S3 bucket as our storage platform.
The Amazon SageMaker is divided into three blocks:
- Model training: Our data scientists will write two kinds of scripts.
a. Helper Code: Define where the data comes from and how we are preprocessing the data. What kind of algorithm and hyperparameter tuning we are using are defined here.
b. Training Code: Nothing but the algorithm. AWS has all the training images in the ECR container registry (no need to pip install the packages). We need to pull the docker image to start doing the model training. Once the model training is done automatically, the ML code will be saved into the S3 bucket in the model.tar.gz file format. - Deployment/ Hosting a. Helper Code: We mention to which machine we deploy our ML Model. How many devices are we deploying our model to?
b. Inference Code: This code is fetched from the ECR Container registry, which will be responsible for creating the endpoint. End Point is an API developers use to create client-side applications, which will be our REST API. - EC2/ DCR Container Registry: Place where the dockerized images are stored.
Amazon SageMaker can be integrated with AWS ground truth to collect the incoming data from the customers. We can reuse this data to retrain the model after six months to deploy another version of the ML model.
AWS for MLOps
Amazon SageMaker Pipelines
- First purpose-built CI/CD service for machine learning
- With SageMaker Pipelines, you can create, automate, and manage end-to-end ML workflows at scale.
- Orchestrating workflows across each step of the machine-learning process
Key Features:
- Compose, manage, and reuse ML workflows
- Choose the best models for deploying into production
- Automatic tracking of models: AWS CloudWatch
- Bring CI/CD to machine learning
AWS ML Model Monitoring
- Serving issues
- Built-in analysis
- Monitor your ML models by scheduling monitoring jobs through Amazon SageMaker Model Monitor.
- Automatically kick off monitoring jobs to analyze model predictions during a given period.
- Generated reports by monitoring jobs can be saved in Amazon S3 for further analysis.
- View model metrics via Amazon CloudWatch and consume notifications to trigger alarms or corrective actions, such as retraining the model or auditing data.
- Integrates with other visualization tools, including Tensorboard, Amazon QuickSight, and Tableau.
- Use AWS EventBridge Service to trigger a timer to run the pipelines on SageMaker.
Feature Store
Centralized repository or a platform for managing and serving machine learning features.
Key Benefits
- Improved model accuracy and consistency
- Faster model development and deployment
- Better governance and compliance • Increased collaboration and knowledge sharing
Ingest data from many sources
✓ Ingest features using streaming data sources like Amazon Kinesis Data Firehose
✓ Create features using data preparation tools such as Amazon SageMaker Data Wrangler & and store them directly in SageMaker Feature Store
Search and discovery
✓ Tags and index features
✓ Browsing the feature catalog
AWS MLOps Post-Deployment Challenges
Data Drift: Data drift is the change in data distribution over time.
- We should retrain our model as we see that Movie or TV streaming service is becoming increasingly popular.
Example: Iris flower dataset (Credits: Evidently AI)
- Concept Drift: After some time, we can observe that the importance of different features changes with time. So, the model needs to be retrained.
- In the figure, loading time is getting more critical for churning with time.
Software engineering challenges
- Environmental Changes: Some libraries used may run out of support.
- Out-of-service Cloud
- Compute resources (CPU/GPU/memory)
- Security and Privacy