Case Study


The client who wants to install a monitoring system for their machine learning model on the cloud. 


The client wishes to deploy their machine learning models on Azure cloud in production so that they can be actively in use by consumers. In production environment, the model’s behavior may change because the new data differs from what was used to train the model. This has an impact on the model’s accuracy and performance, and the client does not want to be caught off guard if critical production models deteriorate beyond acceptable levels. Because data science models are so important to the client’s business, they require a production monitoring system. Maintaining an end-to-end model lifecycle is impossible without observability. In this regard, Datics was asked to come up with a solution that overcomes the challenges existing in their infrastructure in order to reach their final goal. 


Major cloud service providers already integrate monitoring solution to their products, for example, Azure Monitor on Azure and Cloudwatch on AWS. However, a combination of Machine Learning and the cloud is still a rather new concept. Such specific application of the cloud technology demands more than what the integrated monitoring solution is offering. As a result, the lack of information about the model’s performance in production proved to be quite the challenge for the client’s data scientists in improving their models. So far, they have been relying on Pytorch framework to obtain Machine Learning metrics. This makes it difficult for data scientists to correlate the hardware specifications and model’s performance as the metrics cannot be viewed alongside each other.   


Datics researched various monitoring products in order to find an affordable solution that is compatible with existing infrastructure. From the outcome the search, Datics found the tech stack Grafana and Prometheus would be the best tools to collect and visualize metrics. Both tools are inexpensive, and even free of charge. Grafana’s customizable dashboards are feature-rich and can be configured to display data from a wide range of databases using visualization graph such as heatmaps, histograms, and charts. The platform is flexible and easy to use. The tool has native support of a broad range of datasources, including CloudWatch, Azure Monitor and of course, Prometheus. Prometheus offer capabilities to monitor cloud-native applications and infrastructure, and watch over hundreds of microservices. Prometheus monitoring allows sending alerts when major issues need to be solved. Furthermore, the Prometheus Node Exporter can be adjusted to retrieve data from any client, which can be very helpful. In the client’s scenario, Datics offers solution to serve the Machine Learning models on the cloud with TorchServe because the data scientists are already working with  PyTorch framework. In addition, TorchServe can collect important metrics of model performance make it available to Prometheus through its API interface. The monitoring system is contained inside a Docker container and is ready to be deployed on any cloud platform.  


The data scientists are now able to view hardware metrics from Azure Monitor as well as model metrics from Prometheus side by side in a Grafana dashboard. Not only the performance of the models is improved, but now the IT staff can have quicker response time to critical issues in production environment.   


Grafana, Prometheus, Azure Monitor, TorchServe, Docker.  


You’ll also like

Contact us!

Looking to explore the solution?

See the Datics Consulting Privacy Policy and Terms of Use for details on how we collect, use, and share information about you.