How to Scale Microservice using EC2 GPU Instances in Cloud

7 min readOct 3, 2022

Translating millions of requests per day using Scalable ML models.

By Vikram Gupta, Senior Software Engineer at Integral Ad Science.

Introducing IAS’s In-House Language Translation Services

Integral Ad Science’s (IAS) Machine Learning-Based Language Translation Service is a neural machine translation service that delivers fast, high-quality, affordable, and configurable language translation. Neural machine translation is a form of language translation automation that uses deep learning models to provide more accurate and natural-sounding translation than traditional statistical and rule-based translation algorithms.

With the help of the IAS Machine Translation Learning Service, you can localize content such as that of websites and applications for your linguistically diverse users, easily translate large volumes of text for analysis, and efficiently enable cross-lingual communication between users.

Currently, at IAS we are translating 40 languages for contextual and sentimental analysis.

The IAS Machine Learning Translation Service is a scalable service that runs on Amazon ECS powered by GPU-based EC2 instances. This article talks about IAS’s in-house language translation service, its performance, and cost savings in detail.

Scalable and Optimal Multi-Language Deployment Design

The current architecture supports multi-language that gets deployed in different cloud formation stacks. Each language has two CF stacks — one for log groups and one for infrastructure stacks. The log group stack uses a cloud watch that is being used in the second infrastructure CF stack which creates cluster and ECS service. Depicted below, the high-level diagram explains the architecture of the language translation service.

The translation service uses the FLASK framework for creating translation APIs which internally loads DS models and calls the translation server to get the translated strings. This flask application is dockerized and runs using AWS ECS service with EC2. The service is then wrapped into a cloud formation stack.

The tasks of ECS service register themselves to their own target group (Language-specific target groups are created. eg. Dutch has its own target group, Japanese has its own, etc.). All the languages have their own target groups and they are the listeners to the public-facing ALB using listener rules. When a request comes to ALB, it uses a query param to identify the TG and then route the traffic to a specific Language. Multiple tasks can be registered to a single TG, so the requests are balanced using the Round Robin algorithm.

The IAS language translation service is designed to support as many languages as we get the requirements from the customers. There are two basic needs we need to fulfill to add new language support:

Build the new Machine Learning Model for required languages.
Add infrastructure requirements to the JSON file.

See the example below.

The service is designed in a way that you just need to upload a model file at a specified s3 bucket.

Once the model is uploaded, we add the changes attached to the above snippet.

The Jenkins pipeline builds the infrastructure on AWS to run the service in ECS containers.

Cost Savings From Using a New Design — Choosing G Series Over C Series Instances.

The highly scalable multi-region language translation service earlier was deployed on the EC2 c6i series instances. Compute-optimized instances are ideal for compute-bound applications that benefit from high-performance processors. Instances belonging to this family are well suited for batch processing workloads, media transcoding, high-performance web servers, and other compute-intensive applications.

But the EC2 G4 instances are the industry’s most cost-effective and versatile GPU instances for deploying machine learning models such as translation models, image classification, object detection, and speech recognition, and for graphics-intensive applications. G4 instances are available with a choice of NVIDIA GPUs (G4dn) or AMD GPUs (G4ad).

G4dn instances are optimized for machine learning inferences and model training. At IAS, there are other teams who are already using this type(G4dn) for video frame analysis. After consulting with these teams for our use case, we went ahead with g4dn instances for evaluation.

Initially, the language translation services were deployed on c6i.4xlarge instances as those were compute optimized but we executed different performance tests with c6i.4xlarge and g4dn.xlarge and compared the results including cost, latency, error rates, etc. Finally, we concluded to use the g4dn instances for language translation services.

The following is the result that compares the key parameters for Romanian:

We performed multiple tests for different languages, and for every test the results were better for GPU instances.

Cost Savings — In-House Language Translation Services Vs Third-Party Services.

Before the launch of a house language translation service, IAS relied on third-party services that cost a significant amount of money for translation services and license renewal.

IAS Engineering and Data Science teams conceptualized and built their own services that are scalable and reliable with high performance and cost savings.

The following table illustrates the comparison based on 25 languages currently supported by in-house services.

Scaling Helps Reduce Costs Using Amazon ECS Service and Cluster Autoscaling

The language translation service uses two types of scaling:

Service auto-scaling
Cluster auto-scaling

Service Auto-Scaling:

The ECS service auto-scaling is configured with min capacity, max capacity, and desired capacity for the number of tasks in the service. The service scales itself based on avg service CPU utilization. Currently, we have configured the service with a 45% avg. CPU for scaling and whenever avg. service (tasks) CPU goes above this threshold and new tasks are created. If it goes below, tasks are destroyed so that it can maintain the threshold value.

Note that each language has different min, max, and desired counts. The desired count is initialized with a min count so that it can have the desired number of instances running while updating the stack and keep handling the requests at run time. The desired count can be adjusted to reduce the response latency during the update of the stack.

A target tracking policy is used to track CPU and scale-in and out tasks.

Host/Service Auto Scaling for Romanian Language Translation Service:

Real-time requests vs latency graph on Romanian languages translation service

Cluster Auto-Scaling:

The Cluster is assigned to a Capacity provider and CP is linked with an ASG. The CP has a target capacity percentage of 95%. This means at any point the cluster will try to utilize a max of 95% of Amazon EC2 instances all the time. The remaining 5% of the Amazon EC2 will be kept free.

The Managed scaling of CP is enabled to manage scale-in and scale-out of ASG. The managed termination protection is enabled to protect the Amazon ec2 instances that hold tasks in ASG to be terminated. The ASG uses a target tracking policy to scale in and out for Amazon EC2 instances as per ECS services demands when service scale-in and out tasks.

Concluding the Story

IAS’s language-based services are getting bigger and handling more requests every day with additional support for new customer languages. It also makes a significant business impact and supports IAS’s other contextual products. These services help in context control, pre-bid, and post-bid offerings.

The Services deployed on AWS ECS are working well for language-based services and have good performance and cost savings for us. There is an opportunity to explore a new infrastructure design for language-based services using Kubernetes.

Join Our Innovative Team

IAS is a global leader in digital media quality. Our engineers collaborate daily to design for excellence as we strive to build high-performing platforms and leverage impactful tools to make every impression count. We analyze emerging industry trends in order to drive innovation, research new areas of interest, and enhance our revolutionary technology to provide top-tier media quality outcomes. IAS is an ever-expanding company in a constantly evolving space, and we are always looking for new collaborative, self-starting technologists to join our team. If you are interested, we would love to have you on board! Check out our job opportunities here.