Google Machine Learning Engineer Professional-Machine-Learning-Engineer Question # 48 Topic 6 Discussion

Professional-Machine-Learning-Engineer Exam Topic 6 Question 48 Discussion:

Question #: 48

Topic #: 6

You have deployed a scikit-learn model to a Vertex Al endpoint using a custom model server. You enabled auto scaling; however, the deployed model fails to scale beyond one replica, which led to dropped requests. You notice that CPU utilization remains low even during periods of high load. What should you do?

Attach a GPU to the prediction nodes.

Increase the number of workers in your model server.

Schedule scaling of the nodes to match expected demand.

Increase the minReplicaCount in your DeployedModel configuration.

Get Premium Professional-Machine-Learning-Engineer Questions

Explanation

Auto scaling is a feature that allows you to automatically adjust the number of prediction nodes based on the traffic and load of your deployed model1. However, auto scaling depends on the CPU utilization of your prediction nodes, which is the percentage of CPU resources used by your model server1. If your CPU utilization is low, even during periods of high load, it means that your model server is not fully utilizing the available CPU resources, and thus auto scaling will not trigger more replicas2.

One possible reason for low CPU utilization is that your model server is using a single worker process to handle prediction requests3. A worker process is a subprocess that runs your model code and handles prediction requests3. If you have only one worker process, it can only handle one request at a time, which can lead to dropped requests when the traffic is high3. To increase the CPU utilization and the throughput of your model server, you can increase the number of worker processes, which will allow your model server to handle multiple requests in parallel3.

To increase the number of workers in your model server, you need to modify your custom model server code and use the --workers flag to specify the number of worker processes you want to use3. For example, if you are using a Gunicorn server, you can use the following command to start your model server with four worker processes:

gunicorn --bind :$PORT --workers 4 --threads 1 --timeout 60 main:app

By increasing the number of workers in your model server, you can increase the CPU utilization of your prediction nodes, and thus enable auto scaling to scale beyond one replica.

The other options are not suitable for your scenario, because they either do not address the root cause of low CPU utilization, such as attaching a GPU or scheduling scaling, or they do not enable auto scaling, such as increasing the minReplicaCount, which is a fixed number of nodes that will always run regardless of the traffic1.

References:

Scaling prediction nodes | Vertex AI | Google Cloud

Troubleshooting | Vertex AI | Google Cloud

Using a custom prediction routine with online prediction | Vertex AI | Google Cloud

Actual exam question for Google Professional-Machine-Learning-Engineer exam by Kairo38350 at May 19, 2025, 12:00:11 AM

Contribute your Thoughts:

Chosen Answer: A B C D
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.

Summer Certification Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: validbest

Exam Professional-Machine-Learning-Engineer All Questions

Google Machine Learning Engineer Professional-Machine-Learning-Engineer Question # 48 Topic 6 Discussion

Correct Answer:

Options Selected by Other Users:

Contribute your Thoughts:

Summer Certification Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: validbest

Exam Professional-Machine-Learning-Engineer All Questions

Google Machine Learning Engineer Professional-Machine-Learning-Engineer Question # 48 Topic 6 Discussion

Correct Answer:

Options Selected by Other Users:

Contribute your Thoughts:

Awaiting moderator approval