Google Cloud introduced that Ops Agent, the agent for gathering telemetry from Compute Engine cases, can now accumulate and mixture metrics from NVIDIA GPUs on VMs.
The utilization of AI and ML applied sciences inside organizations has grown, notably in domains like product suggestions, scientific computing, and gaming. To fulfill the demanding computational necessities of those functions is critical to make use of GPUs. Efficient utilization and optimization of AI and ML improvement processes necessitate a complete understanding of GPU efficiency metrics, addressing these wants Google Cloud expanded the capabilities of its Ops Agent with the power to gather metrics from NVIDIA GPUs.
Ops Agent empowers customers to:
1. Visualize GPU Fleet Well being: Achieve insights into GPU fleet well being by means of GPU metrics and pre-built dashboards.
2. Optimize Prices and Workloads: Determine underutilized GPUs and optimize workload distribution to streamline prices and maximize effectivity.
3. Plan Scaling Effectively: Analyze tendencies and patterns to make knowledgeable choices on GPU capability enlargement or upgrading current GPUs.
4. Determine Workload Consumption: Pinpoint which GPU processes, notably ML fashions, are consuming GPU utilization and reminiscence.
5. Make the most of DCGM Profiling Metrics: Leverage DCGM profiling metrics to detect bottlenecks and efficiency points throughout the GPU.
The NVIDIA Administration Library (NVML) underpins Ops Agent, enabling easy assortment of important GPU metrics with out further configurations. These metrics embody GPU utilization, GPU reminiscence utilization, course of most GPU reminiscence utilization, and course of lifetime GPU utilization.

As well as, Ops Agent facilitates the gathering of superior GPU metrics using NVIDIA’s Information Middle GPU Supervisor (DCGM). DCGM affords an API for profiling-level metrics of numerous {hardware} parts, offering deeper insights into GPU efficiency.

Ops Agent simplifies GPU metric visualization alongside different choices in Google Cloud’s operations suite. Customers can effortlessly question and visualize the collected GPU metrics, assemble customized charts, and create dashboards. A devoted NVIDIA GPU Monitoring dashboard affords a consolidated view of GPU fleet well being.
Ops Agent stands out as a unified telemetry agent, automating the gathering of host metrics, system logs, and different metrics. It simplifies the administration of telemetry processes, permitting customers to deal with maximizing the potential of their GPU VMs.
Google Cloud has launched a one-click possibility so as to add an Ops Agent whereas creating a brand new VM by way of the Google Cloud console for straightforward adoption. This lets customers expertise Ops Agent with default configurations, facilitating a seamless monitoring expertise.

For complete directions on methods to set up and configure Ops Agent for GPU occasion monitoring, seek advice from the offered documentation.
Additionally different cloud distributors have monitoring options for GPUs particularly CloudWatch, the monitoring instrument of AWS, helps GPU monitoring and Azure helps GPU monitoring with container insights.
