Toward a collaborative and efficient usage of GPUs in DICLUB  

3 June 2024 | Sala Stringa - Online | 11:00 | Marco Gaido and Roldano Cattoni  (FBK-MT) 

The GPUs in the DICLUB cluster are a precious resource that is shared among the groups and users of our center. Their "scarcity", combined with the high demand by the users, leads to a competition across users to obtain the right of executing jobs on them. This is particularly true for the top-performing ones, the A40, where many jobs often have to "wait" in a queue for them to be available, while many other GPUs (such as the K80) are underutilized. In light of this, the efficient usage of the GPUs in our cluster has become an important topic to increase the productivity of our center. However, over the past years we have seen patterns of allocation and usage that do not prioritize the overall throughput and efficient usage of resources. This is particularly true for new users with less experience. For this reason, in this seminar we will report some of the "usage problems" we have seen in these years with suggestions on how to monitor and avoid them. Our goal is the creation of a collaborative culture on DICLUB, where users focus on the full exploitation of the resources available, prioritizing the throughput (i.e. the overall number of experiments executed in a given amount of time), instead of competing for resource allocation, having as only goal the latency of the single experimentation, at the expense of overall throughput.

Marco Gaido obtained the PhD in Computer Science in 2023 at the University of Trento (Italy) with a thesis on Speech to Text translation and the MS in computer engineering from the Politecnico di Torino with a thesis on text clustering. Before starting his PhD, he worked in big data computing, becoming an Apache Spark contributor and Apache Livy PPMC member.

Roldano Cattoni joined FBK in 1990 working for 5 years on planning and control of mobile robots, for 3 years on visual based monitoring and user profiling, and for 2 years on software agents and distributed computing. In early 2000 he started working on Machine Translation. His interests include Machine and Speech Translation, web-based tools for systems and demonstrators, virtualization tools.