Background

The DAVE multibeam sonar plugin, adapted for ROS 2 Jazzy and Gazebo Harmonic, uses a ray-based multibeam model that simulates phase, reverberation, and speckle noise through a point scattering approach. It generates realistic intensity-range (A-plot) data while accounting for time and angular ambiguities as well as speckle noise.

Below is a diagram showing how the plugin is structured and how CUDA is utilized to perform the sonar calculations, after optimization and ROS 2 migration. The green highlighted functions are new functions added to substitute older ones. Other modifications were made which will be discussed in this article.

Diagrama em branco(3).png

Methodology

The CUDA code is basically divided in 4 steps: sonar calculation, ray summation, beam culling & windowing and FFT (details of the process and math is discussed in https://www.frontiersin.org/journals/robotics-and-ai/articles/10.3389/frobt.2021.706646/full#e8). In the context of this project, the goal was to optimize the code to reduce execution time. The initial optimization strategy was based on the hypothesis that applying specific performance principles could enhance the code’s efficiency. (cite https://ams148-spring18-01.courses.soe.ucsc.edu/lecture-notes.html):

To begin analyzing the pain points of the CUDA code, the sonar wrapper was tested in isolation using standard inputs.

Computation Step Avg Time (ms) Min (ms) Max (ms) Samples
GPU Sonar Computation 108.27 102.56 124.56 50
Sonar Ray Summation 267.57 235.24 345.40 50
GPU Window & Correction 38.85 34.77 51.21 50
GPU FFT Calculation 22.20 17.29 28.35 50
Total Sonar Calculation Wrapper 434.71 396.84 540.35 50

Table 1 presents an overview of the original implementation used in the DAVE demo, with the code divided into its main components.

The results, obtained on an NVIDIA GeForce MX330, indicate that the sonar computation and ray summation stages are the most time-consuming, accounting for 86% of the total sonar wrapper single-call time. Notably, these sections are also where the majority of malloc and memcpy operations occur, making them the primary focus of the optimization efforts in this work.

Using nvsys stats, we obtained results from running it a single time at initialization (see Figure 1). It shows how much time is spent in a single run of the CUDA code in the plugin (which is called every frame of the Gazebo). It shows that a most of the time is spent on memory allocation, device synchronize and memory copies. It can be seen that cudaMalloc takes the most time, around 127.52 ms, to initialize memory on the device (all arrays and variables). Next is cudaMemcpy at 122.79 ms, followed by cudaMallocHost at 40.51 ms.

When running a second time immediately afterward, cudaMalloc still takes roughly the same total time (even with twice the number of calls), but cudaMemcpy increases to 246.06 ms and cudaMallocHost to 87.68 ms. The CUDA code allocates and frees memory during every sonar frame calculation, which explains why cudaMallocHost time doubles, even though cudaMalloc does not (possibly due to caching, as a hypothesis). However, it is clear that memory transfers between device and host are the main bottleneck of the code, and repeated cudaMallocHost calls also contribute to higher times.

Figure 1: Top 10 CUDA Functions by Total Time for the old code in one run

Figure 1: Top 10 CUDA Functions by Total Time for the old code in one run

Figure 2: CUDA Kernels by Total Time for the old code in one run

Figure 2: CUDA Kernels by Total Time for the old code in one run

Figure 2 shows the execution time taken by the kernels. Although the kernels themselves are not the main bottleneck, the way they are called still warrants analysis, particularly the number of times each kernel is invoked per run. For example, in the demo, the column_sums_reduce kernel is called 1,026 times in a single run, which can introduce significant overhead, whereas most other kernels are called only once or twice per run.

With all this in mind, the first step in optimizing the code was to examine how memory allocation was being performed, determine whether buffers could be reused, and verify if pinned memory was being applied (principle 2).

The second step involved identifying the parts of the code that consumed the most execution time. Optimization efforts then focused on minimizing memory transfers wherever possible, either by modifying kernel calls or avoiding unnecessary memory copies when values could remain on the GPU (principles 1 and 3).