Accelerating Ray-Based Multibeam Sonar Simulation in ROS 2 with CUDA

Background

The DAVE multibeam sonar plugin, adapted for ROS 2 Jazzy and Gazebo Harmonic, uses a ray-based multibeam model that simulates phase, reverberation, and speckle noise through a point scattering approach. It generates realistic intensity-range (A-plot) data while accounting for time and angular ambiguities as well as speckle noise.

Below is a diagram showing how the plugin is structured and how CUDA is utilized to perform the sonar calculations, after optimization and ROS 2 migration. The green highlighted functions are new functions added to substitute older ones. Other modifications were made which will be discussed in this article.

Diagrama em branco(3).png

Methodology

The CUDA code is basically divided in 4 steps: sonar calculation, ray summation, beam culling & windowing and FFT (details of the process and math is discussed in https://www.frontiersin.org/journals/robotics-and-ai/articles/10.3389/frobt.2021.706646/full#e8). In the context of this project, the goal was to optimize the code to reduce execution time. The initial optimization strategy was based on the hypothesis that applying specific performance principles could enhance the code’s efficiency. (cite https://ams148-spring18-01.courses.soe.ucsc.edu/lecture-notes.html):

Minimize data transfers between host and device whenever possible, even if it means executing kernels on the GPU that offer little or no speed-up compared to running them on the CPU. (1)
Using page-locked (pinned) memory on the host can achieve higher transfer bandwidth between host and device, but over-allocating pinned memory can hurt performance, so it should be tested and sized appropriately in the code. (2)
Grouping many small transfers into a single larger transfer improves performance by reducing the per-transfer overhead. (3)

To begin analyzing the pain points of the CUDA code, the sonar wrapper was tested in isolation using standard inputs.

Computation Step	Avg Time (ms)	Min (ms)	Max (ms)	Samples
GPU Sonar Computation	108.27	102.56	124.56	50
Sonar Ray Summation	267.57	235.24	345.40	50
GPU Window & Correction	38.85	34.77	51.21	50
GPU FFT Calculation	22.20	17.29	28.35	50
Total Sonar Calculation Wrapper	434.71	396.84	540.35	50

Table 1 presents an overview of the original implementation used in the DAVE demo, with the code divided into its main components.

The results, obtained on an NVIDIA GeForce MX330, indicate that the sonar computation and ray summation stages are the most time-consuming, accounting for 86% of the total sonar wrapper single-call time. Notably, these sections are also where the majority of malloc and memcpy operations occur, making them the primary focus of the optimization efforts in this work.

Using nvsys stats, we obtained results from running it a single time at initialization (see Figure 1). It shows how much time is spent in a single run of the CUDA code in the plugin (which is called every frame of the Gazebo). It shows that a most of the time is spent on memory allocation, device synchronize and memory copies. It can be seen that cudaMalloc takes the most time, around 127.52 ms, to initialize memory on the device (all arrays and variables). Next is cudaMemcpy at 122.79 ms, followed by cudaMallocHost at 40.51 ms.

When running a second time immediately afterward, cudaMalloc still takes roughly the same total time (even with twice the number of calls), but cudaMemcpy increases to 246.06 ms and cudaMallocHost to 87.68 ms. The CUDA code allocates and frees memory during every sonar frame calculation, which explains why cudaMallocHost time doubles, even though cudaMalloc does not (possibly due to caching, as a hypothesis). However, it is clear that memory transfers between device and host are the main bottleneck of the code, and repeated cudaMallocHost calls also contribute to higher times.

Figure 1: Top 10 CUDA Functions by Total Time for the old code in one run

Figure 1: Top 10 CUDA Functions by Total Time for the old code in one run

Figure 2: CUDA Kernels by Total Time for the old code in one run

Figure 2 shows the execution time taken by the kernels. Although the kernels themselves are not the main bottleneck, the way they are called still warrants analysis, particularly the number of times each kernel is invoked per run. For example, in the demo, the column_sums_reduce kernel is called 1,026 times in a single run, which can introduce significant overhead, whereas most other kernels are called only once or twice per run.

With all this in mind, the first step in optimizing the code was to examine how memory allocation was being performed, determine whether buffers could be reused, and verify if pinned memory was being applied (principle 2).

The second step involved identifying the parts of the code that consumed the most execution time. Optimization efforts then focused on minimizing memory transfers wherever possible, either by modifying kernel calls or avoiding unnecessary memory copies when values could remain on the GPU (principles 1 and 3).