Abstract
Sequential Monte Carlo probability hypothesis density (SMC-PHD) filter has received much interest in the field of nonlinear non-Gaussian visual tracking due to its ability to handle a variable number of speakers. The SMC-PHD filter employs surviving, spawned and born particles to model the state of the speakers and jointly estimates the variable number of speakers with their states. The born particles play a critical role in the detection of new speakers, which makes it necessary to propagate them in each frame. However, this increases the computational cost of the visual tracker. Here, we propose to use audio data to determine when to propagate the born particles and re-allocate the surviving and spawned particles. In our framework, we employ audio data as an aid to visual SMC-PHD (V-SMC-PHD) filter by using the direction of arrival (DOA) angles of the audio sources to reshape the distribution of the particles. Experimental results on the AV16:3 dataset with multi-speaker sequences show that our proposed audio-visual SMC-PHD (AV-SMC-PHD) filter improves the tracking performance in terms of estimation accuracy and computational efficiency.