开源软件名称(OpenSource Name):shvdiwnkozbw/Multi-Source-Sound-Localization开源软件地址(OpenSource Url):https://github.com/shvdiwnkozbw/Multi-Source-Sound-Localization开源编程语言(OpenSource Language):Python 99.6%开源软件介绍(OpenSource Introduction):Multi-Source-Sound-LocalizationCode of ECCV paper Multiple Sound Sources Localization from Coarse to Fine. We have just uploaded the latest simplified version, which is much easier to use and reaches comparable performance with the original one, which is in the folder This is a PyTorch implementation that aims to perform sound localization in complex audiovisual scenes, where there multiple objects making sounds. We disentangle a complex scene into several simple scenes consisting of one-to-one sound-object pairs. We propose a two-stage learning framework, which establishes coarse-grained audiovisual correspondence in the category level at the first stage, and achieves fine-grained sound-object alignment at the second stage. Requirements
Prepare DatasetThe detailed preprocessing code and classification pseudo label generation can be referred to SoundNet-Flickr DatasetThe audiovisual pairs are defined as one frame and a corresponding 5-second audio clip. We resize the image into AVE DatasetThere are totally 4143 10-second video clips available. We extract video frames at AudioSet Instrument DatasetThis is a subset of AudioSet covering 15 music instruments. The video clips are annotated with labels that indicate which music instruments make sound in the audio. We extract video frames at For unlabeled videos in SoundNet-Flickr and AVE dataset, it is optional to introduce class-agnostic proposals generated by Faster RCNN, and perform classification on each ROI region to improve the quality of pseudo labels. Procedure of the simplified versionThe input data required is as follows:
Training 1st stage
Training 2nd stage
Evaluation
ResultsSound Localization on SoundNet-FlickrWe visualize the localization maps corresponding to different elements contained in the mixed sounds of two sources. Sound Localization on AudioSet instrumentWe visualize some examples in AudioSet with two categories of instruments making sound simultaneously. The localization maps in each subfigure are listed from left to right: AVC, Multi-task, Ours. The green boxes are detection results of Faster RCNN. Spatio-temporal Sound Localization in VideosWe visualize the changes of localization maps in videos over time. The frames shown are extracted at 1 fps, the heatmaps show localization responses to corresponding 1-second audio clip. When only with noise, our model mainly focuses on background regions as the first two frames in Fig. (a). When there are sounds produced by specific objects, our model can accurately capture the sound makers, e.g., our model can distinguish sounds of guitar and accordion in Fig. (b), dog barking and toy-car sound in Fig. (c). Comparison with CAM BaselineWe show some comparison between our model and CAM method. The images in each subfigure are listed as: original image, localization result of our model, result of CAM method. It is clear that CAM method cannot distinguish the objects belonging to the same category, e.g., violin and piano in Fig. (e), but our model can precisely localize the object that makes sound in input audio. Procedure of original codeTrainingTraining 1st stageFor SoundNet-Flickr or AVE dataset, run
For AudioSet dataset, run
Training 2nd stageFor SoundNet-Flickr or AVE dataset, run
For AudioSet dataset, run
The training log file and trained model are stored in
EvaluateFor quantitative evaluation on human annotated SoundNet-Flickr, run
It outputs cIoU and AUC result, and the visualization of localization maps. For evaluation on AudioSet Instrument dataset, run
It outputs class-specific localization maps on each sample stored in
to calculate evaluation results and visualize localization maps on different difficulty levels. Citation
|
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论