AudioCaps: Generating Captions for Audios in The Wild
Abstract
We explore the problem of audio captioning: generating natural language description for any kind of audio in the wild, which has been surprisingly unexplored in previous research. We contribute a large-scale dataset of about 46K audio clips to human-written text pairs collected via crowdsourcing on the AudioSet dataset. Our thorough empirical studies not only show that our collected captions are indeed faithful for audio inputs and but also discover what forms of audio representation and captioning models are effective for the audio captioning. From extensive experiments, we also propose two novel components that help improve audio captioning performance: the top-down multi-scale encoder and aligned semantic attention.
Examples
(Ours) a man speaking with a series of whistling in the background
(GT) a man talking as another person whistles while water trickles on a hard surface in the background
(Ours) a large explosion followed by a loud pop
(GT) a whooshing noise followed by an explosion
(Ours) a truck engine is running, a siren is occurring, and an adult male speaks
(GT) a child shouts, and adult male speaks, and an emergency vehicle siren sounds and the horn blows
(Ours) a small motor is running , whirring occurs , and a high - pitched whine is present
(GT) a drone whirring followed by a crashing sound
(Ours) a man and woman talking , then a baby crying
(GT) a kid crying as a man and a woman talk followed by a car door opening then closing
(Ours) a man speaking as plastic crinkles
(GT) plastic crumpling and crinkling are ongoing , and an adult male speaks
(Ours) a vehicle engine is running and revving , and tires squeal
(GT) white noise , then a motor revving up and a tire skidding
(Ours) hissing and gurgling of water flowing down a toilet
(GT) plastic crumpling and crinkling are ongoing , and an adult male speaks
(Ours) a man speaks with birds chirping in the distance
(GT) a man speaking with light wind followed by brief silence then birds chirping
(Ours) a man speaking followed by bees buzzing
(GT) a man speaks with wind blowing and buzzing of insects
(Ours) a thunderstorm is in the distance
(GT) rain falling and thunder roaring in the distance
(Ours) a large aircraft engine is running
(GT) humming of a nearby jet engine
Wrong Examples
(Ours) a vehicle engine revving
(GT) high pitched continuous drilling that slows down
(Ours) a man speaking and a sewing machine working
(GT) two men speaking followed by plastic clacking then a power tool drilling