AudioCaps: Generating Captions for Audios in the Wild

What is AudioCaps?

We explore audio captioning: generating natural language description for any kind of audio in the wild.

We contribute AudioCaps, a large-scale dataset of about 46K audio clips to human-written text pairs collected via crowdsourcing on the AudioSet dataset. The collected captions of AudioCaps are indeed faithful for audio inputs.

We provide the source code of the models to explore what forms of audio representation and captioning models are effective for the audio captioning.

Examples

Video

A drone is whirring followed by a crashing sound

1(current)
2
3
See more

Papers

AudioCaps: Generating Captions for Audios in The Wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim NAACL-HLT 2019 (Oral)

Bibtex

@inproceedings{audiocaps,
  title={AudioCaps: Generating Captions for Audios in The Wild},
  author={Kim, Chris Dongjoo and Kim, Byeongchang and Lee, Hyunmin and Kim, Gunhee},
  booktitle={NAACL-HLT},
  year={2019}
}