TAU: Taiwan Audio Understanding Benchmark

Authors:
Yi-Cheng Lin, Yu-Hua Chen, Jia-Kai Dong, Yueh-Hsuan Huang, Szu-Chi Chen,
Yu-Chen Chen, Chih-Yao Chen, Yu-Jung Lin, Yu-Ling Chen, Zih-Yu Chen,
I-Ning Tsai, Hsiu-Hsuan Wang, Ho-Lam Chung, Ke-Han Lu, Hung-yi Lee

Affiliations:
National Taiwan University, University of Toronto

📚 View on GitHub

What is TAU?

Large audio–language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese “soundmarks.” TAU curates everyday, locally distinctive Taiwanese non-speech sounds and evaluates models with multiple-choice questions (MCQs) that cannot be answered by semantic reasoning alone, steering evaluation toward timbre, rhythm, and iconic acoustic patterns. Beyond Taiwan, TAU illustrates how localized benchmarks can highlight cultural blind spots in audio–language models, include underrepresented communities in multimodal evaluation, and guide the design of more equitable and robust multimodal systems.

Dataset Snapshot

TAU contains 702 audio clips across 10 culturally distinctive categories. This diversity ensures that the dataset covers both highly frequent urban soundmarks (e.g., transit chimes, store jingles) and less common but socially important cues (e.g., emergency alarms, religious chants). The category distribution is intentionally imbalanced to reflect the natural frequency of sounds in everyday Taiwanese soundscapes, rather than enforcing artificial uniformity. Each clip is paired with up to four MCQs, resulting in 1794 evaluation items in total. The median clip length is 9.43 seconds, with a maximum of 30 seconds by design. This design balances realism with usability, allowing evaluation without excessive cognitive load. On average, each soundmark has 2.1 recording variants that differ by location, time, or background conditions, which increases robustness and reduces overfitting to specific contexts.

View Leaderboard Download Data & Instructions

Note: The homepage features selected examples only. Full dataset available via download link.

Cite Our Paper

MLA Format:

Lin, Yi-Cheng, et al. "TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics." arXiv, 2025, arXiv:2509.26329.

BibTeX:

@misc{lin2025taubenchmarkculturalsound,
      title={TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics}, 
      author={Yi-Cheng Lin and Yu-Hua Chen and Jia-Kai Dong and Yueh-Hsuan Huang and Szu-Chi Chen and Yu-Chen Chen and Chih-Yao Chen and Yu-Jung Lin and Yu-Ling Chen and Zih-Yu Chen and I-Ning Tsai and Hsiu-Hsuan Wang and Ho-Lam Chung and Ke-Han Lu and Hung-yi Lee},
      year={2025},
      eprint={2509.26329},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2509.26329}, 
}