d
CH-SIMS v2.0, a Fine-grained Multi-label Chinese Sentiment Analysis Dataset, is an enhanced and extended version of CH-SIMS Dataset. We re-labeled all instances in CH-SIMS to a finer granularity and the video clips as well as pre-extracted features are remade. We also extended the number of instances to a total of 14563. The new dataset contains videos collected from much wider scenarios, as shown in the banner image.
As shown in below figure, CH-SIMS v2.0 contains 4402 supervised instances, denoted as CH-SIMS v2.0 (s), and 10161 unsupervised instances, denoted as CH-SIMS v2.0 (u). The supervised instances share similar properties with CH-SIMS dataset. The unsupervised instances show a much more diverse distribution of video duration, which better simulates real-world scenarios. The textual features of the unsupervised instances are collected from the ASR transcript without manual correction and thus contain noise, which also fits real-world scenarios better.
We split train, valid and test set to a proportion of roughly 9:2:3. The regression labels range from -1 to 1. The classification labels are: Negative(NEG), Weakly Negative(WNEG), Neutral(NEU), Weakly Positive(WPOS), and Positive(POS). The label distribution is shown in below figure. The test set is speaker-unrelated with the train/valid set.
The baseline experiments are conducted via MMSA platform. The results are reported HERE on Github.
Dataset | Links |
CH-SIMS v2.0 (s) | [Google Drive] [Baiduyun Drive] |
CH-SIMS v2.0 (u) | [Google Drive] [Baiduyun Drive] |
CH-SIMS | [Google Drive] [Baiduyun Drive] |
Please cite us if you find our work useful.
@misc{liu2022make, title={Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module}, author={Yihe Liu and Ziqi Yuan and Huisheng Mao and Zhiyun Liang and Wanqiuyue Yang and Yuanzhe Qiu and Tie Cheng and Xiaoteng Li and Hua Xu and Kai Gao}, year={2022}, eprint={2209.02604}, archivePrefix={arXiv}, primaryClass={cs.MM} }