July 3rd, 2024
00:00
00:00
Speech Emotion Recognition, commonly known as SER, is a rapidly evolving field focused on identifying human emotions through voice analysis. This discipline holds significant importance in understanding human communication, as emotions play a crucial role in how messages are conveyed and interpreted in daily interactions. By analyzing vocal cues such as intonation, loudness, tempo, and rhythm, SER aims to detect and categorize the emotional state of a speaker, which can be valuable in various applications ranging from healthcare to improving human-computer interaction. One of the primary challenges in SER is the reliance on staged scenarios and the scarcity of natural databases. Most existing databases are either acted or elicited, meaning they involve actors simulating emotions or participants reacting to controlled stimuli. While these methods provide valuable data, they often result in exaggerated or unnatural emotional expressions that do not accurately represent real-world conditions. Acted databases, for example, typically feature predetermined sentences performed by actors, leading to stereotypical portrayals of emotions. Elicited databases, on the other hand, gather responses from artificially induced situations, which, despite being closer to genuine expressions, still carry limitations and ethical concerns. The scarcity of natural databases, which are collected in spontaneous and unscripted environments, further complicates the development of robust SER models. Such databases capture more authentic emotional expressions but present additional challenges due to the presence of overlapping voices, background noise, and the subjective nature of emotion perception among different raters. The limited availability of these databases, especially in languages other than English, restricts the advancement of SER models in real-world applications. To address these challenges, the EMOVOME database represents a significant advancement in the field of SER. EMOVOME stands for Emotional Voice Messages and is a publicly available database consisting of 999 voice messages collected from real conversations of 100 Spanish speakers on a messaging app. This database bridges the gap between staged and real-life scenarios by providing authentic emotional expressions captured in non-staged, spontaneous conversations. The messages are labeled with both continuous and discrete emotions by a mix of expert and non-expert annotators, ensuring a diverse and comprehensive dataset. The introduction of EMOVOME marks a pivotal step towards improving the reliability and applicability of SER models in real-world contexts. By utilizing this database, researchers can develop and refine models that better recognize genuine emotions, ultimately enhancing the effectiveness of SER in various practical applications. The EMOVOME database is composed of 999 voice messages collected from actual WhatsApp conversations involving 100 Spanish speakers. This extensive collection aims to provide a more realistic representation of emotional expression by capturing spontaneous interactions in natural settings. Each voice message within the database is meticulously labeled for both continuous and discrete emotions, offering a rich resource for developing and testing SER models. The methodology for labeling these emotions is particularly noteworthy. It involves contributions from both expert and non-expert annotators, ensuring a diverse range of perspectives in the emotional evaluation process. Expert annotators bring a depth of knowledge and consistency, while non-expert annotators offer insights that may more closely align with general public perception. This combination of annotator types has a significant impact on the performance of SER models, as it helps to balance accuracy with relatability. The inclusion of non-expert annotations can also enhance the model's ability to generalize across different user groups, making it more applicable in real-world scenarios. When comparing the performance of SER models using the EMOVOME database to other reference databases such as IEMOCAP and RAVDESS, several unique challenges and advantages emerge. The IEMOCAP database, which involves elicited emotions through controlled interactions, and the RAVDESS database, which contains acted emotional expressions, both provide valuable benchmarks but differ fundamentally from EMOVOME's naturalistic approach. While EMOVOME exhibits lower performance in emotion category prediction compared to these staged databases, it offers comparable results in the dimensions of valence and arousal. This indicates that while recognizing specific emotions remains challenging, the database excels in identifying the broader emotional tone and intensity of speech. One of the critical aspects addressed by EMOVOME is gender fairness in SER models. SER systems have historically shown biases that affect accuracy across different demographic groups, including gender. EMOVOME contributes to a more equitable evaluation by incorporating a balanced representation of male and female speakers and analyzing the impact of gender on model performance. The study found that combining expert and non-expert annotations led to better results and improved fairness, suggesting that diverse annotator perspectives can help mitigate biases in SER systems. In summary, the EMOVOME database stands out as a pioneering resource in the field of Speech Emotion Recognition. By offering a rich, naturalistic dataset and addressing key issues such as annotation diversity and gender fairness, it paves the way for more accurate and equitable SER models capable of performing effectively in real-world applications.