Revista Nebrija de Lingüística Aplicada a la Enseñanza de las Lenguas. Vol. 20. Núm. 40 (2026).
ISSN 1699-6569
Assessing emotion in L2 writing: Validating Watson NLU with emotional vocabulary
training
Evaluación de las emociones en la escritura en L2: Validación de Watson NLU con
entrenamiento de vocabulario emocional
María Jesús Sánchez a, Elisa Pérez-García b, Beatriz Bermúdez-Margarettoc
a Universidad de Salamanca, mjs@usal.es
b Universidad de Salamanca, elisapg@usal.es
c Universidad de Salamanca, bermudezmargaretto@usal.es
Abstract
Affective word values have been widely studied across languages, often focusing on isolated words due to the
difficulty of assessing emotionality in texts. This study examines whether written emotional content can be reliably
captured using a specific software tool (Watson Natural Language Understanding). Thirty-three Spanish
undergraduates wrote 150-word autobiographical texts in their L2 (English) before and after training with
emotional vocabulary. Normative valence ratings of content words obtained in the pre- and post-training phases
were compared with sentiment scores generated by Watson NLU. Strong positive relations were found between
sentiment and normative valence scores in both phases, with stronger relations at post-training. Regression
analyses confirmed that sentiment scores significantly predicted normative valence. Importantly, while normative
valence did not differ between phases, sentiment scores increased after training. These results suggest that Watson
NLU is a valid and sensitive tool for assessing emotionality in written language and its modulation through training
during text writing.
Keywords. Emotion, bilingualism, sentiment, valence, emotional training
Resumen
Los valores afectivos de las palabras se han estudiado ampliamente en distintas lenguas, a menudo centrándose
en palabras aisladas debido a la dificultad de evaluar la emocionalidad en los textos. Este estudio analiza si el
contenido emocional escrito puede captarse de forma fiable mediante Watson Natural Language Understanding.
Treinta y tres universitarios españoles escribieron textos autobiográficos de 150 palabras en su L2 (inglés) antes
y después de un entrenamiento en vocabulario emocional. La valencia normativa de las palabras de contenido se
comparó con las puntuaciones de sentimiento generadas por Watson NLU. Ambas medidas mostraron
correlaciones positivas y fuertes en las fases pre- y post-entrenamiento, siendo mayores tras el entrenamiento. Los
análisis de regresión confirmaron que las puntuaciones de sentimiento predijeron significativamente la valencia
normativa. Aunque no se observaron cambios en la valencia normativa, las puntuaciones de sentimiento
aumentaron tras el entrenamiento, lo que indica la sensibilidad de la herramienta a la modulación emocional del
lenguaje durante la escritura de textos.
Palabras clave. Emoción, bilingüismo, tono emocional, valencia, entrenamiento emocional
DOI: 10.26378/rnlael2040659
Recibido: 09/01/2026 - Aprobado: 1/04/2026
Publicado bajo licencia de Creative Commons Reconocimiento Sin Obra Derivada 4.0 Internacional
1. Introduction
Word emotionality has been investigated in first (L1) and second (L2) languages (Imbault et al., 2021;
Warriner et al., 2013) by means of different measures: subjective ratings (Dewaele, 2004; Pavlenko,
2005), more objective, behavioral data (response times or accuracy rates during reading or word
categorization), or physiological (skin conductance, Harris, 2004; Harris et al., 2006) and
neurophysiological indices (neural responses, Opitz & Degner, 2012). This research has shown that
emotional words are perceived and evaluated as more emotionally extreme (i.e., more positive or
negative) in L1 than in L2 (Caldwell-Harris, 2015; Ferré et al., 2010, Sánchez et al., 2025) and that their
processing is more automatic and effortful in the L1 too, leading to higher physiological reactivity (skin
conductance, electromyography, pupillometry) and increased or faster brain responses (Conrad et al.,
2011; Fan et al., 2018; Foroni, 2015; Opitz & Degner, 2012; Toivo & Scheepers, 2019; Winskel, 2013).
Although results in the production domain are scarcer, recent evidence also shows higher emotional
verbal fluency in L1 than L2 (Lam & Mardquardt, 2022) as well as more diverse emotional vocabulary
during L1 than L2 text production (Pavlenko & Driagina, 2007, Kyriakou et al., 2024; Vidal Noguera &
Mavrou, 2025). Similarly, gestures usage has been reported during the retelling of emotional experiences
in L1 than L2 (Emir Özder et al., 2023). Overall, this research systematically highlights that L2 speakers
tend to experience their L2 as less emotional.
Nonetheless, there is still no unanimous conclusion on the reduced emotional sensitivity in L2 (see, for
instance, lack of L1-L2 differences in Eilola & Havelka, 2010; Kazanas & Altarriba, 2016), with various
factors such as age of L2 acquisition, proficiency and exposure potentially modulating the L2
emotionality (Conrad et al., 2011; Opitz & Degner, 2011). In this line of research, the common reduced
emotionality in L2 has been attributed to weaker associations between words and their emotional
contexts, largely due to fewer meaningful encounters with those words (Pavlenko, 2012). Unlike L1, L2
is often learned in formal instructional contexts such as the school or university, where language use
tends to be less spontaneous and less embedded in emotionally rich interactions. As a result, the limited
exposure to words in socially and affectively meaningful contexts has been proposed as a key factor
underlying the lower emotional resonance in L2.
From this perspective, it is reasonable to hypothesize that increasing the number of such encounters and
embedding learning in richer emotional contexts—as implemented in the present study—may help
counteract this reduced emotionality. In this sense, this study aims to examine whether a pedagogical
intervention enhances the emotional content of learners’ written production by increasing meaningful
encounters with emotional vocabulary and promoting deeper lexical-semantic processing. Among the
instructional strategies that may help enhance emotionality in L2 within formal learning contexts, the
summary strategy appears particularly relevant. Summarizing involves reducing a source text to its
essential ideas and therefore requires a demanding higher-order cognitive process in which learners
synthesize content and identify the most relevant information (Khoshsima & Rabani Nia, 2014). This
complex task engages both cognitive and metacognitive operations, including scanning, skimming,
inferencing, and information construction (Keck, 2006; Mokeddem & Houcine, 2016), making reading
and writing closely interdependent. Previous research has shown that the use of summarization promotes
reading comprehension, writing development, and vocabulary acquisition (Keck, 2014; Shokrpour et
al., 2013; Stevens et al., 2019; Hsiang et al., 2020), likely because it encourages deeper lexical-semantic
processing. Such effortful processing may strengthen the mental representation of L2 words and
facilitate vocabulary learning, including emotionally charged lexical items. Thus, it was expected that
producing personal summaries of texts containing both positive and negative emotional content would
increase learners’ encounters with emotional vocabulary in meaningful contexts, and that this repeated,
elaborative engagement would promote a stronger integration of emotional language in L2.
When assessing emotional content in written texts, two complementary approaches can be adopted:
focusing on the emotional value of individual words or analysing emotion at the level of the text as a
whole. Previous studies on emotional processing in L2 have predominantly examined isolated words
(Kousta et al., 2009; Opitz & Degner, 2012; Palazova et al., 2011), whereas evidence at sentence or text
level remains comparatively limited (Tang & Ding, 2024; Sheikh & Titone, 2016; Vidal Noguera &
Mavrou, 2025; Kyriakou et al., 2024), largely because capturing the emotional tone of an entire text is
methodologically more complex. To address this limitation, the present study also explores whether
emotional content in written production in L2 can be captured using recently developed AI-based tools,
specifically IBM Watson Natural Language Understanding (hereafter, Watson NLU1). This natural
language processing system extracts meaning from both structured and unstructured language data and
can provide information about sentiment expressed in text. Unlike traditional approaches based solely
on lexical items, this tool estimates sentiment at the phrase or text level, assigning a score on a continuum
from negative to positive (from –1 to +1), thereby allowing the analysis of the overall emotional tone
conveyed in a text rather than only the valence of isolated words.
At this point, it is worth clarifying the distinction between lexical valence analysis and sentiment
analysis carried out at single-word or text level, respectively. Lexical valence refers to the affective
polarity associated with individual words along a bipolar continuum from negative to positive,
traditionally derived from normative affective databases such as those developed by Warriner et al.
(2013) in L1 English. These databases provide emotional ratings for isolated words based on native-
speaker judgments, while comparable L2 norms for bilinguals and foreign language learners remain
scarce (Imbault et al., 2021). By contrast, sentiment refers to the overall evaluative attitude or emotional
tone expressed by the writer toward a topic within discourse, making it inherently context-sensitive.
Although the terms emotion and sentiment are sometimes used interchangeably, emotion typically refers
to internal affective states, whereas sentiment reflects how those states are linguistically conveyed in
context. From this perspective, sentiment analysis may offer a more naturalistic way of assessing
emotional tone in writing, because it evaluates meaning at text level rather than assigning normative
polarity to lexical items (Taherdoost & Madanchian, 2023). In addition, AI-based tools like Watson NLU
provide important methodological advantages: they enable consistent, replicable analysis of large text
samples and reduce the degree of subjectivity associated with manual coding procedures (Pérez-García
& Sánchez, 2020).
Regarding Watson-based sentiment tools, the earlier version of Watson NLU, IBM Watson Tone
Analyzer, has been shown to be particularly useful for examining emotional features of language in
diverse contexts. For example, Maleki et al. (2023) investigated whether financial incentives influence
the production of health-related content on social media by comparing posts from Steemit, a platform
that rewards user participation, with posts from Reddit, where no such incentives are provided. Their
analysis showed that posts written in the incentive-based environment displayed a more confident and
analytical language style, were less tentative, and expressed more joy and less negativity than those
published on the non-incentivized platform. Similarly, Steffens et al. (2021) used Watson Tone Analyzer
to examine whether the source of funding influences how medical research findings are written. By
examining emotional features in the texts—such as expressions of anger, fear, joy, and sadness—as well
as language style (for example, whether the writing sounded more analytical, confident, or cautious),
they found that studies without commercial funding tended to use language that reflected more fear and
a more impersonal tone than commercially funded studies. Langerhuizen et al. (2021) also analysed
patients’ online comments about healthcare providers and found that comments characterized by joy and
confidence were associated with higher service ratings, whereas sadness and tentativeness were linked
to lower evaluations.
This tool has also been applied in the field of music. Marouf et al. (2019), for instance, analysed a large
corpus of English song lyrics and classified them according to both language style (analytical, confident,
tentative) and emotional tone (anger, fear, joy, sadness). Extending this line of work, Somse et al. (2022)
combined tone detection with voice analysis to identify users’ emotional states and subsequently
recommend music that matched their mood. More practical research has been done on neuromotor
disability to help dependent people (Jain & Verma, 2020). The authors presented a solution to control
the movement of people who speak clearly but cannot walk because of their disability,and proposed a
machine-learning based methodology to detect emotion from speech to help people to interact better
with their surroundings. Likewise, Gain and Hotti (2017) suggested that emotional tones and linguistic
patterns extracted from text may also contribute to assessing personality traits and social tendencies.
Taken together, these findings demonstrate the usefulness of Watson-based language analysis tools and
indicate that they are sufficiently robust to capture emotional and linguistic variation in written discourse
in domains as diverse as healthcare, social media, music, and scientific communication.
However, despite this potential, most previous applications of Watson tools have been developed outside
the fields of philology and language teaching, which highlights the novelty of applying this tool to the
study of emotional content in L2 written production. Nonetheless, a recent study (Sánchez et al., under
review) applied this software to examine the effects of two instructional strategies—summary and
guessing—on emotional writing performance in L2 English. In that study, learners’ written productions
before and after the intervention were analysed using Watson Tone Analyzer to quantify the emotional
tone of each text. Although the intervention did not produce statistically significant changes in overall
emotional writing performance—measured as the average score of four emotional dimensions (anger,
fear, joy, and sadness)—it did lead to a reduction in the analytical tone of the texts produced by both
experimental groups. In this way, the tool made it possible to identify that both teaching strategies were
similarly effective in encouraging learners to write in a less analytical and comparatively more affective
manner.
Based on this rationale, the main aim of the present study was to further investigate possible changes in
the emotional content of L2 written texts following specific instruction based on the summary strategy2.
It was expected that the summary-based training would modulate the emotional content in L2 written
production. Consequently, changes after training were expected to be reflected in both normative
valence scores and sentiment scores. More specifically, the study aimed to test the validity of Watson
NLU for measuring emotional language in written texts. It was hypothesized that sentiment scores
(generated by Watson) would correlate with and predict mean normative valence scores (obtained from
affective ratings traditionally used in L2 research) of the L2 written texts collected both before and after
the teaching instruction.
2. Method
2.1. Participants
A group of Spanish undergraduate students (n=33, 7 males, Mage= 18.27) enrolled in the English Studies
degree (University of Salamanca, Spain) participated in this study. They were B2-level English students
(Council of Europe, 2018) who volunteered to participate in the research. This was done to inform
participants about the study and to obtain permission to use their data anonymously and in aggregated
form-administered to account for the study that was being carried out and to request permission so that
their data could be used globally, never individually.
2.2. Procedure
Participants wrote an autobiographical text (approximately 150 words) in their L2 (English) before
(pretest) and after instruction (posttest) with emotional language (see section Instruction for more
details).
2.1.1. Pretest and Posttest
In the pretest phase, students wrote about a dream (150 words) they had had in approximately 30
minutes. In the posttest phase, they wrote another short text (150 words) about a personal experience
(30 minutes); this phase took place two weeks after the instruction sessions to measure long-term effects.
The topics were chosen to generate texts in which participants felt inclined to use emotional vocabulary
and expressions through the retelling of subjective autobiographical experiences (Pavlenko, 2012).
These activities were administered online through the students’ virtual campus.
2.1.2. Instruction
Instruction was provided in two 50-minute sessions one week apart, and the tasks participants completed
during these two sessions were paper-based. In the first instruction session, a text adapted from a blog
post was used to address negative, high-arousal emotions related to anger (see Appendix 1). Participants
were first asked to read the text to become familiar with the words and emotional expressions, and then
they were asked to summarize it. They were advised to outline the main ideas before summarizing the
text to help them paraphrase and rephrase ideas. While writing the summary they could look at the text,
and guidance and support were always provided by the instructor, in order to motivate the participants
and make them feel more confident (Méndez López, 2016).
In the second instruction session, the input text used was adapted from a blog post on positive (pleasant)
feelings (see Appendix 2). The procedure was the same as in the first session. Participants read and
summarized the text, and while writing the summary they were allowed to look at the text and were
encouraged to ask any questions.
The fact that the first text dealt with negative emotions (session 1) and the second with positive emotions
(session 2) did not jeopardize the validity of the research because in the pre- and posttest participants
were not directed towards positivity or negativity and could express themselves freely with the
emotional terms learned in the instruction sessions.
2.3. Data analyses
A quasi-experimental pretest / posttest design (Larson-Hall, 2010; Rogers & Révész, 2020) was used to
examine the effect of the teaching summary strategy on students’ emotional L2 writing performance
(cf., Sánchez et al., 2026, where this design was applied to test the effect of teaching strategies on
vocabulary learning in EFL).
Students’ texts were analyzed using two different yet complementary indices: lexical valence scores
derived from a normative database and sentiment scores obtained through automated sentiment analyses
by means of Watson NLU software tool. For the lexical valence analysis, each text produced in the pre-
and post-instruction phases was first corrected for spelling and tokenized. Once all the words from each
text were extracted, the content words (nouns, adjectives, verbs, and adverbs) were selected and
lemmatized. Thus, words were reduced to their base form (e.g., singular nouns and infinitive verb forms)
in order to facilitate matching with the normative database. Then, valence scores for each lexical item
were obtained from the set of English affective norms provided by Warriner et al. (2013), which provides
ratings on emotional valence of a large set of words, in on a scale ranging from 1 (very negative) to 9
(very positive). Nonetheless, content words that were not present in the database were excluded from
the analysis (23.69% of the words in the pretest texts and 23.87% in the posttest texts). For each text in
both phases, the mean valence score was computed by averaging the valence ratings extracted across all
content words, providing an index of the affective polarity of the lexical items used in the text. This
measure captures the emotional characteristics of the vocabulary, rather than the evaluative tone of the
discourse as a whole. For descriptive purposes, the proportion of valenced words in each text was also
extracted as an index of emotional vocabulary density. Following common practices in previous studies,
words with valence scores ≥ 6 and ≤ 4 were classified as emotional, and their proportion relative to the
total number of words matched with the normative database was calculated for each text.
Regarding sentiment, scores for each text in pre- and posttest phases were automatically generated by
Watson NLU tool. The system applies machine-learning models trained on large text corpora to analyze
the linguistic features in the input text, providing a computational estimation for the emotional tone
expressed in the text. The resulting sentiment scores range from -1 to 1, where values closer to 1 indicate
a more positive tone, values closer to -1 reflect a negative tone, and values around 0 indicate a more
neutral evaluative tone. Therefore, whereas lexical valence operates at the level of individual words,
sentiment scores indicate the overall polarity of texts as a whole, taking into account the linguistic
context in which words appear.
Then, two different analyses were carried out with both normative valence and sentiment scores
obtained for the texts. First, to determine the relationship between sentiment and normative valence
scores, correlational and regression analyses were carried out. Thus, Pearson correlations were
computed between both sentiment and normative valence scores, separately for written texts obtained
in pre- and posttest phases. Then, regression analyses were carried out considering sentiment scores as
predictor or independent variable, and mean valence scores as dependent variable, separately conducted
for texts written in the pre- and posttest phases. Second, to determine the effect of the specific training,
paired-sample t-tests were performed to contrast written L2 texts in pre- and post-training phases,
separately considering normative valence and sentiment scores. Statistical analyses were conducted with
the SPSS package (IBM, version 23) and the R software (Core Team, 2021) was used to plot and
visualize results by means of ggplot2 package (Wickham, 2016) implemented in R Studio (version
2022.02.0).
3. Results
Correlational analyses conducted between sentiment and normative valence scores obtained in the
pretest phase demonstrated a strong, positive relation between both indices (r=.55, p=.001). The relation
was found even stronger when indices obtained in the posttest were analyzed (r=.78, p<.001).
Importantly, regression analyses confirmed that sentiment scores significantly predicted normative
valence ratings, showing a strong, linear relation in the pretest [F(1, 32) = 13.9, R²Adj. = .28, p = .001;
see Graph 1, left panel]. Moreover, such relation was found even stronger in the posttest phase [F(1, 32)
= 50.21, R²Adj. = .60, p < .001; see Graph 1, right panel]. These results confirm that more positive
sentiment scores actually predict more positive valence scores obtained through normative ratings, thus
indicating Watson NLU as is able to determine the emotional tone of a given text.
Graph 1. Normative valence scores obtained for each written text in pretest (left panel) and posttest (right panel)
as a function of sentiment scores (each point represents the mean obtained for each written text, for normative
valence and sentiment scores)
Regarding the t-test carried outperformed to compare the emotionality of L2 texts in pretest vs. posttest
phases, no differences were observed in normative valence scores between both training phases [t(32) =
0.815, p = 0.421, mean pretest = 5.966, mean posttest = 6.006, mean difference = -0.04]. See Graph 2
(left panel). Indeed, the proportion of emotionally valenced words was similar in both phases (pretest:
61.50%, posttest: 60.72%) and did not differ significantly across phases (p > .05). Conversely, the
analysis considering text-derived sentiment values revealed that the emotionality of the written texts
significantly increased after the specific training [t(32) = -2.01, p = .05, mean pretest= -0.002, mean
posttest= 0.244, mean difference= -0.246]. See Graph 2 (right panel).
Graph 2. Distribution of normative valence scores (left panel) and sentiment scores (right panel) for L2 written
texts across pretest and posttest phases. Dots within each boxplot represent the mean obtained in normative
valence and sentiment scores for each pretest and posttest condition; the asterisk indicates significant differences
for the contrast between sentiment scores in pretest and posttest phases
4. Discussion
The first aim of the present study was to investigate whether specific training could promote changes in
L2 emotionality in written production. Previous research has consistently reported differences between
L1 and L2 in the processing and use of emotional language, as shown through subjective evaluations
and objective measures such as behavioral and neurophysiological responses (Caldwell-Harris, 2015;
Ferré et al., 2010; Foroni, 2015; Kousta et al., 2009; Opitz & Degner, 2012). Since these differences are
often attributed to L2 learning in affectively detached contexts, such as formal instructional settings, it
was hypothesized that increasing learners’ exposure to L2 words in emotional contexts through targeted
training would encourage the use of emotional language and, consequently, improve emotional L2
communication.
To test this hypothesis, B2-level learners of English as an L2 underwent a training based on the summary
strategy. Emotionality in their written production before and after the intervention was assessed using
two complementary indices: normative valence scores derived from emotional norms in English
(Warriner et al., 2013) and sentiment scores estimated through Watson NLU. Overall, results confirmed
the usefulness of the training enhancing L2 emotionality in written production. Notably, this
improvement was more clearly detected through discourse-level sentiment scores than through the
traditional word-based approach based on normative valence ratings.
The emotional tone captured by sentiment scores significantly increased after the application of the
summary strategy. This result supports the usefulness of this teaching approach for enhancing emotional
expression in L2 written production and aligns with previous research showing the benefits of
summarization for reading comprehension and vocabulary learning (Hsiang et al., 2020; Keck, 2014;
Shokrpour et al., 2013; Stevens et al., 2019). Summarizing likely promotes deeper processing of
affective language because learners must re-elaborate linguistic content through metacognitive
operations such as identifying key information, paraphrasing, and synthesizing ideas during text
production (Keck, 2006; Mokeddem & Houcine, 2016). This process may facilitate the integration of
emotional L2 vocabulary into memory and improve later access to such vocabulary during writing. It is
also possible that this strategy could strengthen learners’ ability to express emotions in oral
communication or even influence emotional experience in L2, although these possibilities remain open
questions for future research.
However, contrary to our predictions, the emotional valence of the words used in the texts, measured
through standardized normative ratings in English (Warriner et al., 2013), did not change significantly
across testing phases, despite a slight increase after training. The different sensitivity shown by
sentiment and valence scores may be explained by the distinct dimensions captured by each measure,
particularly in a relatively small sample size. Whereas sentiment analysis evaluates the emotional tone