In What Languages are Generative Language Models the Most Formal? Analyzing Formality Distribution across Languages
As{\i}m Ersoy, Gerson Vizcarra, Tasmiah Tahsin Mayeesha, Benjamin, Muller

TL;DR
This paper investigates the distribution of formality in multilingual language models across five languages, revealing biases and behaviors in how models generate formal or informal text depending on prompts and language.
Contribution
It provides the first analysis of formality biases in multilingual models like XGLM and BLOOM across multiple languages, with a new dataset of 6,000 annotated samples.
Findings
XGLM generates more informal text in Arabic and Bengali with informal prompts
Models tend to produce more formal text when prompted neutrally
Significant informal predictions occur even with formal prompts
Abstract
Multilingual generative language models (LMs) are increasingly fluent in a large variety of languages. Trained on the concatenation of corpora in multiple languages, they enable powerful transfer from high-resource languages to low-resource ones. However, it is still unknown what cultural biases are induced in the predictions of these models. In this work, we focus on one language property highly influenced by culture: formality. We analyze the formality distributions of XGLM and BLOOM's predictions, two popular generative multilingual language models, in 5 languages. We classify 1,200 generations per language as formal, informal, or incohesive and measure the impact of the prompt formality on the predictions. Overall, we observe a diversity of behaviors across the models and languages. For instance, XGLM generates informal text in Arabic and Bengali when conditioned with informal…
| Prompt | Generation | |
| XGLM(7.5B) | BLOOM(7.1) | |
|
|
|
|
| Neutral+ | Formal* | Informal* | |
| ar | (When/Then), (Yes), (There), (Unless), (If), (From), (At/When), (I swear), (In), (No) | TAOCD (Zaidan and Callison-Burch, 2011) | TAOCD (Zaidan and Callison-Burch, 2011) |
| bn |
|
InFormal (Krishna et al., 2022) | InFormal + Microblog dataset (Chowdhury and Chowdhury, 2014) |
| en | The, I, This, He, She, You, They, We, Do, There | GYAFC (Rao and Tetreault, 2018) | GYAFC (Rao and Tetreault, 2018) |
| fr | C’est (It is), Ils (They), Elles (They), Il (He), elle (She), ce (This), Est-ce que (question), Ça (That), Ce (This), Deux (Two) | XFORMAL (Briakou et al., 2021b) | XFORMAL (Briakou et al., 2021b) |
| es | Por la (For the), Las (The), Los (The), Por el (For the), Con unos (With some), Por que la (Why the), Se ha (It had), Por su (Because of), Para un (For a), De una (Of a) | Wikipedia (Cañete, 2019) | 9322 rap lyrics in Spanish (filtered) (Muñoz, 2018) |
| Model/Language | Arabic | Bengali | English | French | Spanish |
| XGLM(2.9B) | 9.3% | 8.0% | 6.7% | 16.0% | 6.7% |
| BLOOM(3B) | 13.3% | 4.3% | 3.3% | 12.0% | 3.3% |
| XGLM(7.5B) | 8.7% | 5.0% | 10.0% | 18.0% | 7.7% |
| BLOOM(7.1B) | 12.3% | 6.3% | 3.7%* | 8.7%* | 2.7%* |
| Model/Language | Arabic | Bengali | English | French | Spanish |
| XGLM(2.9B) | 92% | -3% | 14% | 41% | 58% |
| BLOOM(3B) | 100% | -6% | -6% | -1%* | 79% |
| XGLM(7.5B) | 83%* | 33% | 8% | 32% | 45% |
| BLOOM(7.1B) | 100% | -3%* | -13% | 14% | 67% |
| Model/Language | Arabic | Bengali | English | French | Spanish |
| FF% / II% | FF % / II% | FF% / II% | FF% / II% | FF% / II% | |
| XGLM(2.9B) | 89.4% / 61.1% | 79.8% / 100.0%* | 34.0% / 94.0% | 26.7% / 59.5% | 85.9% / 80.2% |
| BLOOM(3B) | 94.2% / 55.1% | 83.7% / 87.1% | 29.2% / 91.7% | 32.0% / 82.0%* | 77.8% / 90.4% |
| XGLM(7.5B) | 88.6% / 76.7%* | 75.5% / 98.8% | 34.4% / 84.7% | 54.0%* / 75.6% | 86.9% / 75.8% |
| BLOOM(7.1B) | 93.5% / 51.1% | 74.0% / 91.9% | 27.6% / 94.0%* | 25.8% / 66.7% | 83.8% / 96.8%* |
| Top-k | Top-p | Temperature | |
| Arabic | 50 | 0.95 | 1 |
| Bengali | 50 | 0.95 | 1 |
| English | 50 | 0.95 | 1 |
| French | 50 | 1 | 0.8 |
| Spanish | 50 | 1 | 0.8 |
| Corpus Size (GiB) | Data source domains | |||
| XGLM | BLOOM | XGLM | BLOOM | |
| ar | 64.34* (0.88%) 66 upsampled | 69.71 (4.34%) | Web | Web, news, books, subtitles, Wikipedia, wikisources |
| bn | 11.19* (0.15%) 50 upsampled | 17.32 (1.15%) | Web | Web, Wikipedia, Wikisource, open-source NLP datasets |
| en | 3,324.45 (45.66%) | 451.64 (30.04%) | Web | Papers, Web, patents, books, subtitles, forums, Wikipedia, news |
| fr | 303.76 (4.17%) | 193.94 (12.90%) | Web | Web, scholarly documents from all academic fields (HAL), Wikisource, Wikipedia, subtitles |
| es | 363.83 (4.99%) | 163.07 (10.84%) | Web | Web, subtitles, Wikipedia, news, magazines |
| Arabic | |||||||
| Prompt/ | |||||||
| Statistic | Avg. Length | Avg. # of sentences | Avg. Length of sentences | Avg. # of emojis | Avg. # of punctuation marks | Avg. # of new lines | Avg. # of dialogue mark(-) |
| 175.074 | 1.338 | 130.593 | 0.000 | 6.794 | 0.000 | 0.000 | |
| 246.696 | 1.686 | 145.895 | 0.010 | 4.755 | 0.000 | 0.000 | |
| 446.444 | 2.037 | 218.664 | 0.000 | 5.463 | 2.185 | 0.019 | |
| 495.403 | 3.583 | 137.523 | 0.005 | 6.820 | 3.626 | 0.345 | |
| 187.345 | 1.471 | 127.023 | 0.000 | 9.391 | 0.000 | 0.000 | |
| 244.610 | 1.620 | 150.581 | 0.000 | 3.299 | 0.000 | 0.000 | |
| 441.538 | 2.692 | 163.357 | 0.000 | 5.404 | 2.019 | 0.058 | |
| 506.123 | 3.185 | 158.176 | 0.000 | 5.905 | 2.645 | 0.104 | |
| Bengali | |||||||
| Prompt/ | |||||||
| Statistic | Avg. Length | Avg. # of sent. per gen. | Avg. Length of sent. | Avg. # of emojis per gen. | Avg. # of punctuation marks per gen. | Avg. # of new lines per gen. | Avg. # of dialogue mark(-) |
| 151.734 | 1.552 | 97.431 | 0.357 | 17.487 | 0.000 | 0.000 | |
| 164.149 | 1.256 | 130.467 | 0.000 | 3.091 | 0.000 | 0.000 | |
| 413.338 | 2.128 | 193.667 | 0.128 | 11.047 | 1.561 | 0.020 | |
| 384.252 | 1.338 | 286.909 | 0.014 | 5.518 | 1.288 | 0.007 | |
| 167.110 | 1.728 | 96.294 | 0.360 | 15.507 | 0.000 | 0.000 | |
| 152.767 | 1.248 | 122.199 | 0.008 | 2.880 | 0.000 | 0.000 | |
| 419.400 | 1.845 | 226.829 | 0.187 | 13.484 | 2.058 | 0.155 | |
| 418.500 | 1.198 | 349.046 | 0.000 | 5.063 | 1.127 | 0.008 | |
| English | |||||||
| Prompt/ | |||||||
| Statistic | Avg. Length | Avg. # of sent. per gen. | Avg. Length of sent. | Avg. # of emojis per gen. | Avg. # of punctuation marks per gen. | Avg. # of new lines per gen. | Avg. # of dialogue mark(-) |
| 225.720 | 3.332 | 67.047 | 0.005 | 7.518 | 0.000 | 0.000 | |
| 261.529 | 3.103 | 83.544 | 0.000 | 5.943 | 0.000 | 0.000 | |
| 584.288 | 10.236 | 56.152 | 0.014 | 19.803 | 6.159 | 0.620 | |
| 646.354 | 6.829 | 93.727 | 0.000 | 12.537 | 2.159 | 0.000 | |
| 241.613 | 3.497 | 68.359 | 0.022 | 8.680 | 0.000 | 0.006 | |
| 281.921 | 3.371 | 82.887 | 0.000 | 6.000 | 0.000 | 0.000 | |
| 575.236 | 10.718 | 52.733 | 0.005 | 22.278 | 7.204 | 1.324 | |
| 639.466 | 6.808 | 93.020 | 0.027 | 14.123 | 2.959 | 0.110 | |
| French | |||||||
| Prompt/ | |||||||
| Statistic | Avg. Length | Avg. # of sent. per gen. | Avg. Length of sent. | Avg. # of emojis per gen. | Avg. # of punctuation marks per gen. | Avg. # of new lines per gen. | Avg. # of dialogue mark(-) |
| 207.861 | 2.723 | 75.713 | 0.058 | 8.927 | 0.000 | 0.000 | |
| 231.435 | 2.652 | 86.646 | 0.000 | 6.861 | 0.000 | 0.000 | |
| 621.216 | 11.869 | 51.417 | 0.006 | 25.562 | 8.273 | 1.051 | |
| 612.727 | 6.125 | 99.197 | 0.000 | 13.909 | 2.205 | 0.034 | |
| 208.567 | 2.850 | 72.525 | 0.047 | 9.323 | 0.000 | 0.000 | |
| 235.277 | 2.891 | 80.735 | 0.000 | 7.403 | 0.000 | 0.000 | |
| 588.804 | 13.375 | 43.091 | 0.006 | 29.667 | 10.583 | 2.607 | |
| 637.415 | 6.500 | 97.216 | 0.000 | 15.679 | 2.425 | 0.066 | |
| Spanish | |||||||
| Prompt/ | |||||||
| Statistic | Avg. Length | Avg. # of sent. per gen. | Avg. Length of sent. | Avg. # of emojis per gen. | Avg. # of punctuation marks per gen. | Avg. # of new lines per gen. | Avg. # of dialogue mark (-) |
| 222.789 | 2.798 | 78.974 | 0.028 | 9.514 | 0.000 | 0.000 | |
| 249.69 | 2.228 | 111.517 | 0.000 | 6.123 | 0.000 | 0.000 | |
| 553.59 | 9.291 | 58.689 | 0.000 | 21.000 | 5.427 | 0.846 | |
| 613.827 | 4.532 | 134.672 | 0.000 | 12.012 | 1.734 | 0.012 | |
| 225.454 | 2.981 | 74.957 | 0.019 | 9.870 | 0.000 | 0.000 | |
| 248.728 | 2.32 | 106.663 | 0.006 | 6.254 | 0.000 | 0.000 | |
| 530.218 | 8.589 | 60.846 | 0.000 | 20.565 | 5.435 | 1.331 | |
| 640.643 | 4.661 | 136.668 | 0.000 | 12.393 | 1.655 | 0.012 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Language and cultural evolution
MethodsBLOOM
In What Languages are Generative Language Models the Most Formal? Analyzing Formality Distribution across Languages
Asım Ersoy1 * , Gerson Vizcarra2,3* , Tasmiah Tahsin Mayeesha 4*, Benjamin Muller 5
1Huawei Turkey R&D Center 2Nisum Latam 3Banco de Crédito e Inversiones
4North South University 5Sorbonne Université
[email protected] [email protected]
[email protected] Equal contribution. This work was done as part of the Fatima Fellowship mentoring program.
Abstract
Multilingual generative language models (LMs) are increasingly fluent in a large variety of languages. Trained on the concatenation of corpora in multiple languages, they enable powerful transfer from high-resource languages to low-resource ones. However, it is still unknown what cultural biases are induced in the predictions of these models. In this work, we focus on one language property highly influenced by culture: formality. We analyze the formality distributions of XGLM and BLOOM’s predictions, two popular generative multilingual language models, in 5 languages. We classify 1,200 generations per language as formal, informal, or incohesive and measure the impact of the prompt formality on the predictions. Overall, we observe a diversity of behaviors across the models and languages. For instance, XGLM generates informal text in Arabic and Bengali when conditioned with informal prompts, much more than BLOOM. In addition, even though both models are highly biased toward the formal style when prompted neutrally, we find that the models generate a significant amount of informal predictions even when prompted with formal text. We release with this work 6,000 annotated samples, paving the way for future work on the formality of generative multilingual LMs.
1 Introduction
Natural Language Processing (NLP) systems are used worldwide across multiple cultures, audiences, contexts, communication goals, demographics, and languages. Thus it is essential that these models be able to adapt to the sociocultural context of its users. As described by Hershcovich et al. (2022), linguistic style is one of the major dimensions by which cultures vary in NLP technologies.
In this work, we focus on formality. Formality is a stylistic property of language that can impact how we perceive a text. It typically carries information about the culture of the speaker (or writer), is constrained by the context of the message, and can impact the communicative goal of a text Heylighen and Dewaele (1999). Generating text with a desired level of formality can be useful for different NLP applications Hovy and Yang (2021). For example, controlling the tone of machine translation models Sennrich et al. (2016); Niu et al. (2017); Feely et al. (2019), designing chatbots with formality awareness to respond to user-preferred conversational style Cox and Ooi (2022), or assisting users to change the formality level of their writings Rao and Tetreault (2018); Wang et al. (2019, 2020).
Generative language models have demonstrated capabilities in producing cohesive texts and solving NLP tasks with zero/few-shot learning (Radford et al., 2019; Brown et al., 2020b; Chowdhery et al., 2022; Zhang et al., 2022), even in multilingual scenarios (Lin et al., 2021b; Scao et al., 2022; Barbieri et al., 2022; Jiang et al., 2022). Multilingual language models are trained with large amounts of text from different sources. That training process could make the model biased towards a certain level of formality because of the data of each language as well as cross-lingual transfer (Pires et al., 2019; Libovický et al., 2020; Muller et al., 2021), limiting the capabilities of the model to adapt to different cultures of an NLP application.
This work analyzes the formality level of two multilingual language models: XGLM (Lin et al., 2021b) and BLOOM Scao et al. (2022), across five languages, namely Arabic, Bengali, English, French, and Spanish. To do so, a native/proficient speaker of each language evaluates the generation outputs of each model into three categories: formal, informal, and incohesive. This evaluation allows us to analyze the generations across three different dimensions: the cohesiveness of the generations,111In short, we define a sequence as incohesive if it cannot be evaluated as formal/informal. More details in Section 4.4 the formality bias given neutral prompts, and the formality preservation given formal/informal prompts. As an example, we show in Table 1 the predictions of BLOOM and XGLM conditioned on the same prompt in Bengali but generating text of different formality level. Overall, our contributions are the following:
- •
To the best of our knowledge, this is the first work to analyze the formality of generative multilingual language models across multiple languages. While we have focused on specific models and languages in this work, the procedures followed to define formality, prompt sourcing, language generation, and measurement of how formality is preserved from prompts are generalizable to any generative system and language. We open-source 1,200 generations, per language, manually annotated as formal, informal, or incohesive 222https://github.com/asimokby/formality-bias-analysis.
- •
We find that BLOOM generates about twice longer texts as XGLM. Besides, almost all the generated formal sentences are longer than the informal ones. Also, informal generations in English, French, and Spanish are characterized by being more conversational, and in Bengali, by having more punctuation marks.
- •
We find that BLOOM is significantly more cohesive than XGLM in English, French, and Spanish and performs similarly in other languages.
- •
Both XGLM and BLOOM are generally biased toward formal text when prompted in a neutral way. However, both models are very sensitive to the formality of the prompt and will generate informal text if conditioned with informal prompt. This is particularly striking for Arabic: BLOOM generates dialectal Arabic (considered informal) when prompted with informal text while being extremely biased toward Modern-Standard Arabic (considered formal).
2 Formality Across Different Languages
We start by defining formality in the five languages of our study.
Arabic
The Arabic language is spoken in many dialects (Watson, 2011). These dialects are variants of classical or standard Arabic, which has a modernized version of it called Modern Standard Arabic (MSA). Badawi (1973), in his famous book Mustawayat Al-arabiyya Al-muasira Fi Misr (The levels of contemporary Arabic in Egypt), presents a theory on the relationship between standard Arabic (Fusha) and vernacular Arabic (Ammiya) in Egypt. His theory describes the situation as a continuum with 5 major divisions: illiterate colloquial Arabic, educated colloquial Arabic, elevated colloquial Arabic, modern standard Arabic, and classical Arabic. The first three divisions are Ammiya, which is considered informal and not necessarily grammatically correct. The last two divisions are Fusha, which is considered formal. However, the definition of what is formal and what is informal could depend on the problem at hand, for example, in one case, elevated colloquial Arabic could be considered formal while illiterate colloquial Arabic as informal. In our work, we define formality for Arabic as follows: a piece of text is formal if it contains no words coming from any Arabic dialect which is not considered as Fusha, following Badawi (1973)’s definition of Fusha. For example, the following sentence: ?
where is the closest mosque? is formed of only Fasih, formal, words. Similarly, a piece of text is informal if it contains a word coming from any dialect and not Fusha. For example, ? where is the closest mosque? is informal because of the word where which is coming from Egyptian Arabic.
Bengali
Bengali has a complex and elaborate system of using pronouns to express the degrees of familiarity and formality between the participants in a conversation Das (1968); Uddin (2019). T-V distinction Brown et al. (1960) or the contextual usage of pronouns to convey varying levels of formality, familiarity, and politeness, which is found in many Romance languages (French, German, Italian, Spanish, etc), can also be seen in Bengali. Bengali follows a tripartite form of second-person pronouns like other South Asian languages, including Hindi/Urdu Bhatt (2012, 2015), Malaysian McGinn et al. (1991) and can be considered a T/N/V language Thompson (2006); Uddin (2019) with an added level of neutral or semi-formal tone aside from formal and informal.
The set of pronouns to be used depends on the relationship between the speaker and the audience and the intimacy level. For instance, you in English has three different variations in Bengali,
/ Apni (formal) for respected elders and strangers,
/ Tumi (polite) for siblings/friends or familiar people and
/ Tui (informal) for those who are younger, children or very close friends. The third person he / she can be translated to
/Tini (formal) vs
/Se (informal) which encodes two levels of formality- honorific and non-honorific. Bengali Pronouns can encode numbers such as singular/plural, but the notion of formality is not changed by gender or numerical properties David (2015).
The following are other considerations of formality in Bengali :
- •
Texts containing a high frequency of Sanskrit-originated words can be considered formal. Agglutination/Compound words can be considered more formal compared to their analytical or elaborated forms.
(formal) /
(informal) — death has same meaning, but a different formality Panda (1992); Nagarajan (2014); Ghosh et al. (2022).
- •
Bengali pronouns agree with the verb in levels of formality and there are formal and informal variations of the same verb. David (2015); Sultana (2016) For instance, verbs like Give, Eat, Go can be written as
(formal) or
(informal) depending on the context.
- •
Bengali does not contain any negative pronoun or adverb and sentences can be modified to be negative at a syntactic level by adding
or other modifiers. These negation modifiers like na/nei/nai/Ni can indicate variations in formality Thompson (2006).
- •
Among Bengali speakers in Bangladesh, regional dialects like Sylheti, Chakma, and Chittagonian are generally considered deviant or informal while classical Bengali dialect (Sādhubhāsā) or standardized Bengali dialect (Cholito vasha) is considered formal Ray et al. (1966).
English
Formality in English is commonly defined as the style of language used in a given situation. A formal speech, for instance, has a very careful selection of pronunciation, words, and structure (Richards and Schmidt, 2013). Heylighen and Dewaele (1999) divide English formality into two dimensions: a deep formality, characterized by the understanding of the precise meaning, avoiding ambiguity; and a surface formality which focuses on the rigorous selection of manners. Some recent works focus on the latter to evaluate formality using the selection of words (Brooke et al., 2010) and discarding the topic (Pavlick and Tetreault, 2016). In accordance with Liardét et al. (2019), we use the following rules to evaluate cohesive English text as informal:
- •
Presence of contractions, abbreviations, and colloquial expressions.
- •
Presence of grammar infelicities, that is, unsuitable expressions, inconsistencies in writing, and misspellings.
- •
High occurrence of delexical verbs and phrasal verbs.
- •
Higher involvement of human participants and subjective judgments, such as opinions.
French
Formality is typically classified in French into three classes: soutenu, courant and familier Gadet (2005); Beeching et al. (2009). The register soutenu is reserved for legal documents, literature, or when addressing someone we want to show particular respect (e.g., a judge). It usually involves addressing someone with the second singular person (called vousvoîment). The register courant corresponds to the one used in day-to-day life, for instance when we talk to someone new which is typically neutral and includes few grammatical errors. The register familier is the one used with friends, or within a family circle. It usually involves addressing someone with the second singular person tu (tutoîment). It can include a large portion of grammatical errors. It can also include slang and insults in their most vulgar form. In this work, following what was done in the XFORMAL work (Briakou et al., 2021b), we classify generated text into two classes. Soutenu is associated with the formal class while familier and courant with the informal class.
Spanish
Formality in Spanish is commonly described by the T-V distinctions in the singular second-person pronoun derived from Latin. Specifically, there are two possible translations for the English pronoun ”you”: tú is considered informal while usted is formal. Both pronouns have different conjugations. Thus, the formality in sentences that use the singular second person is easily recognizable.
In the case of the other pronouns, the first person is often considered less polite than the third one Stewart (2001). For that reason, the third person is commonly used in scientific texts Salazar et al. (2013). Aside from the pronouns and their conjugations, according to Cépeda and Tavera (2007), a formal text in Spanish should accomplish other characteristics such as:
- •
Having no typographical or grammatical errors.
- •
Being a set of sentences referring to the same topic.
- •
Being arranged in paragraphs and having a coherent correlation between ideas using appropriate connectors.
In our work, we check the presence of slang or offensive terms in a sequence to classify text as informal. Then, T/V distinction in sentences written using the second person defines the formality level. In a similar way, sentences written in the third person have a bigger probability of being classified as formal compared to the ones written in the first person. The final priority is the layout: paragraph-structured sequences are considered as formal in more scenarios than conversational-structured ones.
3 Related Work
Biases of Generative Language Models
Recent literature on Large Language Models (LLMs) demonstrated social bias and prejudice against minorities Sheng et al. (2021); Blodgett et al. (2020); Bender et al. (2021); Bommasani et al. (2021); Liang et al. (2021) in terms of many categories including gender Sun et al. (2019); Cao and Daumé III (2020); Felkner et al. (2022), race Davidson et al. (2019), religion Abid et al. (2021); Malik et al. (2022), occupation, politics and disabilities which result in the production of damaging content. Evaluating social bias and harm produced by monolingual language models is hard, but difficulties increase in multilingual settings. To create multilingual evaluation frameworks, it has been argued that careful curation of culturally aware datasets and knowledge of cultural differences that exist between languages is necessary Talat et al. (2022).
Many papers have focused on measuring social biases and stereotypes against historically disadvantaged groups and counteracting them for a limited number of languages like English Nadeem et al. (2021); Nangia et al. (2020); Barikeri et al. (2021), French Névéol et al. (2022), Hindi Malik et al. (2022), but similar work has not been done for low-resource languages like Bengali. Since LLMs such as BLOOM Scao et al. (2022) can be continuously (re)trained and are deployed by companies to be accessible by users, proposals have been made to create social bias verification pipelines for LLMs similar to software testing Nozza et al. (2022). To our knowledge, the evaluation of multilingual models for measuring cultural biases like formality has not been attempted so far.
Formality Analysis
Previous work in formality analysis has focused on formality classification Heylighen and Dewaele (1999); Abu Sheikha and Inkpen (2010); Pavlick and Tetreault (2016); Dementieva et al. (2022), formality style transfer in English Rao and Tetreault (2018); Wang et al. (2019, 2020); Czeresnia Etinger and Black (2019); Madaan et al. (2020); Yao and Yu (2021); Briakou et al. (2021a), and in the multilingual setting Korotkova et al. (2019); Briakou et al. (2021b); Krishna et al. (2022). Formality-sensitive machine translation to control the generation of machine translation models to target formality has received attention in recent years Sennrich et al. (2016); Niu et al. (2017); Feely et al. (2019); Viswanathan et al. (2020); Niu and Carpuat (2020); Schioppa et al. (2021) and benchmark MT datasets and models have been published Nadejde et al. (2022); Rippeth et al. (2022).
Recently, several datasets with formality annotations have been introduced in multiple languages. Initial attempts included annotating sentences from various resources such as emails, news, online forums, and blog sentences with numerical formality rating Lahiri (2015); Pavlick and Tetreault (2016). The Grammarly’s Yahoo Answers Formality Corpus (GYAFC) Rao and Tetreault (2018) is a benchmark formality style transfer dataset for English. XFORMAL Briakou et al. (2021b) extended formality style transfer to the multilingual setting by collecting data for four European languages (Brazilian, Portuguese, French, and Italian). InFormal (Indic Formality Evaluation Dataset) Krishna et al. (2022) is a small dataset of 4k samples in four Indic languages - Hindi, Bengali, Kannada, Telugu with crowdsourced formality annotations. TAOCD (The Arabic Online Commentary Dataset) Zaidan and Callison-Burch (2011) presents an annotated dataset of informal Arabic with high dialectal content with 108k labeled sentences. In our work, we use GYAFC (English), XFORMAL (French), TAOCD (Arabic), and InFormal (Bengali) to source prompts for our analysis of language models along with other resources described in table 2. In the following sections, we describe our experiments and results for different languages.
4 Experiments
We evaluate different dimensions of formality of the generation outputs of two state-of-the-art generative multilingual language models: XGLM (Lin et al., 2021b) and BLOOM (Scao et al., 2022), in five languages: Arabic, Bengali, English, Spanish, and French. We hypothesize that the influence of high-resource languages in the corpus can involve biases in the formality of the whole models. To see their behavior in different scenarios, we employ distinct variations of prompt lengths and formality. In addition, we tweak some parameters when generating to avoid incohesive outputs.
4.1 Language Models
XGLM
(Lin et al., 2021b) is a multilingual generative language model based on a decoder transformer. XGLM is trained with 500 billion tokens belonging to 30 languages. XGLM aims to achieve multilingual zero-shot and few-shot learning performance for different tasks. To do so, their authors propose multilingual prompting to improve the results of single-language prompts. XGLM has five sizes according to their number of parameters ranging from 564 million to 7.5 billion parameters. We employ the models with 2.9 and 7.5 billion parameters for this study333We use the checkpoints and implementations from https://huggingface.co/models.
BLOOM
(Scao et al., 2022) is also a multilingual generative language model trained on around 341 billion tokens from a corpus of 59 languages (13 of them are programming ones) to democratize huge pre-trained language models. BLOOM was trained from a collection of multiple sources such as Huggingface datasets, Github code, and Web Common Crawl. The data sources were then preprocessed to reduce non-natural language and anonymize personal identifiable information. BLOOM used architectural improvement introduced with the Megatron-LM GPT2 (Shoeybi et al., 2019), such as a normalization layer after the embeddings, ALiBi positional embeddings (Press et al., 2021), and a Byte-Level Byte Pair Encoding (Radford et al., 2019). BLOOM was released in different sizes ranging from 560 million to 176 billion parameters. We use the 3B and 7.1B parameter checkpoints22footnotemark: 2 for our experiments as they can be compared to XGLM ones.
XLGM and BLOOM are decoder-based transformers pre-trained on a similar set of languages with a comparable amount of data. We compare checkpoints of similar scale (i.e. we compare XGLM 2.9B with BLOOM 3B and XGLM 7.5B and BLOOM 7.1B). Regarding the proportion and data sources on which both models were trained, BLOOM was trained on a more varied set of domains than XGLM in spite of the XGLM corpus being larger. In addition, the BLOOM corpus has a more balanced distribution of the amount of data of the languages evaluated in this study. More details about the quantity and sources of both models can be found in Appendix C.
4.2 Prompting for Formality Evaluation
We employ two prompting strategies to condition the generation of the models. In that way, the behavior of the model in different scenarios can be assessed.
Short Neutral Prompts
A short prompt is composed of up to three words to condition the language of the output without giving any context that could impact the formality level. That allows us to measure the models’ tendency to produce a certain formality level with a neutral input. For the lexicon of each language444http://corpus.rae.es/lfrecuencias.html,
https://www.pinhok.com/kb/bengali/98/100-basic-bengali-vocabularies/,
https://talkinarabic.com/arabic-words/,
https://en.wikipedia.org/wiki/Most_common_words_in_English
https://strommeninc.com/1000-most-common-french-words-frequency-vocabulary/, we pick a set of common words (or a combination of them to avoid the confusion of languages when generating) that can be used in both formal and informal sentences. We illustrate the short prompt we use in Table 2.
Long Informal/Formal Prompts
This set of prompts is composed of truncated sentences extracted from existing formal/informal sources. Using these prompts, we can verify how much the models preserve the formality level of their input. The sources of the prompts include formality datasets such as GYAFC (Rao and Tetreault, 2018), XFORMAL (Briakou et al., 2021b), InFormal (Krishna et al., 2022). We also include dataset crawlings from webs (Zaidan and Callison-Burch, 2011; Cañete, 2019) and informal songs (Muñoz, 2018).
Table 2 details which words/group of words we use as short prompts and the dataset sources of the formal/informal prompts for each language.
4.3 Generation Parameters
Decoding parameters are important because they can affect the output of a language model directly. For each language, we select a set of parameters to produce fluent text that can be evaluated properly. All selections were chosen to impact the natural formality level of models as less as possible. This subsection presents our list of generation parameters to reproduce our experiments.
Global generation parameters
Our evaluation of the models is based on the formality of the outputs of each model. Very short sentences, code snippets, and outputs in other languages cannot be evaluated properly. This set of parameters is a collection of language-independent configurations to produce an assessable amount of outputs with a significant length to be evaluated.
We filter out the generation sequences that are not natural language (i.e., code) by excluding from the generation process all the tokens that contain any of the following symbols: , , , , , , , , , , and . 2. 2.
We force the model to generate at least 30 new subword tokens (excluding the prompt) to have a long enough generation sequence and be able to assess formality. 3. 3.
We set a maximum of 150 new tokens of generation to avoid long outputs that could include multiple formality variations. 4. 4.
Length of the prompts. For the short-prompt setting, we employ at most three tokens to condition the generation in the desired language. For the formal/informal prompts, we use 15 words (tokenization with white spaces) on average.
Regarding the total number of evaluated outputs, we generated three sets for each evaluated model and language: 100 with short, 100 with formal, and 100 with informal prompts. That resulted in 1200 generated outputs for each language.
Language-specific generation parameters
Before generating the sequences for formality evaluation, we tweaked some logit parameters for each language. All modifications were done to obtain more fluent sequences and reduce incohesive outputs such as ones with generation repetitions or non-understandable text. This process was done with a varied set of prompts regardless of length and formality level.
We use sampling to obtain the generation outputs for both models. Three specific parameters were set for both models: We set top-k to 50, which truncates the number of tokens to sample from. We set a high top-p (Holtzman et al., 2019) to generate diverse sampled tokens by cumulative frequency, and a high temperature (Ackley et al., 1985), which does not skew the distribution towards high probability tokens. The specific details of the parameters can be found in Appendix 6.
4.4 Formality Evaluation
We assessed the formality of all generated outputs. To do so, one native/proficient speaker of each language classified all 1200 generated sequences individually. We opted for this evaluation procedure because, at the time of performing the experiments, to our knowledge, there were no multilingual formality classifier models that include Arabic, Bengali, English, Spanish, and French. To avoid possible biases, each generated output was annotated without looking at its prompt and in a randomized order.
The classification categories for all languages are formal, informal, and incohesive. A sequence is classified as formal or informal according to the rules of each language described in section 2. The ”Incohesive” label is only assigned under certain conditions, such as sequences written in other languages, non-understandable text, very short sequences that cannot be evaluated for formality level, or code snippets.
5 Results & Analysis
We interpret our results across different dimensions. We start by analyzing the cohesiveness of each model. We then exclude the incohesive text from our formality analysis.
5.1 Cohesiveness of Generation
As seen in Table 3, BLOOM(7.1B) generates significantly more cohesive texts than XGLM(7.5B) for English, French, and Spanish with p-values under 5%, based on a permutation-based statistical test.
Interestingly, the results in Table 3 also show that a larger model does not necessarily lead to more cohesive generations. For example, BLOOM(3B) generates more cohesive texts than BLOOM(7.1B) for Bengali and English. XGLM(2.9B) also generates more cohesive texts than XGLM(7.5B) for English, French, and Spanish. We note that we are only evaluating cohesiveness in a binary way (cohesive vs. incohesive) and are not judging the quality of the predictions beyond that.
Besides, the percentage of incohesive texts is noticeably higher for some languages than others for both BLOOM and XGLM. For example, the highest percentage of incohesive texts in the case of Bengali, English, and Spanish is less than or equal to 10%, while that percentage is higher in the case of Arabic and French.
5.2 Formality-Level Bias
Neutral prompts, given to an assumingly unbiased model, should lead to equitable distributions of formal and informal generations with a difference close to zero between both generations. However, this is not the case here as we show in Table 4. In the case of Bengali, we see that XGLM(2.9B), BLOOM(3B) and BLOOM(7.1B) are almost neutral with small differences of -3% -6% and -3%, respectively, showing bias toward informal generations. On the other hand, we see XGLM(7.5B), surprisingly, showing significantly more bias toward formal generations than BLOOM(7.1B) with a difference of 33%. Upon qualitative analysis, we found that many of the generations of XGLM(7.5B) had Bengali religious Islamic text-like attributes that were considered formal during annotation and the usage of hashtags or emojis was also less than the smaller model for neutral prompts.
BLOOM, for French, continues to show less bias showing only a bias of 1% toward informal generations in the case of BLOOM(3B) and 14% towards formal generations in the case of BLOOM(7.1B). On the other hand, XGLM(2.9) shows significantly more bias than BLOOM(3B) toward formal generations with a difference of 41%. For English, XGLM and BLOOM both show a small bias (in terms of percentages) towards different directions. XGLM(2.9B) and XGLM(7.5B) show bias towards formal generations by 14% and 8% respectively. However, BLOOM(3B) and BLOOM(7.1B) display bias towards informal generations by 6% and 13% respectively. After a careful review of the predictions, we find that French and English informal predictions of BLOOM are due to a large proportion of informal generated dialog.
BLOOM, this time for Spanish, shows extreme bias towards the formal generations with a difference of 79% for BLOOM(3B) and 67% for BLOOM(7.1B). On the other hand, XGLM exhibits less bias towards formal generations with a difference of 58% for XGLM(2.9B) and 45% for XGLM(7.5B). These values indicate that both models are influenced by formal sources. In fact, most of the generated sequences with short prompts have the style of news titles/contents and Wikipedia articles.
A biased distribution of outputs could be reasoned by the data the model was trained on. As stated in BLOOM (Scao et al., 2022), the biggest part of the corpus for Arabic was the Arabic-focused Masader repository (Alyafeai et al., 2021; Altaher et al., 2022), which is dominated by Modern Standard Arabic (MSA) that is considered formal according to our definition of formality in section 2. This explains the extreme bias BLOOM(3B) and BLOOM(7.1B) show towards formal generations with a bias of 100%. XGLM(7.5B) similarly shows an extreme bias toward formal generations, but significantly less than BLOOM(7.1B) with a difference of 83%.
In terms of model size, we notice that XGLM(2.9B) shows more bias towards formal or informal generations than XGLM(7.5) for all the languages except Bengali, which could indicate that the bigger the XGLM model’s size, the less biased it is. On the other hand, this isn’t the case for BLOOM as BLOOM(3B) is only expressing more bias for Bengali and Spanish, while BLOOM(7.1B) shows more bias for English and French.
In summary, the models show moderate bias for some languages such as English and Bengali, except for XGLM(7.5B) in the case of Bengali, while also showing extreme bias for other languages such as Arabic, French, and Spanish. This difference might be caused by the fact that every language is present in the data with a different percentage and is coming from different sources as shown in Table 7. Overall, it is noticeable that the bias is mostly toward formal generations for all the models and for all the languages.
5.3 Formality-Level Preservation
In this experiment, we measure how well the formality level of a generation is the same as the formality level of the prompt (i.e. how well the model preserves the formality-level of the prompt). We find that the formality style of the prompts is preserved efficiently for some languages by some models while being almost ignored in some other cases.
For Arabic, as we show in Table 5, BLOOM(3B) and BLOOM(7.1B) preserve the formality style of 94.2% and 93.5%, respectively, of the samples when the given prompt is formal. However, BLOOM does not pay that much attention to the style of the informal prompts and preserves the style of only 55.1% of the samples with BLOOM(3B) and 51.1% of the samples with BLOOM(7.1B). This confirms our finding from section 5.2 that showed that BLOOM is biased toward formality in Arabic. XGLM(7.5B), on the other hand, preserves the informal style of the prompts significantly better than BLOOM(7.5B) with a percentage of 76.7%.
XGLM(2.9B), for Bengali, preserves the style of the informal prompts of significantly more samples than BLOOM(3B) with a percentage of 100%. BLOOM pays attention to the informal style of the prompts as well, unlike the case for Arabic, and preserves the style of 87.1% of the samples generated with BLOOM(3B) and 91.9% of the samples generated with BLOOM(7.1B).
Both BLOOM and XGLM, this time for English, do not preserve the formal style of the prompts for more than 34.4% of the samples for any model. However, they both preserve the informal style in at least 84.7% of the generated samples with BLOOM(7.1B) preserving significantly more samples than XGLM(7.5B). A similar trend follows for French with both BLOOM and XGLM unable to preserve the formal style for more than 32.0% of the samples in the case of XGLM(2.9B), BLOOM(3B) and BLOOM(7.1B). On the other hand, XGLM(7.5) preserves the formal style significantly better than BLOOM(7.1B) with a percentage of 54.0%. And again the informal style is being preserved better with, specifically, BLOOM(3B) which preserves the style better than XGLM(2.9B) with a percentage of 82%.
The formal and informal styles in Spanish are preserved consistently across the models to at least 77.8% of the samples with formal prompts and at least 75.8% with informal prompts with BLOOM(7.1B) preserving the style in significantly more samples than XGLM(7.5B).
In terms of model size, we notice that the size of the model is not an indicator of how well the model can preserve the formality style. For example, BLOOM(3B) preserves the formal style better than BLOOM(7.1B) for all languages except Spanish. In summary, we see that the informal style is mostly preserved well for most languages except with BLOOM for Arabic. The formal style, on the other hand, is mostly preserved well for all languages except English and French.
5.4 General Statistics about Generations
We report in Table 8 general statistics about the generated texts of each model and language by formality level. Results show that BLOOM generates about twice longer texts as XGLM. In terms of the average number of sentences per generation, BLOOM, when the generation is informal, generates more and shorter sentences than when the generation is formal. Also, informal generations tend to have emojis as expected, especially in the case of Bengali. Besides, informal generations tend to have more punctuation marks than formal ones. Finally, the results of the average number of new lines and the average number of “-”, which are used to signal dialogues, support what we mentioned earlier about BLOOM’s tendency to generate conversational text.
6 Discussion
Formality bias when present in multilingual models, which are increasingly popular nowadays, can lead to undesirable outcomes. For example, using ”please” is common among North American English native speakers in requests, even among close friends, while in Arabic, it could be considered awkward, if not rude, in conversations among close friends Hovy and Yang (2021). A usage example of language models is solving downstream tasks using prompting techniques for zero-shot learning, such as Zhong et al. (2021)’s work on question-answering. Prompting has also been used to utilize large language models for conversational chatbots such as ChatGPT Ouyang et al. (2022). As prompting is becoming popular, we must understand that prompting a model that exhibits formality bias could be a barrier to getting the expected output. Furthermore, depending on the application, formality bias could even lead to sometimes unwanted misunderstandings (Hershcovich et al., 2022) and conflicts if the models, for example, are not able to generate text in the formality style of the users’ expectations.
Controlling LLMs generations has been taken into consideration in recent work, such as Ouyang et al. (2022), which fine-tuned a language model Brown et al. (2020a) intending to align the model with the intent of the users using reinforcement learning from human feedback (RLHF) Christiano et al. (2017); Stiennon et al. (2020). Future work could analyze the impact of RLHF on the formality distributions present in language models. Furthermore, our work focused only on two pre-trained models with up to 7B parameters. The same analysis could be conducted for larger models such as GPT-3 and BLOOM(175B). Finally, the increase in the number of multilingual language models calls for more work on the bias analysis of multilingual language models.
7 Conclusion
In conclusion, we analyzed the formality level of the generations of two large-scale generative language models, XGLM and BLOOM, ranging from 2B parameters to 7B parameters. We first observed the cohesiveness of the predictions. We found that BLOOM(7.1B) predicts significantly more cohesive text than XGLM(7.5B) for English, French, and Spanish. Second, we showed that, across all five languages, both models tend to generate formal text when prompted neutrally. Finally, we found that the formality of the prompt highly impacts both models. In most cases, they generate the same style as the prompt, with slight differences between the models depending on the language. Our analysis is based on the annotations of 1,200 generations in Arabic, Bengali, English, French, and Spanish. We release them with this paper opening future avenues for modeling the formality of generative multilingual language models.
8 Acknowledgment
We thank the Fatima Fellowship555cf. https://www.fatimafellowship.com/ and Hugging Face for organizing and sponsoring the Fatima Research Fellowship program.
Appendix A Generation parameters
Table 6 shows details of the language-specific generation parameters we used for both BLOOM and XGLM.
Appendix B Descriptive statistics of the generations
General statistics of the generations are in Table 8 reported per language for each model and generation label pair. The table contains the following statistics: the average length of the generation, the average number of sentences in a generation, the average length of the sentences, the average number of emojis per generation, the average number of punctuation marks per generation, the average number of new lines per generation, and finally, the average number of the dialogue mark/dash (-) per generation.
Appendix C XGLM and BLOOM training corpora
We show in Table 7 details of the languages used in our analysis in the training corpus of BLOOM and XGLM.
Appendix D Formality Distribution
We visualize the annotated data for each language to help in seeing an overview of all the results. Each language is represented by a plot, see Figures 1, 2, 3, 4, and 5, with 12 bars with 3 bars corresponding to each model representing the 3 prompts types: formal informal, and neutral. Each bar in the plot represents 100 texts generated with the corresponding model when prompted with the corresponding prompt type. The colors in each bar represent the 3 possible annotations: formal, informal, and incohesive.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abid et al. (2021) Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent anti-muslim bias in large language models . In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’21, page 298–306, New York, NY, USA. Association for Computing Machinery. · doi ↗
- 2Abu Sheikha and Inkpen (2010) Fadi Abu Sheikha and Diana Inkpen. 2010. Automatic classification of documents by formality . In Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010) , pages 1–5. · doi ↗
- 3Ackley et al. (1985) David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. 1985. A learning algorithm for boltzmann machines. Cognitive science , 9(1):147–169.
- 4Altaher et al. (2022) Yousef Altaher, Ali Fadel, Mazen Alotaibi, Mazen Alyazidi, Mishari Al-Mutairi, Mutlaq Aldhbuiub, Abdulrahman Mosaibah, Abdelrahman Rezk, Abdulrazzaq Alhendi, Mazen Shal, Emad Alghamdi, Maged Alshaibani, Jezia Zakraoui, Wafaa Mohammed, Kamel Gaanoun, Khalid Elmadani, Mustafa Ghaleb, Nouamane Tazi, Raed Alharbi, and Zaid Alyafeai. 2022. Masader plus: A new interface for exploring +500 arabic nlp datasets . · doi ↗
- 5Alyafeai et al. (2021) Zaid Alyafeai, Maraim Masoud, Mustafa Ghaleb, and Maged S. Al-shaibani. 2021. Masader: Metadata sourcing for arabic text and speech data resources.
- 6Badawi (1973) As-Said Muhámmad Badawi. 1973. Mustawayat al-arabiyya al-muasira fi Misr . Dar al-maarif.
- 7Barbieri et al. (2022) Francesco Barbieri, Luis Espinosa Anke, and Jose Camacho-Collados. 2022. Xlm-t: Multilingual language models in twitter for sentiment analysis and beyond. In Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages 258–266.
- 8Barikeri et al. (2021) Soumya Barikeri, Anne Lauscher, Ivan Vulić, and Goran Glavaš. 2021. Reddit Bias: A real-world resource for bias evaluation and debiasing of conversational language models . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 1941–1955, Online. Association for Computational Linguistics. · doi ↗
