Bias Considerations in Bilingual Natural Language Processing

By: Marie-Pier Schinck, Eunbee (Andrea) Jang and Julien-Charles Lévesque, Employment and Social Development Canada

Introduction & Study objective

Employment and Social Development Canada (ESDC) has leveraged natural language processing (NLP) in multiple projects in recent years and has identified challenges around working with data that is skewed in the proportion of each official language. Recent advances in NLP are predominantly focused on the English language, and there are limited resources for non-English languages. As such, when working on applied NLP solutions for ESDC, data scientists must make decisions when processing the French language while also dealing with limited resources and competing priorities.

Concerns around treatment of the French language were initially raised by the authors of this study and are based primarily on their experience as data scientists working with bilingual datasets at ESDC (see Official Languages in Natural Language Processing). In response, the authors consulted various federal data scientists and NLP researchers and found that these challenges were not limited to ESDC and were, in fact, common across different departments and agencies.

The main objective of this project is to explore this issue and gain transferable knowledge that data scientists can use to increase the equity of solutions provided by ESDC.

As a starting point, we measure the extent of language bias in four ESDC projects where multilingual classification systems were implemented. We also experiment with rebalancing strategies to gain insight regarding an ideal representation of the minority language. We compare model performance across several scenarios, including multilingual models, separate unilingual models, or a translation-based cross-lingual approach where we translate French to English to allow unilingual training on an English-only model. Through this, we can observe how much the models' performance improves or worsens with respect to each language.

2 – Study setup

The four datasets used in this study focus on supervised classification problems of past and ongoing projects at ESDC. The scope was limited to supervised classification problems to reflect the time and resources available for this project, because of ease of access to the data and because it is the most common NLP task solved by our team.

Table 1: Dataset characteristics.
Dataset Description Number documents Proportion of French data
T4 Call summary notes written by Service Canada (SC) call center agents. These notes are generally short and incomplete sentences with administrative jargon. The goal of the project is to reduce costly human labour by automatically identifying the case where a T4 form has been returned to individuals by SC. 6 k 35 %
HR Responses of applicants to a pre-screening question in a hiring process. This research project was undertaken to assess the feasibility of using NLP to filter down the candidate pool of large-scale hiring processes. 5 k 6 %
ROE Comments written by employers on the Record of Employment (ROE) forms received by SC. ROE comments are generally short incomplete sentences with the frequent use of jargon used in employment insurance. The project is designed to reduce manual labour of SC employees by classifying ROE comments into different objectives. 280 k 28 %
PASRB News articles from Canadian media sources, obtained through the NewsDesk platform. The task is to indicate whether an article should be flagged as a relevant source to include in a brief for deputy ministers. 69 k 25 %

2.1 – Vectorization and model architecture

For our experiment, we trained models for each dataset and across several vectorization methods, model architectures and hyperparameter configurations (see Table 2). The choice of vectorization methods and models cover some of the most common tools used for NLP classification problems. The vectorization methods include a feature selection method applied to bag-of-words based on the chi-square distribution (Chi2BOW), FastText word embeddings (FT) and contextual embeddings from multilingual BERT (Devlin et al., 2018b).

For the model setup, we have two main bodies of classification architectures, which we call contextual and non-contextual learning methods. By non-contextual, we refer to learning systems that process aggregate representations of sentences, discarding information about the order of words. Those include logistic regression (LR), multi-layer perceptron (MLP) and XGBoost (XGB; Chen & Guestrin, 2016). Contextual approaches, on the other hand, take word-order information into account when learning and predicting. We implemented two model architectures that are contextual, a long and short-term memory recurrent neural network (LSTM, Hochreiter & Schmidhuber, 1997) and a popular attention-based model called BERT (Devlin et al., 2018a). The hyperparameter search details are out of scope for this article, however the main results are provided based on an exhaustive search methodology. We also evaluated the LR, MLP and XGB methods with a simple bag-of-words embedding (without chi square feature selection), though it is omitted in this article to streamline the presentation. Feel free to contact the authors to get the complete report.

Table 2: Vectorization and Model Setup
Vectorization Models
Chi square with Bag-of-Words (Chi2BOW) Logistic Regression (LR)
Chi2BOW Multi-layer Perceptron (MLP)
Chi2BOW XGBoost (XGB)
FastText (FT) LSTM
BERT (WordPiece) BERT

3 – Presence of Bias

In this study, we will investigate language bias by looking at the disparity on test accuracy between the two official languages. The presence of bias will be assessed in several different settings, discussed in more detail below.

Performance disparity in multilingual models

Our first experiment consisted of training multilingual models (i.e., training on both languages simultaneously with a single model), across the methods discussed in the previous section, using the language representation found in the original datasets (no rebalancing). To assess the presence of language bias, we then compared the performance achieved on the French portion of the data to the one achieved on the English portionFootnote 1. Figure 1 shows the test accuracy by language for the best hyperparameter configuration of each method tested.

Figure 1: Test accuracy with regards to text language

Figure 1: Test accuracy with regards to text language
Figure 1: Test accuracy with regards to text language

Performance of all the methods listed in Table 2, split by dataset and by language. Detailed numbers below.

Figure 1: Test accuracy with regards to text language
Performance of all the methods listed in Table 2, split by dataset and by language.
Dataset T4 HR ROE PASRB
Method / Language En Fr En Fr En Fr En Fr
BERT 97.6 97.2 78.4 73.2 91.7 91.2 86.3 87.0
C2BOW + LR 96.6 96.7 68.1 67.9 90.6 90.0 82.3 83.8
C2BOW + MLP 95.4 97.9 69.1 66.1 87.5 87.3 63.9 68.3
ChiSquareBOW + XGBoost 95.0 97.2 75.3 75.0 88.7 86.4 84.6 85.0
FT + LSTM 94.8 94.7 72.0 69.6 91.7 90.9 83.7 83.2

The first conclusion, when looking at Figure 1, is that there is no dominant trend that remains true across all four datasets. For instance, the ROE dataset shows that the results achieved on the English portion of the data systematically outperform those of the French portion of the data. The T4 dataset, on the other hand, shows the opposite trend, with the French portion of the data outperforming the English portion across most methods. This can be explained by the fact that the T4 dataset contains the highest proportion of French data and by the content of the data itself, where the business context supports the hypothesis of a different underlying distribution for each language, leading to the classification problem being easier to solve in French than in English. The PASRB and HR datasets display less clear trends, with French outperforming English slightly for PASRB and the opposite for HR.

To get a more detailed picture, we compiled the differences in performance of those experiments and normalised them by calculating their z-scores (higher scores indicating better relative performance on English). This revealed that, on average, the multilingual models trained for this study performed slightly better on English than on French, by a factor of 0.13 standard deviation on the performance metric. Trends on individual datasets are slightly stronger, with an average difference of 0.56 and 0.41 standard deviations respectively on ROE and HR, and –0.33 on T4. Despite this slight overall bias in favour of English, the main takeaway of these results is the importance of a language specific and careful performance analysis when using multilingual models because the presence of bias will vary based on dataset properties and the business context behind data collection.

Influence of the language distribution in multilingual systems

In this section, we explore the impact of language proportion (i.e. the ratio of French to English data) in multilingual systems. We evaluate with two methods that are commonly used for NLP classification tasks and which performed satisfyingly on our benchmarks: BOW+XGBoost and BERT. We evaluate on the ROE dataset due to its larger size.

In the experiment, undersampling is applied to one of the languages to obtain a target ratio of French to English data ranging from 10:90 to 90:10. The testing data is kept intact with a 28:72 French to English ratio, in order to evaluate on the same samples every time.

Figure 2: Language ratio experiment on ROE data. Left: Bag-of-Words with an XGBoost classifier. Right: BERT model averaged across 3 repetitions per ratio.

Figure 2: Language ratio experiment on ROE data. Left: Bag-of-Words with an XGBoost classifier. Right: BERT model averaged across 3 repetitions per ratio
Figure 2: Language ratio experiment on ROE data. Left: Bag-of-Words with an XGBoost classifier. Right: BERT model averaged across 3 repetitions per ratio. The two graphs above show the result of the language ratio experiments on the ROE dataset. On the left is the performance for the XGBoost model and on the right is the performance for the BERT model averaged across 3 repetitions per ratio. The x-axis of the two graphs shows the proportion of data, a series of French to English ratios. It starts from a 10:90 French to English ratio all the way to a 90:10 ratio. The y-axis is the accuracy score denoted in percentage. The gray dashed line represents the overall accuracy score of each model, and the solid coloured lines show the performance of each language separately; red represents the French portion of the data, and blue represents the English portion of data. According to these figures, increasing the proportion of one language for model training will result in an improved performance on that language, with the opposite trend also being observed. Furthermore, the graphs show that the ratio where the French and English have the least performance disparity is different from the original ratio (28:72) of the dataset. That is a 50:50 ratio for the XGBoost model (on the left) and a 40:60 ratio for the BERT model.

Figure 2 illustrates the performance of these two models with the data ratio splits described above. As expected, the experiment shows that decreasing the proportion of data in a given language tends to reduce the performance on that language in all cases, with the opposite trend observed when increasing the proportion of a language, although sometimes the performance stays stable for different ratios. The overall accuracy curve systematically lies closer to the English accuracy curve because it is calculated on a test set using the fixed language ratio (fr:en) found in the original dataset (28:72).

The experiments also show that the optimal language ratios vary based on the different learning methods. More specifically, on BOW + XGBoost, French and English accuracy scores have the lowest discrepancy at a 50:50 ratio (fr:en). With BERT, the two languages have the lowest discrepancy in accuracy at the 30:70 and 40:60 ratios. This is especially interesting given that the optimal ratio is different than the original ratio in the dataset, in this case 28:72.

This experiment indicates that artificially manipulating language proportion may intensify or improve bias. It is advisable to have a somewhat balanced language proportion at training to reduce the disparity between the performance on the two languages. However, there is a trade-off to make between the overall accuracy (accuracy on all samples), and the accuracy on French versus English texts – the point of optimal performance might not be the same for both criteria.

Trade-off between multilingual and unilingual modelling

In this section, we present an analysis of the performance disparity for each language when training one model on both languages, known as the multilingual setting (multi), and when training two models (one per language), known as the unilingual setting (uni). The results are presented separately for each language, with the French language section also including the performance of an English-based unilingual system trained on French data translated to English (trans_uni). It should be noted that the model was trained only on the translated French data, rather than on the whole dataset including data originally in English and translated French data, mostly due to constraints in computational resources. This experiment aims to understand to what extent the signal needed for classification remains intact after documents have been translated. For translation, we use the Marian neural machine translation model.Footnote 2

English

The bar plot shown in Figure 3 displays the best performance for each method on the English portion of the data, with two bars for each, the left bar showing the multilingual setting and the right bar showing the unilingual setting.

In terms of model architecture, it can be seen that the BERT model performs the best across all datasets. Contrastingly, Multi-Layer Perceptron with the Chi Square BOW vectorization method (C2BOW + MLP) is one of the worst performing configurations across all datasets.

Figure 3: Comparison of English performance (test accuracy) in two different settings – multilingual and unilingual.

Figure 3: Comparison of English performance (test accuracy) in two different settings – multilingual and unilingual
Figure 3: Comparison of English performance (test accuracy) in two different settings – multilingual and unilingual.

Performance on English texts for all the methods listed in Table 2 trained in a unilingual versus multilingual setting, split by dataset. Detailed numbers below.

Figure 3: Comparison of English performance (test accuracy) in two different settings – multilingual and unilingual.
Performance on English texts for all the methods listed in Table 2 trained in a unilingual versus multilingual setting, split by dataset.
Dataset T4 HR ROE PASRB
Method / Mode Multi Uni Multi Uni Multi Uni Multi Uni
BERT 97.57 97.35 78.40 77.84 91.66 91.83 86.29 84.47
C2BOW + LR 96.63 96.56 68.14 72.87 90.56 90.58 82.34 82.12
C2BOW + MLP 95.41 95.90 69.09 69.79 87.52 87.95 63.92 68.07
C2BOW + XGB 95.01 96.69 75.30 77.61 88.70 89.30 84.65 84.83
FT + LSTM 94.82 93.65 71.97 77.37 91.69 91.37 83.70 83.30

For the datasets T4, HR, and PASRB, BERT's performance is higher in the multilingual setting, while the unilingual setting slightly outperforms the multilingual one for the ROE dataset, which interestingly, is the largest dataset. While multilingual training appears to perform better for BERT-based models, multilingual methods overall have slightly lower performance compared to their unilingual counterparts, on average 0.56% lower. We take this as evidence that the method of training (multilingual vs unilingual) does not significantly impact the dominant language category, English.

French

Figure 4 presents the overall comparison of the three approaches in French modelling. More specifically, we compare the performance of the French portion of the data in the multilingual setting (multi) against unilingual approaches, one with models trained on the original French data (uni) and another with the French data translated into English to be fed into an English unilingual system (trans_uni). For trans_uni, we only use the French portion of the data for training, leaving out the original English data, to directly observe the impact of translation approach on the minority language.

Figure 4: Comparison of French performance on three approaches – multilingual, unilingual, translated unilingual.

Figure 4: Comparison of French performance on three approaches – multilingual, unilingual, translated unilingual
Figure 4: Comparison of French performance on three approaches – multilingual, unilingual, translated unilingual.

Performance on French texts for all the methods listed in Table 2 trained in a unilingual versus multilingual setting, split by dataset. Detailed numbers below.

Figure 4: Comparison of French performance on three approaches – multilingual, unilingual, translated unilingual.
Performance on French texts for all the methods listed in Table 2 trained in a unilingual versus multilingual setting, split by dataset.
Dataset T4 HR ROE PASRB
Method / Mode multi trans_uni uni multi trans_uni uni multi trans_uni uni multi trans_uni uni
BERT 97.18 96.34 96.34 73.21 88.00 90.00 91.23 90.99 92.45 87.05 81.63 86.10
C2BOW + LR 96.71 96.59 96.83 67.86 86.00 86.00 90.02 89.39 90.36 83.79 80.89 83.85
C2BOW + MLP 97.89 94.39 96.10 66.07 58.00 64.00 87.30 86.69 88.20 68.32 68.62 81.08
C2BOW + XGB 97.18 95.61 96.59 75.00 76.00 84.00 86.42 88.71 89.38 84.97 82.63 86.47
FT + LSTM 94.68 95.37 94.63 69.57 60.00 68.00 90.88 90.16 90.92 83.21 78.30 83.36

With regards to model architecture, similar to what was observed for English, the best methods tend to vary depending on the dataset, with BERT performing best overall. BERT is the best method for HR, ROE and PASRB, with the unilingual setting outperforming both other settings in the first two cases (HR, ROE) and the multilingual setting preforming best for PASRB. Interestingly, the Chi2BOW + MLP method outperforms all other methods for the T4 datasets while it shows the worst results across the other three datasets.

As for the training schemes, we first notice that trans_uni appears to be the worst performing setting in general on T4, ROE, and PASRB. On HR, trans_uni is not always the worst approach but it is also not the best model. It appears that the errors from two cascaded models, the neural machine translation model used and the main classifier, are potentially being propagated when they are used one after the other. This provides evidence that translation may not be an ideal option for mitigating the data imbalance issue. However, it should be noted that the scope of our experiment on translation-based approach is limited to the French portion of data and the result may vary if the full data (English + French translated into English) is used on an English unilingual system.

French unilingual models seem to outperform their multilingual counterparts for three out of four datasets, HR, ROE and PARSB. For the T4 dataset, we observe that the multilingual models outperform their unilingual version for the majority of methods. Finally, when assessing the differences in accuracy scores for the French portion of the data between the unilingual and multilingual models, we see that unilingual models on average outperform multilingual models by 2.22 percentage points of accuracy. This difference is a lot more pronounced than what was observed for the English language. This indicates that the choice of using a multilingual model, as opposed to two unilingual ones, will on average lead to a greater decline in performance on the French portion of the data, compared to the English portion.

4 – Conclusion

Making decisions regarding the handling of bilingual text data is commonplace for many data scientists working as federal public servants. While the status of official languages prescribes that there should be no difference in treatment between them, this can be particularly difficult to achieve when a greater quantity and quality of NLP tools are available for English than for French. This initiative aimed to gain applied and transferable knowledge to help the Government of Canada's (GoC) data scientists make more informed decisions when developing NLP solutions for bilingual datasets.

Our results first indicated that there is no trend that remains true across datasets when looking at bias in multilingual models. For instance, the ROE dataset showed a slight bias where performance on English comments is systematically higher than on French comments, whereas analysis on the T4 data revealed an opposite trend with bias favouring French. In short, although there is no definitive rule for bias emerging in multilingual models across all datasets, some models do have a tendency to underperform on one of the official languages, highlighting the need for proper language-specific assessment to avoid risks of biased treatment or disparate impact. The experiments on language proportions in the multilingual setting showed that aiming for a 30-50% representation of French through undersampling of the majority language leads to the best results. More specifically, it allows for a decreased disparity in performance between both official languages, without harming overall performance.

The exploration of the multilingual setting compared to the unilingual setting revealed that the impact on the performance of the English portion of the data was negligible, with both settings leading to similar results, although performance was slightly improved in the unilingual setting. On the other hand, the French portion of the data sees a more significant decrease in performance in the multilingual setting, compared to the unilingual one. This means that, when good quality language identification is available, data science practitioners across the GoC should seriously consider the use of two unilingual models as it tends to result in better performance on average, when compared to a single multilingual model. Finally, translating French to use a unilingual English model showed the least promise of the three settings across all datasets. Since it carries greater risk of bias on the minority language in our experiment, we recommend conducting a complete analysis of its impact when attempting to deploy a single unilingual model with the minority language translated.

Register for the Data Science Network's Meet the Data Scientist Presentation

If you have any questions about this article or would like to discuss this further, we invite you to our new Meet the Data Scientist presentation series where the author(s) will be presenting this topic to DSN readers and members.

Tuesday, June 21
2:00 to 3:00 p.m. EDT
MS Teams – link will be provided to the registrants by email

Register for the Data Science Network's Meet the Data Scientist Presentation. We hope to see you there!

References

Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKIDD international conference on knowledge discovery and data mining (pp. 785-794).

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018a). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018b). Multilingual BERT. GitHub: google-research / bert

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

Date modified: