Comparing Pre-Training Schemes for Luxembourgish BERT ModelsLothritz, Cedric ; Ezzini, Saad ; Purschke, Christoph et alin Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023) (2023, September) Despite the widespread use of pre-trained models in NLP, well-performing pre-trained models for low-resource languages are scarce. To address this issue, we propose two novel BERT models for the ... [more ▼] Despite the widespread use of pre-trained models in NLP, well-performing pre-trained models for low-resource languages are scarce. To address this issue, we propose two novel BERT models for the Luxembourgish language that improve on the state of the art. We also present an empirical study on both the performance and robustness of the investigated BERT models. We compare the models on a set of downstream NLP tasks and evaluate their robustness against different types of data perturbations. Additionally, we provide novel datasets to evaluate the performance of Luxembourgish language models. Our findings reveal that pre-training a pre-loaded model has a positive effect on both the performance and robustness of fine-tuned models and that using the German GottBERT model yields a higher performance while the multilingual mBERT results in a more robust model. This study provides valuable insights for researchers and practitioners working with low-resource languages and highlights the importance of considering pre-training strategies when building language models. [less ▲] Detailed reference viewed: 278 (0 UL) Towards Refined Classifications Driven by SHAP ExplanationsArslan, Yusuf ; Lebichot, Bertrand ; Allix, Kevin et alin Holzinger, Andreas; Kieseberg, Peter; Tjoa, A. Min (Eds.) et al Machine Learning and Knowledge Extraction (2022, August 11) Machine Learning (ML) models are inherently approximate; as a result, the predictions of an ML model can be wrong. In applications where errors can jeopardize a company's reputation, human experts often ... [more ▼] Machine Learning (ML) models are inherently approximate; as a result, the predictions of an ML model can be wrong. In applications where errors can jeopardize a company's reputation, human experts often have to manually check the alarms raised by the ML models by hand, as wrong or delayed decisions can have a significant business impact. These experts often use interpretable ML tools for the verification of predictions. However, post-prediction verification is also costly. In this paper, we hypothesize that the outputs of interpretable ML tools, such as SHAP explanations, can be exploited by machine learning techniques to improve classifier performance. By doing so, the cost of the post-prediction analysis can be reduced. To confirm our intuition, we conduct several experiments where we use SHAP explanations directly as new features. In particular, by considering nine datasets, we first compare the performance of these "SHAP features" against traditional "base features" on binary classification tasks. Then, we add a second-step classifier relying on SHAP features, with the goal of reducing false-positive and false-negative results of typical classifiers. We show that SHAP explanations used as SHAP features can help to improve classification performance, especially for false-negative reduction. [less ▲] Detailed reference viewed: 113 (4 UL) LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for LuxembourgishLothritz, Cedric ; Lebichot, Bertrand ; Allix, Kevin et alin Proceedings of the Language Resources and Evaluation Conference, 2022 (2022, June) Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and ... [more ▼] Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request. [less ▲] Detailed reference viewed: 463 (51 UL) On the Suitability of SHAP Explanations for Refining ClassificationsArslan, Yusuf ; Lebichot, Bertrand ; Allix, Kevin et alin In Proceedings of the 14th International Conference on Agents and Artificial Intelligence (ICAART 2022) (2022, February) In industrial contexts, when an ML model classifies a sample as positive, it raises an alarm, which is subsequently sent to human analysts for verification. Reducing the number of false alarms upstream in ... [more ▼] In industrial contexts, when an ML model classifies a sample as positive, it raises an alarm, which is subsequently sent to human analysts for verification. Reducing the number of false alarms upstream in an ML pipeline is paramount to reduce the workload of experts while increasing customers’ trust. Increasingly, SHAP Explanations are leveraged to facilitate manual analysis. Because they have been shown to be useful to human analysts in the detection of false positives, we postulate that SHAP Explanations may provide a means to automate false-positive reduction. To confirm our intuition, we evaluate clustering and rules detection metrics with ground truth labels to understand the utility of SHAP Explanations to discriminate false positives from true positives. We show that SHAP Explanations are indeed relevant in discriminating samples and are a relevant candidate to automate ML tasks and help to detect and reduce false-positive results. [less ▲] Detailed reference viewed: 413 (14 UL) A Comparison of Pre-Trained Language Models for Multi-Class Text Classification in the Financial DomainArslan, Yusuf ; Allix, Kevin ; Veiber, Lisa et alin Companion Proceedings of the Web Conference 2021 (WWW '21 Companion), April 19--23, 2021, Ljubljana, Slovenia (2021, April 19) Detailed reference viewed: 249 (24 UL) Search-based adversarial testing and improvement of constrained credit scoring systemsGhamizi, Salah ; Cordy, Maxime ; Gubri, Martin et alin ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE '20), November 8-13, 2020 (2020) Detailed reference viewed: 253 (28 UL) |
||