On Evaluating Adversarial Robustness of Chest X-ray Classification: Pitfalls and Best PracticesGhamizi, Salah ; Cordy, Maxime ; Papadakis, Mike et alin The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI- 23) - SafeAI Workshop, Washington, D.C., Feb 13-14, 2023 (2022) Detailed reference viewed: 190 (0 UL) Evasion Attack STeganography: Turning Vulnerability Of Machine Learning ToAdversarial Attacks Into A Real-world ApplicationGhamizi, Salah ; Cordy, Maxime ; Papadakis, Mike et alin Proceedings of International Conference on Computer Vision 2021 (2021) Evasion Attacks have been commonly seen as a weakness of Deep Neural Networks. In this paper, we flip the paradigm and envision this vulnerability as a useful application. We propose EAST, a new ... [more ▼] Evasion Attacks have been commonly seen as a weakness of Deep Neural Networks. In this paper, we flip the paradigm and envision this vulnerability as a useful application. We propose EAST, a new steganography and watermarking technique based on multi-label targeted evasion attacks. Our results confirm that our embedding is elusive; it not only passes unnoticed by humans, steganalysis methods, and machine-learning detectors. In addition, our embedding is resilient to soft and aggressive image tampering (87% recovery rate under jpeg compression). EAST outperforms existing deep-learning-based steganography approaches with images that are 70% denser and 73% more robust and supports multiple datasets and architectures. [less ▲] Detailed reference viewed: 275 (25 UL) Requirements And Threat Models of Adversarial Attacks and Robustness of Chest X-ray classificationGhamizi, Salah ; Cordy, Maxime ; Papadakis, Mike et alE-print/Working paper (2021) Vulnerability to adversarial attacks is a well-known weakness of Deep Neural Networks. While most of the studies focus on natural images with standardized benchmarks like ImageNet and CIFAR, little ... [more ▼] Vulnerability to adversarial attacks is a well-known weakness of Deep Neural Networks. While most of the studies focus on natural images with standardized benchmarks like ImageNet and CIFAR, little research has considered real world applications, in particular in the medical domain. Our research shows that, contrary to previous claims, robustness of chest x-ray classification is much harder to evaluate and leads to very different assessments based on the dataset, the architecture and robustness metric. We argue that previous studies did not take into account the peculiarity of medical diagnosis, like the co-occurrence of diseases, the disagreement of labellers (domain experts), the threat model of the attacks and the risk implications for each successful attack. In this paper, we discuss the methodological foundations, review the pitfalls and best practices, and suggest new methodological considerations for evaluating the robustness of chest xray classification models. Our evaluation on 3 datasets, 7 models, and 18 diseases is the largest evaluation of robustness of chest x-ray classification models. We believe our findings will provide reliable guidelines for realistic evaluation and improvement of the robustness of machine learning models for medical diagnosis. [less ▲] Detailed reference viewed: 292 (19 UL) A Replication Study on the Usability of Code Vocabulary in Predicting Flaky TestsHaben, Guillaume ; Habchi, Sarra ; Papadakis, Mike et alin 18th International Conference on Mining Software Repositories (2021, May) Abstract—Industrial reports indicate that flaky tests are one of the primary concerns of software testing mainly due to the false signals they provide. To deal with this issue, researchers have developed ... [more ▼] Abstract—Industrial reports indicate that flaky tests are one of the primary concerns of software testing mainly due to the false signals they provide. To deal with this issue, researchers have developed tools and techniques aiming at (automatically) identifying flaky tests with encouraging results. However, to reach industrial adoption and practice, these techniques need to be replicated and evaluated extensively on multiple datasets, occasions and settings. In view of this, we perform a replication study of a recently proposed method that predicts flaky tests based on their vocabulary. We thus replicate the original study on three different dimensions. First we replicate the approach on the same subjects as in the original study but using a different evaluation methodology, i.e., we adopt a time-sensitive selection of training and test sets to better reflect the envisioned use case. Second, we consolidate the findings of the initial study by building a new dataset of 837 flaky tests from 9 projects in a different programming language, i.e., Python while the original study was in Java, which comforts the generalisability of the results. Third, we propose an extension to the original approach by experimenting with different features extracted from the Code Under Test. Our results demonstrate that a more robust validation has a consistent negative impact on the reported results of the original study, but, fortunately, these do not invalidate the key conclusions of the study. We also find re-assuring results that the vocabulary-based models can also be used to predict test flakiness in Python and that the information lying in the Code Under Test has a limited impact in the performance of the vocabulary-based models [less ▲] Detailed reference viewed: 401 (27 UL) Test Selection for Deep Learning SystemsMa, Wei ; Papadakis, Mike ; et alin ACM Transactions on Software Engineering and Methodology (2021), 30(2), 131--1322 Detailed reference viewed: 541 (46 UL) Statistical model checking for variability-intensive systems: applications to bug detection and minimizationCordy, Maxime ; Lazreg, Sami ; Papadakis, Mike et alin Formal Aspects of Computing (2021), 33(6), 1147--1172 Detailed reference viewed: 138 (8 UL) Towards Exploring the Limitations of Active Learning: An Empirical StudyHu, Qiang ; Guo, Yuejun ; Cordy, Maxime et alin The 36th IEEE/ACM International Conference on Automated Software Engineering. (2021) Deep neural networks (DNNs) are being increasingly deployed as integral parts of software systems. However, due to the complex interconnections among hidden layers and massive hyperparameters, DNNs ... [more ▼] Deep neural networks (DNNs) are being increasingly deployed as integral parts of software systems. However, due to the complex interconnections among hidden layers and massive hyperparameters, DNNs require being trained using a large number of labeled inputs, which calls for extensive human effort for collecting and labeling data. Spontaneously, to alleviate this growing demand, a surge of state-of-the-art studies comes up with different metrics to select a small yet informative dataset for the model training. These research works have demonstrated that DNN models can achieve competitive performance using a carefully selected small set of data. However, the literature lacks proper investigation of the limitations of data selection metrics, which is crucial to apply them in practice. In this paper, we fill this gap and conduct an extensive empirical study to explore the limits of selection metrics. Our study involves 15 selection metrics evaluated over 5 datasets (2 image classification tasks and 3 text classification tasks), 10 DNN architectures, and 20 labeling budgets (ratio of training data being labeled). Our findings reveal that, while selection metrics are usually effective in producing accurate models, they may induce a loss of model robustness (against adversarial examples) and resilience to compression. Overall, we demonstrate the existence of a trade-off between labeling effort and different model qualities. This paves the way for future research in devising selection metrics considering multiple quality criteria. [less ▲] Detailed reference viewed: 338 (49 UL) CONFUZZION: A Java Virtual Machine Fuzzer for Type Confusion Vulnerabilities; Khanfir, Ahmed ; et alin IEEE International Conference on Software Quality, Reliability, and Security (QRS), 2021 (2021) Detailed reference viewed: 176 (13 UL) MuDelta: Delta-Oriented Mutation Testing at Commit TimeMa, Wei ; ; Papadakis, Mike et alin International Conference on Software Engineering (ICSE) (2021) Detailed reference viewed: 419 (27 UL) Killing Stubborn Mutants with Symbolic ExecutionTitcheu Chekam, Thierry ; Papadakis, Mike ; Cordy, Maxime et alin ACM Transactions on Software Engineering and Methodology (2021), 30(2), 191--1923 Detailed reference viewed: 410 (18 UL) Data-driven simulation and optimization for covid-19 exit strategiesGhamizi, Salah ; Rwemalika, Renaud ; Cordy, Maxime et alin Ghamizi, Salah; Rwemalika, Renaud; Cordy, Maxime (Eds.) et al Data-driven simulation and optimization for covid-19 exit strategies (2020, August) The rapid spread of the Coronavirus SARS-2 is a major challenge that led almost all governments worldwide to take drastic measures to respond to the tragedy. Chief among those measures is the massive ... [more ▼] The rapid spread of the Coronavirus SARS-2 is a major challenge that led almost all governments worldwide to take drastic measures to respond to the tragedy. Chief among those measures is the massive lockdown of entire countries and cities, which beyond its global economic impact has created some deep social and psychological tensions within populations. While the adopted mitigation measures (including the lockdown) have generally proven useful, policymakers are now facing a critical question: how and when to lift the mitigation measures? A carefully-planned exit strategy is indeed necessary to recover from the pandemic without risking a new outbreak. Classically, exit strategies rely on mathematical modeling to predict the effect of public health interventions. Such models are unfortunately known to be sensitive to some key parameters, which are usually set based on rules-of-thumb.In this paper, we propose to augment epidemiological forecasting with actual data-driven models that will learn to fine-tune predictions for different contexts (e.g., per country). We have therefore built a pandemic simulation and forecasting toolkit that combines a deep learning estimation of the epidemiological parameters of the disease in order to predict the cases and deaths, and a genetic algorithm component searching for optimal trade-offs/policies between constraints and objectives set by decision-makers.Replaying pandemic evolution in various countries, we experimentally show that our approach yields predictions with much lower error rates than pure epidemiological models in 75% of the cases and achieves a 95% R² score when the learning is transferred and tested on unseen countries. When used for forecasting, this approach provides actionable insights into the impact of individual measures and strategies. [less ▲] Detailed reference viewed: 291 (18 UL) Statistical Model Checking for Variability-Intensive SystemsCordy, Maxime ; Papadakis, Mike ; in FUNDAMENTAL APPROACHES TO SOFTWARE ENGINEERING, Dublin 22-25 April 2020 (2020, April) Detailed reference viewed: 212 (3 UL) Search-based adversarial testing and improvement of constrained credit scoring systemsGhamizi, Salah ; Cordy, Maxime ; Gubri, Martin et alin ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE '20), November 8-13, 2020 (2020) Detailed reference viewed: 278 (28 UL) Pandemic Simulation and Forecasting of exit strategies:Convergence of Machine Learning and EpidemiologicalModelsGhamizi, Salah ; Rwemalika, Renaud ; Cordy, Maxime et alReport (2020) The COVID-19 pandemic has created a public health emergency unprecedented in this century. The lack ofaccurate knowledge regarding the outcomes of the virus has made it challenging for policymakers to ... [more ▼] The COVID-19 pandemic has created a public health emergency unprecedented in this century. The lack ofaccurate knowledge regarding the outcomes of the virus has made it challenging for policymakers to decideon appropriate countermeasures to mitigate its impact on society, in particular the public health and the veryhealthcare system.While the mitigation strategies (including the lockdown) are getting lifted, understanding the current im-pacts of the outbreak remains challenging. This impedes any analysis and scheduling of measures requiredfor the different countries to recover from the pandemic without risking a new outbreak.Therefore, we propose a novel approach to build realistic data-driven pandemic simulation and forecastingmodels to support policymakers. Our models allow the investigation of mitigation/recovery measures andtheir impact. Thereby, they enable appropriate planning of those measures, with the aim to optimize theirsocietal benefits.Our approach relies on a combination of machine learning and classical epidemiological models, circum-venting the respective limitations of these techniques to allow a policy-making based on established knowl-edge, yet driven by factual data, and tailored to each country’s specific context. [less ▲] Detailed reference viewed: 391 (20 UL) Automatic Testing and Improvement of Machine Translation; ; et al in International Conference on Software Engineering (ICSE) (2020) Detailed reference viewed: 163 (12 UL) Commit-Aware Mutation TestingMa, Wei ; ; Ojdanic, Milos et alin IEEE International Conference on Software Maintenance and Evolution (ICSME) (2020) Detailed reference viewed: 384 (64 UL) FeatureNET: Diversity-driven Generation of Deep Learning ModelsGhamizi, Salah ; Cordy, Maxime ; Papadakis, Mike et alin International Conference on Software Engineering (ICSE) (2020) Detailed reference viewed: 170 (14 UL) Muteria: An Extensible and Flexible Multi-Criteria Software Testing FrameworkTitcheu Chekam, Thierry ; Papadakis, Mike ; Le Traon, Yves ![]() in ACM/IEEE International Conference on Automation of Software Test (AST) 2020 (2020) Program based test adequacy criteria (TAC), such as statement, branch coverage and mutation give objectives for software testing. Many techniques and tools have been developed to improve each phase of the ... [more ▼] Program based test adequacy criteria (TAC), such as statement, branch coverage and mutation give objectives for software testing. Many techniques and tools have been developed to improve each phase of the TAC-based software testing process. Nonetheless, The engineering effort required to integrate these tools and techniques into the software testing process limits their use and creates an overhead to the users. Especially for system testing with languages like C, where test cases are not always well structured in a framework. In response to these challenges, this paper presents Muteria, a TAC-based software testing framework. Muteria enables the integration of multiple software testing tools. Muteria abstracts each phase of the TAC-based software testing process to provide tool drivers interfaces for the implementation of tool drivers. Tool drivers enable Muteria to call the corresponding tools during the testing process. An initial set of drivers for KLEE, Shadow and SEMu test-generation tools, Gcov, and coverage.py code coverage tools, and Mart mutant generation tool for C and Python programming language were implemented with an average of 345 lines of Python code. Moreover, the user configuration file required to measure code coverage and mutation score on a sample C programs, using the Muteria framework, consists of less than 15 configuration variables. Users of the Muteria framework select, in a configuration file, the tools and TACs to measure. The Muteria framework uses the user configuration to run the testing process and report the outcome. Users interact with Muteria through its Application Programming Interface and Command Line Interface. Muteria can benefit to researchers as a laboratory to execute experiments, and to software practitioners. [less ▲] Detailed reference viewed: 331 (8 UL) Selecting fault revealing mutantsTitcheu Chekam, Thierry ; Papadakis, Mike ; Bissyande, Tegawendé François D Assise et alin Empirical Software Engineering (2020) Detailed reference viewed: 317 (17 UL) Adversarial Embedding: A robust and elusive Steganography and Watermarking techniqueGhamizi, Salah ; Cordy, Maxime ; Papadakis, Mike et alScientific Conference (2020) We propose adversarial embedding, a new steganography and watermarking technique that embeds secret information within images. The key idea of our method is to use deep neural networks for image ... [more ▼] We propose adversarial embedding, a new steganography and watermarking technique that embeds secret information within images. The key idea of our method is to use deep neural networks for image classification and adversarial attacks to embed secret information within images. Thus, we use the attacks to embed an encoding of the message within images and the related deep neural network outputs to extract it. The key properties of adversarial attacks (invisible perturbations, nontransferability, resilience to tampering) offer guarantees regarding the confidentiality and the integrity of the hidden messages. We empirically evaluate adversarial embedding using more than 100 models and 1,000 messages. Our results confirm that our embedding passes unnoticed by both humans and steganalysis methods, while at the same time impedes illicit retrieval of the message (less than 13% recovery rate when the interceptor has some knowledge about our model), and is resilient to soft and (to some extent) aggressive image tampering (up to 100% recovery rate under jpeg compression). We further develop our method by proposing a new type of adversarial attack which improves the embedding density (amount of hidden information) of our method to up to 10 bits per pixel. [less ▲] Detailed reference viewed: 521 (44 UL) |
||