![]() Titcheu Chekam, Thierry ![]() ![]() ![]() in Empirical Software Engineering (in press) Detailed reference viewed: 331 (35 UL)![]() Hu, Qiang ![]() in preprint (2023) Detailed reference viewed: 37 (2 UL)![]() Akli, Amal ![]() ![]() in FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning (2023, May) Flaky tests are tests that yield different outcomes when run on the same version of a program. This non-deterministic behaviour plagues continuous integration with false signals, wasting developers’ time ... [more ▼] Flaky tests are tests that yield different outcomes when run on the same version of a program. This non-deterministic behaviour plagues continuous integration with false signals, wasting developers’ time and reducing their trust in test suites. Studies highlighted the importance of keeping tests flakiness-free. Recently, the research community has been pushing towards the detection of flaky tests by suggesting many static and dynamic approaches. While promising, those approaches mainly focus on classifying tests as flaky or not and, even when high performances are reported, it remains challenging to understand the cause of flakiness. This part is crucial for researchers and developers that aim to fix it. To help with the comprehension of a given flaky test, we propose FlakyCat, the first approach to classify flaky tests based on their root cause category. FlakyCat relies on CodeBERT for code representation and leverages Siamese networks to train a multi-class classifier. We train and evaluate FlakyCat on a set of 451 flaky tests collected from open-source Java projects. Our evaluation shows that FlakyCat categorises flaky tests accurately, with an F1 score of 73%. Furthermore, we investigate the performance of our approach for each category, revealing that Async waits, Unordered collections and Time-related flaky tests are accurately classified, while Concurrency-related flaky tests are more challenging to predict. Finally, to facilitate the comprehension of FlakyCat’s predictions, we present a new technique for CodeBERT-based model interpretability that highlights code statements influencing the categorization. [less ▲] Detailed reference viewed: 65 (1 UL)![]() Gubri, Martin ![]() ![]() ![]() E-print/Working paper (2023) Transferability is the property of adversarial examples to be misclassified by other models than the surrogate model for which they were crafted. Previous research has shown that transferability is ... [more ▼] Transferability is the property of adversarial examples to be misclassified by other models than the surrogate model for which they were crafted. Previous research has shown that transferability is substantially increased when the training of the surrogate model has been early stopped. A common hypothesis to explain this is that the later training epochs are when models learn the non-robust features that adversarial attacks exploit. Hence, an early stopped model is more robust (hence, a better surrogate) than fully trained models. We demonstrate that the reasons why early stopping improves transferability lie in the side effects it has on the learning dynamics of the model. We first show that early stopping benefits transferability even on models learning from data with non-robust features. We then establish links between transferability and the exploration of the loss landscape in the parameter space, on which early stopping has an inherent effect. More precisely, we observe that transferability peaks when the learning rate decays, which is also the time at which the sharpness of the loss significantly drops. This leads us to propose RFN, a new approach for transferability that minimizes loss sharpness during training in order to maximize transferability. We show that by searching for large flat neighborhoods, RFN always improves over early stopping (by up to 47 points of transferability rate) and is competitive to (if not better than) strong state-of-the-art baselines. [less ▲] Detailed reference viewed: 55 (0 UL)![]() Kubler, Sylvain ![]() ![]() in Expert systems with applications (2023), 211 Blockchain technologies, also known as Distributed Ledger Technologies (DLT), are increasingly being explored in many applications, especially in the presence of (potential) dis-/mis-/un-trust among ... [more ▼] Blockchain technologies, also known as Distributed Ledger Technologies (DLT), are increasingly being explored in many applications, especially in the presence of (potential) dis-/mis-/un-trust among organizations and individuals. Today, there exists a plethora of DLT platforms on the market, which makes it challenging for system designers to decide what platform they should adopt and implement. Although a few DLT comparison frameworks have been proposed in the literature, they often fail in covering all performance and functional aspects, adding that they too rarely build upon standardized criteria and recommendations. Given this state of affairs, the present paper considers a recent and exhaustive set of assessment criteria recommended by the ITU (International Telecommunication Union). Those criteria (about fifty) are nonetheless mostly defined in a textual form, which may pose interpretation problems during the implementation process. To avoid this, a systematic literature review regarding each ITU criterion is conducted with a twofold objective: (i) to understand to what extent a given criterion is considered/evaluated by the literature; (ii) to come up with ‘formal’ metric definition (i.e., on a mathematical or experimental ground) based, whenever possible, on the current literature. Following this formalization stage, a decision support tool called CREDO-DLT, which stands for “multiCRiteria-basEd ranking Of Distributed Ledger Technology platforms”, is developed using AHP and TOPSIS, which is publicly made available to help decision-maker to select the most suitable DLT platform alternative (i.e., that best suits their needs and requirements). A use case scenario in the context of energy communities is proposed to show the practicality of CREDO-DLT. •Blockchain (DLT) standardization initiatives are reviewed.•To what extent ITU’s DLT assessment criteria are covered in literature is studied.•A mathematical formalizations of the ITU recommendations are proposed.•A decision support tool (CREDO-DLT) is designed for DLT platform selection.•An energy community use case is developed to show the practicality of CREDO-DLT. [less ▲] Detailed reference viewed: 107 (3 UL)![]() Dyrmishi, Salijona ![]() ![]() ![]() in Conference Proceedings 2023 IEEE Symposium on Security and Privacy (SP) (2023) While the literature on security attacks and defense of Machine Learning (ML) systems mostly focuses on unrealistic adversarial examples, recent research has raised concern about the under-explored field ... [more ▼] While the literature on security attacks and defense of Machine Learning (ML) systems mostly focuses on unrealistic adversarial examples, recent research has raised concern about the under-explored field of realistic adversarial attacks and their implications on the robustness of real-world systems. Our paper paves the way for a better understanding of adversarial robustness against realistic attacks and makes two major contributions. First, we conduct a study on three real-world use cases (text classification, botnet detection, malware detection)) and five datasets in order to evaluate whether unrealistic adversarial examples can be used to protect models against realistic examples. Our results reveal discrepancies across the use cases, where unrealistic examples can either be as effective as the realistic ones or may offer only limited improvement. Second, to explain these results, we analyze the latent representation of the adversarial examples generated with realistic and unrealistic attacks. We shed light on the patterns that discriminate which unrealistic examples can be used for effective hardening. We release our code, datasets and models to support future research in exploring how to reduce the gap between unrealistic and realistic adversarial attacks. [less ▲] Detailed reference viewed: 55 (0 UL)![]() Ojdanic, Milos ![]() ![]() ![]() E-print/Working paper (2023) Fault seeding is typically used in empirical studies to evaluate and compare test techniques. Central to these techniques lies the hypothesis that artificially seeded faults involve some form of realistic ... [more ▼] Fault seeding is typically used in empirical studies to evaluate and compare test techniques. Central to these techniques lies the hypothesis that artificially seeded faults involve some form of realistic properties and thus provide realistic experimental results. In an attempt to strengthen realism, a recent line of re- search uses machine learning techniques, such as deep learning and Natural Language Processing, to seed faults that look like (syntactically) real ones, implying that fault realism is related to syntactic similarity. This raises the question of whether seeding syntactically similar faults indeed results in semantically similar faults and, more generally, whether syntactically dissimilar faults are far away (semantically) from the real ones. We answer this question by employing 4 state-of-the-art fault-seeding techniques (PiTest - a popular mutation testing tool, IBIR - a tool with manually crafted fault patterns, DeepMutation - a learning-based fault seeded framework and μBERT - a mutation testing tool based on the pre-trained language model CodeBERT) that operate in a fundamentally different way, and demonstrate that syntactic similarity does not reflect semantic similarity. We also show that 65.11%, 76.44%, 61.39% and 9.76% of the real faults of Defects4J V2 are semantically resembled by PiTest, IBIR, μBERT and Deep- Mutation faults, respectively. [less ▲] Detailed reference viewed: 255 (20 UL)![]() Ojdanic, Milos ![]() ![]() ![]() in On Comparing Mutation Testing Tools through Learning-based Mutant Selection (2023) Recently many mutation testing tools have been proposed that rely on bug-fix patterns and natural language models trained on large code corpus. As these tools operate fundamentally differently from the ... [more ▼] Recently many mutation testing tools have been proposed that rely on bug-fix patterns and natural language models trained on large code corpus. As these tools operate fundamentally differently from the grammar-based traditional approaches, a question arises of how these tools compare in terms of 1) fault detection and 2) cost-effectiveness. Simultaneously, mutation testing research proposes mutant selection approaches based on machine learning to mitigate its application cost. This raises another question: How do the existing mutation testing tools compare when guided by mutant selection approaches? To answer these questions, we compare four existing tools – μBERT (uses pre-trained language model for fault seeding), IBIR (relies on inverted fix-patterns), DeepMutation (generates mutants by employing Neural Machine Translation) and PIT (ap- plies standard grammar-based rules) in terms of fault detection capability and cost-effectiveness, in conjunction with standard and deep learning based mutant selection strategies. Our results show that IBIR has the highest fault detection capability among the four tools; however, it is not the most cost-effective when considering different selection strategies. On the other hand, μBERT having a relatively lower fault detection capability, is the most cost-effective among the four tools. Our results also indicate that comparing mutation testing tools when using deep learning-based mutant selection strategies can lead to different conclusions than the standard mutant selection. For instance, our results demonstrate that combining μBERT with deep learning- based mutant selection yields 12% higher fault detection than the considered tools. [less ▲] Detailed reference viewed: 45 (0 UL)![]() Khanfir, Ahmed ![]() ![]() ![]() in 22nd IEEE International Conference on Software Quality, Reliability and Security (QRS'22) (2022, December 05) Much of recent software-engineering research has investigated the naturalness of code, the fact that code, in small code snippets, is repetitive and can be predicted using statistical language models like ... [more ▼] Much of recent software-engineering research has investigated the naturalness of code, the fact that code, in small code snippets, is repetitive and can be predicted using statistical language models like n-gram. Although powerful, training such models on large code corpus can be tedious, time consuming and sensitive to code patterns (and practices) encountered during training. Consequently, these models are often trained on a small corpus and thus only estimate the language naturalness relative to a specific style of programming or type of project. To overcome these issues, we investigate the use of pre-trained generative language models to infer code naturalness. Pre-trained models are often built on big data, are easy to use in an out-of-the-box way and include powerful learning associations mechanisms. Our key idea is to quantify code naturalness through its predictability, by using state-of-the-art generative pre-trained language models. Thus, we suggest to infer naturalness by masking (omitting) code tokens, one at a time, of code-sequences, and checking the models’ability to predict them. We explore three different predictability metrics; a) measuring the number of exact matches of the predictions, b) computing the embedding similarity between the original and predicted code, i.e., similarity at the vector space, and c) computing the confidence of the model when doing the token completion task regardless of the outcome. We implement this workflow, named CODEBERT-NT, and evaluate its capability to prioritize buggy lines over non-buggy ones when ranking code based on its naturalness. Our results, on 2,510 buggy versions of 40 projects from the SmartShark dataset, show that CODEBERT-NT outperforms both, random-uniform and complexity-based ranking techniques, and yields comparable results to the n-gram models. [less ▲] Detailed reference viewed: 97 (15 UL)![]() Bernier, Fabien ![]() ![]() ![]() in Has it Trained Yet? Workshop at the Conference on Neural Information Processing Systems (2022, December 02) Energy demand forecasting is one of the most challenging tasks for grids operators. Many approaches have been suggested over the years to tackle it. Yet, those still remain too expensive to train in terms ... [more ▼] Energy demand forecasting is one of the most challenging tasks for grids operators. Many approaches have been suggested over the years to tackle it. Yet, those still remain too expensive to train in terms of both time and computational resources, hindering their adoption as customers behaviors are continuously evolving. We introduce Transplit, a new lightweight transformer-based model, which significantly decreases this cost by exploiting the seasonality property and learning typical days of power demand. We show that Transplit can be run efficiently on CPU and is several hundred times faster than state-of-the-art predictive models, while performing as well. [less ▲] Detailed reference viewed: 147 (16 UL)![]() Lorentz, Joe ![]() ![]() ![]() in Software and Systems Modeling (2022) Models based on differential programming, like deep neural networks, are well established in research and able to outperform manually coded counterparts in many applications. Today, there is a rising ... [more ▼] Models based on differential programming, like deep neural networks, are well established in research and able to outperform manually coded counterparts in many applications. Today, there is a rising interest to introduce this flexible modeling to solve real-world problems. A major challenge when moving from research to application is the strict constraints on computational resources (memory and time). It is difficult to determine and contain the resource requirements of differential models, especially during the early training and hyperparameter exploration stages. In this article, we address this challenge by introducing CalcGraph, a model abstraction of differentiable programming layers. CalcGraph allows to model the computational resources that should be used and then CalcGraph’s model interpreter can automatically schedule the execution respecting the specifications made. We propose a novel way to efficiently switch models from storage to preallocated memory zones and vice versa to maximize the number of model executions given the available resources. We demonstrate the efficiency of our approach by showing that it consumes less resources than state-of-the-art frameworks like TensorFlow and PyTorch for single-model and multi-model execution. [less ▲] Detailed reference viewed: 73 (4 UL)![]() ; Haben, Guillaume ![]() ![]() in What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness (2022, October) Flaky tests are defined as tests that manifest non-deterministic behaviour by passing and failing intermittently for the same version of the code. These tests cripple continuous integration with false ... [more ▼] Flaky tests are defined as tests that manifest non-deterministic behaviour by passing and failing intermittently for the same version of the code. These tests cripple continuous integration with false alerts that waste developers' time and break their trust in regression testing. To mitigate the effects of flakiness, both researchers and industrial experts proposed strategies and tools to detect and isolate flaky tests. However, flaky tests are rarely fixed as developers struggle to localise and understand their causes. Additionally, developers working with large codebases often need to know the sources of non-determinism to preserve code quality, i.e. avoid introducing technical debt linked with non-deterministic behaviour, and to avoid introducing new flaky tests. To aid with these tasks, we propose re-targeting Fault Localisation techniques to the flaky component localisation problem, i.e. pinpointing program classes that cause the non-deterministic behaviour of flaky tests. In particular, we employ Spectrum-Based Fault Localisation (SBFL), a coverage-based fault localisation technique commonly adopted for its simplicity and effectiveness. We also utilise other data sources, such as change history and static code metrics, to further improve the localisation. Our results show that augmenting SBFL with change and code metrics ranks flaky classes in the top-1 and top-5 suggestions, in 26% and 47% of the cases. Overall, we successfully reduced the average number of classes inspected to locate the first flaky class to 19% of the total number of classes covered by flaky tests. Our results also show that localisation methods are effective in major flakiness categories, such as concurrency and asynchronous waits, indicating their general ability to identify flaky components. [less ▲] Detailed reference viewed: 50 (1 UL)![]() Souani, Badr ![]() ![]() ![]() in Jianying, Zhou (Ed.) Applied Cryptography and Network Security Workshops (2022, September 24) In this paper, we propose two empirical studies to (1) detect Android malware and (2) classify Android malware into families. We rst (1) reproduce the results of MalBERT using BERT models learning with ... [more ▼] In this paper, we propose two empirical studies to (1) detect Android malware and (2) classify Android malware into families. We rst (1) reproduce the results of MalBERT using BERT models learning with Android application's manifests obtained from 265k applications (vs. 22k for MalBERT) from the AndroZoo dataset in order to detect malware. The results of the MalBERT paper are excellent and hard to believe as a manifest only roughly represents an application, we therefore try to answer the following questions in this paper. Are the experiments from MalBERT reproducible? How important are Permissions for mal- ware detection? Is it possible to keep or improve the results by reducing the size of the manifests? We then (2) investigate if BERT can be used to classify Android malware into families. The results show that BERT can successfully di erentiate malware/goodware with 97% accuracy. Further- more BERT can classify malware families with 93% accuracy. We also demonstrate that Android permissions are not what allows BERT to successfully classify and even that it does not actually need it. [less ▲] Detailed reference viewed: 65 (4 UL)![]() Garg, Aayush ![]() ![]() ![]() in Empirical Software Engineering (2022) Detailed reference viewed: 184 (28 UL)![]() ; Bissyande, Tegawendé François D Assise ![]() in Empirical Software Engineering (2022), 27 Detailed reference viewed: 67 (0 UL)![]() Khanfir, Ahmed ![]() ![]() in ACM Transactions on Software Engineering and Methodology (2022) Detailed reference viewed: 58 (2 UL)![]() Ojdanic, Milos ![]() ![]() ![]() in ACM Transactions on Software Engineering and Methodology (2022) Context: When software evolves, opportunities for introducing faults appear. Therefore, it is important to test the evolved program behaviors during each evolution cycle. However, while software evolves ... [more ▼] Context: When software evolves, opportunities for introducing faults appear. Therefore, it is important to test the evolved program behaviors during each evolution cycle. However, while software evolves, its complexity is also evolving, introducing challenges to the testing process. To deal with this issue, testing techniques should be adapted to target the effect of the program changes instead of the entire program functionality. To this end, commit-aware mutation testing, a powerful testing technique, has been proposed. Unfortunately, commit-aware mutation testing is challenging due to the complex program semantics involved. Hence, it is pertinent to understand the characteristics, predictability, and potential of the technique. Objective: We conduct an exploratory study to investigate the properties of commit-relevant mutants, i.e., the test elements of commit-aware mutation testing, by proposing a general definition and an experimental approach to identify them. We thus, aim at investigating the prevalence, location, and comparative advantages of commit-aware mutation testing over time (i.e., the program evolution). We also investigate the predictive power of several commit-related features in identifying and selecting commit-relevant mutants to understand the essential properties for its best-effort application case. Method: Our commit-relevant definition relies on the notion of observational slicing, approximated by higher-order mutation. Specifically, our approach utilizes the impact of mutants, effects of one mutant on another in capturing and analyzing the implicit interactions between the changed and unchanged code parts. The study analyses millions of mutants (over 10 million), 288 commits, five (5) different open-source software projects involving over 68,213 CPU days of computation and sets a ground truth where we perform our analysis. Results: Our analysis shows that commit-relevant mutants are located mainly outside of program commit change (81%), suggesting a limitation in previous work. We also note that effective selection of commit-relevant mutants has the potential of reducing the number of mutants by up to 93%. In addition, we demonstrate that commit relevant mutation testing is significantly more effective and efficient than state-of-the-art baselines, i.e., random mutant selection and analysis of only mutants within the program change. In our analysis of the predictive power of mutants and commit-related features (e.g., number of mutants within a change, mutant type, and commit size) in predicting commit-relevant mutants, we found that most proxy features do not reliably predict commit-relevant mutants. Conclusion: This empirical study highlights the properties of commit-relevant mutants and demonstrates the importance of identifying and selecting commit-relevant mutants when testing evolving software systems. [less ▲] Detailed reference viewed: 60 (1 UL)![]() Habchi, Sarra ![]() ![]() ![]() in A Qualitative Study on the Sources, Impacts, and Mitigation Strategies of Flaky Tests (2022, April) Test flakiness forms a major testing concern. Flaky tests manifest non-deterministic outcomes that cripple continuous integration and lead developers to investigate false alerts. Industrial reports ... [more ▼] Test flakiness forms a major testing concern. Flaky tests manifest non-deterministic outcomes that cripple continuous integration and lead developers to investigate false alerts. Industrial reports indicate that on a large scale, the accrual of flaky tests breaks the trust in test suites and entails significant computational cost. To alleviate this, practitioners are constrained to identify flaky tests and investigate their impact. To shed light on such mitigation mechanisms, we interview 14 practitioners with the aim to identify (i) the sources of flakiness within the testing ecosystem, (ii) the impacts of flakiness, (iii) the measures adopted by practitioners when addressing flakiness, and (iv) the automation opportunities for these measures. Our analysis shows that, besides the tests and code, flakiness stems from interactions between the system components, the testing infrastructure, and external factors. We also highlight the impact of flakiness on testing practices and product quality and show that the adoption of guidelines together with a stable infrastructure are key measures in mitigating the problem. [less ▲] Detailed reference viewed: 45 (0 UL)![]() Ghamizi, Salah ![]() ![]() E-print/Working paper (2022) Clinicians use chest radiography (CXR) to diagnose common pathologies. Automated classification of these diseases can expedite analysis workflow, scale to growing numbers of patients and reduce healthcare ... [more ▼] Clinicians use chest radiography (CXR) to diagnose common pathologies. Automated classification of these diseases can expedite analysis workflow, scale to growing numbers of patients and reduce healthcare costs. While research has produced classification models that perform well on a given dataset, the same models lack generalization on different datasets. This reduces confidence that these models can be reliably deployed across various clinical settings. We propose an approach based on multitask learning to improve model generalization. We demonstrate that learning a (main) pathology together with an auxiliary pathology can significantly impact generalization performance (between -10% and +15% AUC-ROC). A careful choice of auxiliary pathology even yields competitive performance with state-of-the-art models that rely on fine-tuning or ensemble learning, using between 6% and 34% of the training data that these models required. We, further, provide a method to determine what is the best auxiliary task to choose without access to the target dataset. Ultimately, our work makes a big step towards the creation of CXR diagnosis models applicable in the real world, through the evidence that multitask learning can drastically improve generalization. [less ▲] Detailed reference viewed: 155 (17 UL)![]() Garg, Aayush ![]() ![]() ![]() in IEEE Transactions on Software Engineering (2022) Detailed reference viewed: 161 (36 UL) |
||