Automatic Detecting Unethical Behavior in Open-source Software Projects
Hsu Myat Win, Haibo Wang, Shin Hwei Tan

TL;DR
This paper presents a comprehensive study of unethical behaviors in open-source software projects, introduces a taxonomy of 15 unethical types, and proposes an automated detection method called Etor with promising results.
Contribution
It is the first study to classify various unethical behaviors in OSS from stakeholders' perspectives and develop an automated detection approach using ontological and semantic techniques.
Findings
Identified 15 types of unethical behavior in OSS.
Etor detects 6 types of unethical behavior with 74.8% accuracy.
Analyzed 195,621 GitHub issues across 1,765 repositories.
Abstract
Given the rapid growth of Open-Source Software (OSS) projects, ethical considerations are becoming more important. Past studies focused on specific ethical issues (e.g., gender bias and fairness in OSS). There is little to no study on the different types of unethical behavior in OSS projects. We present the first study of unethical behavior in OSS projects from the stakeholders' perspective. Our study of 316 GitHub issues provides a taxonomy of 15 types of unethical behavior guided by six ethical principles (e.g., autonomy).Examples of new unethical behavior include soft forking (copying a repository without forking) and self-promotion (promoting a repository without self-identifying as contributor to the repository). We also identify 18 types of software artifacts affected by the unethical behavior. The diverse types of unethical behavior identified in our study (1) call for attentions…
| Type | Issues (#) | Affected Software Artifacts |
|---|---|---|
| S1 | 49 | 41 Source code, 5 Configuration files, 1 API, 1 project, 1 script |
| S2 | 19 | 19 Projects |
| S3 | 6 | 2 Source code, 2 Data, 1 UI, 1 Project |
| S4 | 26 | 9 Legalese, 7 Source code, 4 README/ CONTRIBUTING.md, 3 Configuration files, 1 Image, 1 OS, 1 Website |
| S5 | 31 | 31 Legalese |
| S6 | 9 | 9 CHANGELOGs |
| S7 | 16 | 16 APIs |
| S8 | 8 | 8 PR/Issue comments |
| S9 | 10 | 10 Release histories |
| S10 | 27 | 23 Source code, 4 APIs |
| S11 | 21 | 10 Product names, 8 Source code, 1 UI, 1 Data, 1 Script |
| S12 | 15 | 10 PR/Issue comments, 5 PR/Issue code reviews |
| S13 | 7 | 2 UIs, 2 Product names, 1 Source code, 1 README/ CONTRIBUTING.md, 1 Website |
| S14 | 36 | 15 UIs, 11 Software features, 6 Source code, 4 Configuration files |
| S15 | 36 | 12 Source code, 10 APIs, 5 UIs, 5 Software features, 3 Configuration files, 1 Website |
| Total | 316 | 316 |
| Attribute | Type | Description |
| GHRepository: main class | ||
| licenseFile | GHContent | repo’s license file |
| readmeFile | GHContent | readme file |
| fileCount | int | # of files in repo |
| fileContent | GHContent | file’s content |
| commitCountByPath | int | # of commits for specific file path |
| commitByPath | GHCommit | commit for file path |
| fork | GHRepository | fork of a repo |
| forkCount | int | # of forks of repo |
| contributor | GHUser | stakeholder taking part in GitHub repo |
| pullRequestCountByCommit | int | # of PRs which contain specific commit |
| latestRelease | GHRelease | the last release in GitHub history |
| GHUser: A GitHub user identified by username | ||
| user | String | GitHub username |
| GHIssue: A GitHub issue that describes a bug or a feature. | ||
| issueMessageBody | String | description of issue |
| issueOwner | GHUser | stakeholder who reports an issue |
| GHCommit: The code changes in a commit. | ||
| commitCodeChange | String | code change in commit |
| GHContent: The content (including source code) of a file and its location (file path). | ||
| contentCount | int | # of contents stored in file’s content |
| content | String | content |
| path | String | file path |
| pathCount | int | # of file paths |
| GHRelease: A latest release is represented by the published date of the release | ||
| publishedDate | Date | date of release in GitHub history |
| Type | # Unethical Issues | True Positive | False Positive | Time (s) | ||
| # repos or issues / Total | # repos or issues / Total | % | # repos or issues / Total | % | ||
| (S1) No attribution to the author in code | 80 / 195,621 issues | 59 / 80 issues | 74 | 21 / 80 issues | 26 | 5.4 |
| (S2) Soft forking | 10 / 100 repos | 10 / 10 repos | 100 | 0 / 10 repos | 0 | 343.1 |
| (S5) No license provided in public repository | 476 / 1,765 repos | 424 / 476 repos | 89 | 52 / 476 repos | 11 | 3.1 |
| (S6) Uninformed license change | 18 / 1,765 repos | 16 / 18 repos | 88 | 2 / 18 repos | 11 | 9.2 |
| (S8) Self-promotion | 116 / 195,621 issues | 37 / 116 issues | 32 | 79 / 116 issues | 68 | 4.3 |
| (S9) Unmaintained Android Project with Paid Service | 3 / 1,765 repos | 2 / 3 repos | 66 | 1 / 3 repos | 33 | 5.3 |
| Average | - | - | 74.8 | - | 24.8 | - |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Open Source Software Innovations · Software Engineering Techniques and Practices
\epstopdfDeclareGraphicsRule
.pdfpng.pngconvert #1 \OutputFile \AppendGraphicsExtensions.pdf
Automatic Detecting Unethical Behavior in Open-source Software Projects
Hsu Myat Win
Southern University of Science and TechnologyChina
,
Haibo Wang
Southern University of Science and TechnologyChina
and
Shin Hwei Tan
Southern University of Science and TechnologyChina
Abstract.
Given the rapid growth of Open-Source Software (OSS) projects, ethical considerations are becoming more important. Past studies focused on specific ethical issues (e.g., gender bias and fairness in OSS). There is little to no study on the different types of unethical behavior in OSS projects. We present the first study of unethical behavior in OSS projects from the stakeholders’ perspective. Our study of 316 GitHub issues provides a taxonomy of 15 types of unethical behavior guided by six ethical principles (e.g., autonomy). Examples of new unethical behavior include soft forking (copying a repository without forking) and self-promotion (promoting a repository without self-identifying as contributor to the repository). We also identify 18 types of software artifacts affected by the unethical behavior. The diverse types of unethical behavior identified in our study (1) call for attentions of developers and researchers when making contributions in GitHub, and (2) point to future research on automated detection of unethical behavior in OSS projects. Based on our study, we propose Etor, an approach that can automatically detect six types of unethical behavior by using ontological engineering and Semantic Web Rule Language (SWRL) rules to model GitHub attributes and software artifacts. Our evaluation on 195,621 GitHub issues (1,765 GitHub repositories) shows that Etor can automatically detect 548 unethical behavior with 74.8% average true positive rate. This shows the feasibility of automated detection of unethical behavior in OSS projects.
1. Introduction
With the increasing popularity of Open-Source Software (OSS) development, ethical considerations have become an important yet often neglected topic within the research community. For example, the incident where researchers investigated the feasibility of stealthily introducing vulnerabilities in OSS by making hypocrite commits (commits that deliberately introduces critical bugs into code), has provoked active discussion among the Linux community, researchers, and other OSS developers (Wu and Lu, 2021). The Linux developers argued that making “hypocrite commits” is “not ethical”, and wasting developers’ time in reviewing invalid patches (lin, 2021). More importantly, this incident has revealed an attack on the basic premise of OSS itself (i.e., the fact that anyone can contribute to the code and any OSS project is susceptible to a similar incident). Indeed, unethical behavior committed by OSS contributors might lead to broken trust between the OSS community and the contributor, whereas unethical software development might lead to loss of funding, reputation, or other resources of the OSS organization involved. Despite the importance of understanding the unethical issues by stakeholders (individuals who participated or interested in the OSS project, and can either affect or be affected by the OSS project), most studies on unethical behavior in OSS projects mainly focuses on the common types of unethical behavior, such as gender bias (Terrell et al., 2016; Imtiaz et al., [n.d.]), fairness in the code review process (German et al., 2018), and software licensing (Lerner and Tirole, 2005; Vendome et al., 2017, [n.d.]). There is little to no study that investigates the important question: “What kind of behavior is considered unethical by stakeholders in OSS projects?”. Without understanding the definition of unethical behavior from the perspective of the stakeholders of OSS projects, incidents similar to the “hypocrite commits” experiments are bound to reoccur.
Prior studies stress the importance of considering ethical issues in OSS projects by using various examples and referring to ethical principles (Oezbek et al., 2008; Grodzinsky et al., 2003; Gold and Krinke, [n.d.]). Unfortunately, a study revealed that instructing participants to consider the ACM code of ethics does not affect their ethical decision-making in software engineering tasks (McNamara et al., [n.d.]). A similar argument has been made in AI ethics, which calls for practical methods to translate principles into practice (Mittelstadt, 2019). Hence, we argue that it is not enough to merely observe the occurrence of unethical behavior via several examples in OSS projects, it is much more important to systematically study their characteristics, and design practical tools that can automatically detect unethical behavior by presenting evidences using data in OSS projects to stakeholders.
To bridge the gaps between general ethical principles and OSS practices, we present the first study of the types of unethical behavior in OSS projects from stakeholders’ perspectives. Specifically, our study aims to answer the research questions below:
(RQ1) How does stakeholder in OSS projects define unethical behavior, and what are the types of unethical behavior?
By referring to ethical principles, we study the diverse types of unethical behavior, their characteristics, and the corresponding ethical principles that drive these unethical behavior in OSS projects.
(RQ2) Given the type of unethical behavior, what is the corresponding type of software artifacts that are deemed as unethical by stakeholders of OSS project?
For each type of unethical behavior, we study the affected software artifacts (i.e., artifacts in which the stakeholders claimed to violate ethical principles) to guide our design of a tool that can automatically identify unethical behavior in OSS projects.
Our study leads to a taxonomy of 15 types of unethical behavior in OSS projects. including S1: No attribution to the author in code, S2: Soft forking, S3: Plagiarism, S4: License incompatibility, S5: No license provided in public repository, S6: Uninformed license change, S7: Depending on proprietary software, S8: Self-promotion, S9: Unmaintained project with paid service, S10: Vulnerable code/API, S11: Naming confusion, S12: Closing issue/PR without explanation, S13: Offensive language, S14: No opt-in or no option allowed, and S15: Privacy Violation. Six of them have not been studied (i.e., S2, S6, S8, S9, S11, S12). For example, our study discovered the unethical behavior of “S8: Self-promotion” where a contributor deliberately opened many new pull requests s in several popular OSS projects where the code of the s depends on a newly released library in which is a contributor without mentioning the conflict of interests (the fact that he is promoting his own library) (S8, [n.d.]). Another example is “S11: Naming confusion” where the developer selects a conflicting name for an artifact which is the same as existing names but stakeholders should be responsible for selecting unique names.
Inspired by our study, we propose Ethic detector (Etor), an automatic detection tool based on ontological engineering (a description of entities and their properties, relationships, and behaviors) and Semantic Web Rule Language (SWRL) rules to model software artifacts in GitHub. In summary, we made the following contributions:
- •
Study. To the best of our knowledge, we conducted the first study of unethical behavior in OSS projects from the stakeholders’ perspective. Our study of 316 GitHub issues/PRs from 301 projects revealed 15 types of unethical behavior with six new ones. Our study also revealed the diversity of the affected software artifacts. Our benchmark containing 316 issues with various types of unethical behavior lays the foundation for future automated approaches for detecting unethical behavior.
- •
Technique. We propose Etor, a novel ontology-based tool that automatically detects unethical behavior in OSS projects. We model GitHub attributes using ontologies, and design SWRL rules to check for unethical behavior in various artifacts.
- •
Evaluation. Our evaluation on 195,621 GitHub issues/ PRs from 1,765 repos shows that Etor can automatically detect 548 issues with 74.8% true positive rate on average.
2. Background And Related work
Ethical Principles. Prior work on ethical principles in OSS projects mainly studied six aspects: (1) accountability, (2) attribution, (3) autonomy, (4) informed consent, (5) privacy, and (6) trust (Turilli and Floridi, 2009; Friedman et al., 2013; Cenite et al., 2009; Kocsis and de Vreede, 2016). Accountability means that an individual is accountable for his/her actions. Attribution (e.g., copyright) means giving credit to authors when the credit is due. Autonomy allows an individual to decide, plan, and act to achieve their goals. In OSS projects, individuals inherently have autonomy because they can choose which tasks to perform but may gain or lose autonomy once they agree to participate. Informed Consent is an agreement between the individual and the institution maintaining ethical values, such as autonomy. Privacy is a right of a stakeholder on what information another stakeholder can obtain and communicate to others. Trust refers to expectations between people through goodwill.
Web Ontology Language (OWL) is a standard ontology language endorsed by the W3C to construct an OWL knowledge model (RN2, [n.d.]a; McGuinness et al., 2004; Antoniou and Harmelen, 2004). It is a semantic web language designed to model rich and complex knowledge about things, groups of things, and relations between things. Knowledge expressed in OWL can be exploited by computer programs, e.g., to verify the consistency of that knowledge or to make implicit knowledge explicit. Thus, we design our tool based on ontology engineering.
Semantic Web Rule Language (SWRL) is a language that combines OWL and Rule Markup Language (RuleML), which can be used to express Horn-like rules and logic (RN2, [n.d.]b). SWRL rules are used to infer new knowledge regarding the individual (instance) by chains of properties. We choose to model the unethical behavior in OSS projects using SWRL because (1) its expressiveness (Vrandečić, 2009) is well-suited for modeling unethical behavior that involves different GitHub attributes and diverse types of software artifacts, and (2) it has been widely used to model concepts such as privacy for medical data (Boussi Rahmouni et al., 2009) and access control policy (Kayes et al., 2018; Beimel and Peleg, 2011).
Related Work. Prior work studies focus on multiple aspects of ethical concerns for several domains.
Ethical concerns in Software Engineering research. Several studies focus on ethical concerns for empirical studies in software engineering. Badampudi conducted a study about the reports of the ethical considerations in Software Engineering publications (Badampudi, [n.d.]). Andrews et al. illustrated some of the common approaches to encourage ethical behavior and their limits for demanding ethical behavior between researchers’ duty and their publishing as well as the companies’ and individuals’ integrity (Andrews and Pradhan, 2001). Singer et al. introduced their work as a practical guide to ethical research involving humans in software engineering (Singer and Vinson, 2002). Our study is complementary to these studies as the types of unethical behavior discovered in our study points to potential violations of ethical principles that software engineering researchers should consider when their evaluations of automated tools use OSS projects.
Studies on ethical concerns in OSS. Existing studies of OSS projects focus on issues related to gender bias (Terrell et al., 2016; Imtiaz et al., [n.d.]), fairness of the code review process (German et al., 2018), similar code in Stack Overflow and GitHub (Yang et al., 2017; Baltes and Diehl, 2019), and software licensing (Lerner and Tirole, 2005) (Vendome et al., 2017) (Vendome et al., [n.d.]). Studies relating to gender bias in GitHub (Terrell et al., 2016; Imtiaz et al., [n.d.]) aims to address the obstacles in improving gender diversity. Meanwhile, a study of a large industrial open source ecosystem (OpenStack) shows that unfairness is “starting to be perceived as an issue” in OSS (German et al., 2018). Several studies investigated code clones between code snippets from Stack Overflow and projects on GitHub and found a considerable number of non-trivial clones (Yang et al., 2017; Baltes and Diehl, 2019). Although these studies also explored how GitHub stakeholder’s reference code was copied or adapted from Stack Overflow answers without giving proper credits to the authors (who wrote the code), they did not consider the scenario where the stakeholder of the code snippets used in GitHub is the same as the owner of the code in Stack Overflow (in this case, a credit is not needed). Several techniques have been proposed for the automated detection of license incompatibility (German et al., 2010; Kapitsaki et al., 2017; Xu et al., 2021). While our study identifies license incompatibility as an unethical behavior, it includes more diverse types of issues related to licensing (e.g., missing license, and uninformed license change). Nevertheless, all existing studies on ethical concerns in OSS projects only focus on a few aspects of ethical principles, and they did not conduct analysis of the diverse types of ethical violations in OSS projects in GitHub.
3. Study of unethical behavior in OSS
To address the two research questions introduced in Section 1, we conducted a study of unethical behavior in OSS projects. Although using a mixed-method research methodology (e.g., adding a survey that asks developers for their opinions on each unethical behavior) would provide stronger empirical evidences, we choose to observe unethical behavior passively by reading developers’ discussions to avoid spamming developers (Baltes and Diehl, 2016).
Study methodology. Figure 1 gives an overview of our study. We built a crawler that crawls GitHub issues by searching using the keyword “ethic”, concepts related to unethics, and synonyms for “un/ethical” (i.e., “unprofessional”, “unfair”, “right”, “proper”, and “principle”) via the GitHub API. We then manually checked the results to exclude issues that do not have a clear description or are unrelated to ethical behavior. After getting the relevant issues, we manually analyzed the stakeholders’ discussions using thematic analysis (Cruzes and Dyba, 2011), an approach for identifying patterns (or “themes”) within data. Specifically, the first two authors of the paper followed five steps: (1) we carefully read and analyzed all discussions in the issue to understand what stakeholders discussed about and how they described unethical behavior, identifying the key sentences and phrases which represent unethical behavior. (2) We coded the key sentences and phrases in each issue by highlighting sections of text, and coming up with shorthand labels or “codes” to describe their content. We reread the related key sentences, phrases, and their surrounding context discussions to generate initial codes. New codes can be added as we go through the discussions. After we have been through the discussions, we collate together all the key sentences and phrases into groups identified by codes. These codes allow us to gain a condensed overview of the main points and common meanings that recur throughout the discussions. (3) After generating initial codes, we looked over the created codes, aggregated codes with similar meaning into groups, and started coming up with themes for those groups. Themes are generally broader than codes. (4) With the initial set of themes in the previous step, we reviewed all themes to look for chances to merge similar themes or sub-theme. (5) We finalized the themes by providing clear definitions. To reduce research bias, steps (1) to (4) were conducted independently by the first two authors of the paper. Then, a sequence of meetings was held to resolve conflicts and define the final themes in step (5). Both authors are PhD students with more than two years of research experience. The first author had taken a computer ethics course, while the second author had experience in OSS development. For RQ1, we develop a taxonomy of the types of unethical behavior in OSS projects and its underlying principles. Before following the steps of thematic analysis, we reviewed ethical principles from prior studies (Turilli and Floridi, 2009; Friedman et al., 2013; Cenite et al., 2009; Kocsis and de Vreede, 2016), and identified six ethical principles guiding the action of stakeholders in OSS projects, including: (1) accountability, (2) attribution, (3) autonomy, (4) informed consent, (5) privacy, and (6) trust (i.e., we exclude “welfare” because it is related to fair wages which is generally not discussed in our studied issues). We use these six underlying ethical principles and their corresponding ethical guidelines as guidance for merging relevant themes. For RQ2, we first obtained the initial “themes” (i.e., software artifacts) based on prior work (Pfeiffer, 2020; Huq et al., 2019). Then, via an iterative process of (1) reading 316 issues with their corresponding types of unethical behavior, and (2) refining the themes via thematic analysis, we derived 18 types of affected software artifacts.
3.1. RQ1: Types of unethical behavior
We crawled issues in GitHub, and obtained 1235 issues/PRs of 842 projects submitted by stakeholders. After reading the stakeholders’ discussion in GitHub issue/PRs and manually filtering out the invalid issues (e.g., issues that mentioned “ethic”’ but only involved updating terms and conditions in document (Inv, [n.d.])), we obtained 316 issues with 23 keywords (e.g., “copy”, “plagiarism”) shown in the supplementary material. We then identified themes in these keywords by referring to the six principles and their corresponding guidelines. For example, keywords such as “copy” and “plagiarism” belong to the same ethical guideline (“To respect copyright”) for “(S1) No attribution to the author in code”, “(S2) Soft forking”, and “(S3) Plagiarism” as they are all related to giving proper credits to the authors but we separate them into different types as they involve different degrees of copying (copying entire repository in “Soft forking” versus copying texts in “Plagiarism”). Subsequently, we obtained 15 types of unethical behavior with 11 ethical guidelines. After the generation of initial themes, both authors meet to discuss the 39 cases (12%) with divergent themes to reach a consensus.
Figure 2 shows the 15 types of unethical behavior in our study. Boxes on the left (e.g., “Attribution”) describes the ethical principle behind each type, whereas the grey heading for the boxes on the right (e.g., “To respect copyright”) includes the 11 ethical guidelines, and the contents present the related types of unethical behavior. Six of the 15 types have not been previously studied (i.e., S2, S6, S8, S9, S11, S12).
Finding 1: The types of unethical behavior in OSS projects are diverse (15 types identified in our study, with six new ones)
We explain the 11 ethical guidelines and the corresponding types of unethical behavior below:
- (1)
To respect copyright. There are three types of unethical behavior related to copyright, described below:
S1: No attribution to the author in code. This issue occurs when the stakeholders failed to give proper credit after copying a piece of code (Baltes et al., 2017). An example for S1 is:
(S1) “it is unethical not to credit or at the least, point out that these features are inspired by…” (S1, [n.d.])
S2: Soft forking. This issue occurs if the copied item is a repository and the copied repository has not been forked. Although GitHub encourages forking for social coding, a copied repository should acknowledge the original repository by creating an official fork (Nyman and Mikkonen, 2011). An example discussion for S2 is:
(S2) “Unauthorised copy of… unethical… You must delete this repo and fork from the original…” (S2, [n.d.])
S3: Plagiarism. Plagiarism occurs if the stakeholders copied texts (non-source code) or the entire product regardless of giving credit or not (RN2, [n.d.]c). An example discussion for S3 is:
(S3) “Interactive book should be free of plagiarism. By replicating the content used by…unethical.” (S3, [n.d.])
In this example, the repository of an interactive book is unethical because the book uses copied texts from several websites. 2. (2)
To help individuals make informed consent decisions easier via licensing. There are three types of unethical behavior related to licensing, described below:
S4: License incompatibility. It occurs if the repository includes source code or text files carrying different license types than the project’s license because stakeholders must ensure license compatibility of the repository. Example for S4 is:
(S4) “To continue distributing when we know they have incompatible licenses is unethical.” (S4, [n.d.]).
S5: No license provided in public repository. This issue occurs if the public repository does not have any license and the stakeholders request for it because licenses state the official permissions to use a repository, and project owners should provide them if the OSS is public for greater transparency. An example comment for S5 is:
(S5) “The repository is public which implies an intent of being open-source but no license is specified making review of the code an issue…People get…at the end of the day, but they are funding this stuff instead of the… developers. That’s unethical but legal.” (S5, [n.d.]).
S6: Uninformed license change. Due to transparency concerns, OSS developers should inform the stakeholders about the license change (via CHANGELOG or PR) prior to changing the license. S6 occurs if the contributors fail to do so. An example for S6 is:
(S6) “Normally license change are announced in some form of PR or announcement or discussion and none of that has taken place…I find this silent change unethical.” (S6, [n.d.]) 3. (3)
To avoid license violation. Stakeholders must obey the OSS license agreement and avoid integrating prohibited licenses that cause violations in license dependency chains (Kapitsaki et al., 2015).
S7: Depending on proprietary software. This issue occurs if the OSS project relies on closed-source software because OSS projects should be fully open-sourced. An example comment for S7 is:
(S7) “Since … is fully open source software, I believe depending on closed source software is unethical” (S7, [n.d.]). 4. (4)
To respect expectations between people through goodwill. Trust is an ethical principle that refers to respecting expectations between people through goodwill. The following type of unethical behavior may lead to broken trust among stakeholders in OSS projects:
S8: Self-promotion. This issue occurs when the stakeholder advertises his or her repository by suggesting to incorporate it into another repository without mentioning that he or she is a contributor or owner of the artifact. This goal of the stakeholder is to attract attention to his or her less well-known repository to increase its popularity. An example comment for S8 is:
(S8) “I strongly advise against migrating to nanocolors…Seeing him leverage his notability and following to promote and increase the adoption of nanocolors …, which he just released a few days ago, is unethical…failing to disclose that you are promoting your own package here is a bad” (S8, [n.d.])
In this example, an external (not affiliated with the ESLint repository) user who is the owner of the nanocolors library opens a PR in the ESlint repository to suggest replacing the chalk library with nanocolors to promote his own library. To mitigate this issue, a contributor of ESLint later suggested the user to disclose the fact that he is promoting his own library. 5. (5)
To be responsible for the project maintenance. Project owners, especially those who offer paid services should actively maintain their projects. If the project owners would to discontinue their technical supports, they should inform the users before asking them to pay for the service.
S9: Unmaintained project with paid service. This issue occurs if the project repository is not actively maintained when it has a paid service. It is unethical because the project owner is responsible for providing support to paid users who reported the bugs, and fix the bugs within a reasonable time. An example for S9 is:
(S9) “I just bought the pro version, and now I’m having this same problem…definitely unethical.” (S9, [n.d.])
In this example, the user who has paid for the open-source app reported the failure in using themes (a functionality that is only available for paid users) but the app is no longer maintained. 6. (6)
To avoid fraudulent activities. As a code of conduct in OSS, stakeholders should be aware of malicious activities.
S10: Vulnerable code/API. The issue occurs when stakeholders or a project is involved in malicious activities (e.g., contributing malicious code/API or leaving an unfixed vulnerability in the code). An example comment for S10 is:
(S10) “Given that iText 2.1.7 has…unfixed security vulnerability, …continuing to release it is unethical. In my opinion, iText 2.1.7 should be replaced by OpenPDF.” (S10, [n.d.]).
In this example, the user suggested replacing iText which has unfixed vulnerability with another library (OpenPDF) where the vulnerability has been fixed. Another example for S10 is the “hypocrite commits” incident mentioned in Section 1. 7. (7)
To be responsible for naming. Stakeholders are responsible for all software artifacts that they owned, including the selected names.
S11: Naming confusion. This issue occurs when it involves the stakeholders’ duty to give unique names for their artifacts (e.g., packages, variables, and libraries). Project owners should identify unique names before using the names. An example for S11 is:
(S11) “There is already a package ‘click’ for creating command-line interfaces. I am using coreapi package which … import click package:… your library does not have a style component and python throws an error…this kind of behavior for a company… unethical” (S11, [n.d.]).
In the above example, a user complained that the developers of the click-integration-django library select the same package name as the Click package, causing a error when using the package due to naming conflicts. 8. (8)
To be responsible for explaining public actions. Owners of OSS projects should explain each decision made for supporting users.
S12: Closing issue/PR without explanation. This occurs when an issue/PR has been closed without providing any explanation because all stakeholders are expected to receive reasonable explanations for informational fairness (Colquitt, 2001). An example for S12 is:
(S12) “It’s a bit unfair to just close something without explaining why?…I don’t understand why this (despite several closed issues all saying the same thing) isn’t being implemented” (S12, [n.d.]). 9. (9)
To avoid offensive language. Stakeholders should encourage respectful environment in OSS projects by avoiding offensive language because words with offensive language might represent unethical behavior (da Silva et al., 2021). Prior study stated that hate speech (offensive words) might not be a criminal offense but can still be harmful (Mondal et al., 2017).
S13: Offensive language. This occurs if the stakeholders or part of the project uses offensive language. An example for S13 is:
(S13) “Rename the Scroll of Genocide to something else…It was never a good or ethical name…It is not “merely” systemic and deliberate mass-murder…but state-enacted systemic destruction, neglect and suppression of entire schools of culture, science, literature, truth, of everything that makes us human” (S13, [n.d.]).
In this example, the stakeholder thinks that using the word “Genocide” to name a scroll in the open-source game is unethical because the word promotes intentional destruction of human being. 10. (10)
To allow individuals to choose which tasks to perform. Based on the “Autonomy” ethical principle, stakeholders of OSS should have the freedom to choose the tasks to perform.
S14: No opt-in or no option allowed. This occurs if the system does not provide users options such as withdrawing from using the product. For example, no option is available for uninstalling the third-party library. We focus on issues with “no option” or “no opt-in” because they provide stronger protections than opt-out (Bergerson, 2000). An example comment for S14 is:
(S14) “There should be an option if someone wants to completely remove … from the system…I think it’s unethical to not provide an easy way for a program to be uninstalled” (S14, [n.d.]). 11. (11)
To protect the right of an individual of personal information. The privacy of stakeholders of OSS should be protected.
S15: Privacy Violation. This occurs in OSS projects under two common scenarios: (1) if the software still collects data despite opting-out via consent, and (2) if there exist personal data leaks regardless of the options (opt-in/out). Example for S15 is:
(S15) “Form submitted even if opt-in checkbox is unchecked…Signing people up when they haven’t opted in is a major enough bug that it renders the plugin useless (or at least unethical)” (S15, [n.d.]).
Table 1 presents the numbers of issues we found for each type of unethical behavior. The “Type” and “Issues (#)” columns represent the types of unethical behavior and the number of issues we found in GitHub, respectively. Overall, our study identifies 15 types of unethical behavior where the most common types of unethical behavior are related to copyright (S1, S2, and S3) and licensing (S4, S5, S6, and S7). As our study shows that illegal copying of code (S1) or copying the entire repository (S2), or copying texts (S3) are common in OSS projects, we hope to raise awareness to stakeholders of OSS projects that such behavior is considered unethical.
Finding 2: The most common types of unethical behavior in OSS are issues related to copyright (23%) and licensing (26%).
3.2. RQ2: Affected software artifacts
We define affected software artifacts as objects in software repositories that violate ethical principles. To derive the set of affected software artifacts, we started with the 19 categories from the taxonomy of prior study (Pfeiffer, 2020). Then, we categorized the artifacts we found in our study based on the 19 categories. After removing categories with no artifact found, we obtained eight categories: (1) source code, (2) script, (3) configuration, (4) database (data), (5) image, (6) prose, (7) legalese, and (8) other. For the prose category (i.e., plain text files), we only found two concrete types (i.e., README/CONTRIBUTING.md, and CHANGELOG) so we separated them into two categories. As the category “other” in prior study (Pfeiffer, 2020) is too broad, we split it into 10 new categories based on aforementioned steps in thematic analysis: (1) external application programming interface (API), (2) user interface (UI), (3) project, (4) release history, (5) software feature, (6) product name, (7) operating system (OS), (8) website, (9) PR/Issue code review, (10) PR/Issue comment. We derive “PR/Issue code review” and “PR/Issue comment” based on prior work (Huq et al., 2019). Our newly introduced categories aim to preserve the hierarchy of artifacts (Project Software feature (Eisenbarth et al., 2003) Source code). For 28 cases (8.9%), both authors meet to discuss the issues labeled with different categories to resolve any disagreement. Finally, we obtained 18 types of affected software artifacts: (1) project, (2) software feature, (3) source code, (4) external API, (5) legalese, (6) product name, (7) release history, (8) UI, (9) configuration file, (10) PR/Issue code review, (11) PR/Issue comment, (12) README / CONTRIBUTING.md, (13) CHANGELOG, (14) data, (15) image, (16) OS, (17) website, and (18) script (i.e., source code in languages executed by an interpreter). As several artifacts are more difficult to understand, we explain them below:
**Project:: **
The affected artifacts involve more than one types of artifacts within the entire repository.
**Software feature:: **
Functional or non-functional requirements of a system (Eisenbarth et al., 2003; Hsi and Potts, 2000). An example is the ability to unsubscribe a service.
**Source code:: **
Source files (excluding scripts, binary code, build code) that belong to the current repository (internal).
**External API:: **
API from third party (external) library or service.
**Legalese:: **
Licenses, copyright notes, or patents.
**Product name:: **
The product, project, or app name.
Finding 3: The unethical behavior in OSS projects affect many different types of software artifacts (our study found 18 types).
The third column in Table 1 presents the affected artifacts for each unethical behavior. Each number in the column denotes the number of GitHub issues with a certain type of artifact (e.g., “19 Projects” means that there are 19 issues where S2 is affected by projects). Theoretically, one issue might discuss multiple artifacts but we found that each issue only discusses one artifact because (1) developers prefer discussing ethical concerns for one type of artifact in one issue, and (2) some categories are hierarchical (e.g., “project” includes multiple types of artifacts). Overall, Table 1 shows that source code is still the most common type of artifacts for unethical behavior (i.e., it affects eight types of unethical behavior).
Finding 4: Source code is the most common types of affected artifacts (affects eight types of unethical behavior).
4. Methodology
Our study shows that diverse types of unethical behavior exist in OSS projects, and they usually involve diverse types of software artifacts. The diversity and the complexity of the rules governing the ethics-related activities in GitHub motivate the need for a modeling approach that can abstract this complexity and facilitate its automatic detection. In Section 4.1, we describe how we model unethical behavior using SWRL rules. Then, we explain the architecture of Etor that uses SWRL rules for automatic detection in Section 4.2.
4.1. Modeling via SWRL rules
We propose using SWRL rules to represent unethical behavior in an OSS project together with the publicly available data in GitHub. SWRL rules allow us to model affected software artifacts as hierarchies of classes and properties, capturing the relationships between affected software artifacts and stakeholders. Table 2 shows GitHub attributes used in our modeling. The columns under “Attribute”, and “Type” explain each attribute and its type. We model each OSS project as GHRepository. By referring to the GitHub Repositories API (Rep, [n.d.]), we selected 11 data properties (e.g., latestRelease and licenseFile) that belong to a GHRepository by excluding properties that are irrelevant for unethical behavior (e.g., avatar_url that points to the icon for a repository). Apart from GHRepository, we introduce six classes to model data properties of a repository: (1) GHUser, (2) GHCommit, (3) GHContent, (4) GHIssue (5) GHPullRequest), (6) GHRelease. While GitHub users (GHUser) usually play different roles in OSS projects, we only model: (1) contributors (users who are official contributors of a repository) and (2) issue owners (users who report an issue). For modeling GHIssue, we reuse the same convention in GitHub by modeling a PR (GHPullRequest) as a subclass of GHIssue (i.e., GitHub Issue Search API will search for issues and PRs, essentially treating a PR as a type of GitHub issue). Figure 4 shows the OWL ontology for our model where GHRepository is the main class, and the arrows denote the relationships between the classes. Specifically, represents the subclass relations, whereas other arrows denote hasA relations (e.g., means that each issue has a user who reports the issue).
4.2. Automatic detection of unethical behavior
We designed Etor to auto-detect six types. We excluded nine types because (1) they involve artifacts (e.g., product names, software features) that are difficult to automatically isolate from other artifacts (i.e., “No opt-in or no option allowed”, “Privacy Violation”, “Naming confusion”, and “Offensive language”), (2) they require sophisticated analysis of configuration files, API or source code (i.e., “Plagiarism”, “Depending on proprietary software”, and “Vulnerable code/API”), (3) their detection requires advanced natural language processing (i.e., “Closing issue/PR without explanation” as it requires automatically checking if the explanation for closing the PR/issue exists), and (4) approaches for “License incompatibility” (German et al., 2010; Kapitsaki et al., 2017; Xu et al., 2021) exist so we exclude it to avoid reinventing the wheels.
Overview of Etor. Figure 4 presents the overall architecture of our automatic detection tool, Etor. Etor supports detection of unethical behavior for two levels, including: (1) repository (denoted as repo), and (2) GitHub issue/pull request (we denote an issue as issue and a pull request as PR). Given a repo or an issue/PR, and the type of unethical behavior eType to be checked, the Etor relies on its set of SWRL rules for its detection, and produces as output whether there is a violation of eType in the given input. Apart from GitHub attributes in Table 2 that can be detected using the GitHub API, our SWRL rule reasoner uses two additional components for its detection: (1) license detector that checks for licenses at the repository level, and (2) code similarity checker that identifies similar code.
Supported types. Etor supports six types of unethical behavior. We include the SWRL rules for all supported types in the supplementary material. We next describe how Etor checks each supported type.
(S1) No attribution to the author in code. Etor checks if an issue or a PR has a Stack Overflow link representing a reference code, and the code snippet copied from Stack Overflow cites the reference link. Although there can be many resources from which stakeholders copy the reference code, Etor only check for Stack Overflow links because (1) we learned from our study and from existing work (Baltes et al., 2017) that contributors are required to give credit to copied code snippets in Stack Overflows as they are protected by the CC-BY-SA Creative Commons license, and (2) to support other online resources (e.g., GitHub links), we need to automatically extract the original reference code (requires parsing Web pages of different formats), and identify the appropriate license for the code snippet (requires detecting the license for partial code, which is beyond the scope of this paper). Given an issue/PR, Etor checks if a comment b in the issue/PR posted by a stakeholder u1 contains the Stack Overflow link (w) (we use regular expression to extract w). Etor reports a potential violation if: (1) u1 is not the owner of the Stack Overflow comment, (2) the code snippets from Stack Overflow is found in one of the files in the repository (F) with at least 10% similarity (copyright law permits the use of up to 10% of work without permission (Cop, [n.d.])), and (3) w is not found in F.
(S2) Soft forking. Given two repositories r1 and r2, Etor compares the contents of all source files in the two repositories to check if one repository is a soft-fork (the repository has the same content but it is not listed as an official fork of another repository) of another repository. Specifically, we use AC2 (AC2, [n.d.]) to detect the similarities between files. AC2 is a source code plagiarism detection tool that has been widely used by graders to detect plagiarism within a group of assignments. We select AC2 because (1) it supports many programming languages (e.g., C, C++, Java, and PHP), (2) it can be run in a local environment without connection to remote servers, and (3) it is quite robust as it incorporates multiple algorithms found in the literature. Etor reports a violation if it detects: (1) 100% similarity between r1 and r2, and (2) r2 is not in the fork list of r1.
(S5) No license provided in public repository. Given a repository r, Etor detects the repo-level license by checking if it exists in the: (1) LICENSE file (Add, [n.d.]) in the main directory of r, (we check only in the main directory to avoid mistakenly finding API license or package license) or (2) README.md file with license information (we use the list of licenses provided by GitHub (Lic, [n.d.]) for repo-level license detection). Etor reports a potential violation if no license is found after searching for the two files.
(S6) Uninformed license change. We consider a change to be uninformed if (1) it is not announced in the CHANGELOG.md or (2) the license change is not done via PR. Given a repository r, Etor checks if the repo-level license has been changed by: (1) extracting commit lists of the license file, and (2) checking if commit changes include license updates. If the license changes occur in more than one commit (we ignore the first commit as it is the initial license creation), Etor checks whether the changes have been announced in the CHANGELOG.md by checking whether the CHANGELOG.md mentions license information. If license information is not found, Etor checks the PR count for the commit (pullRequestCountByCommit). If the count is less than one, Etor marks it as a potential violation.
(S8) Self-promotion. We consider self-promotion to be the scenario where a contributor u opens a GitHub issue/PR where the content of the issue/PR includes links to another repository in GitHub to promote his or her own repository. Given an issue/PR for r1 as input, Etor first (1) checks that the issue/PR includes a link L to another repository r2, and (2) identifies the stakeholder u who opens the issue/PR. Then, it reports a violation if: (1) r1 is not r2, (2) u is not a contributor of r1 (i.e., u is an outsider for r1), and (3) u is a contributor of r2. To reduce false positives, Etor also checks if L includes specific keywords that usually indicate that the contributor is sharing the link L for demonstration purposes (e.g., [DEMO]) instead of promoting a repository/library (“\issues\”, “\pull\”, “\commit\”, “\tree\”, “\releases\”, “\blob\”, and “\runs\”).
(S9) Unmaintained Android Project with Paid Service. This type checks whether an Android project offered paid service in Google Play, but stop actively maintaining the GitHub repository. On average, 115 APIs are updated per month (McDonnell et al., 2013), and 49% of app updates have at least one update within 47 days (McIlroy et al., 2016). Based on this frequency of app updates, we define an unmaintained Android project to be an Android project where the latest update is released less than 0.5 year. Given a repository r as input, Etor first checks for unmaintained Android projects by examining whether (1) the latest release date (D) of r is less than 0.5 year, and (2) r is an original repository (not forked from other repositories). Then, it checks whether the app offers a paid service by (1) identifying the Google Play link l from r, and (2) searching for the “in-app purchase”.
5. Evaluation
We applied Etor on 195,621 GitHub issues and PRs of 1,765 GitHub repositories to address the following research questions:
**RQ3:: **
How many unethical issues can Etor detect in OSS projects?
**RQ4:: **
What are the accuracy and efficiency of Etor in its detection?
By counting the number of unethical issues in OSS projects, RQ3 provides a rough estimation of the prevalence of each type of unethical behavior in OSS projects. For RQ4, we measure the accuracy and efficiency of Etor using the following metrics:
**True Positive (TP):: **
Etor classifies an unethical behavior as a potential violation, and it is a true violation.
**False Positive (FP):: **
Etor incorrectly classifies an unethical behavior as a potential violation, and it is a false violation.
**Time:: **
The average time taken (in seconds) to detect a type of unethical behavior across all the evaluated repositories/issues.
Selection of projects/issues. As there is no prior benchmark for evaluating the detection of unethical behavior, we construct a dataset by crawling GitHub. Our goal is to select the most recent popular (most stars and most forks) OSS projects and the GitHub issues/PRs from OSS projects for evaluation. We first obtain the list of the top 2,000 OSS projects (we first get the top 1,000 projects with the greatest number of stars, and then the top 1,000 projects with the greatest number of forks) created last year (2021). After eliminating duplicated projects, there are 195,621 GitHub issues/PRs of 1,765 projects in our evaluation set. As soft forking requires two repositories as input, we obtain the pair of repositories () by getting from the top 200 projects (first 100 from most stars, subsequent 100 from most forks) from the initial list of 2,000 projects. From these 200 projects, our crawler automatically identifies by searching GitHub for projects with similar names using the name of as the query. At this step, our crawler found only 10 out of the 200 projects that have repositories with similar names. For each of these 10 projects, our crawler retrieves the first 10 repositories from the search results as , leading to a total of 10*10=100 projects for evaluating soft forking.
Ethical considerations. Before getting feedback from stakeholders, we obtained IRB approval from our institution. As calling out stakeholders for violations of unethical behavior could potentially lead to similar ethical concerns in prior work (Wu and Lu, 2021), we choose to evaluate Etor by (1) manually inspecting the reported issues, and (2) reporting only the types of unethical behavior with high accuracy (based on our manual analysis). To avoid violating ethical principles as in the “hypocrite commits” incident, we explicitly mentioned in each reported issue that we are researchers conducting research on mining software repositories. To reduce author bias in the manual classification of TP/FPs, we ask for help from a non-author to independently label each issue.
All experiments are conducted on a machine with Intel(R) Core (TM) i7-7500 CPU @2.7 GHz and 16 GB RAM.
Implementation. We use Protégé 5.5.0 (Musen, 2015) to define the ontology model. Our crawler uses PyGitHub (Git, [n.d.]) for querying GitHub.
5.1. RQ3: Number of detected issues
Table 3 summarizes the results of our evaluation. The “Type” column denotes the types of unethical behavior detected by Etor, whereas the second column is of the form / where represents the number of repositories/issues with the unethical behavior detected and denotes the total number of repositories/issues in our evaluation dataset. Overall, Etor has successfully detected at least one violation for all types of unethical behavior that we studied. As our evaluation dataset is different from the study dataset, and we have observed the occurrences of unethical behavior in both datasets, this indicates that different types of unethical behavior is prevalent in OSS projects. Table 3 also shows that “No license provided in public repository” is the most common types among the six types of detected issues. This means that a relatively high percentage of the evaluated repositories are missing license files (around 24% of the evaluated repositories if we exclude the false positives). For the issue-level detection, we observe that “No attribution to the author in code” and “Self-promotion” are the most common ones among all evaluated issues/PRs. This indicates that contributors of OSS projects tend to (1) forget to give credit to the author in their copied code snippets, or (2) promote their own repositories without mentioning they are contributors to the repositories.
5.2. RQ4: Accuracy and Efficiency of Etor
Accuracy. To evaluate the effectiveness of Etor, two raters (one author, and one non-author who is an undergraduate CS student working as a part-time student assistant) independently inspect its output. Specifically, for each violation reported by Etor, each rater determines if the violation is a true positive (TP) or a false positive (FP). The initial Cohen’s Kappa was 0.82, which indicates a high level of agreement. The two raters then meet to resolve any disagreement to reach Cohen’s Kappa of 1.0. The “True positive” and “False positive” columns in Table 3 show the results for the inspection. On average, the TP rate is 74.8%, and the FP rate is 24.8%. For repository-level detection, although Etor can only detect a small number of violations for “Soft forking”, it can detect these unethical issues with high accuracy (0% FP rate). As we consider a repository a soft-fork only if all the contents of the two repositories are the same (100% similarity), this design decision may lead to fewer violations being found but increase the accuracy of its detection. In future, it is worthwhile to study the effect of the similarity threshold on the accuracy of its detection. For issue-level detection, Etor can detect S1 with reasonable accuracy (26% FP rate).
Efficiency. The “Time” column in Table 3 shows the average time taken to detect an unethical behavior. Overall, the average time to analyze a repository is 3.1–343.1 seconds and the average time taken to analyze an issue is 4.3–5.4 seconds. This indicates that Etor can detect a type of unethical behavior relatively fast. We also observe that “Soft forking” is the most time-consuming type to detect because Etor needs to check for code similarities for all source files within the repository.
Reasons behind inaccurate detection. We manually inspect the reasons behind the FPs reported by Etor. Etor reports the highest FP rate for “Self-promotion”. Recall that Etor checks that a stakeholder St opens an issue/PR I at repository R1, and includes the other repository (R2) link (L). A true “Self-promotion” only occurs if St did not mention about being a contributor of R2. We need to manually verify this condition by reading the comments written in natural language. Hence, FPs may occur if (1) St mentioned that he or she is a contributor of R2 (e.g., “I am working on a project called the …” in comment (Sel, [n.d.])) or (2) St wanted to ask for suggestion in using R1 for R2 (e.g., “I’d like to try your … module in a non-mmdetection repo (…)” (url, [n.d.]c)).
There are three main reasons for FPs in “No attribution to the author in code”: (1) no actual copying occurs but a link exists (e.g., the Stack Overflow link was mentioned as references (url, [n.d.]e)), (2) Etor checks the exact link and fails to detect if the citation uses the short link of Stack Overflow, and (3) Etor matches the exact GitHub user name with the Stack Overflow user name, and fails to detect if the user name is different (e.g., GitHub user name is devinrhode2 and Stack Overflow user name is Devin Rhode (url, [n.d.]b)). For “No license provided in public repository”, FPs occur because the repository (1) has a license file that is not in the main directory (e.g., LICENSE file in the inner folder (url, [n.d.]a)), (2) has a disclaimer in README.md (e.g., “This repository is for personal study and research purposes only. Please DO NOT USE IT FOR COMMERCIAL PURPOSES.” (url, [n.d.]f)), (3) is used for education purposes (we need to manually exclude repositories for the public schools where the license is not required), (4) has no source code or data, and (5) is under an organization license and no separate license is defined for the repository (url, [n.d.]d). For “Uninformed license change”, FPs occur because the scenario where the repository has restored the old license should not be considered a violation (e.g., the stakeholder changed the license from “Apache License Version 2.0” to “GNU GENERAL PUBLIC LICENSE Version 3” on Feb 17, and he/she restored back to “Apache License Version 2.0” on Feb 18). For “Unmaintained Android Project with Paid Service,” we found one FP because the unmaintained project is a library that an app uses instead of the app itself but the app is actively maintained. (a new version is recently released).
Stakeholders’ feedback. Apart from manually labeling the unethical issues, we also obtained qualitative feedback by reporting them to stakeholders of OSS projects. To avoid spamming OSS developers with inaccurate results, we only reported the types of unethical behavior with ¿=80% TP rate in our manual analysis (i.e., S2, S5, S6). For each of these reported types, we opened a GitHub issue to developers when both raters labeled it as TP. We excluded 39 issues because the project owners have disabled GitHub issues (this usually indicates that they do not accept contributions or bug reports (dis, [n.d.])). For example, the repository (noi, [n.d.]) violates the “No license provided in public repository” rule but we cannot report this to the project owner as GitHub issues have been disabled. We also found 19 issues that were previously reported and fixed the issues before we file a bug report. In total, we have reported 392 issues, and received 83 replies (a response rate of 21.17) from stakeholders. We carefully looked through all those responses and identified 68 (81.93) replies as valid and 15 (18.07) responses as invalid. Among these 68 valid replies, 39 (57.35) have been fixed, and 29 (42.65) have been accepted by the stakeholders of the OSS. An example valid feedback that we received is “Thank you very much for the warning. I have already added the license to the repos that didn’t have it.”. For the 15 responses that we considered as invalid, developers (1) directly deleted or closed our submitted issues without any explanations (7/15), (2) thought that the issue reporter is a software bot although we have created the issue manually and explicitly mentioned in the issue that we are researchers (5/15), (3) are not interested in getting any GitHub issues (e.g., claiming that the repository is personal) (2/15), and (4) explained that “Software is not open source but everyone or you can use my soft without license. thank you for support my soft” (1/15).
6. Discussion and Implications
We discuss practical takeaways and suggestions below:
Implications for stakeholders of OSS projects. By reading issues that stakeholders in OSS considered as “unethical behavior”, our study revealed that the types of unethical behavior in OSS projects are diverse (Finding 1), suggesting that stakeholders of OSS projects should be better educated to create awareness of the different types of unethical behavior when contributing to OSS projects to avoid violating ethical principles. Apart from general types of unethical behavior, our study also pinpoints six new types of unethical behavior in OSS projects (i.e., (S2) Soft forking, (S6) Uninformed license change, (S8) Self-promotion, (S9) Unmaintained project with paid service, (S11) Naming confusion, and (S12) Closing issue/PR without explanation). Some of them are related to the unique features of GitHub (e.g., “Soft forking” represents ethical concerns when forking, “Closing issue/PR without explanation” are related to closing GitHub issues/PRs, “Self-promotion” occurs due to the need to promote the popularity of one’s new repository, whereas “Unmaintained project with paid service” denotes the responsibility of an OSS project owner to actively maintain the project to support paid users). The identified new types call for considerations of the unique context of OSS projects when studying unethical behavior. Meanwhile, although most software development efforts focus on source code maintenance, our study urges OSS project owners to be responsible for the product names selection to avoid violating “Naming confusion”. As issues related to copyright and licensing are the most common ones (Finding 2), contributors of OSS projects should pay more attentions in giving appropriate credits, and selecting suitable software licenses when copying software artifacts or using library. Meanwhile, although source code is the still most common affected software artifacts (Finding 4), our study urges OSS stakeholders to be responsible for various types of software artifacts (Finding 3) to avoid violating ethical principles when uploading them to GitHub.
Implications for researchers and tool designers. As many of the identified types of unethical behavior (Finding 1) are ethical issues that frequently occur in daily life, our study provides empirical evidence that there exists some overlaps between the types that occur under general setting (e.g., “Plagiarism” and “Offensive language”) and those that are deemed as unethical behavior by stakeholders in OSS projects. Indeed, the prevalence of plagiarism is inline with prior study which reported the prevalence of the code borrowing practices in GitHub (Golubev et al., 2020). Due to the diverse types of unethical behavior, future empirical research should advance beyond the general types of unethical behavior.
While existing work mostly focus on license incompatibility (German et al., 2010; Kapitsaki et al., 2017; Xu et al., 2021), our study found new issues related to licensing. e.g., “Uninformed license change”. As these issues still occur frequently (Finding 2) and our study identified new types of issues, our study provides empirical motivations for improving techniques related to copyright and software licenses. For the newly identified types of unethical behavior, we foresee a huge potential for future research direction in: (1) conducting more in-depth study in the motivations and the common solutions behind each type of unethical behavior, and (2) introducing automated techniques that can detect and possibly resolve these issues. We believe that our taxonomy of 316 GitHub issues and our tool that uses software artifacts and data available in GitHub API lay the foundation for future approaches on automated detection of unethical behavior. Although source code is still the most common type of affected software artifacts (Finding 4), other artifacts in natural language (e.g., PR/Issue comments, product names, and website) are also common in our study (Finding 3). A promising research direction is to apply natural language processing techniques to accurately detect affected software artifacts in natural language. For example, techniques can be designed to automatically extract and recommend descriptive yet non-conflicting names (e.g., package names) to avoid “Naming confusion”. Another future direction is to design techniques that can automatically identify disclaimer-like statements to accurately detect “Closing issue/PR without explanation” (to detect the explanation for the PR/issue), and “Self-promotion” (to extract statement where the stakeholder has mentioned being a contributor).
Challenges in automated detection of unethical behavior. To provide guidelines for future research on the automated detection of unethical behavior, we discuss several challenges identified in our study and evaluation:
- •
As shown in our study in Section 3.2, the types of artifacts affected by the unethical behavior are too diverse. An accurate detection technique needs to support analysis of various types of artifacts, including source code, data, and websites.
- •
Within GitHub, we notice that discussion and announcement in GitHub spread across multiple web pages (issues, PRs, wikis, discussions, and commit logs). With the rapid growth of different types of web pages in GitHub, it poses additional challenges for automated approaches to exhaustively analyze all relevant web pages.
- •
Some discussions of unethical behavior occur outside of GitHub (e.g., personal emails, slack channel). For example, for “Self-promotion”, we cannot check whether the stakeholder has communicated with the developers in advance through emails. Without complete information about the discussion, the detection is bound to be inaccurate.
- •
The scope for the detection can be too broad for some types of unethical behavior (e.g., “Naming confusion”). Without a predefined scope of detection (package name collision versus app name collision), we cannot accurately detect the behavior.
- •
There exist ambiguities for certain unethical behavior, which makes it difficult even for human beings to reach consensus (e.g., whether the language used is offensive). In this case, an automated tool can present all relevant information to help stakeholders in making more grounded ethical decisions.
7. Threats to validity
External. Our findings of unethical behavior may not generalize beyond the studied OSS projects and issues/PRs. There could be unethical behavior that are not reported to the issue tracker. Unfortunately, there is no conceivable way to study these unreported issues. As some issues may not have the ethics-related keywords that we used for searching, we could have also missed some unethical behavior. Nevertheless, our selected keywords already help us in discovering many types of unethical behavior. Hence, we believe the issues in our study provide a representative sample of the reported and resolved unethical issues in our studied repositories. While other types of unethical behavior discovered in our study is important, Etor can only detect six of them, and our evaluation is limited to these six types. Nevertheless, our experiments show that Etor can detect unethical behavior with relatively high accuracy.
Internal. Our code and scripts may have bugs that can affect our results. To mitigate this threat, we make our tool and data publicly available for inspection.
8. Conclusion
To better understand unethical behavior in OSS projects, we conduct a study of the types of unethical behavior in OSS projects. By reading and analyzing the discussion of stakeholders in OSS projects, our study of 316 GitHub issues identifies 15 types of unethical behavior. These unethical behaviors are affected by various types of software artifacts. Inspired by our study, we propose Etor, an ontology-based approach that can automatically detect unethical behavior. Our evaluation of Etor on 195,621 issues (1,765 repositories) shows that Etor can automatically detect 548 issues with 74.8% TP rate on average. As the first study that investigates the types of unethical behavior in OSS projects, we hope to raise awareness among OSS stakeholders regarding the importance of understanding ethical issues in OSS projects. While Etor shows promising results in automated detection of unethical behavior in OSS projects, we plan to enhance Etor in future to detect more types and reduce false positives using machine learning techniques.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2S 8 ([n.d.]) [n.d.]. https://github.com/eslint/eslint/pull/15102
- 3RN 2 ([n.d.]a) [n.d.]a. https://www.w 3.org/2001/sw/#owl
- 4RN 2 ([n.d.]b) [n.d.]b. http://www.w 3.org/Submission/SWRL/
- 5Inv ([n.d.]) [n.d.]. https://github.com/Pryaxis/handbook/issues/3
- 6S 1 ([n.d.]) [n.d.]. https://github.com/novus-package-manager/novus/issues/3
- 7S 2 ([n.d.]) [n.d.]. https://github.com/biddyweb/yes-cart/issues/33
- 8S 3 ([n.d.]) [n.d.]. https://github.com/Circuit Verse/Interactive-Book/issues/80
