Papers We Love (FOSS Edition)

Papers We Love is a repository of academic computer science papers and a community who loves reading them but i’m co-opting the term to also refers to FOSS-related papers. I just read the Harvard Business School Strategy Unit Working Paper on The Value of Open Source Software and I wanted to share a few personal highlights from the paper. A number of references within the paper also look interesting and I want to add highlights from those papers too within this topic in the future.

I would love it if the community also shares Papers they love in the topic.

7 Likes

Harvard Business School Strategy Unit Working Paper on The Value of Open Source Software by Manuel Hoffmann, Frank Nagle, Yanuo Zhou

The key highlights from the paper are

We estimate the supply-side value of widely-used OSS is $4.15 billion,but that the demand-side value is much larger at $8.8 trillion

Further, 96% of the demand-side value is created by only 5% of OSS developers

Why studying the value of FOSS is important

The parallels between shared grazing lands and shared digital infrastructure are palpable –the availability of communal grass to feed cattle, and in turn feed people, was critical to the agrarian economy, and the ability to not have to recreate code that someone else has already written is critical to the modern economy

Ammunition for FOSS advocates

Other recent studies have come to similar conclusions showing that open source software (OSS) appears in 96% of codebases (Synopsys 2023), and that some commercial software consists of up to 99.9% freely available OSS (Musseau et al., 2022)

With data from the United States the resulting estimates show a value of $2 billion for the OSS Apache Web Server in 2012 (Greenstein and Nagle, 2014) and a combined value of $4.5 billion for Apache and the increasingly popular OSS web server nginx in 2018 (Murciano-Goroff, et al., 2021)

We find a value ranging from $1.22 billion to $6.22 billion if we were to decide as a society to recreate all widely used OSS on the supply side. However, considering the actual usage of OSS leads to a demand-side value that is orders of magnitude larger and ranges from $2.59 trillion to $13.18 trillion, if each firm who used an OSS package had to recreate it from scratch(e.g., the concept of OSS did not exist). … However, as for any project, the evidence is not complete and we argue that we underestimate the value since our data, e.g.,does not include operating systems, which are a substantial omitted category of OSS

Not so great assumptions

Here, we do not incorporate consumption externalities, i.e., we do not allow a benefit to arise for the general public when a package has been created and we further make sure that each firm is only replacing a package they use once, since a replaced package can be used within a firm as a club good (e.g., see Cornes and Sandler, 1996).

For large firms, there will be overhead coordination costs associated with building and maintaining a club good (an internal package). This potentially means that the demand-side 8.8 trillion $ number is a lower-bound.

In this calculation, we implicitly do not incorporate any production externalities since we assume that there is no spillover knowledge from one package to the next that would lower the cost of programming.

This too we know to be false. Packaging and project management add considerable overhead for a software project. Spillover knowledge definitely reduces the cost of programming as the developer becomes comfortable with those aspects of a project over time. This potentially means that the supply-side 4.15 billion $ is an upper-bound.

At the repository level, we quantified each developer’s proportional work contribution by calculating their share of commits to the total number of commits for a repository

Commits aren’t the best indicator of a developer’s work contribution to a FOSS project e.g. what if the project uses squash merge to merge a large feature branch that contained 10s or 100s of commits. Lines of code aren’t a great indicator either e.g. complicated bugs that require 10s of hours of debugging might be fixed by a change in a single line of code. There are no clean/easy indicators to quantify the work contribution of a developer so such assumptions are inevitable.

Unexpected (to me) findings

We find that OSS packages created in Go have the highest value with $803millionin value that would have to be created from scratch if the OSS packages did not exist. Go is closely followed by JavaScript and Java with$758 million and$658 million, respectively. The value of C and Typescript is $406 million and $317 million, respectively,while Python has the lowest value of the top languages with around $55 million

Potential growth areas for FOSS

The industry with the highest usage value of around $43 billion is “Professional, Scientific, and Technical Services.”“Retail Trade” as well as “Administrative and Support and Waste Management and Remediation Services” make up another large part of the demand-side externally facing value of OSS with $36billion and $35billion, respectively. In contrast, industries that constitute just a small portion of the value are “Mining, Quarrying, and Oil and Gas Extraction”, “Utilities”, “Agriculture, Forestry, Fishing, and Hunting.” The latter industries are classical non-service sector industries and as such software is expected to play less of a role there.

I’ve been singing this song for a while but now I have the evidence to back it up - we should be advocating for people to apply computing to their domains instead of expecting them to abandon their domains to become generic software developers.

Giants in the FOSS ecosystem

Indeed, the last five percent of programmers, or 3,000 programmers, generate over 93% of the supply side value. Similarly, Panel B shows –when accounting for usage–that those last five percent generate over 96% of the demand side value.

2 Likes

“Do Software Developers Understand Open Source Licenses?” published in 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). See IEEE link and PDF link shared by one of the authors.

The key highlight from the paper is

The 375 respondents to the survey, who were largely developers, gave answers consistent with those of a legal expert’s opinion in 62% of 42 cases. Although developers clearly understood cases involving one license, they struggled when multiple licenses were involved

Survey

seven hypothetical software development scenarios, someof which included multiple license combination

I highlight recommend checking out the survey questions first before looking at the legal answers to them in the paper in Table III - https://www.cs.ubc.ca/labs/spl/projects/softwarelicensing/resources/UBC_SPL_software_licensing_survey.pdf

Key observations

  1. Developers cope well with single licenses even in complex scenarios
  2. Developers have difficulty interpreting which actions are allowed in scenarios where more than one open source license is in use.
  3. Developers understand technical decisions will impact open source license use
  4. Developers recognize that there are interactions between open source licenses, but those interactions were not always correctly interpreted
  5. Questions that arise about the use of multiple open source licenses are situationally dependent.
  6. A number of developers lack knowledge of the details of open source licenses.

Survey underestimates the problem

In particular,we observed that the large majority of the participants (85.3%)had chosen a software project’s license before, which might bean instance of self-selection bias.

1 Like

I came across this paper when I looking at the website of Prof. Philip Guo. Prof. Guo is renowned in Computer Human Interaction (CHI) circles for his work on the pythontutor.com. His other papers are definitely worth checking out e.g. Ten Million Users and Ten Years Later: Python Tutor’s Design Guidelines for Building Scalable and Sustainable Research Software in Academia

In this lovely little paper, the authors look at the use of ASCII diagrams in source code, specifically in 4 projects - Linux (OS), LLVM (computer), Chromium (Browser) Tensorflow (ML).

TL; DR - The diagrams that they dug up from the codebases and the design space they synthesized from the ASCII diagrams can be found at https://asciidiagrams.github.io/. Also, know, that there are 428 ASCII diagrams in Chromium, 1386 in Linux, 220 in LLVM, and 122 in Tensorflow.

Abstract : Documentation in codebases facilitates knowledge transfer. But tools for programming are largely text-based, and so developers resort to creating ASCII diagrams—graphical artifacts approximated with text—to show visual ideas within their code. Despite real-world use, little is known about these diagrams. We interviewed nine authors of ASCII diagrams, learning why they use ASCII and what roles the diagrams play. We also compile and analyze a corpus of 507 ASCII diagrams from four open source projects, deriving a design space with seven dimensions that classify what these diagrams show, how they show it, and ways they connect to code. These investigations reveal that ASCII diagrams are professional artifacts used across many steps in the development lifecycle, diverse in role and content, and used because they visualize ideas within the variety of programming tools in use. Our findings highlight the importance of visualization within code and lay a foundation for future programming tools that tightly couple text and graphics.

The researchers asked three research questions

RQ1 Characteristics. What are the key characteristics of the media of ASCII diagrams?
RQ2 Roles. How are ASCII diagrams used in the software development workflow?
RQ3 Content. What do ASCII diagrams show and how do they show it?

The first author, leveraging their initial enthusiasm for this project, scrolled through the entirety of the four files and manually extracted the comments that resembled ASCII drawings, resulting in a selection of 2,162 ASCII diagrams.

:rofl:

P2 gladly paid $10 for Monodraw, a diagram editor specifically for ASCII. It allowed them to edit the diagram after-the-fact, provided they had saved the Monodraw file, which proved helpful in a codebase where many contributors were authoring these diagrams: “the best $10 I could ever spend was having Monodraw…if I want to like enlarge a box, I can enlarge a box, I can move things around…I usually keep the Monodraw files sitting around so that if someone’s gonna come along and ‘Oh I’m gonna have to update this diagram.”’

Ref https://monodraw.helftone.com/

As an alternative to ASCII, both P1 and P2 mentioned Mermaid [61], a simple textual language for describing and rendering diagrams. The popular code collaboration site GitHub renders Mermaid automatically, so P1 now uses it instead of ASCII art when writing README files. P2’s team also uses Mermaid regularly on GitHub, particularly for charts that are updated often, as updating ASCII is tedious. Nevertheless, P2 remarked that code diffs of Mermaid were hard to understand and P1 still used ASCII in tools that didn’t support Mermaid (e.g. terminals and text editors).

I’ve used mermaid a bunch of times, especially on GitHub as it now natively renders the mermaid diagrams, but like the quote above, it still needs to be rendered and the information isn’t as visual as an ASCII diagram.

a birdseye overview of how they used diagrams
(1) to reify offline work,
(2) to illustrate test cases,
(3) for code review and verification from colleagues, and to document both (4) for others and (5) themselves

Finally, here are a few diagrams from the https://asciidiagrams.github.io/ dataset that I find interesting

// Note that speech recognition is activated on VR UI thread. This means it
// usually involves 3 threads. In the simplest case, the thread communication
// looks like the following:
//     VR UI thread        Browser thread         IO thread
//          |                    |                    |
//          |----ActivateVS----->|                    |
//          |                    |------Start------>  |
//          |                    |                    |
//          |                    |<-NotifyStateChange-|
//          |<--OnSRStateChanged-|                    |
//          |                    |                    |
//          |                    |<--OnSpeechResult---|
//          |<--OnSRStateChanged-|                    |
//          |                 navigate                |
//          |                    |                    |
// VS = voice search, SR = speech recognition 

from chromium

 /*
 * +------------+---------------------------------------------------+
 * |   PHASE    |           FIRMWARE STATUS TRANSITIONS             |
 * +============+===================================================+
 * |            |               UNINITIALIZED                       |
 * +------------+-               /   |   \                         -+
 * |            |   DISABLED <--/    |    \--> NOT_SUPPORTED        |
 * | init_early |                    V                              |
 * |            |                 SELECTED                          |
 * +------------+-               /   |   \                         -+
 * |            |    MISSING <--/    |    \--> ERROR                |
 * |   fetch    |                    V                              |
 * |            |                 AVAILABLE                         |
 * +------------+-                   |   \                         -+
 * |            |                    |    \--> INIT FAIL            |
 * |   init     |                    V                              |
 * |            |        /------> LOADABLE <----<-----------\       |
 * +------------+-       \         /    \        \           \     -+
 * |            |    LOAD FAIL <--<      \--> TRANSFERRED     \     |
 * |   upload   |                  \           /   \          /     |
 * |            |                   \---------/     \--> RUNNING    |
 * +------------+---------------------------------------------------+
 */

from linux

 // An irreducible SCC is one which has multiple "header" blocks, i.e., blocks
// with control-flow edges incident from outside the SCC.  This pass converts a
// irreducible SCC into a natural loop by applying the following transformation:
//
// 1. Collect the set of headers H of the SCC.
// 2. Collect the set of predecessors P of these headers. These may be inside as
//    well as outside the SCC.
// 3. Create block N and redirect every edge from set P to set H through N.
//
// This converts the SCC into a natural loop with N as the header: N is the only
// block with edges incident from outside the SCC, and all backedges in the SCC
// are incident on N, i.e., for every backedge, the head now dominates the tail.
//
// INPUT CFG: The blocks A and B form an irreducible loop with two headers.
//
//                        Entry
//                       /     \
//                      v       v
//                      A ----> B
//                      ^      /|
//                       `----' |
//                              v
//                             Exit
//
// OUTPUT CFG: Edges incident on A and B are now redirected through a
// new block N, forming a natural loop consisting of N, A and B.
//
//                        Entry
//                          |
//                          v
//                    .---> N <---.
//                   /     / \     \
//                  |     /   \     |
//                  \    v     v    /
//                   `-- A     B --'
//                             |
//                             v
//                            Exit
//
// The transformation is applied to every maximal SCC that is not already
// recognized as a loop. The pass operates on all maximal SCCs found in the
// function body outside of any loop, as well as those found inside each loop,
// including inside any newly created loops. This ensures that any SCC hidden
// inside a maximal SCC is also transformed. 

from llvm

 // Computes a dot product between "[M,K]{0,1} lhs" with a [K,1] vector (the
// layout of the vector does not matter).  This implementation uses a tiling
// scheme to improve performance.
//
// We logically separate the LHS matrix into four segments:
//
//   +----------------------+---+
//   |                      |   |
//   |                      |   |
//   |         A            | B |
//   |                      |   |
//   |                      |   |
//   |                      |   |
//   +----------------------+---+
//   |         C            | D |
//   +----------------------+---+
//
// where A is the largest submatrix of the LHS that can be evenly divided into
// tiles.  For each tile in A, assuming tile_rows_ == tile_cols_ == 4, we have:
//
//   +---+---+---+---+       +--+--+--+--+
//   |M00|M10|M20|M30|       |V0|V1|V2|V3|
//   +---+---+---+---+       +--+--+--+--+
//   |M01|M11|M21|M31| and   |V0|V1|V2|V3|
//   +---+---+---+---+       +--+--+--+--+
//   |M02|M12|M22|M32|       |V0|V1|V2|V3|
//   +---+---+---+---+       +--+--+--+--+
//   |M03|M13|M23|M33|       |V0|V1|V2|V3|
//   +---+---+---+---+       +--+--+--+--+
//
// (Legend: rows are horizontal and columns are vertical; and each column is one
// llvm::Value of a vector type)
//
// where:
//
//   a. The left tile is from the column major left matrix.
//   b. The right tile is an elementwise broadcast of a [V0, V1, V2, V3]
//      vector loaded from the RHS vector.
//
// As we iterate through the column dimension, we compute the change to the
// result vector by an elementwise multiplication between the two tiles above
// followed by a reduction along the major dimension:
//
//                     +-----------------------------------+
//                     | M00*V0 + M10*V1 + M20*V2 + M30*V3 |
//                     +-----------------------------------+
//                     | M01*V0 + M11*V1 + M21*V2 + M31*V3 |
// Result[R:R+4] +=    +-----------------------------------+
//                     | M02*V0 + M12*V1 + M22*V2 + M32*V3 |
//                     +-----------------------------------+
//                     | M03*V0 + M13*V1 + M23*V2 + M33*V3 |
//                     +-----------------------------------+
//
// Where R is the starting row for the tile.
//
// We have an inner epilogue loop to deal with the "C" submatrix and an outer
// epilogue loop to deal with the B,D submatrix.
//
// TODO(sanjoy): We should investigate if using gather loads and scatter stores
// can be used here have the same inner loop for both column-major and row-major
// matrix-vector products.

from tensorflow

2 Likes

“Did You Miss My Comment or What?”
Understanding Toxicity in Open Source Discussions
by Courtney Miller, Sophie Cohen, Daniel Klug, Bogdan Vasilescu, Christian Kästner


I came across this paper a little while ago after having a frustrating discussion on the FOSS United telegram. I am also a part of discussion channels of a few open source projects where I regularly notice a lot of not-polite back and forth among maintainers, users and contributors, which from my experience is unique only to OSS forums.

In this paper the authors try to understand online toxicity in open source communities. The kinds, reasons and effects of toxicity across the internet are well documented, but toxicity specifically within OSS communities is not very well understood.

To this end, we curate a sample of 100 toxic GitHub issue discussions combining multiple search and sampling strategies. We then qualitatively analyze the sample to gain an understanding of the characteristics of open-source toxicity. We find that the pervasive forms of toxicity in open source differ from those observed on other platforms like Reddit or Wikipedia.

From an interview -

They used a toxicity and politeness detector developed for another platform to scan nearly 28 million posts on GitHub made between March and May 2020. The team also searched these posts for “code of conduct” — a phrase often invoked when reacting to toxic content — and looked for locked or deleted issues, which can also be a sign of toxicity.

In our sample, some of the most prevalent forms of toxicity are entitled, demanding, and arrogant comments from project users as well as insults arising from technical disagreements. In addition, not all toxicity was written by people external to the projects; project members were also common authors of toxicity.

image


Open Source and Toxicity

What is toxicity in the context of online discussions?

Toxicity, defined here as “rude,disrespectful, or unreasonable language that is likely to make some-one leave a discussion” is a huge problem online

Open source communities are not immune to toxicity. While the term “toxicity” as defined above has only recently started being used in the open-source literature, the presence of behaviors “likely to make someone leave” have long been documented by researchers and practitioners in this space. For example, the Linux Kernel Mailing List is notorious for having discussions with a tone that “tends to discourage people from joining the community”

Linus Torvalds acknowledges that he has at times been “overly impolite” and that is “a personal failing.”

Toxicity is also a major threat to diversity and inclusion: prior work has found that it can especially impact members of certain identity groups, particularly women , who are already severely underrepresented.

I remember wanting to contribute to open source a little more than a year back and the tone of the day to day discussions would be a major factor for me to choose which project I would be like to be involved with.

Identified characteristics of open source toxicity

Unlike some other platforms where the most frequent types of toxicity are hate speech or harassment, we find entitlement, insults, and arrogance are among
the most common types of toxicity in open source. We also learned that many of the ways projects address toxicity are closely connected to the GitHub interface itself and open source culture more broadly, such as locking issues as too heated or invoking a project’s code of conduct.

I tried finding if there are any blogs from github on the measures they take to handle toxicity but couldn’t find anything

Edit: GitHub Community Guidelines - GitHub Docs

Another ex-open source community leader, when explaining why they quit, described how “I had been told that I needed a ‘tough skin’ to work in the community, and I needed to ‘not take it personally’ when developers were abrasive during code review”.

Toxicity in open source is often written off as a naturally occurring if not necessary facet of open source culture. The aforementioned community leader describes how “When I complained about the toxic environment, I was told it was ‘tradition’ to be blunt and rude in order to have a truly open dialogue.”

Four high ranking Perl community members stepped down due to community-
related issues. One of them, when elected as community leader in April 2016, set the goal to make the mailing list “a place on which we can have technical conversations without worrying about abusive language or behavior” however, in April 2021, he stepped down explaining how the “chain of continuous bullying and hostility I’ve been receiving” has caused him “significant emotional distress”.

(Types of) Toxicity observed on Github

Insults

Over half of our sample contained insults (55 cases), i.e., disrespectful or scornful expressions, often using curse words or intentionally offensive language. Toxic insulting comments tend to be targeted at people rather than at the code itself.

This is interesting because I’ve heard a lot of seasoned open-sourcerers say to new contributors that they should never take criticism personally and it’s always about the code not the person. The contributor convenant which is one of the most widely adopted COCs explicitly states that criticism should be constructive.

For example, a user of a GUI crypto wallet with a built-in crypto miner noticed the presence of the miner and interpreted it as malware (a misunderstanding, the presence, deactivated by default,was mentioned as an intentional feature in the readme). The user threw explicit curse words at the maintainers of the project and accused them of being “criminal crooks” for trying to “infect other computers with malware”

A project member was unhappy with the colors of a project, reporting “colors are horrible for […], just look at this s**t” . Even after a contributor provided a link to the documentation, the user remained unsatisfied and unapologetic.

Entitled

Entitled comments make demands of people or projects as if the author had an
expectation due to a contractual relationship or payment.

A user, upon being told that their suggestion was based on a misunderstanding of the project, began aggressively criticizing the contributor for how they addressed the issue, saying “Like just add the flavor text or show me how to or something. Don’t just fu**ing close people’s tickets they would like some help on”

Arrogant

We consider comments as arrogant when the author imposes their view on others from a position of perceived authority or superiority (earned or not) and demands that others act as recommended

One of the users in the discussion was unfamiliar with some of the legislation being discussed and asked for more information, a second user responded saying “Never hear about [standard]? A baseline for developers. Use Google.”

Trolling

For example, a user was generally unhappy with a project and wrote “Worst. App. Ever. Please make it not the worst app ever. Thanks” (I2), followed by a pull request that deleted all the code in the repo; after the main-tainer closed the issue, the user responded “Merge my PR damnit” and nothing else happened.

Unprofessional

Comments that are not overly toxic but nonetheless create an unwelcoming environment.

Examples include self directed pejorative terms (e.g., “It seems like I have been acting like a re**rd. Sorry. […]” ), self-deprecating humor, and jokes and puns with explicit vocabulary or terms broadly perceived as politically
incorrect or unacceptable in a professional setting.

Triggers of toxicity

Failed Use of Tool/Code or Error Message

Some comments actually report the problem in some detail to help the project or receive help with their immediate problem, but still include toxicity, typically expressing frustration.

For example, as one such user of a popular library puts it, “I just tried reinstalling your buggy, sh**ty software for the third time. Maybe you guys can get one that works right and stick to it without changing it all the
time” .

Yet, in other cases, users simply vent about problems
without seeking help or any attempt to provide constructive feed-
back to the project.

In some cases, the users respond with toxic messages when asked for more information or asked to follow the issue template,

for example “Yeah, not really sorry i’m lazy, and it’s more to help you then me. It’s simple to understand: […]. don’t need a ret**ded format to understand that! thanks”

Toxicity triggered by failed tool use is often entitled, insulting, unprofessional, or just trolling e.g., “It doesn’t work. F*** this”

Politics/Ideology

We fairly frequently observed toxicity arising over politics or ideology differences, e.g., referring to specific beliefs about open source culture, processes, or the involvement of specific companies (especially Microsoft was
a frequent target in our sample)

For example, a user wrote a hostile issue in a Microsoft project titled “WHY :interrobang::interrobang::interrobang::interrobang:” which simply said “Revenue. F**k you guys”

Past Interactions

Finally, we observed several cases where toxic comments were posted that referred to past interactions of the author with the project, without continuing to
discuss the previous technical issue, but shifting to personal attacks, complaints, or meta discussions about process.

For example, a user was unsatisfied with the response time on an existing issue so they created a new one asking “did you miss my comment or what?” These comments were often posted in a new issue after the old one was not answered or closed, and they often occur in the opening comment of the new issue

Authors of toxicity

  • New Account

    • No or minimal prior activity
    • Created just to use a particular software
    • Usually engage in anonymous trolling

    a new user was trying to download an application but was having issues and wrote an issue titled “Cant even install the fu**ing app” in which they complained that they could not find the download, upon which another user pointed them to the project’s release page

  • Repeat Issue Reporter

    • Have posted multiple issues but have no contributions.

    A six-year old account with a clear name, profile picture, and contact email has created hundreds of issues over the years, before posting an issue “sh**ty package” (with no context or further content) to a mid-sized repository of a web UI component.

  • Experiences contributors

    • Tend to use less sever language
    • But participate in all kinds of toxicity

    An experienced contributor was upset that a new update did not include Python 2, a project member responded with a workaround, to which the author then responded “and I recommend you quit! There are many more where python2 is used […] and you deleted it from the repository. Do you think at all with your head or do you have a hamburger head place?”

  • Project member

    • Toxicity occurs in smaller projects in reaction to a demand, complaint,
      or perceived affront from another user (which are not toxic)
    • Tend to be less severe. Mostly unprofessional or insults targeted at code.
    • Tend to not engage in unprovoked attacks.

    you can be mad all you want, but let’s be realistic here… this project you’re fighting for so passionately, doesn’t have as many stars as I have thumbs down for telling you that you’re being ridiculous”

image

Project Characteristics

Project size

  • Vast majority of toxic comments were written in popular repositories with high levels of activity.
  • In less popular projects, the nature of toxicity is often insulting, trolling, or unprofessional, mostly directly in the opening comment, but we found no entitled comments, possibly because users have lower expectations in the first place

Several toxic comments in small projects appeared to be trolling or jokes among friends, e.g., “Dear Mr. [project owner name], Could you perhaps please get your s**t together and reincorporate the brilliant switch statement once again, bitch. XoXo, [author]”

Project domain

  • Toxicity occurred often in projects we consider as libraries or end-user-focused applications, which likely also are the most common kinds of projects on GitHub.
  • toxicity in projects related to gaming and to mobile apps tends to use more severe language, e.g., more cursing,

After toxicity: harms and reactions

Tools to curb discussions

  • Closing/Locking/Deleting issues
  • Deleting/Editing/Hiding comments
  • Blocking users
  • Invoking the code of conduct

Reactions

When maintainers invoked the code of conduct, the author of the toxic comment usually did not engage any further. However, there were also a few cases where the author pushed back on being policed in their speech

"Again. No discussion allowed. No critique allowed. Just pushing fingers into the ears and singing. To avoid hearing about the impending doom, to avoid hearing the truth about the quality of this project”

In one case, a user called out for violating the code of conduct responded insisting “I will neither change my language, nor my tone or style. Both, language and tone, are perfectly valid, given the circumstances. I will remain myself, and will repel this attack to my individuality”

referring to invoking the code of conduct as “CoC-Fascism,” upon which projects members banned the user.

Discussion and Implications

  • Toxicity presents differently on GitHub (compared to other platforms)
  • Open-source experience does not prevent toxicity.
  • Research into harms of toxicity is needed. We can’t reliably measure harms
    that toxic comments cause, especially indirect harms on bystanders and potential future contributors who decide not to engage with the repository or open source in general.
    • In almost all cases, a maintainer reacts to the toxic issue or comment, even if just to close or lock the issue. That is, maintainers need to use some of their time for extra work.
    • Maintainers often engage to explore whether there is truly an issue behind
      strongly worded complaints. Even when maintainers invoke the code of conduct, they usually do so in a custom comment tailored to the specific case. All this requires substantial effort which can be emotionally taxing to developers over time and cause fatigue.
  • There are opportunities to build open-source specific toxicity detectors. Early interventions are promising.
3 Likes

SQLite: Past, Present, and Future - is an interesting paper by researchers from a University and SQLite project maintainers. The paper talks about the architecture of the project and the history that led to the architecture being what it is, profiles SQLite against DuckDB, and makes improvements to the SQLite database to enable better performance on OLAP (Online Analytical Processing) benchmarks where it performs worse than DuckDB.

Unless you’re a fresh graduate from college, I’d be surprised if a professional software developer hasn’t heard about SQLite

“SQLite is embedded in major web browsers, personal computers, smart televisions, automotive media systems, and the PHP and Python programming languages. Furthermore, SQLite is found in every iOS and Android device, which currently number in the billions. There are likely over one trillion SQLite databases in active use. It is estimated that SQLite is one of the most widely deployed software libraries of any type.”

After reading Sections 2 and 3 of the paper, I was surprised to find that SQLite doesn’t get a mention in The Architecture of Open Source Applications books.

There are over 600 lines of test code for every line of code in SQLite

A statistic that I will hopefully use the next time someone tells me that tests aren’t necessary or useful. Or that they are difficult to write. Or that they don’t provide sufficient value.

The instruction logic is implemented as a large switch statement in the VDBE (Virtual Database Engine), where each instruction is processed as a unique case

sqlite/src/vdbe.c at bcdb28b8f9e525429557c08ed0a03450d0fd8c57 · sqlite/sqlite · GitHub - an 8000-ish line switch statement.

Notably, SQLite generally does not use multiple threads, which limits its ability to take advantage of the available hardware parallelism. For sorting large amounts of data, SQLite uses an optional multithreaded external merge sort algorithm. For all other operations, SQLite performs all work in the calling thread

Bloom filters are memory-efficient and require minimal modification to the query planner

It was pleasantly surprising to find that Bloom filters made their way into SQLite. I vaguely remember coming across Bloom filters in the past. Bloom Filters by Example and Understanding Bloom Filters by building one - rand[om] look like meaningful resources to understand Bloom filters better.

At first glance, this paper might not look relevant to the FOSS ecosystem but in my understanding, it conveys some of the fundamentals of the FOSS ecosystem.

The paper talks about a knowledge-creating company (or community) and discusses how knowledge-creation happens. The article discusses the differences between tacit vs explicit knowledge. Explicit knowledge is format and systematic and can easily be communicated for this reason. Tacit knowledge is highly personal, it is hard to formalize and therefore difficult to communicate to others. The paper then discusses the various kinds of knowledge-creation i.e. tacit to tacit, explicit to explicit, tacit to explicit, and explicit to tacit (the last two are the most important).

The paper then goes on to discuss how teams should be organized in a knowledge-creating company but I’ll stop here for now to go into the FOSS aspects.

Software is knowledge and creating new FOSS is the process of converting tacit (unwritten, personal) knowledge that a software developer/team has into explicit knowledge (systematic, easily communicated). The new FOSS can then be disseminated within the company, leading to new tacit knowledge among the software devs. Successful FOSS is created when people who have a fundamental understanding of the problem domain work with people who are adept at converting their tacit knowledge of the problem/process into explicit FOSS (usually the domain expert and the software developer is the same person). For example, the Twisted Python project came about because of the pains the developer (Glyph) experienced when building a text-based game. The Pandas Python came about because the developer wanted a better way to analyze financial data than writing scripts in Excel files.

2 Likes

In this weeks’ edition of Papers We Love comes a paper that talks about the potential pointlessness of academic papers/research in the context of software engineering.

I closely follow Greg Wilsons work, one of which is It will never work in theory - a project to bridge the gap between academic researchers in software engineering and practicing software engineers. I recently noticed the “short retrospective” on the page and noticed that they have decided to wind down the project.

Looking at the paper (IEEE page or PDF), there are a few interesting points

Most software developers in industry have never heard of any findings more recent than [Fred Brooks’ Mythical Man-Month] (which few of them ever actually read), and routinely dismiss studies as “not statistically significant”, even when those studies are carefully done and directly relevant to their work.

Likewise, those researchers whose papers we reviewed and who presented at our lightning talks have been no more likely to attend non-academic conferences than they were before.

Twelve years after It Will Never Work in Theory launched, the real challenge in software engineering research is not what to do about ChatGPT or whatever else Silicon Valley is gushing about at the moment. Rather, it is how to get researchers to focus on problems that practitioners care about and practitioners to pay attention to what researchers discover.

We believe the best time and place to bridge this divide is when we have the attention of future researchers and practitioners, i.e., in undergraduate programs. After all, if students leave academia without having been exposed to both research methods and useful discoveries, why would those who leave look to researchers later for help or answers?

Given the declining interest/trust in traditional academia/structured learning as a whole, especially from what I can see in the younger Indian generation, I sadly don’t know this problem ever getting solved (or even get better for that matter).

2 Likes