Papers We Love (FOSS Edition)

rahulporuri · 12 April 2024 05:51

Papers We Love is a repository of academic computer science papers and a community who loves reading them but i’m co-opting the term to also refers to FOSS-related papers. I just read the Harvard Business School Strategy Unit Working Paper on The Value of Open Source Software and I wanted to share a few personal highlights from the paper. A number of references within the paper also look interesting and I want to add highlights from those papers too within this topic in the future.

I would love it if the community also shares Papers they love in the topic.

rahulporuri · 12 April 2024 06:38

Harvard Business School Strategy Unit Working Paper on The Value of Open Source Software by Manuel Hoffmann, Frank Nagle, Yanuo Zhou

The key highlights from the paper are

We estimate the supply-side value of widely-used OSS is $4.15 billion,but that the demand-side value is much larger at $8.8 trillion
…
Further, 96% of the demand-side value is created by only 5% of OSS developers

Why studying the value of FOSS is important

The parallels between shared grazing lands and shared digital infrastructure are palpable –the availability of communal grass to feed cattle, and in turn feed people, was critical to the agrarian economy, and the ability to not have to recreate code that someone else has already written is critical to the modern economy

Ammunition for FOSS advocates

Other recent studies have come to similar conclusions showing that open source software (OSS) appears in 96% of codebases (Synopsys 2023), and that some commercial software consists of up to 99.9% freely available OSS (Musseau et al., 2022)

With data from the United States the resulting estimates show a value of $2 billion for the OSS Apache Web Server in 2012 (Greenstein and Nagle, 2014) and a combined value of $4.5 billion for Apache and the increasingly popular OSS web server nginx in 2018 (Murciano-Goroff, et al., 2021)

We find a value ranging from $1.22 billion to $6.22 billion if we were to decide as a society to recreate all widely used OSS on the supply side. However, considering the actual usage of OSS leads to a demand-side value that is orders of magnitude larger and ranges from $2.59 trillion to $13.18 trillion, if each firm who used an OSS package had to recreate it from scratch(e.g., the concept of OSS did not exist). … However, as for any project, the evidence is not complete and we argue that we underestimate the value since our data, e.g.,does not include operating systems, which are a substantial omitted category of OSS

Not so great assumptions

Here, we do not incorporate consumption externalities, i.e., we do not allow a benefit to arise for the general public when a package has been created and we further make sure that each firm is only replacing a package they use once, since a replaced package can be used within a firm as a club good (e.g., see Cornes and Sandler, 1996).

For large firms, there will be overhead coordination costs associated with building and maintaining a club good (an internal package). This potentially means that the demand-side 8.8 trillion $ number is a lower-bound.

In this calculation, we implicitly do not incorporate any production externalities since we assume that there is no spillover knowledge from one package to the next that would lower the cost of programming.

This too we know to be false. Packaging and project management add considerable overhead for a software project. Spillover knowledge definitely reduces the cost of programming as the developer becomes comfortable with those aspects of a project over time. This potentially means that the supply-side 4.15 billion $ is an upper-bound.

At the repository level, we quantified each developer’s proportional work contribution by calculating their share of commits to the total number of commits for a repository

Commits aren’t the best indicator of a developer’s work contribution to a FOSS project e.g. what if the project uses squash merge to merge a large feature branch that contained 10s or 100s of commits. Lines of code aren’t a great indicator either e.g. complicated bugs that require 10s of hours of debugging might be fixed by a change in a single line of code. There are no clean/easy indicators to quantify the work contribution of a developer so such assumptions are inevitable.

Unexpected (to me) findings

We find that OSS packages created in Go have the highest value with $803millionin value that would have to be created from scratch if the OSS packages did not exist. Go is closely followed by JavaScript and Java with$758 million and$658 million, respectively. The value of C and Typescript is $406 million and $317 million, respectively,while Python has the lowest value of the top languages with around $55 million

Potential growth areas for FOSS

The industry with the highest usage value of around $43 billion is “Professional, Scientific, and Technical Services.”“Retail Trade” as well as “Administrative and Support and Waste Management and Remediation Services” make up another large part of the demand-side externally facing value of OSS with $36billion and $35billion, respectively. In contrast, industries that constitute just a small portion of the value are “Mining, Quarrying, and Oil and Gas Extraction”, “Utilities”, “Agriculture, Forestry, Fishing, and Hunting.” The latter industries are classical non-service sector industries and as such software is expected to play less of a role there.

I’ve been singing this song for a while but now I have the evidence to back it up - we should be advocating for people to apply computing to their domains instead of expecting them to abandon their domains to become generic software developers.

Giants in the FOSS ecosystem

Indeed, the last five percent of programmers, or 3,000 programmers, generate over 93% of the supply side value. Similarly, Panel B shows –when accounting for usage–that those last five percent generate over 96% of the demand side value.

rahulporuri · 14 April 2024 15:43

“Do Software Developers Understand Open Source Licenses?” published in 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). See IEEE link and PDF link shared by one of the authors.

The key highlight from the paper is

The 375 respondents to the survey, who were largely developers, gave answers consistent with those of a legal expert’s opinion in 62% of 42 cases. Although developers clearly understood cases involving one license, they struggled when multiple licenses were involved

Survey

seven hypothetical software development scenarios, someof which included multiple license combination

I highlight recommend checking out the survey questions first before looking at the legal answers to them in the paper in Table III - https://www.cs.ubc.ca/labs/spl/projects/softwarelicensing/resources/UBC_SPL_software_licensing_survey.pdf

Key observations

Developers cope well with single licenses even in complex scenarios

Developers have difficulty interpreting which actions are allowed in scenarios where more than one open source license is in use.

Developers understand technical decisions will impact open source license use

Developers recognize that there are interactions between open source licenses, but those interactions were not always correctly interpreted

Questions that arise about the use of multiple open source licenses are situationally dependent.

A number of developers lack knowledge of the details of open source licenses.

Survey underestimates the problem

In particular,we observed that the large majority of the participants (85.3%)had chosen a software project’s license before, which might bean instance of self-selection bias.

rahulporuri · 22 September 2024 10:43

I came across this paper when I looking at the website of Prof. Philip Guo. Prof. Guo is renowned in Computer Human Interaction (CHI) circles for his work on the pythontutor.com. His other papers are definitely worth checking out e.g. Ten Million Users and Ten Years Later: Python Tutor’s Design Guidelines for Building Scalable and Sustainable Research Software in Academia

In this lovely little paper, the authors look at the use of ASCII diagrams in source code, specifically in 4 projects - Linux (OS), LLVM (computer), Chromium (Browser) Tensorflow (ML).

TL; DR - The diagrams that they dug up from the codebases and the design space they synthesized from the ASCII diagrams can be found at https://asciidiagrams.github.io/. Also, know, that there are 428 ASCII diagrams in Chromium, 1386 in Linux, 220 in LLVM, and 122 in Tensorflow.

Abstract : Documentation in codebases facilitates knowledge transfer. But tools for programming are largely text-based, and so developers resort to creating ASCII diagrams—graphical artifacts approximated with text—to show visual ideas within their code. Despite real-world use, little is known about these diagrams. We interviewed nine authors of ASCII diagrams, learning why they use ASCII and what roles the diagrams play. We also compile and analyze a corpus of 507 ASCII diagrams from four open source projects, deriving a design space with seven dimensions that classify what these diagrams show, how they show it, and ways they connect to code. These investigations reveal that ASCII diagrams are professional artifacts used across many steps in the development lifecycle, diverse in role and content, and used because they visualize ideas within the variety of programming tools in use. Our findings highlight the importance of visualization within code and lay a foundation for future programming tools that tightly couple text and graphics.

The researchers asked three research questions

RQ1 Characteristics. What are the key characteristics of the media of ASCII diagrams?
RQ2 Roles. How are ASCII diagrams used in the software development workflow?
RQ3 Content. What do ASCII diagrams show and how do they show it?

The first author, leveraging their initial enthusiasm for this project, scrolled through the entirety of the four files and manually extracted the comments that resembled ASCII drawings, resulting in a selection of 2,162 ASCII diagrams.

P2 gladly paid $10 for Monodraw, a diagram editor specifically for ASCII. It allowed them to edit the diagram after-the-fact, provided they had saved the Monodraw file, which proved helpful in a codebase where many contributors were authoring these diagrams: “the best $10 I could ever spend was having Monodraw…if I want to like enlarge a box, I can enlarge a box, I can move things around…I usually keep the Monodraw files sitting around so that if someone’s gonna come along and ‘Oh I’m gonna have to update this diagram.”’

Ref https://monodraw.helftone.com/

As an alternative to ASCII, both P1 and P2 mentioned Mermaid [61], a simple textual language for describing and rendering diagrams. The popular code collaboration site GitHub renders Mermaid automatically, so P1 now uses it instead of ASCII art when writing README files. P2’s team also uses Mermaid regularly on GitHub, particularly for charts that are updated often, as updating ASCII is tedious. Nevertheless, P2 remarked that code diffs of Mermaid were hard to understand and P1 still used ASCII in tools that didn’t support Mermaid (e.g. terminals and text editors).

I’ve used mermaid a bunch of times, especially on GitHub as it now natively renders the mermaid diagrams, but like the quote above, it still needs to be rendered and the information isn’t as visual as an ASCII diagram.

a birdseye overview of how they used diagrams
(1) to reify offline work,
(2) to illustrate test cases,
(3) for code review and verification from colleagues, and to document both (4) for others and (5) themselves

Finally, here are a few diagrams from the https://asciidiagrams.github.io/ dataset that I find interesting

// Note that speech recognition is activated on VR UI thread. This means it
// usually involves 3 threads. In the simplest case, the thread communication
// looks like the following:
//     VR UI thread        Browser thread         IO thread
//          |                    |                    |
//          |----ActivateVS----->|                    |
//          |                    |------Start------>  |
//          |                    |                    |
//          |                    |<-NotifyStateChange-|
//          |<--OnSRStateChanged-|                    |
//          |                    |                    |
//          |                    |<--OnSpeechResult---|
//          |<--OnSRStateChanged-|                    |
//          |                 navigate                |
//          |                    |                    |
// VS = voice search, SR = speech recognition

from chromium

 /*
 * +------------+---------------------------------------------------+
 * |   PHASE    |           FIRMWARE STATUS TRANSITIONS             |
 * +============+===================================================+
 * |            |               UNINITIALIZED                       |
 * +------------+-               /   |   \                         -+
 * |            |   DISABLED <--/    |    \--> NOT_SUPPORTED        |
 * | init_early |                    V                              |
 * |            |                 SELECTED                          |
 * +------------+-               /   |   \                         -+
 * |            |    MISSING <--/    |    \--> ERROR                |
 * |   fetch    |                    V                              |
 * |            |                 AVAILABLE                         |
 * +------------+-                   |   \                         -+
 * |            |                    |    \--> INIT FAIL            |
 * |   init     |                    V                              |
 * |            |        /------> LOADABLE <----<-----------\       |
 * +------------+-       \         /    \        \           \     -+
 * |            |    LOAD FAIL <--<      \--> TRANSFERRED     \     |
 * |   upload   |                  \           /   \          /     |
 * |            |                   \---------/     \--> RUNNING    |
 * +------------+---------------------------------------------------+
 */

from linux

 // An irreducible SCC is one which has multiple "header" blocks, i.e., blocks
// with control-flow edges incident from outside the SCC.  This pass converts a
// irreducible SCC into a natural loop by applying the following transformation:
//
// 1. Collect the set of headers H of the SCC.
// 2. Collect the set of predecessors P of these headers. These may be inside as
//    well as outside the SCC.
// 3. Create block N and redirect every edge from set P to set H through N.
//
// This converts the SCC into a natural loop with N as the header: N is the only
// block with edges incident from outside the SCC, and all backedges in the SCC
// are incident on N, i.e., for every backedge, the head now dominates the tail.
//
// INPUT CFG: The blocks A and B form an irreducible loop with two headers.
//
//                        Entry
//                       /     \
//                      v       v
//                      A ----> B
//                      ^      /|
//                       `----' |
//                              v
//                             Exit
//
// OUTPUT CFG: Edges incident on A and B are now redirected through a
// new block N, forming a natural loop consisting of N, A and B.
//
//                        Entry
//                          |
//                          v
//                    .---> N <---.
//                   /     / \     \
//                  |     /   \     |
//                  \    v     v    /
//                   `-- A     B --'
//                             |
//                             v
//                            Exit
//
// The transformation is applied to every maximal SCC that is not already
// recognized as a loop. The pass operates on all maximal SCCs found in the
// function body outside of any loop, as well as those found inside each loop,
// including inside any newly created loops. This ensures that any SCC hidden
// inside a maximal SCC is also transformed.

from llvm

 // Computes a dot product between "[M,K]{0,1} lhs" with a [K,1] vector (the
// layout of the vector does not matter).  This implementation uses a tiling
// scheme to improve performance.
//
// We logically separate the LHS matrix into four segments:
//
//   +----------------------+---+
//   |                      |   |
//   |                      |   |
//   |         A            | B |
//   |                      |   |
//   |                      |   |
//   |                      |   |
//   +----------------------+---+
//   |         C            | D |
//   +----------------------+---+
//
// where A is the largest submatrix of the LHS that can be evenly divided into
// tiles.  For each tile in A, assuming tile_rows_ == tile_cols_ == 4, we have:
//
//   +---+---+---+---+       +--+--+--+--+
//   |M00|M10|M20|M30|       |V0|V1|V2|V3|
//   +---+---+---+---+       +--+--+--+--+
//   |M01|M11|M21|M31| and   |V0|V1|V2|V3|
//   +---+---+---+---+       +--+--+--+--+
//   |M02|M12|M22|M32|       |V0|V1|V2|V3|
//   +---+---+---+---+       +--+--+--+--+
//   |M03|M13|M23|M33|       |V0|V1|V2|V3|
//   +---+---+---+---+       +--+--+--+--+
//
// (Legend: rows are horizontal and columns are vertical; and each column is one
// llvm::Value of a vector type)
//
// where:
//
//   a. The left tile is from the column major left matrix.
//   b. The right tile is an elementwise broadcast of a [V0, V1, V2, V3]
//      vector loaded from the RHS vector.
//
// As we iterate through the column dimension, we compute the change to the
// result vector by an elementwise multiplication between the two tiles above
// followed by a reduction along the major dimension:
//
//                     +-----------------------------------+
//                     | M00*V0 + M10*V1 + M20*V2 + M30*V3 |
//                     +-----------------------------------+
//                     | M01*V0 + M11*V1 + M21*V2 + M31*V3 |
// Result[R:R+4] +=    +-----------------------------------+
//                     | M02*V0 + M12*V1 + M22*V2 + M32*V3 |
//                     +-----------------------------------+
//                     | M03*V0 + M13*V1 + M23*V2 + M33*V3 |
//                     +-----------------------------------+
//
// Where R is the starting row for the tile.
//
// We have an inner epilogue loop to deal with the "C" submatrix and an outer
// epilogue loop to deal with the B,D submatrix.
//
// TODO(sanjoy): We should investigate if using gather loads and scatter stores
// can be used here have the same inner loop for both column-major and row-major
// matrix-vector products.

from tensorflow

ansh · 26 September 2024 10:12

“Did You Miss My Comment or What?”
Understanding Toxicity in Open Source Discussions by Courtney Miller, Sophie Cohen, Daniel Klug, Bogdan Vasilescu, Christian Kästner

I came across this paper a little while ago after having a frustrating discussion on the FOSS United telegram. I am also a part of discussion channels of a few open source projects where I regularly notice a lot of not-polite back and forth among maintainers, users and contributors, which from my experience is unique only to OSS forums.

In this paper the authors try to understand online toxicity in open source communities. The kinds, reasons and effects of toxicity across the internet are well documented, but toxicity specifically within OSS communities is not very well understood.

To this end, we curate a sample of 100 toxic GitHub issue discussions combining multiple search and sampling strategies. We then qualitatively analyze the sample to gain an understanding of the characteristics of open-source toxicity. We find that the pervasive forms of toxicity in open source differ from those observed on other platforms like Reddit or Wikipedia.

From an interview -

They used a toxicity and politeness detector developed for another platform to scan nearly 28 million posts on GitHub made between March and May 2020. The team also searched these posts for “code of conduct” — a phrase often invoked when reacting to toxic content — and looked for locked or deleted issues, which can also be a sign of toxicity.

In our sample, some of the most prevalent forms of toxicity are entitled, demanding, and arrogant comments from project users as well as insults arising from technical disagreements. In addition, not all toxicity was written by people external to the projects; project members were also common authors of toxicity.

Open Source and Toxicity

What is toxicity in the context of online discussions?

Toxicity, defined here as “rude,disrespectful, or unreasonable language that is likely to make some-one leave a discussion” is a huge problem online

Open source communities are not immune to toxicity. While the term “toxicity” as defined above has only recently started being used in the open-source literature, the presence of behaviors “likely to make someone leave” have long been documented by researchers and practitioners in this space. For example, the Linux Kernel Mailing List is notorious for having discussions with a tone that “tends to discourage people from joining the community”

Linus Torvalds acknowledges that he has at times been “overly impolite” and that is “a personal failing.”

Toxicity is also a major threat to diversity and inclusion: prior work has found that it can especially impact members of certain identity groups, particularly women , who are already severely underrepresented.

I remember wanting to contribute to open source a little more than a year back and the tone of the day to day discussions would be a major factor for me to choose which project I would be like to be involved with.

Identified characteristics of open source toxicity

Unlike some other platforms where the most frequent types of toxicity are hate speech or harassment, we find entitlement, insults, and arrogance are among
the most common types of toxicity in open source. We also learned that many of the ways projects address toxicity are closely connected to the GitHub interface itself and open source culture more broadly, such as locking issues as too heated or invoking a project’s code of conduct.

I tried finding if there are any blogs from github on the measures they take to handle toxicity ~~but couldn’t find anything~~

Edit: GitHub Community Guidelines - GitHub Docs

Another ex-open source community leader, when explaining why they quit, described how “I had been told that I needed a ‘tough skin’ to work in the community, and I needed to ‘not take it personally’ when developers were abrasive during code review”.

Toxicity in open source is often written off as a naturally occurring if not necessary facet of open source culture. The aforementioned community leader describes how “When I complained about the toxic environment, I was told it was ‘tradition’ to be blunt and rude in order to have a truly open dialogue.”

Four high ranking Perl community members stepped down due to community-
related issues. One of them, when elected as community leader in April 2016, set the goal to make the mailing list “a place on which we can have technical conversations without worrying about abusive language or behavior” however, in April 2021, he stepped down explaining how the “chain of continuous bullying and hostility I’ve been receiving” has caused him “significant emotional distress”.

(Types of) Toxicity observed on Github

Insults

Over half of our sample contained insults (55 cases), i.e., disrespectful or scornful expressions, often using curse words or intentionally offensive language. Toxic insulting comments tend to be targeted at people rather than at the code itself.

This is interesting because I’ve heard a lot of seasoned open-sourcerers say to new contributors that they should never take criticism personally and it’s always about the code not the person. The contributor convenant which is one of the most widely adopted COCs explicitly states that criticism should be constructive.

For example, a user of a GUI crypto wallet with a built-in crypto miner noticed the presence of the miner and interpreted it as malware (a misunderstanding, the presence, deactivated by default,was mentioned as an intentional feature in the readme). The user threw explicit curse words at the maintainers of the project and accused them of being “criminal crooks” for trying to “infect other computers with malware”

A project member was unhappy with the colors of a project, reporting “colors are horrible for […], just look at this s**t” . Even after a contributor provided a link to the documentation, the user remained unsatisfied and unapologetic.

Entitled

Entitled comments make demands of people or projects as if the author had an
expectation due to a contractual relationship or payment.

A user, upon being told that their suggestion was based on a misunderstanding of the project, began aggressively criticizing the contributor for how they addressed the issue, saying “Like just add the flavor text or show me how to or something. Don’t just fu**ing close people’s tickets they would like some help on”

Arrogant

We consider comments as arrogant when the author imposes their view on others from a position of perceived authority or superiority (earned or not) and demands that others act as recommended

One of the users in the discussion was unfamiliar with some of the legislation being discussed and asked for more information, a second user responded saying “Never hear about [standard]? A baseline for developers. Use Google.”

Trolling

For example, a user was generally unhappy with a project and wrote “Worst. App. Ever. Please make it not the worst app ever. Thanks” (I2), followed by a pull request that deleted all the code in the repo; after the main-tainer closed the issue, the user responded “Merge my PR damnit” and nothing else happened.

Unprofessional

Comments that are not overly toxic but nonetheless create an unwelcoming environment.

Examples include self directed pejorative terms (e.g., “It seems like I have been acting like a re**rd. Sorry. […]” ), self-deprecating humor, and jokes and puns with explicit vocabulary or terms broadly perceived as politically
incorrect or unacceptable in a professional setting.

Triggers of toxicity

Failed Use of Tool/Code or Error Message

Some comments actually report the problem in some detail to help the project or receive help with their immediate problem, but still include toxicity, typically expressing frustration.

For example, as one such user of a popular library puts it, “I just tried reinstalling your buggy, sh**ty software for the third time. Maybe you guys can get one that works right and stick to it without changing it all the
time” .

Yet, in other cases, users simply vent about problems
without seeking help or any attempt to provide constructive feed-
back to the project.

In some cases, the users respond with toxic messages when asked for more information or asked to follow the issue template,

for example “Yeah, not really sorry i’m lazy, and it’s more to help you then me. It’s simple to understand: […]. don’t need a ret**ded format to understand that! thanks”

Toxicity triggered by failed tool use is often entitled, insulting, unprofessional, or just trolling e.g., “It doesn’t work. F*** this”

Politics/Ideology

We fairly frequently observed toxicity arising over politics or ideology differences, e.g., referring to specific beliefs about open source culture, processes, or the involvement of specific companies (especially Microsoft was
a frequent target in our sample)

For example, a user wrote a hostile issue in a Microsoft project titled “WHY ” which simply said “Revenue. F**k you guys”

Past Interactions

Finally, we observed several cases where toxic comments were posted that referred to past interactions of the author with the project, without continuing to
discuss the previous technical issue, but shifting to personal attacks, complaints, or meta discussions about process.

For example, a user was unsatisfied with the response time on an existing issue so they created a new one asking “did you miss my comment or what?” These comments were often posted in a new issue after the old one was not answered or closed, and they often occur in the opening comment of the new issue

Authors of toxicity

New Account
- No or minimal prior activity
- Created just to use a particular software
- Usually engage in anonymous trolling
a new user was trying to download an application but was having issues and wrote an issue titled “Cant even install the fu**ing app” in which they complained that they could not find the download, upon which another user pointed them to the project’s release page
Repeat Issue Reporter
- Have posted multiple issues but have no contributions.
A six-year old account with a clear name, profile picture, and contact email has created hundreds of issues over the years, before posting an issue “sh**ty package” (with no context or further content) to a mid-sized repository of a web UI component.
Experiences contributors
- Tend to use less sever language
- But participate in all kinds of toxicity
An experienced contributor was upset that a new update did not include Python 2, a project member responded with a workaround, to which the author then responded “and I recommend you quit! There are many more where python2 is used […] and you deleted it from the repository. Do you think at all with your head or do you have a hamburger head place?”
Project member
- Toxicity occurs in smaller projects in reaction to a demand, complaint,
  or perceived affront from another user (which are not toxic)
- Tend to be less severe. Mostly unprofessional or insults targeted at code.
- Tend to not engage in unprovoked attacks.
you can be mad all you want, but let’s be realistic here… this project you’re fighting for so passionately, doesn’t have as many stars as I have thumbs down for telling you that you’re being ridiculous”

Project Characteristics

Project size

Vast majority of toxic comments were written in popular repositories with high levels of activity.
In less popular projects, the nature of toxicity is often insulting, trolling, or unprofessional, mostly directly in the opening comment, but we found no entitled comments, possibly because users have lower expectations in the first place

Several toxic comments in small projects appeared to be trolling or jokes among friends, e.g., “Dear Mr. [project owner name], Could you perhaps please get your s**t together and reincorporate the brilliant switch statement once again, bitch. XoXo, [author]”

Project domain

Toxicity occurred often in projects we consider as libraries or end-user-focused applications, which likely also are the most common kinds of projects on GitHub.
toxicity in projects related to gaming and to mobile apps tends to use more severe language, e.g., more cursing,

After toxicity: harms and reactions

Tools to curb discussions

Closing/Locking/Deleting issues
Deleting/Editing/Hiding comments
Blocking users
Invoking the code of conduct

Reactions

When maintainers invoked the code of conduct, the author of the toxic comment usually did not engage any further. However, there were also a few cases where the author pushed back on being policed in their speech

"Again. No discussion allowed. No critique allowed. Just pushing fingers into the ears and singing. To avoid hearing about the impending doom, to avoid hearing the truth about the quality of this project”

In one case, a user called out for violating the code of conduct responded insisting “I will neither change my language, nor my tone or style. Both, language and tone, are perfectly valid, given the circumstances. I will remain myself, and will repel this attack to my individuality”

referring to invoking the code of conduct as “CoC-Fascism,” upon which projects members banned the user.

Discussion and Implications

Toxicity presents differently on GitHub (compared to other platforms)
Open-source experience does not prevent toxicity.
Research into harms of toxicity is needed. We can’t reliably measure harms
that toxic comments cause, especially indirect harms on bystanders and potential future contributors who decide not to engage with the repository or open source in general.
- In almost all cases, a maintainer reacts to the toxic issue or comment, even if just to close or lock the issue. That is, maintainers need to use some of their time for extra work.
- Maintainers often engage to explore whether there is truly an issue behind
  strongly worded complaints. Even when maintainers invoke the code of conduct, they usually do so in a custom comment tailored to the specific case. All this requires substantial effort which can be emotionally taxing to developers over time and cause fatigue.
There are opportunities to build open-source specific toxicity detectors. Early interventions are promising.

rahulporuri · 17 October 2024 16:18

SQLite: Past, Present, and Future - is an interesting paper by researchers from a University and SQLite project maintainers. The paper talks about the architecture of the project and the history that led to the architecture being what it is, profiles SQLite against DuckDB, and makes improvements to the SQLite database to enable better performance on OLAP (Online Analytical Processing) benchmarks where it performs worse than DuckDB.

Unless you’re a fresh graduate from college, I’d be surprised if a professional software developer hasn’t heard about SQLite

“SQLite is embedded in major web browsers, personal computers, smart televisions, automotive media systems, and the PHP and Python programming languages. Furthermore, SQLite is found in every iOS and Android device, which currently number in the billions. There are likely over one trillion SQLite databases in active use. It is estimated that SQLite is one of the most widely deployed software libraries of any type.”

After reading Sections 2 and 3 of the paper, I was surprised to find that SQLite doesn’t get a mention in The Architecture of Open Source Applications books.

There are over 600 lines of test code for every line of code in SQLite

A statistic that I will hopefully use the next time someone tells me that tests aren’t necessary or useful. Or that they are difficult to write. Or that they don’t provide sufficient value.

The instruction logic is implemented as a large switch statement in the VDBE (Virtual Database Engine), where each instruction is processed as a unique case

sqlite/src/vdbe.c at bcdb28b8f9e525429557c08ed0a03450d0fd8c57 · sqlite/sqlite · GitHub - an 8000-ish line switch statement.

Notably, SQLite generally does not use multiple threads, which limits its ability to take advantage of the available hardware parallelism. For sorting large amounts of data, SQLite uses an optional multithreaded external merge sort algorithm. For all other operations, SQLite performs all work in the calling thread

Bloom filters are memory-efficient and require minimal modification to the query planner

It was pleasantly surprising to find that Bloom filters made their way into SQLite. I vaguely remember coming across Bloom filters in the past. Bloom Filters by Example and Understanding Bloom Filters by building one - rand[om] look like meaningful resources to understand Bloom filters better.

rahulporuri · 4 November 2024 05:02

At first glance, this paper might not look relevant to the FOSS ecosystem but in my understanding, it conveys some of the fundamentals of the FOSS ecosystem.

The paper talks about a knowledge-creating company (or community) and discusses how knowledge-creation happens. The article discusses the differences between tacit vs explicit knowledge. Explicit knowledge is format and systematic and can easily be communicated for this reason. Tacit knowledge is highly personal, it is hard to formalize and therefore difficult to communicate to others. The paper then discusses the various kinds of knowledge-creation i.e. tacit to tacit, explicit to explicit, tacit to explicit, and explicit to tacit (the last two are the most important).

The paper then goes on to discuss how teams should be organized in a knowledge-creating company but I’ll stop here for now to go into the FOSS aspects.

Software is knowledge and creating new FOSS is the process of converting tacit (unwritten, personal) knowledge that a software developer/team has into explicit knowledge (systematic, easily communicated). The new FOSS can then be disseminated within the company, leading to new tacit knowledge among the software devs. Successful FOSS is created when people who have a fundamental understanding of the problem domain work with people who are adept at converting their tacit knowledge of the problem/process into explicit FOSS (usually the domain expert and the software developer is the same person). For example, the Twisted Python project came about because of the pains the developer (Glyph) experienced when building a text-based game. The Pandas Python came about because the developer wanted a better way to analyze financial data than writing scripts in Excel files.

rahulporuri · 19 December 2024 17:39

In this weeks’ edition of Papers We Love comes a paper that talks about the potential pointlessness of academic papers/research in the context of software engineering.

I closely follow Greg Wilsons work, one of which is It will never work in theory - a project to bridge the gap between academic researchers in software engineering and practicing software engineers. I recently noticed the “short retrospective” on the page and noticed that they have decided to wind down the project.

Looking at the paper (IEEE page or PDF), there are a few interesting points

Most software developers in industry have never heard of any findings more recent than [Fred Brooks’ Mythical Man-Month] (which few of them ever actually read), and routinely dismiss studies as “not statistically significant”, even when those studies are carefully done and directly relevant to their work.

Likewise, those researchers whose papers we reviewed and who presented at our lightning talks have been no more likely to attend non-academic conferences than they were before.

Twelve years after It Will Never Work in Theory launched, the real challenge in software engineering research is not what to do about ChatGPT or whatever else Silicon Valley is gushing about at the moment. Rather, it is how to get researchers to focus on problems that practitioners care about and practitioners to pay attention to what researchers discover.

We believe the best time and place to bridge this divide is when we have the attention of future researchers and practitioners, i.e., in undergraduate programs. After all, if students leave academia without having been exposed to both research methods and useful discoveries, why would those who leave look to researchers later for help or answers?

Given the declining interest/trust in traditional academia/structured learning as a whole, especially from what I can see in the younger Indian generation, I sadly don’t know this problem ever getting solved (or even get better for that matter).

rahulporuri · 14 May 2025 05:52

After a long hiatus, let’s checkout a recent policy paper by the folks at XKDR Forum on Assessing ‘Compute’ Capacities and Policy Frameworks for Artificial Intelligence in India

Here are a few key excerpts from the policy paper to help understand the AI compute landscape in India and policy recommendations to improve the situation

“Indian policies regarding the promotion of “compute” can be broadly divided into two: policies relating to computing infrastructure, and policies relating to development of semiconductors.”

Policies on computing infrastructure - the National Super Computing Mission “70 high-performance computing facilities at various research institutes and universities with a budget outlay of INR 4500 crores spread across 7 years”

Policies on semiconductors - The India Semiconductor Mission (ISM), initiated in 2021 with a budget of INR 76,000 crores across 5 years, is intended to be “a groundbreaking effort aimed at transforming India into a worldwide center for semiconductor manufacturing and design”

The production of semiconductors in India is a sign of how the Indian economy is gradually climbing the value-addition scale. Producing large chips that are required for non-intensive uses like household appliances, etc. at scale will certainly help generate technical expertise provided India is able to do so at low costs. While this does not have a direct bearing on high-performance computing in general, it does enable the transfer and development of skills and know-how.

The impact of external factors on “compute”
Regd the Framework for AI Diffusion and the recent Interim Final Rule issued by the Dept. of Commerce, USA - The intent of the rule is clear: it seeks to restrict AI technology diffusion by restricting imports of GPUs. The rule, in fact, specifically targets Chinese access to AI technologies. At the same time, they also have the effect of limiting the scope of deploying these technologies in India

There are also new reporting requirements for US companies to detect and report “unauthorized training” of models in Tier 2 and Tier 3 countries. Some of these restrictions (such as those on unauthorized training) would apply even if a company from a Tier 2 country is “renting” a set of servers in a Tier 1 country.

How should Indian policymakers analyze these developments? Firstly, we note that in India, most AI products are applications and not models. To put the Indian state of play in the global context, it would be The impact of external factors on “compute” Assessing ‘compute’ capacities and policy frameworks for artificial intelligence in India 09 5 useful to use Lehdonvirta et al. (2024)’s positioning of countries with significant prowess in AI into “Compute North” and “Compute South” countries. Compute North countries are “positioned to use their territorial jurisdiction to intervene in AI development at the point at which models are sent to their local public cloud regions for training. For instance, they could require algorithms and data sets to be audited and certified for compliance with their local rules before training is permitted to commence, shaping what kinds of AI systems can enter the global market”. Compute South countries are better suited for “AI system deployment than for development”. Using this classification, we could put India somewhere in the middle — a country that is ahead of its peers when it comes to the deployment of models and applications to be generated out of it, but is not quite ready to make meaningful changes to the direction of the field itself.

Two interesting responses have emerged to these developments from the government of India. The first and older response is that of the idea of “AI sovereignty” and the creation of a “National AI Stack”. In their concept paper, the Department of Telecommunications, Government of India (2020) identifies the strategic challenges regarding AI adoption in India (e.g., issues concerning access to technology, availability of trained researchers, etc.) It goes on to suggest that a National AI Stack could be a viable answer towards ensuring that all the functional requirements of AI research and services are met in India.

The newer response has been that of the Indian government’s recent RFP to build foundational models in India (Ministry of Electronics and IT, Government of India, 2025b). The developers selected to build an Indian LLM using various products developed under the other pillars of the IndiaAi mission: the developers would have access to GPUs acquired under the “compute” pillar and they would obtain research funding and training data for the same. This development marks a credible and serious response by the Indian government to strengthen its commitment as a serious AI power. At the same time, concerns have arisen over whether these products would run the risk of underutilization, and therefore, are premature.

Policy options to support improvement in “compute”

“Buy” or “rent” should be a commercial decision

Reforms in procurement

Foreign policy that ensures the continued supply of “compute”

Given that the United States and its allies control the overwhelming majority of the supply chain for high-performance computing, it is best to find diplomatic avenues to improve India’s status to obtain preferential treatment for supplies of GPUs and other advanced processors. This has been done before — for example, in 1984, the United States relaxed certain supercomputer export restrictions on India

In an uncertain trade environment — when various countries are beginning to increase tariffs on important goods — it is crucial that the Government of India avoid any tariff rammifications on AI R&D. This would be done by negotiating reduced tariffs on essential AI hardware inputs through bilateral and multilateral trade agreements can lower production costs and encourage domestic manufacturing.

Strategic trade partnerships with the countries that are major players in AI and semiconductors can facilitate technology transfer, allowing Indian firms to access cutting-edge innovations and collaborate on R&D. Additionally, easing foreign direct investment (FDI) norms and including AI-specific provisions in trade deals can attract global tech giants to set up fabrication units and data centers in India. By integrating trade agreements with initiatives like the India Semiconductor Mission and Make in India, the country can create a robust ecosystem for AI-driven computational infrastructure, positioning itself as a competitive global hub.

rahulporuri · 26 May 2025 17:35

Frank Nagle from Harvard University continues to put out amazing work on the FOSS ecosystem. See for example this earlier summary - Papers We Love (FOSS Edition) - #2 by rahulporuri

Under “Executive Summary”

“This report, Census III, is the third investigation into the widespread use of FOSS and aggregates data from over twelve million observations of FOSS libraries used in production applications at over ten thousand companies. It aims to shed additional light on the most commonly used FOSS packages at the application library level. This effort builds on the Census I report that focused on the lower level critical operating system libraries and utilities, and the Census II report that sought to improve our understanding of the language-level FOSS packages that software applications rely on.”

“The Census III effort utilizes data from the Software Composition Analysis (SCA) companies FOSSA, Snyk, Sonatype, and Black Duck, who partnered with Harvard to advance the state of open source research”

“Only through data sharing, coordination, and investment will the value of this critical component of the digital economy be preserved for generations to come.”

“In addition to the detailed results on FOSS usage provided in the report, we also identified eight high-level findings:

the use of cloud service-specific packages is increasing,

there is an ongoing transition from Python 2 to Python 3,

Maven packages continue to be widely used and there is an increased prevalence of NuGet and Python packages,

use of components from Rust package repositories have increased considerably since Census II,

there continues to be a need for the use of standardized naming schema for software components,

much of the most widely used FOSS is developed by only a handful of contributors,

individual developer account security is increasingly important, and

legacy software persists in the open source space.”

Under “Introduction”

“In Census II, we provided eight rank-ordered Top 500 lists of FOSS usage, four of which are at the package level and four of which are at the package/version level. These results were based on the analysis of over half a million observations of FOSS used in applications examined by the SCA data partners in 2020. The current effort, Census III, follows a similar methodology to Census II, but utilizes a richer dataset coming from four different SCA partners and composed of over 12 million data points on FOSS usage in 2023.”

“There are many indicators that could be used to suggest risk and different organizations may weigh factors differently. For example, a potential user might be more concerned if a project has only a single maintainer (or at most a few), is very large (e.g., in terms of lines of code), is written in a memory-unsafe language, has had no merges or other activity within the last few years even though it has nontrivial size, does not use tools to identify potential vulnerabilities in its code, has no OpenSSF best practices badge, does not meet many OpenSSF Scorecards measures, has publicly known vulnerabilities, has required dependencies with publicly known vulnerabilities, and when run is usually directly accessible to the Internet by anyone”

Under “Goals of Census III > Spurring Action > Data Sharing”

“Although public data on package downloads, code changes, and known security vulnerabilities abound, the view on where and how FOSS packages are being used remains opaque. For example, download counts are often misleading because a package may be downloaded millions of times by test processes (once for each test), while a single download may be used to generate an application deployed to billions of devices.”

I might be misremembering, but for one of the Python/PyPI projects that I used to maintain, we had 10 CI (test) downloads for every user download. That ratio might be worse for more popular application libraries.

Under “Goals of Census III > Spurring Action > Investment”

“In the open source world, time and talent may indeed be the most important investments. Larger and more established packages tend to attract more contributors compared to smaller, less visible ones – even if the latter are more heavily depended upon in practice. Companies reliant upon FOSS packages could benefit from supporting these smaller and less-visible yet critical packages, either directly (e.g., paying employees to maintain those projects on the clock) or indirectly (hiring contributors to those projects as employees, and letting them work on the project). Similar to financial resources, time and talent need to be carefully considered to ensure that they are directed toward the most critical projects.”

^ I’m afraid that the spotlight on the FOSS ecosystem is leading to the large projects becoming larger, but the small projects don’t seem to be getting the love that they deserve. It looks like everyone wants to contribute to cloud-related FOSS projects (e.g. kubernetes), but who is interested in contributing to desktop FOSS e.g. Qt, Wx desktop development frameworks? I don’t have a single answer here but I think the FOSS projects themselves need to do a better job at pushing extra capacity to other sister projects in need of new contributors.

Under “Methods”

“While stars, ratings, and download statistics indicate a package’s popularity or reputation, these do not necessarily translate into real-world, day-to-day use. Private usage data from SCA companies’ automated scans and human audits from 2023 provides more insight into which FOSS packages are being used in production codebases.”

“Even developers are typically only aware of the components they directly use, and not the components they indirectly use. This “lower level” focus is important for research, because developers – not end users – tend to drive the widespread adoption and integration of FOSS projects.”

Do you know all of your dependencies? Do you know the dependencies of all of your dependencies?

Under “Results”

“Reviewing 47 of the top 50 non-npm projects from our version-agnostic direct list,62 for commits in the year 2023, it was found that 17% of projects had one developer accounting for more than 80% of commits authored.63 Further, 40% of projects had only one or two developers accounting for more than 80% of commits authored, 64% of projects had four or less developers accounting for more than 80% of commits authored, and 81% of projects had ten or less developers accounting for more than 80% of commits authored.”

Linked in the report was this useful compilation by Nadia Eghbal (now Nadia Asparouhova) - A handy guide to financial support for open source.