Skip to content

Related Literature: Justification for Impact Metrics

This document presents a literature review of metrics proposed in prior studies to assess the impact of open-source software (OSS) projects. Each section outlines the source publication, metric, methodology, and key insights.


1. GitHub Activity Metrics (Stars, Forks, Watchers)

Paper: What Makes Open Source Software Projects Impactful: A Data-Driven Approach

Metric: Weighted sum of stars, forks, and watchers

Methodology: - A user survey was conducted to define a quantitative measurement of project impact. - Based on the survey, forks were deemed most impactful, followed by stars, then watchers. - An impact threshold of 100 was used to classify projects as impactful.

Key Insight: - Only 1% of repositories were classified as highly impactful.


Paper: GitHub Statistics as a Measure of the Impact of Open-Source Bioinformatics Software

Metric: Stars, forks, and watchers

Methodology: - Compared GitHub statistics with other bioinformatics impact measures, including citation counts to assess if Github statistics were a valid way to meassure impact and popularity.

Key Insight: - Found a correlation of 0.66 between GitHub activity and citations.


2. Criticality Score

Source: Quantifying Criticality (Rob Pike)

Metric: Weighted sum of 10 signals (Criticality Score)

Methodology: - Includes signals like contributor counts, dependents, and closed issues. - Normalized between 0 and 1. - Recommended by industry experts.

Key Insight: - No published results; used as a framework for measuring software criticality. The methodology for collecting the signals are not entirely clear from the documentation alone.


3. Popularity and Usage (PyPI Downloads & GitHub Stars)

Paper: Exploring Popularity and Usage: A Comparative Analysis of GitHub Stars and PyPI Downloads in Python Libraries

Metric: PyPI downloads and GitHub stars

Methodology: - Used the Linehaul project to retrieve PyPI download counts of 3182 popular Github repositories. - Computed a correlation between the PyPi downloads statistics of those repositories to their Github stars.

Key Insight: - Found a low correlation (0.235) between stars and PyPI downloads, suggesting partial overlap.


4. Economic Value of OSS

Paper: The Value of Open Source Software

Metric: Supply- and demand-side monetary value of all OSS.

Methodology: - Supply-side: Estimated labor cost to rewrite OSS using the COCOMO II model. - Demand-side: Measured value based on proprietary OSS usage data and calculated replacement cost.

Key Insights: - Demand-side value: $8.8 trillion
- Supply-side value: $4.15 billion
- Six languages (Go, JS, TS, C, Java, Python) drive 84% of OSS value
- 5% of OSS developers create 96% of the value
- Top industries benefiting:
- Professional/Scientific/Technical Services: $43B
- Retail Trade: $36B
- Administrative Support: $35B


5. Centrality and Cost Models

Paper: Measuring the Impact of Open Source Software Innovation Using Network Analysis on GitHub Hosted Python Packages

Metric: Centrality (degree, eigenvector), COCOMO, and PyPI downloads

Methodology: - Focused on packages listed on PyPI. - Used Google BigQuery to gather dependents data. - Applied network analysis to assess centrality.

Key Insight: - Degree and eigenvector centrality had strong correlation with PyPI downloads.


6. Weighted HITS Influence Score

Paper: Influence Analysis of GitHub Repositories

Metric: Weighted HITS on stars graph

Methodology: - Constructed a bipartite graph between users and repositories using star relationships. - Edges were weighted by fork counts. - Applied HITS algorithm to compute repository influence.

Key Insight: - Identified top 10 most influential repositories for each programming language.


7. Software Innovation via Dependency Growth

Paper: Measuring Software Innovation with Open Source Software Development Data

Metric: Log-difference in number of dependents

Methodology: - Measured innovation as the change in number of dependents between Github releases. - Looked into 200,000 unique releases across 28,000 unique packages. - Likely used GitHub's dependency graph to get information about dependencies.

Key Insight: - Found a slight positive correlation between software complexity and dependency growth.


8. PageRank and Disconnected Packages

Paper: Identifying Critical Projects via PageRank and Truck Factor / DaSEA – A Dataset for Software Ecosystem Analysis

Metric: PageRank on dependency networks

Methodology: - Built ecosystem-specific dependency graphs. - Analyzed disconnected packages (no dependents or requirements). - Compared ecosystems and found percentages of disconnectedness:
- NPM: 18%
- Cargo: 22%
- Maven: 40%
- Packagist: 43%
- PyPI: 79%

Key Insight: - Suggested PageRank as an alternative to criticality score and highlighted ecosystem fragmentation.