Related Literature: Justification for Impact Metrics¶
This document presents a literature review of metrics proposed in prior studies to assess the impact of open-source software (OSS) projects. Each section outlines the source publication, metric, methodology, and key insights.
1. GitHub Activity Metrics (Stars, Forks, Watchers)¶
Paper: What Makes Open Source Software Projects Impactful: A Data-Driven Approach
Metric: Weighted sum of stars, forks, and watchers
Methodology: - A user survey was conducted to define a quantitative measurement of project impact. - Based on the survey, forks were deemed most impactful, followed by stars, then watchers. - An impact threshold of 100 was used to classify projects as impactful.
Key Insight: - Only 1% of repositories were classified as highly impactful.
Paper: GitHub Statistics as a Measure of the Impact of Open-Source Bioinformatics Software
Metric: Stars, forks, and watchers
Methodology: - Compared GitHub statistics with other bioinformatics impact measures, including citation counts to assess if Github statistics were a valid way to meassure impact and popularity.
Key Insight: - Found a correlation of 0.66 between GitHub activity and citations.
2. Criticality Score¶
Source: Quantifying Criticality (Rob Pike)
Metric: Weighted sum of 10 signals (Criticality Score)
Methodology: - Includes signals like contributor counts, dependents, and closed issues. - Normalized between 0 and 1. - Recommended by industry experts.
Key Insight: - No published results; used as a framework for measuring software criticality. The methodology for collecting the signals are not entirely clear from the documentation alone.
3. Popularity and Usage (PyPI Downloads & GitHub Stars)¶
Paper: Exploring Popularity and Usage: A Comparative Analysis of GitHub Stars and PyPI Downloads in Python Libraries
Metric: PyPI downloads and GitHub stars
Methodology: - Used the Linehaul project to retrieve PyPI download counts of 3182 popular Github repositories. - Computed a correlation between the PyPi downloads statistics of those repositories to their Github stars.
Key Insight: - Found a low correlation (0.235) between stars and PyPI downloads, suggesting partial overlap.
4. Economic Value of OSS¶
Paper: The Value of Open Source Software
Metric: Supply- and demand-side monetary value of all OSS.
Methodology: - Supply-side: Estimated labor cost to rewrite OSS using the COCOMO II model. - Demand-side: Measured value based on proprietary OSS usage data and calculated replacement cost.
Key Insights:
- Demand-side value: $8.8 trillion
- Supply-side value: $4.15 billion
- Six languages (Go, JS, TS, C, Java, Python) drive 84% of OSS value
- 5% of OSS developers create 96% of the value
- Top industries benefiting:
- Professional/Scientific/Technical Services: $43B
- Retail Trade: $36B
- Administrative Support: $35B
5. Centrality and Cost Models¶
Paper: Measuring the Impact of Open Source Software Innovation Using Network Analysis on GitHub Hosted Python Packages
Metric: Centrality (degree, eigenvector), COCOMO, and PyPI downloads
Methodology: - Focused on packages listed on PyPI. - Used Google BigQuery to gather dependents data. - Applied network analysis to assess centrality.
Key Insight: - Degree and eigenvector centrality had strong correlation with PyPI downloads.
6. Weighted HITS Influence Score¶
Paper: Influence Analysis of GitHub Repositories
Metric: Weighted HITS on stars graph
Methodology: - Constructed a bipartite graph between users and repositories using star relationships. - Edges were weighted by fork counts. - Applied HITS algorithm to compute repository influence.
Key Insight: - Identified top 10 most influential repositories for each programming language.
7. Software Innovation via Dependency Growth¶
Paper: Measuring Software Innovation with Open Source Software Development Data
Metric: Log-difference in number of dependents
Methodology: - Measured innovation as the change in number of dependents between Github releases. - Looked into 200,000 unique releases across 28,000 unique packages. - Likely used GitHub's dependency graph to get information about dependencies.
Key Insight: - Found a slight positive correlation between software complexity and dependency growth.
8. PageRank and Disconnected Packages¶
Paper: Identifying Critical Projects via PageRank and Truck Factor / DaSEA – A Dataset for Software Ecosystem Analysis
Metric: PageRank on dependency networks
Methodology:
- Built ecosystem-specific dependency graphs.
- Analyzed disconnected packages (no dependents or requirements).
- Compared ecosystems and found percentages of disconnectedness:
- NPM: 18%
- Cargo: 22%
- Maven: 40%
- Packagist: 43%
- PyPI: 79%
Key Insight: - Suggested PageRank as an alternative to criticality score and highlighted ecosystem fragmentation.