Methodology¶
This section outlines the data sources, tools, and research design used to assess the impact of CERN’s open-source software (OSS). The study combines Software Heritage graph data, external metadata, and usage metrics, alongside interactive visualizations in a dedicated dashboard to explore and communicate results. The methodology is informed by prior work on OSS contribution analysis (TODO: add link to paper once published).
Data Sources¶
Software Heritage Graph¶
A core component of this study is the Software Heritage (SWH) Graph, a comprehensive archive that collects, preserves, and indexes publicly available source code from platforms such as GitHub and GitLab. The graph represents software development history as a versioned directed acyclic graph (DAG), with nodes for commits (revisions), files (contents), directories, releases, and more. For a detailed explanation of the graph structure, see the Software Heritage data model documentation.
To identify CERN-related repositories, I traverse revision nodes in the graph and extract author metadata, selecting pseudonymized emails with the domain @cern.ch. This allows attribution of individual commits to CERN-affiliated developers. Projects with at least one @cern.ch-authored commit are included, after which all commits for each project are collected, and the proportion of CERN-authored commits is calculated relative to total commits.
External Metadata & Usage Metrics¶
To complement the SWH data, I collect additional repository-level and usage metrics from:
- GitHub and GitLab: Repository stars, forks, watchers, and the programming language composition.
- Package registries (e.g., PyPI): Download counts and dependency relationships.
- Criticality Score: The OpenSSF Criticality Score provides an assessment of each project’s ecosystem importance.
These complementary data sources provide insight into real-world adoption and usage of CERN OSS projects.
Research Design¶
The analysis is guided by two primary research goals:
1. Identification and Characterization of CERN Projects¶
Using the Software Heritage graph, I:
- Identify repositories with significant CERN involvement.
- Quantify the proportion of CERN-authored commits.
- Construct a timeline of CERN vs. external commits for each repository.
This approach allows us to understand which OSS projects CERN affiliates contribute to, and the development dynamics between CERN and external developers.
2. Impact Metrics¶
For each project, I collect impact indicators including:
- GitHub stars, watchers, and forks
- Package download counts
- Number of reverse dependencies (dependents)
By combining contribution metrics (e.g., commits, diversity of contributors) with usage metrics (e.g., downloads, popularity, dependency data), we can quantify both CERN's engagement in OSS and the broader adoption of these projects.
Dashboard Development and Visualization¶
To facilitate exploration and communication of results, I developed an interactive dashboard. The dashboard supports both high-level insights and detailed analysis of individual projects.
High-Level Overview¶
Aggregated across repositories, the visualizations show:
- Total number of repositories and commits authored by CERN emails
- Commit activity and unique authors over time (time series)
- Distribution of repositories by CERN commit share (histogram)
- Variation in the number of CERN-affiliated authors across repositories of different contributor sizes (box plot)
- Top 20 repositories ranked by GitHub stars, watchers, forks, and PyPI download counts
Detailed Repository Insights¶
The second pane focuses on individual repositories and includes:
- A searchable, sortable table with repository URLs and key metrics: proportion of CERN commits, GitHub popularity score, and criticality score
- A cumulative timeline of commits, separated by CERN and external developers
- Bar charts summarizing unique authors and commits by contributor type
- Bar chart of repository stars, watchers, and forks
- Pie chart of programming languages used in the project
The dashboard serves as a central exploratory tool for understanding both aggregate patterns and granular details of CERN’s OSS contributions.