GSoC'25 @ The R Project for Statistical Computing
This summer I got selected to on a project with The R Project for Statistical Computing (as part of Google Summer of Code 2025). The project is titled Optimizing a performance testing workflow by reusing minified R package versions between CI runs, and my mentors for this project are Anirban Chetia and Toby Dylan Hocking.
The core aim of my project, is to directly benefit R developers by streamlining a critical part of their workflow. Many R packages, with data.table being a prime example, utilize the atime package for performance benchmarking across different versions. This is crucial for identifying performance regressions or improvements, especially when reviewing new contributions.
Currently, these CI performance tests can be quite time-consuming, often rebuilding multiple package versions repeatedly. And we have a limited number of minutes available for CI runs (500 MB and 2k minutes/month at present). My project aims to address this by implementing a caching mechanism that reuses previously built package versions across different CI runs.
A key challenge in CI is securely handling pull requests from external forks, as these forks typically don’t have access to repository secrets (needed to say, comment back on the PR). It also becomes important to ensure that a malicious actor cannot exploit this to gain access to the repository secrets. Thus I would be implementing a two step process, where the first step builds the package versions and runs the performance tests, and the second step comments the results back on the PR.
Qualification tests
As part of the application process, I completed a series of qualification tests in January 2025. The tasks tackled core components of the proposed project, allowing me to hit the ground running for GSoC. Through these tests, I effectively laid the groundwork for a significant portion of the project:
-
Package Minification Script (Easy Task): I developed a script capable of taking any R package tarball, stripping out unnecessary files (like vignettes, documentation, and tests), and then installing this “minified” version. This directly addresses the goal of reducing package size for faster CI installations.
-
GitHub Action for Minification and Artifact Upload (Medium Task): Building on the minification script, I created a GitHub Action. This action can read a package name and version (e.g., from an issue description), check if a minified version already exists (a precursor to caching), minify it if not, and then upload the minified package as a build artifact. This was a crucial step towards implementing the artifact caching strategy.
-
Supporting PRs from Forks in Autocomment-atime-results (Hard Task): This tackled the challenge of securely handling CI for pull requests from external forks. I modified the existing Autocomment-atime-results workflow to support this. The solution involved a two-part workflow: one part runs the performance tests and uploads the results as artifacts (which can be safely done by forked PRs), and a separate, trusted part then downloads these artifacts and comments the results back onto the pull request. I also cloned the data.table repository with its historical branches to thoroughly test this setup.
Week 0: Community Bonding Period
In my initial plan, I thought of implementing the caching mechanism in the atime
packkage itself, and as that was something I had not worked on in tha tasks, I planned on exploring the package, and working on it, the idea was simple, in the local development environment, we would cache the historical data.table builds into a cache directory, and then in the CI, we would upload this cache directory as an artifact, and then download it in the next run. I made this PR for some feedback, and it seemed like I had missed a pretty easy way to do this, Toby suggested that I could just cache the library directory of the package, instead of saving the stuff again in a atime cache directory, so he suggested me to implement the caching mechanism in the CI workflow itself, which would be much more efficient.
Week 1: Implementing Caching in CI
In the first week of GSoC, I focused on implementing the caching mechanism directly in the CI workflow. The goal was to reuse previously built package versions across different CI runs, significantly speeding up the performance testing process. I started by modifying the existing GitHub Actions workflow I had made during the Hard Task of the qualification tests. I cached both the Library directory and the built libgit2 files, which was another plce where time was being wasted in the CI.
Week 2: Combining the Two Steps into one action
Anirban had previously suggested to have one action that would run the performance tests and comment the results back on the PR, instead of having two separate actions. So I decided to make it two jobs in the same action, one for running the performance tests and uploading the results as artifacts, and another for downloading these artifacts and commenting the results back on the PR. This would simplify the workflow and make it easier to manage, as well as implement in any package that uses autocomment-atime-results
.