GSoC'25 @ The R Project for Statistical Computing

Weekly updates on my Google Summer of Code 2025 project with The R Project for Statistical Computing.

This summer I got selected to work on a project with The R Project for Statistical Computing as part of Google Summer of Code 2025. The project is titled Optimizing a performance testing workflow by reusing minified R package versions between CI runs, and my mentors for this project are Anirban Chetia and Toby Dylan Hocking.

The core aim of my project is to directly benefit R developers by streamlining a critical part of their workflow. Many R packages, with data.table being a prime example, utilize the atime package for performance benchmarking across different versions. This is crucial for identifying performance regressions or improvements, especially when reviewing new contributions.

Currently, these CI performance tests can be quite time-consuming, often rebuilding multiple package versions repeatedly. We have a limited number of minutes available for CI runs (500 MB and 2k minutes/month at present). My project aims to address this by implementing a caching mechanism that reuses previously built package versions across different CI runs.

A key challenge in CI is securely handling pull requests from external forks, as these forks typically don't have access to repository secrets (needed to, say, comment back on the PR). It also becomes important to ensure that a malicious actor cannot exploit this to gain access to the repository secrets. Thus, I will be implementing a two-step process, where the first step builds the package versions and runs the performance tests, and the second step comments the results back on the PR.

Qualification Tests

As part of the application process, I completed a series of qualification tests in January 2025. The tasks tackled core components of the proposed project, allowing me to hit the ground running for GSoC. Through these tests, I effectively laid the groundwork for a significant portion of the project:

  • Package Minification Script (Easy Task): I developed a script capable of taking any R package tarball, stripping out unnecessary files (like vignettes, documentation, and tests), and then installing this "minified" version. This directly addresses the goal of reducing package size for faster CI installations.

  • GitHub Action for Minification and Artifact Upload (Medium Task): Building on the minification script, I created a GitHub Action. This action can read a package name and version (e.g., from an issue description), check if a minified version already exists (a precursor to caching), minify it if not, and then upload the minified package as a build artifact. This was a crucial step towards implementing the artifact caching strategy.

  • Supporting PRs from Forks in Autocomment-atime-results (Hard Task): This tackled the challenge of securely handling CI for pull requests from external forks. I modified the existing Autocomment-atime-results workflow to support this. The solution involved a two-part workflow: one part runs the performance tests and uploads the results as artifacts (which can be safely done by forked PRs), and a separate, trusted part then downloads these artifacts and comments the results back onto the pull request. I also cloned the data.table repository with its historical branches to thoroughly test this setup.

Week 0: Community Bonding Period

In my initial plan, I thought of implementing the caching mechanism in the atime package itself. Since this was something I had not worked on in the tasks, I planned on exploring the package and working on it. The idea was simple: in the local development environment, we would cache the historical data.table builds into a cache directory, and then in the CI, we would upload this cache directory as an artifact and download it in the next run. I made this PR for some feedback, and it seemed like I had missed a pretty easy way to do this. Toby suggested that I could just cache the library directory of the package, instead of saving the files again in an atime cache directory. So he suggested implementing the caching mechanism in the CI workflow itself, which would be much more efficient.

Week 1: Implementing Caching in CI

In the first week of GSoC, I focused on implementing the caching mechanism directly in the CI workflow. The goal was to reuse previously built package versions across different CI runs, significantly speeding up the performance testing process.

I started by modifying the existing GitHub Actions workflow I had created during the Hard Task of the qualification tests. I cached both the library directory and the built libgit2 files, which was another place where time was being wasted in the CI.

Week 2: Combining the Two Steps into One Action

Anirban had previously suggested having one action that would run the performance tests and comment the results back on the PR, instead of having two separate actions. So I decided to make it two jobs in the same action: one for running the performance tests and uploading the results as artifacts, and another for downloading these artifacts and commenting the results back on the PR. This would simplify the workflow and make it easier to manage, as well as implement in any package that uses autocomment-atime-results.

Week 3 & 4: Delays, Implementing Feedback & Changing Approach (Again)

In the third week, I was partially busy due to traveling to participate in the Warpspeed: Agentic AI Hackathon 2025. However, I did finally open a PR on the autocomment-atime-results. I changed my approach again: for supporting forks, instead of having two jobs, I used the pull_request_target event, which runs the workflow in the context of the base repository, allowing it to access secrets and comment on the PR, while ensuring the token can be accessed only on the comment step.

The other change was using the actions/cache action to cache the library directory, instead of uploading it as an artifact. This had its own challenges, as for PRs, the cache is scoped to the merge ref rather than the base branch, which would not allow any other PRs to access the cache. The solution to this was to allow the action to be run on other events, like push and workflow_dispatch, which would allow the cache to be created on the base branch and then accessed by the PRs.