Bachelor Thesis

Assessing the Performance Regression Detection Capabilities of Microbenchmark Suites in Open Source Software

B.Sc. Business Informatics

2024 · TU Berlin 
Methods/Tools: Microbenchmarking, Performance Regression Injection, Open Source Software Analysis, Go, Cloud Benchmarking, Statistical Evaluation
Full thesis can be found here

Abstract

Microbenchmark suites serve as a fine-grained tool to assess the performance of individual code segments, offering a contrast to application benchmarks that simulate realistic user interactions to stress the entire system under test. This thesis examines the efficacy of microbenchmark suites in detecting performance regressions in open-source software (OSS) projects developed in Go. Performance regressions, which refer to decreases in software performance due to code changes, pose significant challenges in software development, particularly within the open-source community where diverse contributions compound complexity. Utilizing a methodology that introduces varying severities of performance regressions into two projects, Common and Gin, this study investigates the microbenchmarks’ ability to identify these injected regressions reliably. By comparing the detection capabilities of the mean and median values in analyzing benchmark results, we aim to discern the impact of outlier data on the accuracy of regression detection and to determine which provides more reliable results. We find that the median returns fewer false positives but tends to be more conservative than the mean in reporting regressions. Our findings also reveal that while microbenchmark suites can identify performance regressions, there exists a miss rate in their detection capabilities as well as false positives which can lead to misleading conclusions, suggesting room for improvement in microbenchmark design and implementation. Therefore, solely relying on microbenchmark suites is insufficient. Notably, we discover that microbenchmarks targeting low-level operations, due to their isolated nature, reduce false positives. Furthermore, microbenchmarks that engage with complex data or operate in multi-threaded contexts show increased sensitivity to regressions.

Methodology

The thesis follows an experimental research approach. First, two open-source Go projects were selected as study objects: Common, a shared library used across Prometheus components, and Gin, a popular HTTP web framework. Both projects include existing microbenchmark suites, which made them suitable for testing regression detection in realistic open-source contexts.
Performance regressions were then artificially injected into specific parts of the projects. These regressions were designed with adjustable severity levels, so that the thesis could examine not only whether the benchmarks detected a regression, but also at which level of severity the change became visible.
The benchmark suites were executed in a cloud environment, and the results were analyzed statistically. Confidence intervals were used to determine whether performance differences between the original and modified versions were significant. The analysis also compared mean and median values to understand how different statistical measures affect the reliability of regression detection.
The evaluation focused on false positives, false negatives, miss rates, and the sensitivity of individual microbenchmarks. This made it possible to assess not only whether a benchmark suite detected regressions, but also how stable, precise, and misleading its results could be.

Reflection

This thesis feels quite far away from the other projects here and the study fields I focus on now, but I still think it is important to include. It shows where I come from technically. At that time, I was working very deep inside software engineering: benchmarks, performance regressions, false positives, false negatives, mean, median, cloud environments. It was precise and very technical. Looking back, I find it a bit funny that I spent my bachelor thesis measuring tiny differences in performance, while one of my current projects questions the laptop as an object built around productivity and performance in the first place. Back then, the question was: how can we detect when software becomes slower? Now, I am more interested in asking why speed, efficiency, and productivity are treated as such goals at all. I do not see this as a contradiction, but as a shift. The technical knowledge is still part of me, but I no longer want to stay only inside technical optimization. I want to use this background to ask broader questions about technology, design, systems, and the values they carry.