Performance testing an IntelliJ plugin with JMH
Performance testing Fuzzier
The time has finally come. I want to create some performance metrics for Fuzzier.
Fuzzier is a file finder plugin for JetBrains editors inspired by Telescope for Nvim that I've been developing for a little over a year now.
First, I need a plan. Fuzzier is written in Kotlin and I'm a Java developer, so I'll be leaning into the tools that I know.
In this article I'll choose to set up JMH and focus on a single function, that I measure and improve. I faced some challenges while setting up JMH with Kotlin and IntelliJ plugin dependencies and have documented them here.
The Plan
While the initial idea was to benchmark the actual search and I had some thoughts on this, I did end up choosing a more contained problem to focus on. I think it was a good choice, as now I have a better idea how to approach performance testing Fuzzier, and have a working setup for the future.
Here are some questions that need to be answered first.
- What do I want to measure?
- What can I use the results for?
- Will the same setup work in the future?
- What tools should I use?
Let's go over each of these one by one.
What do I want to measure?
Primarily, I want to measure how long Fuzzier takes to process IntelliJ's file index and produce a result. The secondary goal is to ensure that the memory footprint is kept under control, though I don't expect this to be a problem.
The measurements should account for the settings used in the processing, either through the IDE or separate integration.
What can I use the results for?
The goal is to create a baseline performance metric, which I can compare future results to.
I cannot account for everything, so these initial results are not perfect and will likely need to be tweaked and measured again later.
Will the same setup work in the future?
I'd like to be able to use the same setup/frame in the future to measure comparable benchmarks.
One large issue is that I do not have static hardware to benchmark on. My personal pc is the primary way that I will be doing these tests, but that is likely to change in the future which would invalidate the previous measurements instantly.
I'm not experienced in using them, or creating performance metrics, so I'll have to do some exploring. Here are my initial thoughts between the tools.
JMeter
I could create a testing endpoint, which could be used to launch full tests using a running IDE. This would allow me to use JMeter to start and measure the actual time taken.
JMH
With JMH I could test smaller parts of the code with more precision, but this would require a lot of work to capture a large part of the program. I would write the tests using code and running them with JMH.
The hard part (setting up JMH)
The hardest part was trying to put JMH, Kotlin and IntelliJ together.
Here I also realize that I have a good function to use and the initial
test can be copy pasted from my existing unit tests. Here's a link
to the
pull request, which includes changes to the build.gradle file.
- How do I create the necessary tests and make sure that it accounts for everything that I want it to?
-
How do I measure the things that I want to?
-
What is the output, how is it read, and how will I store it?
Always start simple
Those were some open ended questions. Instead of trying to consider everything I end up starting with what I run into.JMH with Gradle
No more planning, time to go and break some things. I guess I'll start with JMH and see what happens.Requirements
Thejmhtask doesn't have the same dependencies available as the main project, which caused me some headache. While trying to run the task I got an error wherecom.intellij.ui.JBColorwas not being found.
-
What is the output, how is it read, and how will I store it?
This took me some time to figure out, because the missing libraries were not
in a public dependency. I had to manually add the jars to the
jmh task dependency, because I did not find a solution which
would just reuse all of the dependencies and plugins from the main gradle
project.
My solution was to copy the dependencies to the project and use a file
dependency to add them to the jmh task. This way I could avoid
adding an absolute path to the build file.
-
jmh-gradle-plugin
- Plugin
-
id("me.champeau.jmh") version "0.7.2"
-
-
JMH core - Dependency
-
implementation("org.openjdk.jmh:jmh-core:1.37")
-
-
JMH annotation processor - Dependency
-
annotationProcessor("org.openjdk.jmh:jmh-generator-annprocess:1.37")
-
-
Kotlin standard library for JMH - Dependency
- `jmh("org.jetbrains.kotlin:kotlin-stdlib:2.1.0")'
-
File dependency for
idea:ideaIC:2024.3. Copied each.jarfile from my.gradledirectory.-
ideaIC-2024.3/lib cp *.jar ~/IdeaProjects/fuzzier/libs/ jmh(files("libs/*"))
-
The default source file location is src/jmh, so my tests live
in the following location
src/jmh/kotlin/com/mituuz/fuzzier/performance.
Configuration can be done through build.gradle file. Here's
an example:
jmh {
fork = 1
warmupIterations = 1
iterations = 1
timeOnIteration = "1s"
resultFormat = "JSON"
jvmArgs = listOf("-Xms2G", "-Xmx2G")
jmhTimeout = "30s"
}
Running JMH
Do not use prints when testing JMH. The terminal can't keep up and it froze my system twice before I realized what's going on.
Friendly reminder for us Linux folk: Raising Elephants Is So Utterly Boring
After solving the print issue, I was able to run the JMH task from the
terminal and through IntelliJ. jmh task wasn't compatible
with gradle's configuration cache, so I ended up disabling it when
running the task. ./gradlew jmh --no-configuration-cache
Crafting tests to use with JMH
While setting this up, I realized that I have a perfect function to do some actual improvements.
I created an ugly highlighting function that appends several strings
together to form an HTML string that has certain letters using different
background color (<font style='background-color: $colorAsHex'). I knew that it wasn't the best way and always had the intention to
return to this.
Creating the test
To keep this simple, I just want to measure the highlight function for a single string with varying number of characters to highlight. I can just copy my existing test and change some parameters to create variance.
It's hard to say anything about benchmarking results when you don't have anything to compare to, but here is the initial run.
| Benchmark | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|
| PerformanceTests.noMatchesInFilename | thrpt | 36214160.146 | ops/s | ||
| PerformanceTests.oneMatchInFilename | thrpt | 17464086.561 | ops/s | ||
| PerformanceTests.threeMatchesInFilename | thrpt | 6670398.128 | ops/s | ||
| PerformanceTests.sevenMatchesInFilename | thrpt | 2880108.760 | ops/s |
But before doing any improvements it's time to think about how I can get consistent performance results.
Setting up test environment
I did not end up with a consistent test environment and do not need it at this point, but here is my failed attempt to benchmark with two devices using docker with constraints.
Okay, so I dug around a bit and decided to go with
docker constraints. First building the image from a dockerfile and then defining constraints
on the docker run command.
Not sure if this is the best way, but you gotta start somewhere. (Using docker-compose and defining the resource limits in the file would be an alternative way to do this.)
Here's my initial run configuration and the results (I decided to increase the measurement time a bit).
jmh {
fork = 1
warmupIterations = 2
iterations = 2
timeOnIteration = "2s"
resultFormat = "JSON"
jvmArgs = listOf("-Xms2G", "-Xmx2G")
jmhTimeout = "30s"
}
docker build -t fuzzier-perf .
docker run --memory="3G" --cpus="2.0" -d fuzzier-perf
Initial results
Laptop
| Benchmark | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|
| PerformanceTests.noMatchesInFilename | thrpt | 20833517.400 | ops/s | ||
| PerformanceTests.oneMatchInFilename | thrpt | 7015488.826 | ops/s | ||
| PerformanceTests.threeMatchesInFilename | thrpt | 3328829.228 | ops/s | ||
| PerformanceTests.sevenMatchesInFilename | thrpt | 1342837.360 | ops/s |
Desktop
| Benchmark | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|
| PerformanceTests.noMatchesInFilename | thrpt | 44547130.894 | ops/s | ||
| PerformanceTests.oneMatchInFilename | thrpt | 13409151.110 | ops/s | ||
| PerformanceTests.threeMatchesInFilename | thrpt | 6207958.712 | ops/s | ||
| PerformanceTests.sevenMatchesInFilename | thrpt | 2635322.828 | ops/s |
No luck
So, because the cores aren't equal between the devices the results aren't either. In hindsight this makes perfect sense.
I think I'll use a smaller measure -> change -> measure cycle instead of trying to create permanent, comparable results.
This means that my performance testing will be limited to the scope of a single change and will only be valid for a little time. This is fine, because my goal shifted from creating general measurements to improving a single function.
Pivot
With the new cycle defined I can start working on the improvements.
measure -> change -> measure
I'll do all of the work on a single device to ensure that the measurements are valid.
Baseline
First, I'll create the baseline measurements that I can compare against after doing some changes.
Because I don't need docker anymore, I'll be running the JMH task without it for convenience and speed.
Now that I am ready to start measuring, I should take a closer look on which mode I will be using and what am I actually measuring.
First, I need to select the JMH mode that I want to use.
JMH modes
JMH offers several modes, which measure different things.
-
Throughput (default)
- How many times the function completes per second
-
AverageTime
- How long does the function take to run
-
SampleTime
- Assesses the execution time of specific calls within a function over a defined period
-
SingleShotTime
- Measures the execution time of a function in a single invocation without any iterations.
Because the function in question is being run for x amount of files with each search, I could use either average time or throughput. The function is so fast that throughput feels easiest to read, the higher the score, the better.
Results
So, here are the initial results of how many operations complete in a second. Quite noticeable decrease in the score when the match count increases, which is to be expected because that requires more operations.
| Benchmark | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|
| PerformanceTests.noMatchesInFilename | thrpt | 2 | 44580990.574 | ops/s | |
| PerformanceTests.oneMatchInFilename | thrpt | 2 | 15314862.488 | ops/s | |
| PerformanceTests.threeMatchesInFilename | thrpt | 2 | 6661081.448 | ops/s | |
| PerformanceTests.sevenMatchesInFilename | thrpt | 2 | 2652866.574 | ops/s |
Optimization time
Here is the function that I would like to optimize. It works by inserting tags in to an HTML string, which then highlights the correct characters.
The tag used is a font style which sets the background color. Swing components do not support every HTML tag and I remember trying a lot of different things before ending up with this.
var startStyleTag = createStartStyleTag()
const val END_STYLE_TAG: String = "</font>"
private fun createStartStyleTag(): String {
val color = JBColor.YELLOW
return "<font style='background-color: ${colorAsHex(color)};'>"
}
fun highlight(source: String): String {
val stringBuilder: StringBuilder = StringBuilder(source)
var offset = 0
val hlIndexes = score.highlightCharacters.sorted()
for (i in hlIndexes) {
if (i < source.length) {
stringBuilder.insert(i + offset, startStyleTag)
offset += startStyleTag.length
stringBuilder.insert(i + offset + 1, END_STYLE_TAG)
offset += END_STYLE_TAG.length
}
}
return stringBuilder.toString()
}
Going for CoPilot
Before even taking a look at what I could do, let's see what CoPilot thinks.
I simply asked it to optimize the function, copy pasted the results and measured again.
fun highlight(source: String): String {
val hlIndexes = score.highlightCharacters.sorted()
val parts = mutableListOf<String>()
var lastIndex = 0
for (i in hlIndexes) {
if (i < source.length) {
parts.add(source.substring(lastIndex, i))
parts.add(startStyleTag)
parts.add(source[i].toString())
parts.add(END_STYLE_TAG)
lastIndex = i + 1
}
}
parts.add(source.substring(lastIndex))
return parts.joinToString("")
}
| Benchmark | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|
| PerformanceTests.noMatchesInFilename | thrpt | 2 | 21140629.430 | ops/s | |
| PerformanceTests.oneMatchInFilename | thrpt | 2 | 12419843.488 | ops/s | |
| PerformanceTests.threeMatchesInFilename | thrpt | 2 | 3177763.124 | ops/s | |
| PerformanceTests.sevenMatchesInFilename | thrpt | 2 | 1366734.188 | ops/s |
Here's the explanation that CoPilot gave me.
To optimize the highlight function, we can use a StringBuilder more efficiently by collecting all the parts of the string and then joining them at the end. This avoids multiple insertions which can be costly.
This made everything slower. Instead of using StringBuilder and appending the parts, CoPilot wanted to store all of the invidual strings and keep them in a list. That does not sound like it would be faster, or more memory efficient. Instead of using StringBuilder to create a single String object, it would create and store multiple distinct String objects.
I'll do it myself
Okay, so I know what I can look for. Instead of adding tags for each character, I could group them and add the start and end tags for each group. This of course means that I want to add another test where we have a same number of characters to tag, but the grouping is different.
I decided to create separate threeMatchesInFilenameInARow and
threeMatchesInFilename. So a new set of initial benchmarks,
where it is clear that no matter the groupings 3 matches take the same time.
I'll also drop the no matches case, because every time this function is called the string must have at least one match.
| Benchmark | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|
| PerformanceTests.oneMatchInFilename | thrpt | 2 | 15147366.761 | ops/s | |
| PerformanceTests.threeMatchesInFilenameInARow | thrpt | 2 | 6757701.170 | ops/s | |
| PerformanceTests.threeMatchesInFilename | thrpt | 2 | 6751191.472 | ops/s | |
| PerformanceTests.sevenMatchesInFilename | thrpt | 2 | 2703818.535 | ops/s |
Grouping characters
Here's a comparison on what the results would look like with and without grouping. I think it is pretty clear why this would be an improvement.
Without grouping:
<font style='background-color: #ffff00;'>H</font><font style='background-color: #ffff00;'>e</font>llo
With grouping:
<font style='background-color: #ffff00;'>He</font>llo
This had the exact effect that I was looking for, by reducing the number of tags inserted we can make sequential character processing faster. It does slow down when there are no sequential matches, but I believe that some sequences are to be expected in the most common cases.
| Benchmark | Mode | Cnt | Score | Error | Units | Change |
|---|---|---|---|---|---|---|
| PerformanceTests.oneMatchInFilename | thrpt | 2 | 15262031.008 | ops/s | ||
| PerformanceTests.threeMatchesInFilenameInARow | thrpt | 2 | 10133267.856 | ops/s | ||
| PerformanceTests.threeMatchesInFilename | thrpt | 2 | 5662795.456 | ops/s | ||
| PerformanceTests.sevenMatchesInFilename | thrpt | 2 | 3672124.419 | ops/s |
Here is a better look into the changes between the two implementations. For such a small change, I think it is a good one.
| Test Name | Initial ops/s | Optimized ops/s | Change in ops/s | % Change |
|---|---|---|---|---|
| PerformanceTests.oneMatchInFilename | 15147366.76 | 15262031.01 | 114664.25 | 0.76% |
| PerformanceTests.threeMatchesInFilenameInARow | 6757701.17 | 10133267.86 | 3375566.69 | 49.95% |
| PerformanceTests.threeMatchesInFilename | 6751191.47 | 5662795.46 | -1088396.02 | -16.12% |
| PerformanceTests.sevenMatchesInFilename | 2703818.54 | 3672124.42 | 968305.88 | 35.81% |
I think this has gone on for long enough and this article is getting a bit too long. This is not the only issue that I had with the highlighting and I think it can still be improved, but it is a start.
Final thoughts
Setting up JMH was considerable amount of work, but honestly I feel like it was a good investment. Fuzzier is a file finder plugin and it needs to be fast to feel good. Having the framework ready to keep iterating and being able to see the results in a readable form will help me in the future.
I didn't even get to JMeter here, but maybe at some point in the future I could look at how to integrate it to measure the scoring and matching.
If you have any tips, questions or anything else feel free to contact me at
mitja@mituuz.com.