Evergreen Performance Bulletin 2021-01-11

I’m starting this bulletin in order to solicit feedback on future directions, preserve and disseminate knowledge of evergreen performance characteristics. I’m going to shoot to release once a sprint. Please let me know if you have any suggestions whatsoever.

Link to the last and next bulletins.

Recent Findings

Feedback Rule Does Not Resolve rhel76-small Performance Issue

Over this past weekend, I tested out DependenciesMetTime PR on an ongoing issue with rhel76-small version performance.

The poor performance continued as before, despite my setting free host fraction to 0.01, turning on round-up of fractional required hosts, and turning on wait time feedback. For example, this task took almost an hour to start.

To conclusively rule out insufficient host allocation, on the 13th I set the distro’s minimum hosts to 300, shortly before the nightly creation of tasks.

Despite this, wait times for the E2E tasks were as bad as they have ever been. For example, this task took upwards of seven hours to start. This was despite upwards of 50 and often as many as 100 idle hosts for the first hour of the day, after the jobs had been submitted.

I observed this task sitting at the front of the queue, with no dependencies, but not running for about an hour. Occasionally, the task page would disagree, saying that it was 50th in the queue. It can therefore be concluded this is not a problem with host underallocation but rather an issue with hosts being able to pick up these tasks from the queue.

An important lesson about debugging test design: A month ago, I suspected that host allocation could solve this problem for this distro. I should have done this experiment then. Instead, I designed fairly complex solutions intended to be permanent. The round-up rule and especially the feedback rule will have a positive effect in the future, but were not effective in solving the rhel76-small issue. The takeaway lesson for me is to attempt a simple configuration- or ops- based proof-of-concept trial step before spending lots of time on a potential solution whose concept has not been proven yet.

Improvements in Task Wait Times from Wait-based Feedback Mechanism

I also tested out this feedback mechanism on one of our workhorse distros.

rhel62-small has fewer tasks that wait over an hour when long-wait feedback is used. This holds true even when compared to days with much lower volume.

Jan 1st: Low task volume control
27848 total tasks in time period across all distros
3 tasks with over 1 hour corrected wait time
0.11 hours wait on average

Jan 8th: Matched task volume control
124898 total tasks in time period across all distros
353 tasks with over 1 hour corrected wait time
0.13 hours wait on average

Jan 11th: Feedback rule enabled
126737 total tasks in time period across all distros
0 tasks with over 1 hour corrected wait time
0.06 hours wait on average

(Provenance for this data) Jan 1st Jan 8th Jan 11

By eye, the histogram looks slightly better, but there is no qualitative difference, only a quantitative one.

These preliminary results are encouraging. While there is a definite improvement in wait times, this still does not get us a strictly-enforceable wait time SLA. I think this is because we spin up only one additional host for each task that goes over the wait time threshold, and that host will likely end up running some other task.

Discussion

SLA Guarantees Not Possible Under Bottlenecks

Currently, our system has to run in the presence of performance bottlenecks of one kind or another. We generally hit global limits on the number of dynamic hosts at least once per week, even though our distro-level max hosts limits are infrequently reached. These global limits represent limitations of logkeeper and of our AWS storage allotment. While our AWS storage allotment was increased this week, it would be inadvisable to raise the global limit until logkeeper is retired. As long as these limits exist, SLA guarantees for task wait time are not possible without overriding established rules such as prioritization of patch builds over waterfall tasks. The closer we get to bottlenecks, the more performance degrades – unavoidably, and in more or less the same general pattern.

When we can no longer spin up additional hosts, queue size increases, and time-since-queue clearing also increases. As pointed out by Brian, our queue is not a FIFO queue, though it does have some rules to increase priority as the wait time of a task increases. Our queue is continually reshuffled to accommodate for a variety of prioritization rules, and queues are constantly taking in new tasks. Consequently, the longer a queue runs without emptying out, the greater the chance that some task or tasks will be continually bumped to the back of the queue, and will accumulate a large wait time. I agree with Brian that this is the likely cause of the “long tail” of excessive task wait times. We also came to the conclusion that introducing scheduling rules to, e.g., bump a long-wait task to the head of the queue, would be a bad idea. This would interfere with existing rules, add complexity, and generally bump tasks up in priority that had been previously marked as low-priority by the existing rules. However, host allocation and queue length management are promising potential solutions.

As this sprint’s experiments nevertheless demonstrate, more generous host allocation rules improve performance when we can spin up as many hosts as we want, which is possible most of the time. While we should always keep an eye out for future bottlenecks and work to remove them, we still have plenty of room to improve evergreen task-running performance under normal operating conditions.

Future Directions

Understanding Idle Time Tradeoffs

This is the most important new area of inquiry, IMO. That is to say, it has the greatest expected value for return on information.

There are three main reasons we need to better understand the tradeoffs associated with host idleness:

Savings: We spend hundreds of thousands of dollars on idle hosts each year. However, we also spend hundreds of thousands of dollars on hosts that are spinning up and are not yet ready to run tasks. We could potentially save hundreds of thousands of dollars a year by understanding how to minimize the total time host spend idle or spinning up. It may also be the case that there are better or worse configurations for performance that cost similarly.
Windows Performance: We currently regulate windows host costs by setting low distro-level max hosts limits. This results in very bad performance. It might be possible to get equivalent cost savings without as heavy as a toll on performance.
Making sure performance increases aren’t too costly: In general, I think it is a good idea to have some kind of feedback rule that increases host allocation as a response to poor performance (e.g. time since queue was last cleared, number of long-wait tasks). However, this will likely result in greater idle time. Before we take any action that significantly increases the host idle time, we should understand the tradeoffs.

Roll Out Feedback Rules to More Distros:

Given the “Improvements in Task Wait Times from Wait-based Feedback Mechanism” shown above, it might be promising to turn this option on globally. However, I want to be able to understand the impact on host idle time before a large-scale rollout, hence why this is ranked below that objective.

Develop New Data Source:

DependenciesMetTime PR populates the DependenciesMetTime field, now using an already-existing cache. This metric is not currently included in the task-end-stats report. Due to fear that database writes will severely negatively impact performance, this field is only ever set in the in-memory object, and because it is never written to the database it never makes it to the report. This would be easy to fix, but we would need to monitor this to make sure that the increased writes were not impacting evergreen webapp reliability too much.

Future Directions Not Considered

Rethinking Host Task Assignment

The current paradigm is low-complexity and high-speed. Complicating this would likely introduce new edge cases and bugs. Host allocation tweaks and bottleneck targeting has a greater expected benefit at much lower expected effort. However, once the containers initiative becomes more mature, we may need to rethink this anyway, so I’m going to revisit this at the end of the quarter.