Evergreen Performance Bulletin 2021-02-22

As always, please let me know if you have any suggestions whatsoever.

Link to the last and next bulletins.

Recent Findings

New Data Source:

I replaced the “makespan” metric in the distro-scheduler-report with time_to_empty and host_queue_ratio.

Discussion

host_queue_ratio is interpretable as the number of time chunks of length MaxDurationThreshold it would take for (the hosts currently free + the newly-spawned hosts) to work through the backlog of tasks in the queue. If this is 1, everything is working as expected. if this is .5, that means that the queue will be empty in distroQueueInfo.MaxDurationThreshold * .5 minutes, and so on.

The immediate motivations for me are:

figuring out when this number is significantly below one so we potentially scale back hosts before they become idle, decreasing the number of idle hosts
Having a better metric of queue health. Alerting on task waits is after-the-fact, as task waits come from the task-end-stats report. As a metric that uses only the queue state and host information, this has much greater predictive potential.

The unitless ratio should be actionable on its own, without having to know MaxDurationThreshold. However, we also have time_to_empty, which should give roughly the time for the estimate of free capacity in the next (MaxDurationThreshold) time chunk to work through the tasks currently on the queue.

There is a major caveat here, which is that the number does not factor in the tasks in the queue that have individual runtime longer than the MaxDurationThreshold. If a queue were full of only these tasks, then time_to_empty would be zero.

Future Directions

Understanding Idle Time Tradeoffs

(carried forward without modification until this can be addressed)

This is the most important new area of inquiry, IMO. That is to say, it has the greatest expected value for return on information.

There are three main reasons we need to better understand the tradeoffs associated with host idleness:

Savings: We spend hundreds of thousands of dollars on idle hosts each year. However, we also spend hundreds of thousands of dollars on hosts that are spinning up and are not yet ready to run tasks. We could potentially save hundreds of thousands of dollars a year by understanding how to minimize the total time host spend idle or spinning up. It may also be the case that there are better or worse configurations for performance that cost similarly.
Windows Performance: We currently regulate windows host costs by setting low distro-level max hosts limits. This results in very bad performance. It might be possible to get equivalent cost savings without as heavy as a toll on performance.
Making sure performance increases aren’t too costly: In general, I think it is a good idea to have some kind of feedback rule that increases host allocation as a response to poor performance (e.g. time since queue was last cleared, number of long-wait tasks). However, this will likely result in greater idle time. Before we take any action that significantly increases the host idle time, we should understand the tradeoffs.

Task Packing to decrease idle host time for Windows

new

I’ve been looking at windows-64-vs2017-large more closely, on account of its’ poor performance. There are idle hosts up to 2 hours after the queue is exhausted. Currently, hosts request tasks as soon as they are free, which guarantees that the task at the head of the queue is taken off the queue in the fastest time possible. However, in the case of windows distros, the performance is degraded to the point of multiple-hour waits, due to the distro max hosts limit. We should modify our strategy, because it makes no sense to mind seconds and let hours go by. The hardest part about this would be detecting when the queue is in terminal decline. One possibility would be to mark hosts nearing the end of their allotted hour as unable to accept new tasks, if the As such, if we were able to use some coordination step that added on the order of a minute to each task’s wait, without worrying about modes, it would still be worth it if we were able to decrease idle time enough to justify the lifting of distro max host limits.

Several potential methods for detecting terminal queue decline were unsuccessful in that they do not provide warning in advance of idle hosts. Dataset was 5-minute average queue length for windows-64-vs2017-large. I tried deltas, 3-wide average deltas, and delta of deltas. The most clear signal was deltas divided by current queue size, however this only became extremely clear at 00:25PM, at which point there is already a clear signal from idle hosts (1 at 00:20 to 25 at 00:25).

Partially due to this, I am trying to clean up the “makespan” metric in the distro-scheduler-report.

Future Directions Not Considered

~~~

Blotter

(For one-off or temporary performance issues)

Verifying the new SLA dash

The test set was tasks that finished 2/9/2021 12AM UTC to 2/9/2021 20:20 UTC.

For rhel62-large, patch_requests tasks only, using the evergreen_task_analysis code to get 50%ile corrected wait times.

with task group tasks vs without:

without:

WARNING:root:3/237592 tasks had bad datetime values
corrected_wait_time(mins)    2.883333
Name: 0.5, dtype: float64
INFO:root: AVG: 9.284366275823551
total tasks: 597

with:

Press ENTER or type command to continue
WARNING:root:4/250442 tasks had bad datetime values
corrected_wait_time(mins)    19.583333
Name: 0.5, dtype: float64
INFO:root: AVG: 21.45218015799725
total tasks: 3249

with only task_group_max_hosts: null

WARNING:root:4/250441 tasks had bad datetime values
corrected_wait_time(mins)    19.383333
Name: 0.5, dtype: float64
INFO:root: AVG: 20.641348600508845
total tasks:3144

(labels have been edited for concision and clarity)

This roughly matches 3153 tasks and 18.4 minutes 50th %ile from splunk. If we get rid of task group tasks, thereby removing all tasks that depend on task group tasks from this analysis, we lose 2500 or so tasks, and many tasks with longer waits. I can’t think of a reason that tasks which depend on task group tasks, but are not themselves task group tasks, should have longer waits, as the dependencies_met_time only starts when the dependencies are met. It’s possible that this is because these tasks are correlated with periods of high load.

amazon2-large and amazon2-small are also roughly equivalent. I’m going to say the wait times SLA dash is solid.

Provenance