Evergreen Performance Bulletin 2021-03-08

As always, please let me know if you have any suggestions whatsoever.

Link to the last and next bulletins.

Updates

New Alerts for High Task Waits

New alerts have been challenging to tune properly. Our system also maxes out its capacity on most weekdays, so most cases of this alert are due to the self-imposed global max dynamic hosts cap. Future directions could include some sort of alert-inhibition system that captures this relationship, which would be possible using https://github.com/pahoughton/splunk-alert and prometheus alertmanager, which could be included in upcoming Improve Alerting and Status Visibility epic, or could form part of a distinct epic.

The new alerts are a good example of how to alert based on distro-specific, or any by X values. We currently alert if any distro has high levels of overdue tasks in its queue, or high median task wait times. There are some quirks – you have to use Table, it would seem, and you can’t include any particular distro name in the alert, or create separate alerts for multiple distros (as far as I can tell).

The combination of these two alerts is particularly powerful because we have two different data sources (distro-scheduler-report and task-end-stats) as well as a pre-emptive alert based on the queue and a post-hoc alert based on newly-finished tasks. Though tuning both in concert may be a challenge, in an ideal world both alerts firing would give us a lot of confidence that there has been some kind of serious event affecting task waits.

Evergreen: long waits on enqueued tasks

Evergreen: long waits on finished tasks

Replacement for Makespan Metric is Also Unusable, Pending Fix

time_to_empty in distro-scheduler-report, introduced last sprint, maxes out too often to be usable. I saw this behavior was associated with drops in host count in several instances. This is potentially because of the following line:

hostsAvail := nHostsFree + len(hostsSpawned) - distroQueueInfo.CountDurationOverThreshold

This line does not account for for task group tasks, which may not add to the grand total of QOS hosts spawned specifically to run long-duration tasks due to task group limits. This will cause hostsAvail to be lower than it should be, and therefore time_to_empty to max out due to hostsAvail <=0.

Fix is underway : https://github.com/hhoke/evergreen/pull/new/EVG-14274.

Future Directions

Understanding Idle Time Tradeoffs

(carried forward without modification until this can be addressed)

This is the most important new area of inquiry, IMO. That is to say, it has the greatest expected value for return on information.

There are three main reasons we need to better understand the tradeoffs associated with host idleness:

Savings: We spend hundreds of thousands of dollars on idle hosts each year. However, we also spend hundreds of thousands of dollars on hosts that are spinning up and are not yet ready to run tasks. We could potentially save hundreds of thousands of dollars a year by understanding how to minimize the total time host spend idle or spinning up. It may also be the case that there are better or worse configurations for performance that cost similarly.
Windows Performance: We currently regulate windows host costs by setting low distro-level max hosts limits. This results in very bad performance. It might be possible to get equivalent cost savings without as heavy as a toll on performance.
Making sure performance increases aren’t too costly: In general, I think it is a good idea to have some kind of feedback rule that increases host allocation as a response to poor performance (e.g. time since queue was last cleared, number of long-wait tasks). However, this will likely result in greater idle time. Before we take any action that significantly increases the host idle time, we should understand the tradeoffs.

Task Packing to decrease idle host time for Windows

(carried forward without modification until this can be addressed)

I’ve been looking at windows-64-vs2017-large more closely, on account of its’ poor performance. There are idle hosts up to 2 hours after the queue is exhausted. Currently, hosts request tasks as soon as they are free, which guarantees that the task at the head of the queue is taken off the queue in the fastest time possible. However, in the case of windows distros, the performance is degraded to the point of multiple-hour waits, due to the distro max hosts limit. We should modify our strategy, because it makes no sense to mind seconds and let hours go by. The hardest part about this would be detecting when the queue is in terminal decline. One possibility would be to mark hosts nearing the end of their allotted hour as unable to accept new tasks, if the As such, if we were able to use some coordination step that added on the order of a minute to each task’s wait, without worrying about modes, it would still be worth it if we were able to decrease idle time enough to justify the lifting of distro max host limits.

Several potential methods for detecting terminal queue decline were unsuccessful in that they do not provide warning in advance of idle hosts. Dataset was 5-minute average queue length for windows-64-vs2017-large. I tried deltas, 3-wide average deltas, and delta of deltas. The most clear signal was deltas divided by current queue size, however this only became extremely clear at 00:25PM, at which point there is already a clear signal from idle hosts (1 at 00:20 to 25 at 00:25).

Partially due to this, I am trying to clean up the “makespan” metric in the distro-scheduler-report.

Future Directions Not Considered

~~~

Blotter

(For one-off or temporary performance issues)

2021-02-24

set ubuntu1804-large back to fleet from on-demand

2021-03-02

A putative issue with commit queue merge tasks was determined to be an illusion/misconception caused by the UI. A ticket has been created, and Maria has solved the UI issue conceptually, though it has not been implemented yet. https://jira.mongodb.org/browse/EVG-14191