Evergreen Performance Bulletin 2021-01-25

As always, please let me know if you have any suggestions whatsoever.

Link to the last and next bulletins.

Recent Findings

As a result of last bulletin, Jonathan made the suggestion that task group max hosts might be responsible for the long task waits seen in mms nightly builds. Cloud dev prod declined to increase these, so we decided to drop it. EVG-13461

Discussion

Future Directions

Understanding Idle Time Tradeoffs

(carried over from the last bulletin)

This is the most important new area of inquiry, IMO. That is to say, it has the greatest expected value for return on information.

There are three main reasons we need to better understand the tradeoffs associated with host idleness:

Savings: We spend hundreds of thousands of dollars on idle hosts each year. However, we also spend hundreds of thousands of dollars on hosts that are spinning up and are not yet ready to run tasks. We could potentially save hundreds of thousands of dollars a year by understanding how to minimize the total time host spend idle or spinning up. It may also be the case that there are better or worse configurations for performance that cost similarly.
Windows Performance: We currently regulate windows host costs by setting low distro-level max hosts limits. This results in very bad performance. It might be possible to get equivalent cost savings without as heavy as a toll on performance.
Making sure performance increases aren’t too costly: In general, I think it is a good idea to have some kind of feedback rule that increases host allocation as a response to poor performance (e.g. time since queue was last cleared, number of long-wait tasks). However, this will likely result in greater idle time. Before we take any action that significantly increases the host idle time, we should understand the tradeoffs.

Roll Out Feedback Rules to More Distros:

Given the “Improvements in Task Wait Times from Wait-based Feedback Mechanism” shown above, it might be promising to turn this option on globally. However, I want to be able to understand the impact on host idle time before a large-scale rollout, hence why this is ranked below that objective.

Develop New Data Source:

Work was reverted due to flaky test, see EVG-13767.

Future Directions Not Considered

~~~

Blotter

(For performance issues that weren’t significant enough to make it into their own ticket)

AWS ARM stuff

On January 13th, I turned on on-demand from fleet for ubuntu1804-arm64-large due to 2+ hour waits. distro!=*arm* is no longer in the spot instance shortage dash. I forgot to turn this back to Fleet until I wrote this report. In the future, turning a distro to on-demand should include the following steps:

1) Post in #evergreen-ops about the change and rationale, and tag Cris. 2) Make the change. 3) Schedule checkups 24 hours and roughly 1 month from current time.

Limits reached (again) for gp2

It was discovered we cannot go up to 6000, though this worked for a while. EVG-13727

As a consequence of this, the gp2 limit was increased by build. This is likely still a bottleneck, but not at 6000.