Evergreen Performance Bulletin 2021-02-08

As always, please let me know if you have any suggestions whatsoever.

Link to the last and next bulletins.

Recent Findings

Develop New Data Source:

This data source has finally been merged and is running in prod without incident.

Cris used it to create the Evergreen Task start Wait Times SLA dash.

Discussion

Preliminary evidence suggests that 96% of the tasks with over one hour wait in the “cruisin” dataset were associated with a task group. Provenance. Similar exclusion cleared up extreme 99%ile values on the SLA dash. As of 2021-02-09 this is no longer considered an open question.

Future Directions

Understanding Idle Time Tradeoffs

(carried forward without modification until this can be addressed)

This is the most important new area of inquiry, IMO. That is to say, it has the greatest expected value for return on information.

There are three main reasons we need to better understand the tradeoffs associated with host idleness:

Savings: We spend hundreds of thousands of dollars on idle hosts each year. However, we also spend hundreds of thousands of dollars on hosts that are spinning up and are not yet ready to run tasks. We could potentially save hundreds of thousands of dollars a year by understanding how to minimize the total time host spend idle or spinning up. It may also be the case that there are better or worse configurations for performance that cost similarly.
Windows Performance: We currently regulate windows host costs by setting low distro-level max hosts limits. This results in very bad performance. It might be possible to get equivalent cost savings without as heavy as a toll on performance.
Making sure performance increases aren’t too costly: In general, I think it is a good idea to have some kind of feedback rule that increases host allocation as a response to poor performance (e.g. time since queue was last cleared, number of long-wait tasks). However, this will likely result in greater idle time. Before we take any action that significantly increases the host idle time, we should understand the tradeoffs.

Future Directions Not Considered

~~~

Blotter

For performance issues that weren’t significant enough to make it into their own ticket

Spinup of rhel80 and spindown of rhel62

rhel80 distros have been created and had their distro max hosts limits increased, after an initial period of confusion where there performance was very bad. See EVG-13933 for details for how this poor performance presented. rhel62 distros have had their distro max host limits decreased, and wait times are currently long on these distros. Across the system we have been hitting global max hosts limits almost every day.

Capacity Planning

Total dynamic host capacity has been a known scaling target for a long time. However, I am starting to think that the host creation rate threshold can be very important as well.

Recently, I modified this threshold from 500 to 2000. Consequently, rhel80-medium was able to scale up hosts faster, with less backlog of intent hosts and therefore less queue backlog. Linked, pay attention to the last graph (host creation rate threshold), the shape of “Hosts requested” peaks (the first graph), and the large buildup visible in “Expected queue duration”. Threshold 500 Threshold 2000