[wp-trac] [WordPress Trac] #58998: Prevent races and stampedes when flush_rewrite_rules() is called

Tue Aug 8 09:04:37 UTC 2023

#58998: Prevent races and stampedes when flush_rewrite_rules() is called
---------------------------+------------------------------
 Reporter:  iCaleb         |       Owner:  (none)
     Type:  enhancement    |      Status:  new
 Priority:  normal         |   Milestone:  Awaiting Review
Component:  Rewrite Rules  |     Version:  trunk
 Severity:  normal         |  Resolution:
 Keywords:  has-patch      |     Focuses:  performance
---------------------------+------------------------------
Description changed by SergeyBiryukov:

Old description:

> This ticket is more or less a follow-up to
> https://core.trac.wordpress.org/ticket/11384.
>
> Currently when `flush_rewrite_rules()` is called,
> `WP_Rewrite->flush_rules()` will set the `rewrite_rules` option to an
> empty string first. And then afterwards it will go and generate the rules
> and save the results to the option.
>
> On a high traffic site there could be potentially hundreds of requests
> all coming in during this time period where the option is empty
> (considerably more-so if it either takes a particularly long time to
> generate, or if the site is utilizing replica databases). Each request
> during the time period will find `rewrite_rules` empty and attempt to
> generate and save the rules. This is the first of the stampedes that
> result from this behavior. You now have all these frontend requests for a
> period of time consuming more processing time than usual, and each having
> to connect to the primary database to perform a database write (which can
> further delay replication lag).
>
> Eventually that stampede should come to an end, but not before
> potentially triggering other stampedes. Since `rewrite_rules` is stored
> in alloptions, you have that cache key being invalidated and set quite
> frequently because all the requests are touching it. That introduces both
> some fun alloptions race conditions in general, but also makes it
> possible that the object cache (Memcached specifically) can enter into a
> stampede condition.
>
> In short, each memcached add() has to find a correctly-sized slab and
> starts to write it's data into memory. A few things happen in between and
> then at the end the item is linked to the hash table so it becomes the
> official record for that key. However if there is a large amount of add()
> requests all coming in at the same time all wanting to be apart of the
> same memory slab - each will essentially be evicting each other so nobody
> is the winner. This can very quickly cascade the site into severe
> instability when alloptions is not being cached. This is of course a
> problem in it's own right and rewrite_rules is not directly to blame, but
> thought it was worth mentioning to add more context.
>
> Lastly, because a request could be at any point in it's logic when
> `rewrite_rules` is made into an empty string, you open up to the
> possibility of just some weird race conditions with execution order and
> whatnot. A plugin could be doing something admittedly funky for example
> (`WP_Rewrite` leaves all it's properties open to being directly changed),
> resulting in a small percentage of requests possibly coming up with a
> different/partial result for `rewrite_rules` than others. I've seen this
> first-hand a number of times. It's not always obvious bugs either,
> sometimes just very particular load order / hooks / current cached state.
> At best you end up with a few 404s during the initial stampede. At worst,
> the last request to save was the faulty path and now you're stuck with
> incorrect rules until the next flush. And it's very tricky to narrow down
> what happened. We of course can't prevent all incorrect code, but we can
> give it less chances to mess things up.
>
> Ideally, and with a seemingly easy fix with limited to no downside, we
> can eliminate these problems and make the rewrite_rules generation more
> consistent and predictable.

New description:

 This ticket is more or less a follow-up to #11384.

 Currently when `flush_rewrite_rules()` is called,
 `WP_Rewrite->flush_rules()` will set the `rewrite_rules` option to an
 empty string first. And then afterwards it will go and generate the rules
 and save the results to the option.

 On a high traffic site there could be potentially hundreds of requests all
 coming in during this time period where the option is empty (considerably
 more-so if it either takes a particularly long time to generate, or if the
 site is utilizing replica databases). Each request during the time period
 will find `rewrite_rules` empty and attempt to generate and save the
 rules. This is the first of the stampedes that result from this behavior.
 You now have all these frontend requests for a period of time consuming
 more processing time than usual, and each having to connect to the primary
 database to perform a database write (which can further delay replication
 lag).

 Eventually that stampede should come to an end, but not before potentially
 triggering other stampedes. Since `rewrite_rules` is stored in alloptions,
 you have that cache key being invalidated and set quite frequently because
 all the requests are touching it. That introduces both some fun alloptions
 race conditions in general, but also makes it possible that the object
 cache (Memcached specifically) can enter into a stampede condition.

 In short, each memcached add() has to find a correctly-sized slab and
 starts to write it's data into memory. A few things happen in between and
 then at the end the item is linked to the hash table so it becomes the
 official record for that key. However if there is a large amount of add()
 requests all coming in at the same time all wanting to be apart of the
 same memory slab - each will essentially be evicting each other so nobody
 is the winner. This can very quickly cascade the site into severe
 instability when alloptions is not being cached. This is of course a
 problem in it's own right and rewrite_rules is not directly to blame, but
 thought it was worth mentioning to add more context.

 Lastly, because a request could be at any point in it's logic when
 `rewrite_rules` is made into an empty string, you open up to the
 possibility of just some weird race conditions with execution order and
 whatnot. A plugin could be doing something admittedly funky for example
 (`WP_Rewrite` leaves all it's properties open to being directly changed),
 resulting in a small percentage of requests possibly coming up with a
 different/partial result for `rewrite_rules` than others. I've seen this
 first-hand a number of times. It's not always obvious bugs either,
 sometimes just very particular load order / hooks / current cached state.
 At best you end up with a few 404s during the initial stampede. At worst,
 the last request to save was the faulty path and now you're stuck with
 incorrect rules until the next flush. And it's very tricky to narrow down
 what happened. We of course can't prevent all incorrect code, but we can
 give it less chances to mess things up.

 Ideally, and with a seemingly easy fix with limited to no downside, we can
 eliminate these problems and make the rewrite_rules generation more
 consistent and predictable.

--

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/58998#comment:2>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform