[wp-trac] [WordPress Trac] #57483: Improve optimizations for WP_Site::get_instance and WP_Network::get_instance

Tue Jan 17 13:49:43 UTC 2023

#57483: Improve optimizations for WP_Site::get_instance and
WP_Network::get_instance
--------------------------+-----------------------------
 Reporter:  kovshenin     |      Owner:  (none)
     Type:  defect (bug)  |     Status:  new
 Priority:  normal        |  Milestone:  Awaiting Review
Component:  General       |    Version:  5.3
 Severity:  normal        |   Keywords:
  Focuses:  performance   |
--------------------------+-----------------------------
 Following up on #42251 and r45910.

 I recently ran into a problem with this specific optimization. In one case
 it resulted in an existing site being redirected to the main site as if it
 did not exist. In a second case it went all the way to `dead_db()`. In
 both instances the existing site in question had a `-1` stored in the
 object cache. Both problems persisted until the object cache was flushed.

 I think the main problem is that in `WP_Site::get_instance` (and Network)
 we do not account for `wpdb::get_row` failing, we don't check
 `last_error`. This means that a brief database outage can cause a
 permanent `-1` be saved in object cache for a site that actually does
 exist.

 It's not easy to reproduce, as the existing/valid sites will be in cache
 most of the time, so that portion of the codebase is rarely executed.
 However, there is one core function that helps with this tremendously.
 It's `wpmu_update_blogs_date` and it invalidates the site cache every time
 it needs to update the timestamp (when a post is published or update for
 example).

 I wrote this little mu-plugin to help reproduce the problem:

 {{{
 add_action( 'muplugins_loaded', function() {
     switch_to_blog( 3 );
     wpmu_update_blogs_date();
     restore_current_blog();
     die();
 } );
 }}}

 I fire some traffic at the site and observed the MySQL query log to
 confirm a significant amount of `UPDATE` queries for the site timestamp,
 as well as `SELECT * FROM wp_blogs ...` queries which came from
 `WP_Site::get_instance()`.

 Next, I simulated some turbulence in the network, by killing all
 established connections on the MySQL port 3306 (using tcpkill), resulting
 in a few "MySQL has gone away" errors in the PHP error log. I did this a
 couple of times until the site with the ID of 3 disappeared from My Sites
 in the Network Admin.

 I confirmed the object cache state in Redis:

 {{{
 127.0.0.1:6379> get sites:3
 "-1"
 }}}

 I was able to reproduce this quite reliably with about 50 concurrent
 requests to the site, and 2-3 rounds of tcpkill. Of course it doesn't
 happen all that often under normal conditions, though I was unlucky enough
 to see it happen twice in the last 6 months (on very high traffic sites
 with plenty of updates).

 Personally I feel like we should revert this entire optimization. As much
 as I appreciate all the hard work, I don't see much benefit of caching IDs
 that don't exist, much like we don't cache posts that do not exist.

 At the very least though, we should check the `last_error` before caching
 a `-1` in both cases.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/57483>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform