[wp-trac] [WordPress Trac] #60375: Site Transfer Protocol

Wed Apr 10 20:29:52 UTC 2024

#60375: Site Transfer Protocol
-------------------------+------------------------------
 Reporter:  zieladam     |       Owner:  (none)
     Type:  enhancement  |      Status:  new
 Priority:  normal       |   Milestone:  Awaiting Review
Component:  Import       |     Version:
 Severity:  normal       |  Resolution:
 Keywords:               |     Focuses:
-------------------------+------------------------------

Comment (by dmsnell):

 **A long post follows; please bear with me.**

 In my proposal, I considered using a form of vector clock to track
 potentially-unsynchronized state between connected WordPresses. I've tried
 to convey an extremely rough sketch in the attachment above. This does not
 address the conflated ID problem, but I can hopefully speak to that at the
 end.

 I propose a best-effort system for ensuring that updated resources are
 detected and shared between connected sites, where connected sites are
 admin-level connections communicating via a "backdoor" secure connection,
 established by exchange of private/public key pairs.

 For each connection, both sites will store a new record in their
 synchronization state table indicating the identify of the connected
 WordPress. This will be important for the UX of the system.

 When resources are updated, they have inherent dependencies. These could
 be files or related database records. By instrumenting `$wpdp` properly,
 we can build associations and dependency chains automatically (or choose
 to keep //all// resources in sync between sites and record everything).
 Every time a record is updated, we track in a state table a version number
 for that resource. This is a simple system: a write increments the version
 by one, even if the data is the same as before the update.

 A site will then have a new table tracking every uploaded file, every
 plugin, every database record, and every of any other resource it has, as
 well as a single number for each of those. This table will be much smaller
 than the tables containing those resources. Deleting a resource can be
 represented through `NULL` or `0` or some other //tombstone//.

 When sites connect, a primary site can transfer all its records (the
 //Transfer//) to the secondary site. It will record in the sync state
 tracking which //version// of each resource it sent during the transfer
 (and it can wait for acknowledgement from the receiving site). From this
 point on it will have a sound guess at what content the secondary site
 has.

 When sites continue to communicate, the primary site can compare the
 version of each resource it has updated against the version it last sent
 to the secondary site. Any new, deleted, or updated resources are expected
 to be stale on the secondary site and thus need to be transferred over.

 **User flows**

 It's at this point we can see some high-level designs in this approach.
 For minimal additional work and storage we can track what content needs to
 be transferred. This can be presented to a user in a dashboard, and we can
 even create "recognizers" to further classify the resources. For example,
 a plugin can give a name and description to an otherwise unknown database
 row. The primary site can perform a quick computation to estimate the
 total number of resources needing a transfer, as well as their approximate
 byte size.

 This method also depends on establishing two-way communication via the
 "backdoor" channel. This can be achieved on standard WordPress hosts using
 a combination of long-polling and `stream_select()` and some other
 communication on the server, but does not require long-running PHP
 processes or threads or forking processes. See the next attached image for
 a preview of the dashboard.

 This is a direct synchronization protocol, whereby two connected sites
 trust each other, and the receiving site will import received content into
 its database. Things is currently lacks is a sense of provenance. It would
 be favorable to store the source and timestamp of all imported resources
 in order to be able to show what has been sync'd vs. what was created
 locally.

 Because of the sync-state table all transfers are interruptable and
 trackable. They can fail and be retried. Also, through the use of the HTML
 API and dependency inference, it's possible to prioritize resource
 transfer, such that dependent resources exist on the receiving end before
 the resource itself. This leads to zero-downtime transfers where an
 imported post is immediately complete upon import, since any linked
 content exists first and the post can be rewritten upon arrival with the
 HTML API to update those links.

 **Discussion**

 I apologize for how lengthy and simultaneously rough and prescribed this
 is. I'm trying to dump some ideas "onto paper" since @zieladam and I have
 spoken about this many times. It's a big-picture idea for a technical
 design that powers a specific user flow, which is all about visibility
 into a reliable and interruptable synchronization process.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/60375#comment:24>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform