[wp-trac] Re: [WordPress Trac] #5183: General Meta-Data Table

Wed Jun 24 23:25:15 GMT 2009

#5183: General Meta-Data Table
-----------------------------------------+----------------------------------
 Reporter:  filosofo                     |        Owner:  anonymous     
     Type:  enhancement                  |       Status:  reopened      
 Priority:  normal                       |    Milestone:  Future Release
Component:  General                      |      Version:  2.9           
 Severity:  normal                       |   Resolution:                
 Keywords:  needs-patch meta-table meta  |  
-----------------------------------------+----------------------------------

Comment(by mikeschinkel):

 Replying to [comment:32 Denis-de-Bernardy]:
 > I think the real underlying issue is we need abstraction in the php.
 It's like, if the data cannot go into wp_posts (which could really be
 called wp_nodes),

 How very Drupalish of you. (I came to WordPress by way of Drupal. :)

 > and ends up requiring a separate table, I'd personally suggest it should
 get its own meta table as well. So long as the needed php functions are
 around to manage this user-defined meta table, things should be fine.

 I think that is grand in concept and certainly follows the relational
 model but I think in practice following said rules results in increases in
 rigidity and complexity without equivalent increases in benefit.

 Certainly for the most prominent data elements it makes sense to have them
 each in their own tables but I urge you to reconsider the edge cases as
 they are IMO handled much better by creating a repository to hold them all
 instead of requiring "yet another table" for what are often edge cases.

 > I know... I've been playing around with the idea of a nodes table for
 over 10 years.

 Not to one-up you (well, okay maybe I am) I've been playing with them for
 20 years... :) And for the most I've decided the complexity is not worth
 the pain.

 > We could actually do this by switching to UUID types instead of auto-
 incremented integers. It adds an extra benefit along the way, too: you can
 then sync sites that use shared databases.

 UUID have a certain elegance but if you are concerned about performance as
 you were regarding "vtypes" then UUIDs have a whole series of issues:
 http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/

 > Might it be that you only ever needed to write queries that involved one
 or two joins?

 LOL!!!!!  Sorry, I've at times written queries that exceeded the number of
 tables that the then SQL Server parser could handle (max 32 tables.)

 I've built some very complex database systems. I built a full ecommerce
 platform starting in 1995 for internal use that processed almost $100
 million in revenue for my company over a 10 year period. I can promise you
 that almost everything I did had more than two joins.

 And I've had to do a lot of work with tables in Drupal too. For example,
 here's a query I pulled from a project of mine (not exactly trivial):

 {{{
 SELECT
         vehicle.nid,
         vehicle.title,
         vehicle_content.teaser,
         vehicle_path.dst AS path,
         IFNULL(caption.field_photo_caption_value,image_node.title) AS
 photo_caption,
         files.filepath AS photo_file,
         IFNULL(photo_info.field_image_orientation_value,'landscape') AS
 photo_orientation
 FROM
         node showcase
         INNER JOIN content_field_vehicle cfv ON showcase.nid=cfv.nid
         INNER JOIN node vehicle ON cfv.field_vehicle_nid=vehicle.nid
         INNER JOIN content_type_vehicle ctv ON vehicle.nid=ctv.nid
         INNER JOIN node_revisions vehicle_content ON
 vehicle.nid=vehicle_content.nid AND vehicle.vid=vehicle_content.vid
         INNER JOIN url_alias vehicle_path ON
 vehicle_path.src=CONCAT('node/',vehicle.nid)
         LEFT OUTER JOIN content_field_photo_caption caption ON
 vehicle.nid=caption.nid
         LEFT OUTER JOIN image_attach ia ON vehicle.nid=ia.nid
         LEFT OUTER JOIN node image_node ON ia.iid=image_node.nid
         LEFT OUTER JOIN content_type_image photo_info ON
 photo_info.nid=image_node.nid
         LEFT OUTER JOIN files ON image_node.nid=files.nid
 WHERE
         showcase.type='showcase' AND
         vehicle.type='vehicle' AND
         ctv.field_vehicle_subtype_value in ('abstract','standalone') AND
         (files.fid IS NULL OR COALESCE(files.filename,'small')='%s') AND
         showcase.nid=%d AND
         vehicle.status=1
 }}}

 > I ask, because my own 10 years of experience tell me a rather different
 story. I was a varchar index junkie until 2-3 years ago -- until I ended
 up needing to look into the performance of an app that was heavy on joins
 and that had more than a minute amount of data.

 Funny how the pendulum swings one way and then the next. It is likely our
 cycles are just not in sync.

 > PGSQL ended up using merge join plans (= the slowest you can get) that
 involved zillions of rows. No amount of vacuum analyze, adding partial
 indexes, or clustering the data helped in any meaningful way. Re-
 engineering the DB in such a way that PG had means to fetch the needed
 rows using less joins, on integer/uuid fields only, bought the app was
 brought back to life.

 No debate there.

 > My own bottom line was this: primary key (object_id, object_type) =
 avoid like the Plague if you've a large amount of data.

 What's large?  1000?  Or 100 million?  I'll give you the latter, not the
 former.

 For performance on large data sets, integer keys are the best. String are
 much slower, and the string the larger the indexes get and the more page
 loads from disk are required to seek into the index. But that's doesn't
 mean all indexes MUST be integers as I'll explain below.

 Also I'm not a fan of composite keys in general, but I've also come to
 believe that every rule has reasonable exceptions. You can get around the
 composite key issue by concatonating (i.e. "273links_meta") but that's not
 elegant unless it can be done on a trigger.

 OTOH, one of the biggest evils I have found in technology is premature
 optimization. To enforce a prohibition on composite indexes of
 (int,string) "at all costs" is a premature optimization for tables that
 may 99.9% of the time have less than 100,000 records. That's why I am
 suggesting them for use in edge cases. IF we find that one of those
 objects starts getting hundreds of thousands of records and many
 installations THEN it is time to optimize. But don't force all the
 complexity on the edge cases simple because there is a tiny chance on of
 them might become a non-edge case.

 > In case you're thinking WP will never get so much data that this will
 make a big difference, consider the zillions of blogs, users, posts,
 terms, links, and comments on wordpress.com.

 I'm not at all assuming WP won't get that much data. I am instead assuming
 that not all data in WP will get so big that performance optimization
 concerns actually matter.

 Look at it this way; if you don't give plugin and theme developers an
 appropriate place to store data they will instead store their data in far
 less appropriate places (i.e. an overloaded wp_options table, for example)
 and they will see far worse performance than the purity of your approach
 wants to ensure against. Remember that most people who use and program for
 WordPress do know have any experience with SQL and are even afraid of it
 so they won't be creating custom tables on their own, they will be
 overloading wp_options.

 As a case in point: Nathan Rice. Nathan is a brillant theme developer but
 has said he doesn't understand database and won't go near SQL even though
 I know if he did he's easily smart enough to learn it in no time. There
 are *many* Nathans out that and you need to give them a tool they are
 comfortable using instead of giving them prohibition for fear that one day
 by some small chance they might misuse that tool.

-- 
Ticket URL: <http://core.trac.wordpress.org/ticket/5183#comment:33>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software