[wp-hackers] A "terms" table
Matt Mullenweg
m at mullenweg.com
Sun Apr 15 19:41:40 GMT 2007
WordPress is like a sandwich.
Assuming we've scared off all the vegetarians with all the talk of BBQ,
the core is the meat. Our meat is the wp_posts table, which stores what
I would refer to as the primary points of content. Currently for us this
is posts, pages, and attachments, though in the future I could see it
expanding to support new post types such as externals, galleries, and
hopefully things we can't even imagine yet.
On the side you have chips (good comments), vegetables (idiot comments),
and that funny stuff your cousin brought that you're going to move
around on the plate but never eat (spam comments). I think comments are
okay right now, maybe they could use a meta table but we can talk about
that later.
Meat alone is only a real meal at rboren's house, so most people put
things on the sandwich to add flavor and spice it up. Some add other
types of meat, in the WP world this is postmeta, which we call custom
fields in polite company.
We also havae condiments which are currently handled by two tables:
wp_categories and wp_post2cat. On the taxonomy/condiment side, right now
we really only allow ketchup aka categories, and users for at least a
year have been asking for more. In 2.2 we decided to satiate their
appetites.
Everyone agrees that ketchup and mayonnaise are totally different, even
though they're both condiments and you put them both on sandwiches. No
one is trying to create some horrible pink mixture of the two tastes.
However there are currently two schools of thought on how we should
store the data for categories and tags at a very low level in our DB.
Let me do my best to make the case for putting category data and tag
data in separate tables, and feel free to chime in if you think I've
missed any points.
* We shouldn't ship anything with a data schema people disagree on,
because plugins and themes will be written against it.
* They're different things, so we should have them in different tables.
* Tags can have things like synonyms, and don't need things like hierarchy.
* There are ugly legacy field names in the category table like
category_nicename, cat_name, cat_ID (wtf capitals) and we can clean
those up in new tables
* With separate tables our queries on the admin side become WAY easier
and cleaner to do, with no bitwise or _count nonsense
* Plugins for tagging have implemented it this way.
The code currently in SVN does something different. It uses the
categories table for names of the tags and then adds fields to hint how
those names are being used for the admin section. If I wanted to make
everyone happy and be popular I would just go with the above since there
seems to be good consensus there, but I think this is an important
long-term decision for WP so let me spell out some reasons why I think
the current design has legs not just for 2.2 but beyond.
1. It performs faster.
On front-end display, we have added ZERO QUERIES to support tags. The
query that grabs categories is also grabbing tags and we're sorting them
out in the code.
In the dashboard some of the queries are more complicated (though not
really any different than what we deal with for link categories) and a
few milliseconds slower than the old ones. However, that really doesn't
matter because 1) we only need to write them once and more importantly
2) they're run several orders of magnitude fewer times than the ones
that display the blog on the front-end. A mantra has always been that
user time is more important than developer time.
A separate tag naming table and post2tag table would require at least 2
additional queries and/or joins to the front page, which already think
does too many queries and is too heavy.
2. It's a better long-term foundation.
I think there are a lot of benefits to having a single ID that maps to a
term and a slug. Let's pretend we had perfect foresight 5 years ago and
instead of wp_categories we had wp_terms.
Regardless of the UI and philosophy behind categories, tags, and ooga
booga, on a data level they're still mapping a set of terms to an item
in post_content.
In WP a term has three important things: an ID, a human-entered name,
and a URL-friendly slug. We use the ID in our relations instead of the
slug because it's more efficient and slugs are not necessarily unique
(because of hierarchy).
Having "dogs" in a category table have one ID and "dogs" in a tag table
have a different ID is a long-term deck of cards that we will seriously
regret later. It's MUCH harder to reconcile items with internally
different IDs than it is to split out unique IDs into different tables.
As for some of the bit and count fields currently causing grief, I would
argue the solution for that isn't a separate tags table, but a separate
table specifically for that type of data. In Drupal for this
infrastructure they have a term_data, term_hiercharchy, term_node,
term_relation, term_synonym, vocabulary, and vocabulary_node_types
tables. I think that might be a little more than we need, but there are
some concepts there we could pretty cleanly combine into a single extra
table that isn't called categories or tags, and will provide a good and
scalable foundation for years to come.
3. There should be no user- or plugin-facing problems with how it's
currently implemented, or if we decide to change it.
Now this isn't to suggest for a second there aren't bugs, many have been
fixed already and I'm sure there are many still left, but that is going
to be true of ANY code we put in WP and anyone who suggests otherwise is
not very familiar with software development. From a point of view of
plugin authors, they shouldn't have to think or care if we're storing it
in a categories table or a turkey, the function they use should remain
consistent no matter what we change or gymnastics we do behind the
curtain. No matter what we do in 2.2 or 2.3, that's not going to change.
I do think there is something intrinsically better about shipping and
iterating than noodling without release in search of the "perfect"
implementation.
More importantly from a user's point of view, all that really matters is
that they have a box they can type tags in and that their host doesn't
tell them not to upgrade to 2.2 because it does more queries.
4. I'm open
I'm not personally tied to any code written thus far and if I think the
best thing is.
There is a separate but related decision around what to do about the
release date. Based on the discussion here I'm going to make go/no-go
decision on Tuesday.
If we do delay I think we should laser-focus on tags and now allow other
pet-issues to creep in, and I will fully expect people to put in as much
time writing code and fixing bugs as they have arguing points on mailing
lists, IRC, and trac. At the very least I hope we've learned a bit more
about getting these things out of the way early rather than a week or
two before a release. Also if something is sitting in trac, take it to
the hackers list early.
I think if we stick with the current implementation we can hit it with a
very stable release next Monday, but if we decide to replace it we need
to push it back at least into mid-May.
--
Matt Mullenweg
http://photomatt.net | http://wordpress.org
http://automattic.com | http://akismet.com
More information about the wp-hackers
mailing list