[wp-hackers] A "terms" table

Matt Mullenweg m at mullenweg.com
Sun Apr 15 19:41:40 GMT 2007


WordPress is like a sandwich.

Assuming we've scared off all the vegetarians with all the talk of BBQ, 
the core is the meat. Our meat is the wp_posts table, which stores what 
I would refer to as the primary points of content. Currently for us this 
is posts, pages, and attachments, though in the future I could see it 
expanding to support new post types such as externals, galleries, and 
hopefully things we can't even imagine yet.

On the side you have chips (good comments), vegetables (idiot comments), 
and that funny stuff your cousin brought that you're going to move 
around on the plate but never eat (spam comments). I think comments are 
okay right now, maybe they could use a meta table but we can talk about 
that later.

Meat alone is only a real meal at rboren's house, so most people put 
things on the sandwich to add flavor and spice it up. Some add other 
types of meat, in the WP world this is postmeta, which we call custom 
fields in polite company.

We also havae condiments which are currently handled by two tables: 
wp_categories and wp_post2cat. On the taxonomy/condiment side, right now 
we really only allow ketchup aka categories, and users for at least a 
year have been asking for more. In 2.2 we decided to satiate their 
appetites.

Everyone agrees that ketchup and mayonnaise are totally different, even 
though they're both condiments and you put them both on sandwiches. No 
one is trying to create some horrible pink mixture of the two tastes.

However there are currently two schools of thought on how we should 
store the data for categories and tags at a very low level in our DB.

Let me do my best to make the case for putting category data and tag 
data in separate tables, and feel free to chime in if you think I've 
missed any points.

* We shouldn't ship anything with a data schema people disagree on, 
because plugins and themes will be written against it.
* They're different things, so we should have them in different tables.
* Tags can have things like synonyms, and don't need things like hierarchy.
* There are ugly legacy field names in the category table like 
category_nicename, cat_name, cat_ID (wtf capitals) and we can clean 
those up in new tables
* With separate tables our queries on the admin side become WAY easier 
and cleaner to do, with no bitwise or _count nonsense
* Plugins for tagging have implemented it this way.

The code currently in SVN does something different. It uses the 
categories table for names of the tags and then adds fields to hint how 
those names are being used for the admin section. If I wanted to make 
everyone happy and be popular I would just go with the above since there 
seems to be good consensus there, but I think this is an important 
long-term decision for WP so let me spell out some reasons why I think 
the current design has legs not just for 2.2 but beyond.

1. It performs faster.

On front-end display, we have added ZERO QUERIES to support tags. The 
query that grabs categories is also grabbing tags and we're sorting them 
out in the code.

In the dashboard some of the queries are more complicated (though not 
really any different than what we deal with for link categories) and a 
few milliseconds slower than the old ones. However, that really doesn't 
matter because 1) we only need to write them once and more importantly 
2) they're run several orders of magnitude fewer times than the ones 
that display the blog on the front-end. A mantra has always been that 
user time is more important than developer time.

A separate tag naming table and post2tag table would require at least 2 
additional queries and/or joins to the front page, which already think 
does too many queries and is too heavy.

2. It's a better long-term foundation.

I think there are a lot of benefits to having a single ID that maps to a 
term and a slug. Let's pretend we had perfect foresight 5 years ago and 
instead of wp_categories we had wp_terms.

Regardless of the UI and philosophy behind categories, tags, and ooga 
booga, on a data level they're still mapping a set of terms to an item 
in post_content.

In WP a term has three important things: an ID, a human-entered name, 
and a URL-friendly slug. We use the ID in our relations instead of the 
slug because it's more efficient and slugs are not necessarily unique 
(because of hierarchy).

Having "dogs" in a category table have one ID and "dogs" in a tag table 
have a different ID is a long-term deck of cards that we will seriously 
regret later. It's MUCH harder to reconcile items with internally 
different IDs than it is to split out unique IDs into different tables.

As for some of the bit and count fields currently causing grief, I would 
argue the solution for that isn't a separate tags table, but a separate 
table specifically for that type of data. In Drupal for this 
infrastructure they have a term_data, term_hiercharchy, term_node, 
term_relation, term_synonym, vocabulary, and vocabulary_node_types 
tables. I think that might be a little more than we need, but there are 
some concepts there we could pretty cleanly combine into a single extra 
table that isn't called categories or tags, and will provide a good and 
scalable foundation for years to come.

3. There should be no user- or plugin-facing problems with how it's 
currently implemented, or if we decide to change it.

Now this isn't to suggest for a second there aren't bugs, many have been 
fixed already and I'm sure there are many still left, but that is going 
to be true of ANY code we put in WP and anyone who suggests otherwise is 
not very familiar with software development. From a point of view of 
plugin authors, they shouldn't have to think or care if we're storing it 
in a categories table or a turkey, the function they use should remain 
consistent no matter what we change or gymnastics we do behind the 
curtain. No matter what we do in 2.2 or 2.3, that's not going to change.

I do think there is something intrinsically better about shipping and 
iterating than noodling without release in search of the "perfect" 
implementation.

More importantly from a user's point of view, all that really matters is 
that they have a box they can type tags in and that their host doesn't 
tell them not to upgrade to 2.2 because it does more queries.

4. I'm open

I'm not personally tied to any code written thus far and if I think the 
best thing is.

There is a separate but related decision around what to do about the 
release date. Based on the discussion here I'm going to make go/no-go 
decision on Tuesday.

If we do delay I think we should laser-focus on tags and now allow other 
pet-issues to creep in, and I will fully expect people to put in as much 
time writing code and fixing bugs as they have arguing points on mailing 
lists, IRC, and trac. At the very least I hope we've learned a bit more 
about getting these things out of the way early rather than a week or 
two before a release. Also if something is sitting in trac, take it to 
the hackers list early.

I think if we stick with the current implementation we can hit it with a 
very stable release next Monday, but if we decide to replace it we need 
to push it back at least into mid-May.

-- 
Matt Mullenweg
  http://photomatt.net | http://wordpress.org
http://automattic.com | http://akismet.com


More information about the wp-hackers mailing list