HTML5, Data Attributes, and Incompetence

HTML5 is evolving to be a concise, semantic framework for marking up textual data. There has always been the need, however, to store custom information within a page, information for which there may not exist an HTML tag or attribute. HTML5 has now provided us with data attributes: one can now store custom data in a page by simply prefixing an attribute with data- (for example data-foo-bar). Isn't this wonderful? No, it's stupid, and only illustrates the level of incompetence in the world of web implementations.

I'm not saying that the W3C, the standards-body responsible for HTML5, is incompetent. Indeed, they were probably shaking their heads at the idiocy in even needing data- attributes. You see, well over a decade ago (practically forever in Internet time) the W3C already created a much better mechanism for incorporating custom data into web pages; it's called namespaces.

The theory of namespaces is simple, taking only two steps: First, at the beginning of your XML/HTML document, you associate some prefix (let's call it data just for example) with some arbitrary URI that you own (e.g. http://www.example.com/ns/data#). Second, whenever you want to insert custom data, simply prefix an attribute (or element) with the prefix you defined earlier. So for example, a custom attribute in the data namespace you defined might be data:foo-bar.

Wait a second! Is data-foo-bar really that different from data:foo-bar? Aren't we just inventing the wheel all over again? Well, yes, in a way, but we're really inventing a way to permanently mount a smaller, less safe spare tire because no one could be bothered to fix a flat. Namespaces have the nice feature that they prevent clashes. That is, one person could have myData:foo-bar, and another person could have yourData:foo-bar, as long as each person defines the prefixes to match distinct namespace URIs—and as long as people stick to namespace URIs in domains they own, then this happens naturally. The data- attribute approach has no way to prevent clashes other than guesswork and luck.

So why didn't namespaces take off? I'll tell you the story and let you form your own opinions. First of all, the namespace specification is technically a specification for XML, which the W3C created even further back, in 2006. Technically many HTML pages already conform to HTML rules by (gasp!) having matching ending tags (e.g. <p>…</p>). If you want the world to know that your HTML page conforms to XML (making it XHTML), your server would report that the content type of your document is application/xhtml+xml instead of plain text/html. Is that so hard?

Yes, it is hard for Microsoft. For years Firefox and other browsers have worked with application/xhtml+xml type documents just fine. On Microsoft's Internet Explorer, a page served as application/xhtml+xml will just die. I'm not saying it will look bad. I'm saying it crashes and burns. So people like me went to extra trouble to detect IE and send back text/html instead.

Nevertheless, even though IE still doesn't support application/xhtml+xml, in IE9 Microsoft started supporting namespaces in text/html documents! However, the document they use for Ajax communication via XMLHttpRequest, does not support XML namespaces—even though the type of document that is returned actually is served as pure text/xml! I have yet to see anyone try to explain that (except for the obvious, which you can find in the title of this article).

But it (always) gets worse. CKEditor, the most popular and supposedly most cross-browser compatible JavaScript-based HTML editor, simply dies in a page served as application/xhtml+xml, and the developers don't care to fix it. So maybe web developers should just give in an serve their pages as plain text/html, right? Well, browser can't get namespaces right for text/html pages. Take Chrome, for example, which is usally pretty standards-compliant. In non-XML HTML pages, Chrome converts attributes to lowercase! One could work around this by simply using lowercase attributes, but Chrome 17 at least doesn't properly parse out the attribute "local name" from its namespace. That is, in data:foo-bar, the data part indicates the namespace and foo-bar is the local name within the namespace; Chrome thinks the local name is data:foo-bar, which means you can't even do a namespace-aware attribute lookup.

So it's no wonder that with all this incompetence in implementing a decade-old namespace specification, the W3C must have thought, "well, if they are so stupid, we'll just give them a stupid attribute prefix—how can they screw that up?" But screw that up they will, just watch. The data- attribute specification comes with something called a dataset, allowing you to quickly lookup just data- attributes on an element. The rest of the name (what in namespace-speak was called the local name) is converted from hyphenated form to camelCase, so that data-foo-bar becomes element.dataset.fooBar. Some browser (probably IE) will forget to convert to camelCase, or will forget to remove a hyphen, or have a dispute about which letters should be converted, or it will treat certain names differently just on IE, or something—don't underestimate their level of incompetence.

And name clashes? Oh yes. That is still very much a problem, leading one site to say (as if this were a new area of research):

As data attributes become more widely used, the potential for clashes in naming conventions becomes much greater. If you use an unimaginative attribute name such as data-height, then it is likely you will eventually come across a library or plugin that uses the same attribute name. Multiple scripts getting and setting a common data- attribute will probably cause chaos. In order to avoid this, I encourage people to choose a standard string (perhaps the site/plugin name) to prefix all their data- attributes — e.g. data-html5doctor-height or data-my-plugin-height.

(sigh) Here we go again.