The Java URI That Isn't

The fundamental identifier used to communicate on the Internet is the Uniform Resource Locator (URI), specified most recently by RFC 3896. Because Java 6 conveniently includes a URI class (and because Java is ostensibly the Internet programming language), one would think that identifying Internet resources in Java is as simple as using the Java URI class. If you make this assumption, your software will probably break. First and foremost, the information stored by the Java URI class is not even a URI.

The API documentation for java.net.URI claims that the class implements RFC 2396, the percursor to RFC 3986. But if you read the small print, you'll find there's a caveat: "Deviation from RFC 2396, which is limited to US-ASCII." In other words, RFC 2396 and RFC 3896 specificies precisely and exclusively what characters are allowed in a URI. Non-ASCII characters are not allowed in a URI. The Java URI class allows non-URI characters. The information in a Java URI class instance therefore might not be a URI.

But the Internet doesn't run on maybe-URIs or sometime-URIs. Machines expect real URIs, whether you're communicating using HTTP, processing XML, sending email, or storing files in WebDAV. If you have information in a Java URI class instance and want to pass it to, for instance, a web server, you have no idea whether the information you have is a valid URI. You have no idea if the information you have will therefore be rejected by the server.

"But I control everything that gets put into a URI class instance, so there shouldn't be any non-URI information in there to begin with." Really? Do you ever access files on your local hard drive using Java? Do you ever want to represent those using URIs? Let's say that you have a file named java—problems.txt (using the em-dash character U+2014 in the filename) on your hard drive. If you ask Java to turn that into a URI using File.toURI(), you'll get something that isn't a URI. Try it yourself:

System.out.println(new File("java\u2014problems.txt").toURI().toString());

What you get is not a URI, and could break communication with things on the Internet that expect URIs. Java purports to offer a solution to this: the URI.toASCIIString() method returns a string form of the URI guaranteed to contain only ASCII characters. In other words, all the non-ASCII characters have been escaped like they should have been to begin with.

But will using URI.toASCIIString() solve all my problems? No—many times I want to process some components of the URI, using URI.getRawPath() as only one example. You might have noticed that there is no URI.getRawASCIIPath(). There shouldn't need to be. The Java URI class should only hold a valid URI. It doesn't. Imagine if you you had to double-check to make sure instances of Integer didn't contain the letter "Q", for example. That's how inconvenient and broken the URI situation is.

To try to manage this problem, never ever use File.toURI(). Instead, create a utility method that does something like this:

return URI.create(file.toURI().toASCIIString());

There are more efficient ways to address this, but this will get the job done.

There is one more way in which URI is broken, this time in terms of even the Java API. Java URIs encode characters using uppercase hexadecimal escape characters. If you're communicating with a server (Apache, for example), you might get back a URI with lowercase hexadecimal escape characters. According to RFC 3896, URIs equivalence is determined without regard to the case of the escape characters, and indeed Java URI.equals() correctly ignores the case of escape sequences. But URI.hashCode() returns different hash codes for URIs that use difference case in hexadecimal escape codes. This completely breaches the contract set forth by Object.hashCode(): "If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result." The astounding result is that you cannot reliably use URIs in sets or as map keys.

Java makes itself out to be the Internet programming language, but it isn't compliant with the fundamental resource identifiers on the web. It also makes working with Internet content types very inconvenient, I explained a few days ago. Java at its core is a great language, if you know what you're doing. Unfortunately Sun has proved more interested in bloating the language with huge libraries filled with redundancies rather than addressing some of the issues working with fundamental pieces of the language.

Don't even ask about IRI support.