Generating a single, globally unique ID

One that's XML-compliant and not too long.

When something has a unique ID, it has identity, and you can do more with it. For example, you can link to it, and you can add metadata to it from anywhere. I wanted to be able to assign a unique ID to something with one or two keystrokes in Emacs. I came up with something that works, although I’m sure there are ways to make it work better.

By “unique”, I mean reasonably unique in the whole universe. “Reasonably unique” is one of those phrases that English teachers hate, because by definition a unique string can’t come up a second time. The Wikipedia entries for GUID and UUID give a good overview of the world of globally unique ID generation, and the related entry for Universally Unique Identifier gets into some probability figures for generating the same ID twice. When these figures are compared with the chance of being hit by a meteorite, that makes the IDs unique enough for me.

Routines for creating these IDs typically generate a 16-byte number and represent it as a hexadecimal number of 32 digits. I wanted something that would definitely begin with a letter, so that it would work as an XML ID value, and I also wanted something shorter than 32 characters if possible. I came up with something 21 characters long that I can insert into a document with two keystrokes in Emacs. Some samples: W4QZtu5rTsWnfyiixxIuFQ, Y_JxAODuQ5uZiiBljYoj0w, DJ6AQkYgTnGal8aR9CM_Ig.

The key to getting it shorter was to use a base 64 representation of the 16 byte value instead of a hexadecimal one. Most programming languages offer a simple function to do this; for the 64 different digits used to represent possible values, they use the upper-case and lower-case latin alphabet, the ten numerical digits, and two more characters. The default two extras are usually the plus sign and the slash, but these functions typically let you override that, often citing the needs of us XML folk as a reason for doing so. I used the hyphen and underscore.

To make sure that the value begins with a letter, I first prepended an “i” onto it, but now I just repeat the value generation until I get one that begins with a letter. Here’s the whole thing in Python:

# Generate a 22-character UUID beginning with a letter.
# (base64 encode the raw bytes from a 16 octet UUID.)


import uuid # from http://zesty.ca/python/uuid.html
import sys
import base64 


b64uid = '00000000'


# Keep generating until we have one that starts with a letter. 
while (b64uid[0:1] < 'A') or \
        (b64uid[0:1] > 'z') or \
        ((b64uid[0:1] > 'Z') and (b64uid[0:1] < 'a')):
    uid = uuid.uuid4()
    b64uid = base64.b64encode(uid.bytes,'-_')


b64uid = b64uid[0:22] # lose the "==" that finishes a base64 value


sys.stdout.write(b64uid)

I used sys.stdout.write instead of print because I don’t want that extra carriage return, which seems to show up even when I add a comma after the string to print.

In Emacs, I wanted Ctrl+C i to insert an ID value at the cursor location, and when editing an XML file, I wanted the same keystroke to insert " id='{id-value-here}'". The following in my .emacs file did it:

(defun b64uuid () ; Insert base64 UUID
  (interactive)
  (shell-command "python c:/util/b64uuid.py" t)
)
(define-key global-map "^Ci" 'b64uuid)


(defun xml-id ()
  (interactive)
  (insert " id=''")
  (backward-char 1)
  (shell-command "python c:/util/b64uuid.py" t)
)


(defun nxml-mode-additional-keys ()
  "Key bindings to add to `nxml-mode'."
  (define-key nxml-mode-map "^Ci" 'xml-id)
  (define-key nxml-mode-map "^Co" 'sgml-comment)
  ; more xml-specific keybindings...
)

The elisp code, like the python code, could probably be more efficient, but it works. Ideally, the whole thing would be done in elisp, but writing all that would have taken me a lot more time than it took to patch together the existing python code. I’d love to hear any suggestions about improving this.