All the personal data you want

Except that it's all fake.

October 4, 2006

I needed some sample address book data for a project that I’m working on. Because of the number of people who may see it, I didn’t want to use real address book entries, so I wrote some Python scripts to generate some.

I spread it across a few scripts because I wanted to generate data for different schemas. I put the main data generation functions in one file and then call those functions and format the data in scripts that are specialized for their particular output format. You can use these to generate data for a relational database, XML, your favorite RDF flavor, or whatever you like. The basic library has functions such as firstName() and zipCode() to generate random values, with some, like middleName() and note(), sometimes returning nothing. I have two scripts that use the library: one generates a CSV file that emulates one exported from Microsoft Outlook 2003, and the other emulates the CSV address file exported by Eudora 7. (Did you know that Eudora can’t import the CSV files that it exports?)

The data is pretty US-oriented, but a few tweaks should adapt it for other countries. It randomly picks first and middle names from the US census list of most popular male and female names and surnames from the census list of most popular last names. It took very little web searching to find the most popular street names and US Cities, and for employer names I went with the last 100 of the Fortune 500.

There generated data has plenty of incongruities. Middle names are randomly picked separately from first names, so male and female names are often mixed. The same happens with city and state names, so that Albert Victoria Freeman Jr. may live in Baltimore, California. To convert an employer name to a domain name for a work email address, I just took out spaces and punctuation, converted to lower-case, and put “.com” at the end, which can result in some long domain names.

I’ve always enjoyed generating random content that faked the appearance of semantic value. One event in particular inspired me about twenty-three years ago, when the only programming languages I knew were Microsoft Basic and dBase II. I was in the early stages of a “poetry” generation program that only had seven or eight possible verbs, and all the nouns were pronouns, and it came out with this:

It thinks. 
It scares her.

(Try to picture it on green and white paper in a dot matrix font.) The heart of all of these is the random function; when coding for fun, seeing different output each time is often more entertaining than consistent output. I’ve recently figured out how I can generate multi-part music from an XSLT script, which I’ll make public somewhere once I have the time to actually implement it and write it up.

2 Comments

By Dan Brickley on October 4, 2006 10:18 AM

Fun stuff :)

I’ve recently been wondering about making a hosted version of the Dada Engine (the system behind the Postmodernism generator you link to). I was actually thinking of it for language learning apps, after noticing that many language courses seem based around exercising one’s ability to translate variations on a theme (“I want to…”, “You need to …”, “We used to…”, / “eat spaghetti” / “drink red|white|green wine” / “quickly” “slowly” / “tonight” “tommorrow” “every day” …etc.

After playing a little (http://spypixel.com/2006/spanglish/testme.cgi http://spypixel.com/2006/spanglish/ …) I realised my grammar skills (machine and human language!) weren’t up to it, … and maybe some sort of wiki or collaborative effort would let people write better dada engine scripts communally.

I’ve a hunch that a hosted dada engine system could catch on, … but what it really needs is some changes to give a bit of a UI for grammar creation, and to have some mechanisms for modularisation so that sentence-fragments can more easily be shared across a group of users.

File under: ProcrastinationOpportunities :)

By deltabob on October 5, 2006 7:56 AM

Ah…green bar paper. I miss it so. I still have a box somewhere of old email printouts from college on green bar paper.

I like the idea of the randomly generated name/address info. Sometimes when looking into the people finder databases, it feels like that’s how they were populated.

blog

home

blog

categories

writing

music

about

Recent Posts

SPARQL queries of the Billboard Hot 100

Visualizing RDF

Using regular expressions to manipulate data in a SPARQL query

Appreciating the SPARQL property path slash character more

Triples about existing triples

Querying for labels

Human-readable names in RDF

My brief tenor banjo career

Nicer date and time handling in SPARQL 1.2

Passing your own data to use in Wikidata visualizations