Specifications

This is a spec to add a new script mgenerate to the mtools family. It will generate structured, semi-random data according to a template document. The template can be specified directly as a command line argument or it can be a file. The format for the template argument is in JSON. Additional arguments to mgenerate specify how many documents should be inserted. The generated documents are directly inserted into a mongod/s, as specified with --host and --port. The default host is localhost and the default port is 27017.

Example
mgenerate <JSON or file> --num 10000 --port 27017

Parsing the JSON document

All values are taken literally, except for special $-prefixed values.

These values can be simple strings, or documents. Simple strings can be used if none of the additional options need to be specified. To customize the behavior of a command, use the document style.

{ "_id": "$command" }
{ "_id": { "$command": { <additional options> } } }

Note that we are writing JSON, so field names have to be strings, the quotes cannot be left out.

Some commands have a shortcut syntax, taking an array as their only value to the command key. This array syntax is always only syntactic sugar for their most common use case, and there is always a (more verbose) document-style syntax that will achieve the same. Each command section will specify exactly what the array-syntax means.

{ "_id": { "$command": [ <additional options> ] } }

$objectid

Creates an ObjectId(). Alias is $oid.

Example
{ "_id" : "$objectid" }

This command replaces "$objectid" with a proper newly generated ObjectId.

Additional Parameters

None

$number

Creates a random number.

Example
{ "age" : "$number" }

This command replaces "$number" with a uniformly random number between 0 and 100.

Additional Parameters

lower and upper bounds

{ "$number" : {"min" : 500, "max" : 1000 } }

Generate a uniformly random number between the min and max values (both ends inclusive). Either parameter can be omitted, the fall-back is the default (0 for lower bound, 100 for upper bound). If min > max, the tool will throw an error.

Array Syntax

{ "$number" : [ MIN, MAX ] }

Short form for { "$number" : {"min" : MIN, "max" : MAX } }.

$datetime

Creates a random date and time. Alias is $date.

Example
{ "_id" : "$datetime" }

This command replaces "$datetime" with a randomly generated date between Jan 1, 1970 0:00:00.000 (Epoch 0) and now.

Additional Parameters

lower and upper bounds

{ "$datetime" : {"min" : 1358831035, "max" : 1390367035 } }

Generate a random date and time between the min and max values (both ends inclusive).

min and max values can be epoch numbers (see example above). They can also be strings that can be parsed as a date (and optionally time), e.g. "2013-05-12 13:30".

Array Syntax

{ "$datetime" : [ MIN, MAX ] }

Short form for { "$datetime" : {"min" : MIN, "max" : MAX } }.

$missing

Will not insert the key/value pair. A percentage of missing values can be specified.

Example
{ "name" : "$missing" }

This will cause the entire key/value pair with key "name" to be missing.

Additional Parameters

Missing Percentage

{ "$missing" : { "percent" : 30, "ifnot" : VALUE } }

Will cause the key/value pair to be missing 30% of the time, and otherwise set the VALUE for the given key.

$choose

Chooses one of the specified values.

Example
{ "status" : { "$choose" : { "from" : [ "read", "unread", "deleted" ] } } }

Will pick one of the values from the array with equal probability.

Additional Parameters

Ratio

{ "$choose" : { "from" : [ VAL1, VAL2, ... ], "weights": [ W1, W2, ... ] } }

Will pick the values proportionally to the given weights. The weights array must be the same length as the from array.

Example
{ "status" : { "$choose" : { "from" : [ "read", "unread", "deleted" ], "weights" : [ 1, 1, 10 ] } } }

Will pick one of the values from the array. Will pick "deleted" 10 times more likely than read and unread.

Array Syntax

{ "$choose" : [ VAL1, VAL2, ... ] }

Short form for { "$choose" : { "from" : [ VAL1, VAL2, ... ] } }.

$array

Builds an array of elements of given length. Can be combined with $number to create random-length arrays.

Example
{ "friends" : { "$array" : { "of": "$oid", "number": 20 } } }

This will create an array for friends containing 20 unique ObjectIds.

Array Syntax

{ "$array" : [ VALUE, NUMBER ] }

Short form for { "$array" : { "of" : VALUE, "number" : NUMBER } }.

An Example to populate a "Users" collection

{
    "user": {
        "name": {
            "first": {"$choose": ["Liam", "Noah", "Ethan", "Mason", "Logan", "Jacob", "Lucas", "Jackson", "Aiden", "Jack", "James", "Elijah", "Luke", "William", "Michael", "Alexander", "Oliver", "Owen", "Daniel", "Gabriel", "Henry", "Matthew", "Carter", "Ryan", "Wyatt", "Andrew", "Connor", "Caleb", "Jayden", "Nathan", "Dylan", "Isaac", "Hunter", "Joshua", "Landon", "Samuel", "David", "Sebastian", "Olivia", "Emma", "Sophia", "Ava", "Isabella", "Mia", "Charlotte", "Emily", "Abigail", "Avery", "Harper", "Ella", "Madison", "Amelie", "Lily", "Chloe", "Sofia", "Evelyn", "Hannah", "Addison", "Grace", "Aubrey", "Zoey", "Aria", "Ellie", "Natalie", "Zoe", "Audrey", "Elizabeth", "Scarlett", "Layla", "Victoria", "Brooklyn", "Lucy", "Lillian", "Claire", "Nora", "Riley", "Leah"] },
            "last": {"$choose": ["Smith", "Jones", "Williams", "Brown", "Taylor", "Davies", "Wilson", "Evans", "Thomas", "Johnson", "Roberts", "Walker", "Wright", "Robinson", "Thompson", "White", "Hughes", "Edwards", "Green", "Hall", "Wood", "Harris", "Lewis", "Martin", "Jackson", "Clarke", "Clark", "Turner", "Hill", "Scott", "Cooper", "Morris", "Ward", "Moore", "King", "Watson", "Baker" , "Harrison", "Morgan", "Patel", "Young", "Allen", "Mitchell", "James", "Anderson", "Phillips", "Lee", "Bell", "Parker", "Davis"] }
        }, 
        "gender": {"$choose": ["female", "male"]},
        "age": "$number", 
        "address": {
            "street": {"$string": {"length": 10}}, 
            "house_no": "$number",
            "zip_code": {"$number": [10000, 99999]},
            "city": {"$choose": ["Manhattan", "Brooklyn", "New Jersey", "Queens", "Bronx"]}
        },
        "phone_no": { "$missing" : { "percent" : 30, "ifnot" : {"$number": [1000000000, 9999999999]} } },
        "created_at": {"$date": ["2010-01-01", "2014-07-24"] },
        "is_active": {"$choose": [true, false]}
    },
    "tags": {"$array": {"of": {"label": "$string", "id": "$oid", "subtags": 
        {"$missing": {"percent": 80, "ifnot": {"$array": ["$string", {"$number": [2, 5]}]}}}}, "number": {"$number": [0, 10] }}}
}

Some comments:

  • The gender is independent of the name. Currently the template language has no way of referring to results of already computed fields.
  • The tags are just random strings, it's unlikely that there are any repetitions.

Already Implemented but Need Documentation

$string

$geo

$point

$float

$concat

$age

$normal

$zipf

TODO

$email

$uuid

$ref

{"$ref": "field"} refers to the value from that field. This will allow to have dependencies on fields, e.g. if the gender is female, pick female names, see "Users" example above.