flyx.org

The Making of NimYAML - Why Nim is a Cool Language

This blog post is an exploration of the advanced features of the programming language Nim, using the example of implementing YAML serialization. I will show why Nim's advanced metaprogramming facilities are useful and how some of them are employed. Prior knowledge about Nim and YAML definitely makes reading this post easier, but I will go over the very basics in case you don't.

YAML

You may have already heard of it. It is a data serialization languages that builds upon JSON but is nicer for human being to read and write (mostly). So, if you have some JSON data like this:

{
"apples": 2,
"bananas": 42,
"other things": [
{ "lasers": true },
{ "spaceships": "shiny" }
]
}


apples: 2
bananas: 42
other things:
- lasers: true
- spaceships: shiny


Now before you think too much about what this data might represent, let's move on to the programming language we will use.

Nim

Nim is a statically typed, garbage collected language that compiles to C (or various other backends). It uses indentation for structure (like Python) and tries not to hinder your hacking habits by requiring you to verbosely specify type names everywhere.

Most of the Nim code shown here is easy to understand if you have seen code before. I will explain the tricky things.

NimYAML

So, what we want to do now is implementing automatic type serialization and deserialization with Nim and YAML. That is to say, we want to be able to dump any Nim type into a YAML representation, and we also want to be able to reload that YAML representation back into Nim. This should work without modifying the type, so we can serialize types of a third-party library we happen to use.

As a prerequisite, we will use a YAML parser and presenter which already exists by magic. Actually, it is a project called NimYAML, you can git-clone it from the repository. At the time of writing, there was a little API break after the most recent release, so if you actually want to test the code, use the devel branch from the repo.

So, how do the values we want to serialize look like? Let's have a look:

var str = "some string"
var i = 42

type Person = object
firstname, surname: string
age: int
isTheChosenOne: bool

var p = Person(firstname: "Luke", surname: "Skywalker", age: 19,
isTheChosenOne: false)


Here, we have some simple things to start with: A string named str, an integer named i and a complex object named p.

Dumping

What we want to do is this:

echo dump(str) # dumps a string as YAML
echo dump(i)   # dumps an integer as YAML
echo dump(p)   # dumps a person as YAML


So, how do we start? Let's just write the dump() proc:

import yaml.stream, yaml.presenter, yaml.taglib

proc dump[T](o: T): string =
# create a buffered stream to hold our YAML events
var events = newBufferYamlStream()
events.put(startDocEvent()) # start the YAML document
represent(o, events)        # write events for the value
events.put(endDocEvent())   # finish the YAML document
# render the event stream as YAML document
result = present(events, serializationTagLibrary)


First thing to notice that our proc (called function in some other languages) has a generic parameter T. That is to say, the parameter o has a generic type. We can call this function with a value of any type as long as everything we do with this value is supported by the type.

When serializing, we transform our value into a stream of events. These events will then be transformed into YAML by the presenter. The stream always starts with a document start event, which we put into the stream, and a document end event, which we also put into the stream.

Finally, we call present, which takes our event stream and transforms it into YAML. The serializationTagLibrary is a value supplied by NimYAML and does not need to be of concern for us right now.

So, with this, almost everything we need to do has been done, right? Well, yes, except of the represent part. This is the everything you do with this value I talked about above. We need to implement represent for all types we want to dump. Let's start with three simple types: string, int and bool:

proc represent(s: string, events: BufferYamlStream) =
events.put(scalarEvent(s))

proc represent(i: int, events: BufferYamlStream) =
events.put(scalarEvent($i)) proc represent(b: bool, events: BufferYamlStream) = events.put(scalarEvent($b))


Well that was easy! The only thing we need to do for each simple value is to create a scalar event which has as content a textual representation of our value. We use the handy \$ operator here, which transforms our int and bool value into strings.

Now comes the interesting part: How can we serialize an arbitrary object? Here, Nim's metaprogramming capabilities are a big help:

proc represent[T: object](o: T, events: BufferYamlStream) =
events.put(startMapEvent())
for name, value in fieldPairs(o):
events.put(scalarEvent(name))
represent(value, events)
events.put(endMapEvent())


First thing to notice is that we once again have a generic proc. But this time, the generic parameter T is constrained: It only takes types that are an object.

Since an object is a complex value, we use a YAML mapping (also known as hash map or dictionary in other languages) to hold all its values. We start and end it with a start map event and an end map event respectively. Now comes the interesting part: fieldPairs is a special iterator that iterates over fields of an object. While this may be common in scripting languages like JavaScript, it is far less so in statically typed languages.

Java, for example, has its reflection API to do this at runtime, but this has severe consequences: You suddenly lose all type safety and operate on raw Object values, which you cast around. It also means that the runtime must provide type information, so the code can get a list of fields for a certain class type. Nim goes a different way:

The for loop is expanded at compile time. So we do not need to carry over any type information into the runtime environment. This means that we can write generic code that works on any object type, no matter how many fields it contains and what types they have. It also means that we can have complete type safety because the code is expanded at compile time and then interpreted for every typed field of the object separately.

So, what do we do in this loop? First, we create a scalar event which holds the name of the field. This will be a key in our YAML mapping. The value for this key is the represented value of the field. This is just a recursive call. With this code, we have written everything necessary to serialize all objects that contain strings, ints, bools or other objects with the same restrictions.

If we execute our three calls from the beginning now:

echo dump(str)
echo dump(i)
echo dump(p)


We get:

%YAML 1.2
---
some string

%YAML 1.2
---
42

%YAML 1.2
---
firstname: Luke
surname: Skywalker
age: 19
isTheChosenOne: false


That wasn't too hard, was it? But, all this dumping would be of little help if we couldn't also load the data back in, right? So…

Let's load it! We start as before:

import yaml.parser

proc load[T](input: string, o: var T) =
var
parser = newYamlParser(serializationTagLibrary)
events = parser.parse(input)
doAssert events.next().kind == yamlStartDoc
construct(events, o)
doAssert events.next().kind == yamlEndDoc


Once again, we define a generic proc to handle any possible type. Note that the variable we want to load is a var parameter rather than a return value (a var parameter is passed by reference). This enables us to later call the load proc without explicitly giving the generic type parameter.

We create a parser object with the same serializationTagLibrary we used before (remember?) which is still not of any concern. Then, we tell the parser to generate an event stream by parsing our input.

Remember how we generated two events that start and end our YAML document? Now we call next() on the event stream, which will each time yield the next event, and check whether the first and last event are of kind document start and document end respectively. So far, so good. As before, we then call a construct proc which needs to be implemented for every type we want to load.

Let us implement it for the simple types:

import strutils

proc construct(events: YamlStream, s: var string) =
let e = events.next()
doAssert e.kind == yamlScalar
s = e.scalarContent

proc construct(events: YamlStream, i: var int) =
let e = events.next()
doAssert e.kind == yamlScalar
i = int(parseBiggestInt(e.scalarContent))

proc construct(events: YamlStream, b: var bool) =
let e = events.next()
doAssert e.kind == yamlScalar
b = e.scalarContent == "true"


For each of the simple types, we first retrieve the the event that hopefully contains our value and check if it actually is a scalar item. Then, we parse it into the target type. This is trivial for a string, and easy enough for an int (using a stdlib feature) and a boolean (which could use stricter checking).

Now that we have these, let us write our loader for object types:

proc construct[T: object](events: YamlStream, o: var T) =
var e = events.next()
doAssert e.kind == yamlStartMap
e = events.next()
while e.kind != yamlEndMap:
doAssert e.kind == yamlScalar
for name, value in fieldPairs(o):
if name == e.scalarContent:
construct(events, value)
break
e = events.next()


Our object must start with a start map event and end with an end map event, which is checked by the doAssert and the while loop condition respectively. For every key-value pair in the map, we check that the key is a scalar and search for a field of the object with the same name. If we find one, we then construct the value of this field from the event stream.

Of course, we should check whether the name actually matched any of the field names, but this is left as an exercise to the reader. We could also ensure that every field is matched, which is an excellent second exercise for the reader. And finally, we could check for duplicates, which is, you guessed it, the third exercise.

That's it! Let's test it:

var p2: Person
---
firstname: Luke
surname: Skywalker
age: 19
isTheChosenOne: false""", p2)
echo p2.firstname, ' ', p2.surname, ", aged ", p2.age, ", is ",
if p2.isTheChosenOne: "" else: "not ", "the chosen one."


Output:

Luke Skywalker, aged 19, is not the chosen one.


Conclusion

We have just implemented a generic serializer for a statically typed programming language in less than 100 lines of code. Isn't that awesome! Compare that to the amount of Java code you need to serialize objects to XML given the standard Java XML API.

Well, the actual serialization API implementation of NimYAML is, of course, a bit bigger (about 1000 lines of code) because it handles far more cases, including reference types (garbage-collected pointers), sequences (lists) and various other types which need special treatment. But you get my point: I do not want to write this in a language that makes it difficult to handle arbitrarily defined object types. And I also do not want to give up on static typing. So Nim seems to be the right language to use for me.

What I have shown here also is just the tip of the iceberg. You can do far more complex things with macros in Nim than what I did with fieldPairs. The actual serialization implementation uses macros instead of fieldPairs for handling objects because they are more flexible.

Well, that's it! Thanks for reading.

Tags: programming

On Having Problems

Variations on

You have a problem and decide to use regular expressions.
Now you have two problems.

Inspired and partly copied from [1], [2] and [3].

You have a problem and decide to use Java.
Now you have a ProblemFactory.

You have a problem and decide to use Python.
Now you have something that looks, swims and quacks like a problem.

You have a problem and decide to use binary.
Now you have 10 problems.

You have a problem and decide to use floating points.
Now you have 1.00000000000001 problems.

You have a problem and decide to use Apple.
Now you have a shiny problem.

You have a problem and decide to use threads.
Now have a you problem.

You have a problem and decide to use mutexes.
Now you

You have a problem and decide to use LISP.
Now you have a list of problems.

You have a problem and decide to use asynchronous calls.
Now you wait for having a problem.

You have a problem and decide to use Smalltalk.
Now you have a metaproblem.

You have a problem and decide to use an unchecked cast.
Now you have a solution and it is Segmentation Fault.

You have a problem and decide to use JavaScript.
Now you have 3 problems in Firefox, 5 problems in Safari, 2 problems in Chrome, and 11 problems in Internet Explorer.

You have a problem and decide to use a sandbox.
Now you still have a problem but you don't care.

You have a problem and decide to write a Makefile.
Now you know how to make problems.

You have a problem and decide to use anagrams.
Now you have lamb rope.

You have a problem and decide to use Unicode.
��� ��� ���� � �������.

You have a problem and decide to use STL.
Now you have a _Hashtable_iterator<std::pair<const basic_string<char, str::char_traits<char>, std::allocator<char> >, int>, basic_string<char, std::char_traits<char>, std::allocator<char > >...

You have a problem and decide to use pair programming.
Now you have someone else's problem.

You have a problem and decide to google it.
Now you know 31,457 ways to describe your problem and around 15 completely wrong solutions for it.

You have a problem and decide to upgrade to paid version.
Now you have an Ultimate Pro blem.

You have a problem and decide to use Agile.
Now you have an epic problem.

You have a problem and decide to use Haskell.
Now you have a lazy problem.

You have a problem and decide to use Maven.
Now you have a problem snapshot.

You have a problem and decide to use JPA.
Now you have a persistent problem.

You have a problem and decide to use crowd sourcing.
Now it's their problem.

You have a problem and decide to use Scala.
Now you have problem traits.

You have a problem and decide to use Prolog.
Now there exists a person that has at least one problem.

You have a problem and decide to use static code verification.
Now you can prove that you have a problem.

You have a problem and decide to use SQL.
Now you have two problems but you can join them.

You have a problem and decide to use -Wall.
Now you have a problem and 21 warnings.

You have a problem and decide to use Visual Basic.
Now you resume next.

You have a problem and decide to use Visual C++.
Now you have a ?Problem@@YAXHD@Z.

You have a problem and decide to use Splash.
But nothing happened.

You have a problem and decide to use ZSH.
Now you have a pro<TAB>

You have a problem and decide to use LaTeX.
Now you have a problem, perhaps a missing \item.

You have a problem and decide to use Debian.
Now you have a problem, a problem-common and a problem-dev.

You have a problem and decide to use XML.
Now you have a <problem xmlns="http://problem-working-group.org/schemas/2015/problem">

You have a problem and decide to use Ada.
Now you have a rendezvous with your problem.

You have a problem and decide to use PHP.
Now you have a problemsolve(). Or a problem_solve(). Or a problem_real_solve(). You're not sure.

You have a problem and decide to use CoffeeScript.
Now you've had a problem before it was cool.

You have a problem and decide to use C++ metaprogramming.
Now you have a partial problem specialization.

You have a problem and decide to post it on StackOverflow.
Now you have a favorite problem.

You have a problem and decide to use Perl.
Now you have a comprehensive problem archive network.

You have a problem and decide to use tips at startup.
Did you know that you have a problem?

Tags: programming fun