Page 1 of 1

Serialisation format change

Posted: Fri Dec 22, 2017 7:05 pm
by jpab
[Sorry, this post got really long as I thought about more stuff.]

Pioneer started with a custom serialisation format, and then switched to JSON (with the custom format still in use within some sub-objects). I propose we make another switch. Yes, I know. I'm sorry!

This is not a proposal to change the structure of the save data. I believe we will need to change the structure too for other reasons, but that would be a separate change and is considerably harder.

EDIT: This used to say msgpack, then I thought about it more and decided CBOR is a better option. They're pretty similar, but CBOR has a few advantages over msgpack. They both have pretty wide library support. CBOR has the advantage of supporting streaming output because it supports emitting an array or map without knowing at the start how many items it contains. It also has an arguably more flexible type tagging mechanism. It's perhaps slightly more complicated though.
I propose we switch to CBOR format; probably still packed in a gzip compressed file because there will still be a lot of redundancy (repeated field names etc).

Advantages of JSON:
- Text based format; can be read (and edited) with just a text editor. (Plus a tool to decompress the file, e.g., gzip).
- Embedded field names/structural hierarchy, so it's possible to add new fields and still load older games, etc. In practice doing that requires a lot of care in the serialise/deserialise code and I'm not sure how often we make the effort in that respect. But at least it is possible.
- JSON libraries are available for approximately every programming language anyone uses, so people can write scripts to hack saves/extract saved data/do special version upgrades/whatever. Not that anyone has done any of those things to my knowledge, but it does seem like a nice thing to support.

Disadvantages of JSON:
- Floats and doubles need custom serialisation on top of JSON to be sure of bit-identical round-trip.
- Vectors, matrices, quaternions (do we save quaternions anywhere?) need custom serialisation on top of JSON.
- Seems pretty slow. However probably not that big a deal and I'm sure we could make the code a lot faster if it seemed important.
- Seems pretty bloated. But not really a big deal because we compress it afterwards.

Advantages of CBOR:
- As with JSON, libraries are available for probably any language anyone would want to use, so again people can write scripts to fiddle with saves if they want.
- CBOR is designed to be structurally compatible with JSON. I.e., "objects" that are key/value mappings, plus arrays, plus a few primitive types (numbers, strings, etc). So it has the same properties w.r.t. being able to add new fields without breaking old saves, etc (again, with care). This also makes it almost straightforward to convert both ways between CBOR (efficient) and JSON (text-editor compatible). However key word there is almost, because CBOR supports some specific things that I'd like to use and that JSON doesn't support, plus you hit the same problems for floating point round-trip and vectors/matrices/etc that we have with JSON.
- Floats and doubles are stored directly in binary IEEE format. No extra conversion step needed and no precision problems.
- Direct support for storing raw binary blobs (with basically no escaping needed) as well as strings. Compare to JSON which only supports strings of Unicode characters and where we've had problems dumping arbitrary binary data as strings. Hopefully we won't have much raw binary data to store since that implies having things that have been serialised some other way and then embedded, but it's nice to have the option I think.
- CBOR supports a fairly simple (albeit a little limited) tagging mechanism for adding metadata. I would propose to use this for various "primitive" types that aren't covered by the built in CBOR types. In particular: Vector, Matrix, Quaternion, Color, Reference to a Lua object that was serialised elsewhere (this is something we do already to cope with, e.g., circular references in Lua objects).
- Binary format; "should be" faster to serialise/deserialise than JSON. I don't think this is really a big deal although the amount of binary->decimal number conversions that we do irks me a little :-P. I suspect the biggest improvement would come from structural changes (reducing the quantity of stuff we include in the save) rather than improvements to serialisation speed.
- Somewhat more space efficient than JSON. I don't actually care because most of our ridiculous save file bloat is due to saving a bunch of stuff that shouldn't be in the save at all; the compression layer deals with most of the space-inefficiency of the serialisation format.

Disadvantages of changing to CBOR:
- Yet another change. (Is it worth it?)
- Can't view it or change it usefully in a text editor: It will need a tool. Most likely a custom tool. However a tool (for CBOR->JSON only; not bidirectional/round-trip) should be very easy to throw together in Python or whatever, since both JSON and CBOR have good library support.
- Actually manipulating the save files either needs a comprehensive custom tool that supports round-trip conversion (which is tricky for exactly the same reason that we have some annoying problems with JSON right now) or directly using CBOR. But if the manipulation is being done in code then directly manipulating CBOR should be relatively easy due to wide library support.

Other options
- Keep using JSON, with our special way of representing numbers, etc. This is very seriously an option; please say if you prefer it!
- Go back to full-custom serialisation. But that seems like a step backwards, or at least it seems like it has only negatives compared to CBOR.
- Pick a different existing schemaless serialisation format (e.g., BSON). I haven't tried to do a comprehensive survey. I think in the schemaless category of formats CBOR is probably a pretty good option though.
- Go down a different route: Switch to a schema based serialisation format. This would be, e.g., protobuf, cap'n'proto, thrift, or flatbuffers. I haven't done a survey of these either and there may be other sensible options beyond those four. Schema based formats still support backwards/forwards compatibility when doing things like adding new fields, but they require a schema to be defined ahead of time. I personally really like schema-based and think it's overall much better than schemaless. However it's a harder change for us to make and not totally obvious how best to deal with mods etc which will want to include data in the save. Advantages: Writing out the schema formally means making clear decisions about how the data is structured and what is included and gives a clear place to explicitly document the save format. Also you get a lot more validation for free when loading the file. Disadvantages: Doesn't seem like this fits very nicely with the total lack of schema/structured types in Lua and in general not clear how to deal with data saved by mods etc. You either end up building a schemaless format on top of the schema-based one (which seems very pointless) or you need to be able to handle a dynamic set of schemas so each module can define its own schema for its data (which should totally be possible with some of the schema-based formats at least, but maybe isn't trivial).

Re: Serialisation format change

Posted: Fri Dec 22, 2017 8:22 pm
by impaktor
My main priority is that you choose the method you're the most comfortable
with to make progress that will end us at a point that is better than the
current, and additionally, since you said you only have the holiday to do it
over, that you'll have time to finish it.

Please break saves and crack format as you wish, I'm sure it'll be worth it.

All systems go!

Re: Serialisation format change

Posted: Tue Dec 26, 2017 9:36 pm
by FluffyFreak
I am totally fine with this idea.

Re: Serialisation format change

Posted: Thu Oct 04, 2018 6:43 pm
by sturnclaw
With https://github.com/pioneerspacesim/pioneer/pull/4459, most of this design doc is resolved. Due to the way it's implemented, we loose support for custom tagged CBOR values and raw strings, but gain the immeasurable advantage of not having to write anything new to save games in both JSON and CBOR. Additionally, due to some nice features of the library, we don't need to bother saving numbers as strings anymore, and don't need to manually throw an exception for every missing field.