Skip to the content.

Encoding and evolutiion

Formats for encoding

Binary encoding

JSON and XML are not compact and fast compared to binary formats. So people think of encode JSON or XML in binary format, but they need to include all object field names within the encoded data since it does not use prescribed schema.

For example:

{
    "userName" : "Martin" ,
    "favoriteNumber" : 1337 ,
    "interests" : [ "daydreaming" , "hacking" ]
}

json-binary-encoding

Thrift and Protocal Buffers

struct Person {
    1 : required string userName,
    2 : optional i64 favoriteNumber,
    3 : optional list < string > interests
}

thrift thrift-compact

message Person {
    required string user_name = 1 ;
    optional int64 favorite_number = 2 ;
    repeated string interests = 3 ;
}

protobuf

New code has a new field added

message Person {
    required string user_name = 1 ;
    optional int64 favorite_number = 2 ;
    repeated string interests = 3 ;
    optional int64 user_age = 4;
}

Old code can read the record that is written in new code by simply ignore the tag 4.

New code can read the record that is written in old code by simply setting the field 4 with its default value. Important: New field cannot be requried. The tags of old field cannot be changed.

New code has removed a field

message Person {
    required string user_name = 1 ;
    repeated string interests = 3 ;
}

Old code can read the record that is written in new code by simply setting the field 4 with its default value.

New code can read the record that is written in old code by simply ignore the tag 4.

Changing datatype

This operation might cause the value to lose prcision or get truncated.

Avro

Avro is another binary encoding format that is differnt from Protobuf and Thrift. It also uses schema to specify the data structure of data being encoded.

record Person {
    string userName ;
    union { null , long } favoriteNumber = null ;
    array < string > interests ; }

avro

If we do not have the tag number to identify each fields, the binary data can only be decoded in the correct order only if the code reads the data uses exactly the same schema. However Avro does not require the schema to be the same, only requries they are compatible.

avro-schema-revolution

How does reader know the writer’s schema

How does reader know the writer’s schema - Thrift and Protobuf

Writer: Using protobuf as the example, writer has to define the schema in a .proto file which look like as below:

syntax = "proto3";
package tutorial;

import "google/protobuf/timestamp.proto";

option go_package = "github.com/danniel1205/explore-protobuf/tutorialpb";

// [START messages]
message Person {
  string user_name = 1;
  int64 favorite_number = 2;  // Unique ID number for this person.
  repeated string interests = 3;
  google.protobuf.Timestamp last_updated = 4;
}
// [END messages]

Reader: Reader has to use the same schema or compatible schema in order to unmarshal the binary data properly.

How does reader know the writer’s schema - Avro

Above might apply to thrift and protobuf as well.

Dynamic generated schemas

We have a relational database, and want to dump its contents into file or transfer over network, using binary format is a better way. Generate a Avro schema from database schema is easier than thrift and protobuf which has tag numbers in their schema. The reasons are:

Merits of Schemas

Protobuf, Thrift and Avro all use schema to describe a binary encoding format.

Modes of dataflow

Dataflow through databases

The process that writes to the database encodes the data. The process that reads from the database decodes the data.

Usually there are multiple processes accessing the database, some of them could have the new version of schema, while others have the old version. The most important rule is to keep the field that does not belong to current version of schema intact!

See the following diagram for more details. We SHOULD NOT remove the photoURL after writing the data back. dataflow-database-compatibilty

Dataflow through service calls

Dataflow through asynchronous message passing

The asynchronous messaage passing system uses a message broker to receive and deliver messages. The messages are usually one way, the sender usually does not expect to receive a reply on the message sent. (RPC is a request/response dataflow)

There are several advantages using message broker:

Additional refs