Introduction to Serialization


Introduction

Serialization is the process of taking an object from memory and converting it into a series of bytes so that it can be stored or transmitted over a network and then reconstructed later on, perhaps by a different program or in a different machine environment.

Deserialization is the reverse action: taking serialized data and reconstructing the original object in memory.

Many object-oriented programming languages support serialization natively, including, but not limited to:

  • Java
  • Ruby
  • Python
  • PHP
  • C#

For the duration of this module, we will only focus on Python and PHP; however, please note that the same concepts taught may be reapplied to most, if not all, languages that support serialization.


PHP Serialization

As an example, this is how we would serialize an array in PHP:

[!bash!]$ php -a

Interactive shell

php > $original_data = array("HTB", 123, 7.77);
php > $serialized_data = serialize($original_data);
php > echo $serialized_data;
a:3:{i:0;s:3:"HTB";i:1;i:123;i:2;d:7.77;}
php > $reconstructed_data = unserialize($serialized_data);
php > var_dump($reconstructed_data);
array(3) {
  [0]=>
  string(3) "HTB"
  [1]=>
  int(123)
  [2]=>
  float(7.77)
}

As you can see, $original_data is an array containing one string ("HTB"), one integer (123), and one double (7.77). Using the serialize function, the array is turned into bytes that represent the array. We carry on to unserialize this serialized string and restore the original array as verified by the var_dump of $reconstructed_data.

Serialized objects in PHP are easy to read, unlike serialized objects in many other languages, which may look like complete gibberish to the human eye, as you will see in the Python example, but before that, let's understand what the letters and numbers in the serialized data mean:

a:3:{ // (A)rray with (3) items
    i:0;s:3: "HTB"; // (I)ndex (0); (S)tring with length (3) and value: "HTB"
    i:1;i:123; // (I)ndex (1); (I)nteger with value (123)
    i:2;d:7.77; // (I)ndex (2); (D)ouble with value (7.77)
}

Python Serialization

Similar to the PHP example above, we will serialize an array in Python. There are multiple libraries for Python which implement serialization, such as PyYAML and JSONpickle. However, Pickle is the native implementation, and it is what will be used in this example.

[!bash!]$ python3

Python 3.10.7 (main, Sep  8 2022, 14:34:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> original_data = ["HTB", 123, 7.77]
>>> serialized_data = pickle.dumps(original_data)
>>> print(serialized_data)
b'\x80\x04\x95\x16\x00\x00\x00\x00\x00\x00\x00]\x94(\x8c\x03HTB\x94K{[email protected]\x1f\x14z\xe1G\xae\x14e.'
>>> reconstructed_data = pickle.loads(serialized_data)
>>> print(reconstructed_data)
['HTB', 123, 7.77]

Reading the serialized data pickle outputs is much harder than reading the output PHP provides. However, it is still possible. According to comments in the pickle library, a pickle is a program for a virtual pickle machine (PM). The PM contains a stack and a memo (long-term memory), and a pickled object is just a sequence of opcodes for the PM to execute, which will recreate an arbitrary object on the stack.

The PM's stack is a Last-In-First-Out (LIFO) data structure. You may push items onto the top of the stack, and you may pop the top object off of the stack.

Quoting from comments in the pickle library, the PM's memo is a "data structure which remembers which objects the pickler has already seen, so that shared or recursive objects are pickled by reference and not by value."

In Lib/pickle.py (Python 3.10), we can see all of the pickle opcodes defined, and by referring to them, as well as the source code for the various pickling functions, we can piece together what our serialized_data does exactly when it is passed to pickle.loads():

'\x80\x04'
# PROTO 4
# Tell the PM that we are using protocol version 4. This is the default since Python 3.8.
# Protocol versions 3-5 can not be unpickled by Python 2.x.

'\x95\x16\x00\x00\x00\x00\x00\x00\x00' 
# FRAME 16
# Essentially we are telling the PM that the serialized data is 16 bytes long.
# The argument is calculated like this: 
# `struct.pack("<Q", len(b']\x94(\x8c\x03HTB\x94K{[email protected]\x1f\x14z\xe1G\xae\x14e.')) = b'\x16\x00\x00\x00\x00\x00\x00\x00'`.

']' 
# EMPTY_LIST
# Pushes an empty list onto the stack. 
# Eventually, we will append the items to this list after we have defined them.

'\x94' 
# MEMOIZE
# This stores the object on the top of the stack in the 'memo' which is akin to long-term memory. 
# The memo is used to keep transient objects alive during pickling. 
# In this case we are 'memozing' the empty list we just pushed onto the stack. 
# This opcode is called when pickling any of the following types:
# - __reduce__
# - bytes
# - bytearray
# - string
# - tuple
# - list
# - dict
# - set 
# - frozenset
# - global

'(' 
# MARK
# Pushes the special 'markobject' on the stack.
# This will be referred to later as the starting point for our array items.

'\x8c\x03HTB' 
# SHORT_BINUNICODE 3 HTB
# Pushes the unicode string with length 3 'HTB' onto the stack.

'\x94' 
# MEMOIZE
# We tell the PM to 'memoize' the string that we just pushed onto the stack.

'K{' 
# BININT1 {
# Pushes a 1-byte unsigned int with value 123 onto the stack. 
# '{' is the byte representation of 123 calculated as so: 
# `chr(123) = b'{'`

'[email protected]\x1f\x14z\xe1G\xae\x14' 
# BINFLOAT @\x1f\x14z\xe1G\xae\x14
# Pushes a float with the value 7.77 onto the stack. 
# '@\x1f\x14z\xe1G\xae\x14' is the 8-byte float encoding of 7.77 which is calculated like this: 
# `struct.pack(">d", 7.77) = b'@\x1f\x14z\xe1G\xae\x14'`

'e'
# APPENDS
# We are telling the PM to extend the empty list on the stack with all items we just defined back up until the 'markobject' we defined earlier.

'.' 
# STOP
# This is how we tell the PM we are at the end of the pickle. 
# The original array `['HTB', 123, 7.77]` was recreated and now sits at the top of the stack.