Encoding Non-Printable Bytes in Python 3


bytes -> str:

In []: b'\x90\x90\x90\x90'.decode('latin-1')
Out[]: '\x90\x90\x90\x90'

str -> bytes:

In []: '\x90\x90\x90\x90'.encode('latin-1')
Out[]: b'\x90\x90\x90\x90'

But Why Tho

Sometimes when you're programming or you're playing CTFs, you'll encounter odd binary file formats. In Python 2, it's no problem to use the rb or wb file modes to open those files and process them using standard string methods.

Then, everything changed when Python 3 came out.

In Python 3 it's very likely you will now be dealing with strings and bytes. Bytes are effectively raw computer "bytes" to Python. They are raw 1's and 0's with no context. Strings (which are made up of bytes) must now have encodings which contain their context. Essentially strings are bytes wrapped with an encoding function which dictates how they are to be viewed and processed.

However, strings have some capabilities which bytes do not share. A notable example is the lack of the .format() method in bytes.

In []: b'{}'.format('test')
AttributeError                            Traceback (most recent call last)
<ipython-input-72a53680a88c> in <module>
----> 1 b'{}'.format('test')

AttributeError: 'bytes' object has no attribute 'format'

I generally just use Python 2 because I like it and it's not 2020 yet.

Don't follow my bad habits. Python 3 is the future.

Fast-forward to today where I need to use Python 3 to manipulate some binary data from a file and generate a newly processed file.

Things that won't work

Don't use str.encode('ascii', 'replace') because the end result you get will likely not be right despite seeming right.

For example:

In []: b'\x90\x90\x90\x90'.decode('ascii', 'replace')
Out[]: '����'  # wow non-printable looks cool

In []: test = b'\x90\x90\x90\x90'.decode('ascii', 'replace')

In []: test
Out[]: '����'

In []: test[0]
Out[]: '�'

In []: ord(test[0])
Out[]: 65533  # o no

Things that will work

If you are trying to round trip raw binary data that you ripped right out of a random file or read from a socket (effectively 0x00 - 0xff) between strings and bytes you want to use the latin-1 encoding to get it done.

bytes -> str:

In []: b'\x90\x90\x90\x90'.decode('latin-1')
Out[]: '\x90\x90\x90\x90'

str -> bytes:

In []: '\x90\x90\x90\x90'.encode('latin-1')
Out[]: b'\x90\x90\x90\x90'

This works and also took me way too long to figure out while Googling. This works because latin-1 a.k.a ISO 8859-1 encodes up to 255 for all 8 bits in a character. However, ascii is only 7 bits which means sad times once you go beyond 127.

Hopefully this saves you some time since it was quite a gotcha for me. 🐍3💩