Wednesday, March 15, 2023

Base Conversions in Python

As a CyberSecurity Professional, there have been innumerable occasions where I needed to convert some obfuscated data into other things so that I could understand what some piece of malware, phishing email, or GET/POST request was doing.

Over the years, I've written many Python scripts to do various decoding of data and thought I'd share what I've learned.

You can find the below tutorial program code on my github page.

There are many ways to do base conversions, but I've always found it simplest to translate it to decimal first using int() and then do a whole slew of conversions from there.

So let's say we have a piece of data in hex, we'd use the int() function to convert it first to decimal as so:

decimal = int(str_to_conv, 16)

or something in octal:

decimal = int(str_to_conv, 8)

Since we know base 10 is decimal, base 16 is hex, base 8 is octal, and base 2 is binary, using the int() function with what we want to convert and the argument being the base we're converting from makes conversions to decimal simple. So let's look at code for various bases.

# binary
decimal = int(str_to_conv, 2)
# character
decimal = ord(str_to_conv)
# hexadecimal
decimal = int(str_to_conv, 16)
# octal
decimal = int(str_to_conv, 8)

There are a few other base conversions you may be aware of such as base16, base32, base64 and base85. These are a bit different, but we'll also cover these in this article.

Let's set up a little program that takes two inputs. Input one is what we're converting from and input two is the data we want to convert. So when we call the program we'll do it like: base_converter.py hex FE or base_converter.py bin 1011

Let's also make a function called convert() to do all our conversions that returns a dictionary containing all our conversions.

import sys

def convert(conv_type, str_to_conv):
# Convert input(s) into decimal as a starting place for all encodings
if conv_type == 'bin':
decimal = int(str_to_conv, 2)
elif conv_type == 'bcd':
decimal = int(str_to_conv)
elif conv_type == 'chr':
decimal = ord(str_to_conv)
elif conv_type == 'dec':
decimal = int(str_to_conv)
elif conv_type == 'hex':
decimal = int(str_to_conv, 16)
elif conv_type == 'oct':
decimal = int(str_to_conv, 8)
# Set up dict to track all conversions
encodings = {"decimal": decimal}
return encodings

encodings = convert(sys.argv[1], sys.argv[2])
print(encodings)
  • We're importing sys so that we can pull in the command line arguments. 
  • The convert function takes two arguments, which are our command line arguments.
  • The next section converts our str_to_conv to decimal based on conv_type.
  • Next we're creating our encodings dictionary, right now with just the decimal.

If you run the above code, with base_converter.py hex FE you should see a dict with a single element called decimal like: {'decimal': 254}

Now, let's add code to handle conversions to various bases.

import sys

def convert(conv_type, str_to_conv):
# Convert input(s) into decimal as a starting place for all encodings
if conv_type == 'bin':
decimal = int(str_to_conv, 2)
elif conv_type == 'bcd':
decimal = int(str_to_conv)
elif conv_type == 'chr':
decimal = ord(str_to_conv)
elif conv_type == 'dec':
decimal = int(str_to_conv)
elif conv_type == 'hex':
decimal = int(str_to_conv, 16)
elif conv_type == 'oct':
decimal = int(str_to_conv, 8)
# Set up dict to track all conversions
encodings = {"decimal": decimal}

# Convert to binary and inverse
encodings["binary"] = format(encodings["decimal"], '08b')
encodings["binary_inv"] = ''.join('1' if x == '0' else '0' for x in encodings["binary"])

# Convert to decimal inverse
encodings["decimal_inv"] = int(encodings["binary_inv"], 2)

# Convert to hexadecimal and inverse
encodings["hex"] = format(encodings["decimal"],'02x').upper()
encodings["hex_inv"] = format(encodings["decimal_inv"],'02x').upper()

# Convert to octal and inverse
encodings["octal"] = format(encodings["decimal"],'02o').upper()
encodings["octal_inv"] = format(encodings["decimal_inv"],'03o').upper()
return encodings

encodings = convert(sys.argv[1], sys.argv[2])
print(encodings)

We first want to do the binary conversion because inverting a binary string is the easiest way to then convert that back to the decimal inverse which we can use for all other inverse conversions. Let's break down what's happening.

The format() function takes an input, like encodings["decimal"] and the base and format. Let's use the encodings["binary"] line as an example. We're passing it encodings["decimal"], the lower case 'b' is to convert to binary and the '08' part is to format it as an eight bit binary number. For example, let's say we called our program with the following arguments:

base_converter.py bin 1111

This is the decimal equivilant of 15, but we want to pad it with enough zeros to make it eight bits, so it becomes 00001111. e.g. base_convert.py bin 1 would output 00000001. base_convert bin 101 would convert to 00000101. 

Let's look at how we invert the binary conversion next.

''.join('1' if x == '0' else '0' for x in encodings["binary"])

Here we're using a list comprehension to iterate over each bit in the encodings["binary"] string and if it's a 1, change it to a 0 and vice-versa. If you're not familiar with list comprehensions, check out this tutorial. The join function is a neat way to take a list and put it back together as a string. In this case "".join() is joining the bits back together with no separator But let's say you wanted to separate each bit with a dash, you'd use "-".join(), or if you wanted to separate them with a space - space, then you could do " - ".join().

Let's look at the decimal_inv line now. We're using the int() function to take the encodings["binary_inv"] and telling it to convert from base 2 (binary) into an int. This is identical to what we're doing at the top of the function to covert whatever input we received into a decimal as our starting point.

On the encodings["hex"] line we're converting to x (hexadecimal) with a format of 02. So if we called our program with something like base_convert.py dec 2, we will get the output of 02, or base_convert.py dec 15, we'll get an output of ff. Notice the lower case, which is why were using the string method .upper() to convert ff to FF. We'll skip over the octal line as it's pretty much the same as the hex line except we're not using the string method .upper() because octal digits can only be 0-7 (no letters).

Alright, so if you run the code above, you'll see the output contains decimal, decimal_inv, bin, bin_inv, hex, hex_inv, oct and oct_inv. Now let's tackle some more difficult conversions with characters. The conversion itself is simple enough, not much different than what we've already done. The trickiness comes from the fact that there are a number of unprintable characters that will either display nothing, or worse, a line feed or backspace which will mess up your printed output.

Let's just look at the basic code of doing a char conversion and then add code to fix issues.

encodings["char"] = chr(encodings["decimal"])

Simple right? We're using the chr() function to return the character from the ordinal value represented by encodings["decimal"]. Now the messy part. Let's look at an ASCII chart. This one is nice because it shows you all the conversions we're doing here in an easy to read table. Let's assume you call the program with base_convert.py dec 10. Looking at the chart we see that is a line feed (like hitting enter, at least on *nix).

Let's change our code a bit on how we're printing out our encodings and add in the char encoding.

import sys

def convert(conv_type, str_to_conv):
# Convert input(s) into decimal as a starting place for all encodings
if conv_type == 'bin':
decimal = int(str_to_conv, 2)
elif conv_type == 'bcd':
decimal = int(str_to_conv)
elif conv_type == 'chr':
decimal = ord(str_to_conv)
elif conv_type == 'dec':
decimal = int(str_to_conv)
elif conv_type == 'hex':
decimal = int(str_to_conv, 16)
elif conv_type == 'oct':
decimal = int(str_to_conv, 8)
# Set up dict to track all conversions
encodings = {"decimal": decimal}

# Convert to binary and inverse
encodings["binary"] = format(encodings["decimal"], '08b')
encodings["binary_inv"] = ''.join('1' if x == '0' else '0' for x in encodings["binary"])

# Convert to decimal inverse
encodings["decimal_inv"] = int(encodings["binary_inv"], 2)

# Convert to hexadecimal and inverse
encodings["hex"] = format(encodings["decimal"],'02x').upper()
encodings["hex_inv"] = format(encodings["decimal_inv"],'02x').upper()

# Convert to octal and inverse
encodings["octal"] = format(encodings["decimal"],'03o').upper()
encodings["octal_inv"] = format(encodings["decimal_inv"],'03o').upper()

# Convert to ASCII char
encodings["char"] = chr(encodings["decimal"])

return encodings

encodings = convert(sys.argv[1], sys.argv[2])

for k, v in encodings.items():
print(f'{k.upper()}: {v}')

We've added the char encoding at the end of our function, and changed our print to iterate through our dictionary and print each key / value pair with pairs separated on each line. 

Go ahead and run it now with base_convert.py dec 33. Notice on the last line of the output, you have CHAR: !. Now run it with base_convert.py dec 10 and you'll notice that not only is there nothing next to CHAR:, but there's an extra blank line below CHAR:. So how do we fix this? Glad you asked. Let's take a look at the below snippet for the char conversion.

if encodings["decimal"] in range(33, 127) or encodings["decimal"] in range(161,256):
encodings["char"] = chr(encodings["decimal"])

We're going to use an if statement to determine whether encodings["decimal"] is between decimal 33 (inclusive) and 127 (not inclusive) or that it's between 161 (inclusive) and 256 (not inclusive). Let's look back at the ASCII chart.

We're going to ignore converting anything to char that is less than decimal 33 and greater than decimal 126, and in the extended ASCII range (128-256) ignoring anything from 128 to 161 and include anything from 161 to 255. So now that we've excluded a bunch of stuff, it might be nice to show that we couldn't print something on the CHAR: line. How about we use 'xxx' to show we couldn't print that character. We can do that with a simple else statement.

if encodings["decimal"] in range(33, 127) or encodings["decimal"] in range(161,256):
encodings["char"] = chr(encodings["decimal"])
else:
encodings["char"] = 'xxx'

Now when we run our program with base_convert.py dec 7, we'll get CHAR: xxx in the output.

Great right? We've removed all the non-printable characters. Or did we? There's decimal 173 which produces an unprintable character called a soft-hyphen and it's kind of in the middle of our range(161, 256). We have a couple ways to do this. We can put in another if statement to take care of this one special case, or we can add an and condition to our original if statement with the two ors. Let's do the latter so it looks like this:

# Convert to ASCII char and inverse and replace unprintable chars
if (encodings["decimal"] in range(33, 127) or encodings["decimal"] in range(161,256)) and encodings["decimal"] != 173:
encodings["char"] = chr(encodings["decimal"])
else:
encodings["char"] = 'xxx'

Now that we have the char conversions taken care of, let's tackle Binary Coded Decimal (BCD). To refresh your memory, and mine, BCD is the binary representation of a single digit (0-9). So, for the decimal 5, the BCD would be 0101. 10 would be 0001 0000.

What has to happen here is let's say we run out program with base_convert.py dec 255. Since BCD is a binary representation of each digit, we need to break apart each digit (2, 5, 5) and convert each digit to binary and then put those binary conversions back together separated by a space. We're going to revisit our friends list comprehension and the join() function. Let's take a look at the code.

encodings["bcd"] = " ".join(format(int(x), '04b') for x in str(encodings["decimal"]))
encodings["bcd_inv"] = " ".join(format(int(x), '04b') for x in str(encodings["decimal_inv"]))

Let's break down the encodings["bcd"] line.

We already know that format(int(x), '04b') is going to take x and convert to binary with four places.

We know that " ".join() is going to take whatever is inside and put it back together separated by a space.  

And we're left with the list comprehension of do something for each digit in encodings["decimal"], which we need to convert to a string first because you can't iterate an int.

Ok, let's put all the code back together and test it out.

import sys

def convert(conv_type, str_to_conv):
# Convert input(s) into decimal as a starting place for all encodings
if conv_type == 'bin':
decimal = int(str_to_conv, 2)
elif conv_type == 'bcd':
decimal = int(str_to_conv)
elif conv_type == 'chr':
decimal = ord(str_to_conv)
elif conv_type == 'dec':
decimal = int(str_to_conv)
elif conv_type == 'hex':
decimal = int(str_to_conv, 16)
elif conv_type == 'oct':
decimal = int(str_to_conv, 8)
# Set up dict to track all conversions
encodings = {"decimal": decimal}

# Convert to binary and inverse
encodings["binary"] = format(encodings["decimal"], '08b')
encodings["binary_inv"] = ''.join('1' if x == '0' else '0' for x in encodings["binary"])

# Convert to decimal inverse
encodings["decimal_inv"] = int(encodings["binary_inv"], 2)

# Convert to hexadecimal and inverse
encodings["hex"] = format(encodings["decimal"],'02x').upper()
encodings["hex_inv"] = format(encodings["decimal_inv"],'02x').upper()

# Convert to octal and inverse
encodings["octal"] = format(encodings["decimal"],'03o').upper()
encodings["octal_inv"] = format(encodings["decimal_inv"],'03o').upper()

# Convert to ASCII char and inverse and replace unprintable chars
if (encodings["decimal"] in range(33, 127) or encodings["decimal"] in range(161,256)) and encodings["decimal"] != 173:
encodings["char"] = chr(encodings["decimal"])
else:
encodings["char"] = 'xxx'

# Convert to BCD
encodings["bcd"] = " ".join(format(int(x), '04b') for x in str(encodings["decimal"]))
encodings["bcd_inv"] = " ".join(format(int(x), '04b') for x in str(encodings["decimal_inv"]))

return encodings

encodings = convert(sys.argv[1], sys.argv[2])

for k, v in encodings.items():
print(f'{k.upper()}: {v}')


So the only thing we have left now is to do a few other conversions to stuff like base64. But these conversions we want to do on the original input, not the conversions to other stuff. i.e. if you run the program with base_convert.py bin 10110110, we want to do a base64 of the binary, not the decimal we converted it to. Or, maybe you do, but I'll let you work that one out on your own.

To do these conversions, we'll need to import the base64 module. Let's take a look at the new code to do base16, base32, base64 and base85 encoding.

# Base16 conversion of input
encodings["b16"] = base64.b16encode(str_to_conv.encode('utf-8')).decode('utf-8')

# Base32 conversion of input
encodings["b32"] = base64.b32encode(str_to_conv.encode('utf-8')).decode('utf-8')

# Base64 conversion of input
encodings["b64"] = base64.b64encode(str_to_conv.encode('utf-8')).decode('utf-8')

# Base85 conversion of input
encodings["b85"] = base64.a85encode(str_to_conv.encode('utf-8')).decode('utf-8')

Nothing too crazy here, for each encoding we're calling base64.xxxencode depending upon which encoding we want. So what's up with the encode('utf-8') and decode('utf-8') stuff. Well, the base64 module requires an input in byte objects and we need to encode it for base64 to process, then decode it to be clean human readable. You could choose to do encode('ascii') instead, but UTF-8 encodes to unicode which can support pretty much any character in the world, not just what's in the limited ASCII characters.

Ok, so let's put the whole thing together.

import base64
import sys

def convert(conv_type, str_to_conv):
# Convert input(s) into decimal as a starting place for all encodings
if conv_type == 'bin':
decimal = int(str_to_conv, 2)
elif conv_type == 'bcd':
decimal = int(str_to_conv)
elif conv_type == 'chr':
decimal = ord(str_to_conv)
elif conv_type == 'dec':
decimal = int(str_to_conv)
elif conv_type == 'hex':
decimal = int(str_to_conv, 16)
elif conv_type == 'oct':
decimal = int(str_to_conv, 8)
# Set up dict to track all conversions
encodings = {"decimal": decimal}

# Convert to binary and inverse
encodings["binary"] = format(encodings["decimal"], '08b')
encodings["binary_inv"] = ''.join('1' if x == '0' else '0' for x in encodings["binary"])

# Convert to decimal inverse
encodings["decimal_inv"] = int(encodings["binary_inv"], 2)

# Convert to hexadecimal and inverse
encodings["hex"] = format(encodings["decimal"],'02x').upper()
encodings["hex_inv"] = format(encodings["decimal_inv"],'02x').upper()

# Convert to octal and inverse
encodings["octal"] = format(encodings["decimal"],'03o').upper()
encodings["octal_inv"] = format(encodings["decimal_inv"],'03o').upper()

# Convert to ASCII char and inverse and replace unprintable chars
if (encodings["decimal"] in range(33, 127) or encodings["decimal"] in range(161,256)) and encodings["decimal"] != 173:
encodings["char"] = chr(encodings["decimal"])
else:
encodings["char"] = 'xxx'

# Convert to BCD
encodings["bcd"] = " ".join(format(int(x), '04b') for x in str(encodings["decimal"]))
encodings["bcd_inv"] = " ".join(format(int(x), '04b') for x in str(encodings["decimal_inv"]))

# Base16 conversion of input
encodings["b16"] = base64.b16encode(str_to_conv.encode('utf-8')).decode('utf-8')

# Base32 conversion of input
encodings["b32"] = base64.b32encode(str_to_conv.encode('utf-8')).decode('utf-8')

# Base64 conversion of input
encodings["b64"] = base64.b64encode(str_to_conv.encode('utf-8')).decode('utf-8')

# Base85 conversion of input
encodings["b85"] = base64.a85encode(str_to_conv.encode('utf-8')).decode('utf-8')
return encodings


encodings = convert(sys.argv[1], sys.argv[2])

for k, v in encodings.items():
print(f'{k.upper()}: {v}')

And there you have it, your first secret encoder / decoder ring.

If you want to take this further, here's some ideas.

  • Accept multiple inputs, like bin 'FF 07 80' or dec '21 56' or a whole word.
  • Add ROT-13, or even better, let the user input how many characters they want to shift by. Like, rot cattle and it will ask how many to shift by.
  • Create hash values like MD5, SHA1, SHA256. Take a look at the hashlib module.
  • Roman numeral conversions.
  • If you really want to take this to the next level, convert braille, morse code or upside down text. Hint, you'll want to create dictionaries like {'A': '.-'} for morse, and {'G': '⠛'} for braille.
  • And finally, if you want a real challenge convert to and from pig-latin. You're gonna lose some hair, I promise. Dealing with punctuation, legal consonant pairs etc.


No comments:

Post a Comment