The string:
49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d
Should produce:
SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t
So go ahead and make that happen. You'll need to use this code for the rest of the exercises.
Cryptopals Rule¶
Always operate on raw bytes, never on encoded strings. Only use hex and base64 for pretty-printing.
Base64 is a way of encoding any bytes into a printable string. Like hexadecimal encoding, but more efficient since one character in base64 encodes 6 bits ($ 2^6 = 64 $) when one character in hexadecimal encodes 4 bits ($ 2^4 = 16 $). For more information see Wikipedia:Base64.
What is nice with Python is that it has tools in the standard library to manage both hexadecimal and base64, so going from one to the other is easy.
hex_string = '49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d'
# I wrote a simple function that will display a nice "OK" or "ERROR"
# depending on the test given as input
from libmatasano import html_test
# https://docs.python.org/3/library/binascii.html#binascii.unhexlify
from binascii import hexlify, unhexlify
# https://docs.python.org/3/library/base64.html
from base64 import b64encode, b64decode
b64_string = b64encode(unhexlify(hex_string))
html_test(b64_string == b'SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t')
Easy ! Next challenge.
Write a function that takes two equal-length buffers and produces their XOR combination.
If your function works properly, then when you feed it the string:
1c0111001f010100061a024b53535009181c
... after hex decoding, and when XOR'd against:
686974207468652062756c6c277320657965
... should produce:
746865206b696420646f6e277420706c6179
"XOR" means the "Exclusive OR" operation: Wikipedia:Exclusive_or. Here we will "XOR" bitstrings together bit-by-bit ("bitwise XOR").
Python does have a builtin operator for bitwise XOR,
it's the ^
operator,
but it operates on integers.
# the builtin "bin" function give the binary representation of a number
# this makes the effect of bitwise XOR more seeable.
bin(123), bin(127), bin(123 ^ 127)
We will mostly be working on bytes so we need the same function but that takes (and returns) bytes.
def bxor(a, b):
"bitwise XOR of bytestrings"
return bytes([ x^y for (x,y) in zip(a, b)])
A = unhexlify('1c0111001f010100061a024b53535009181c')
B = unhexlify('686974207468652062756c6c277320657965')
expected_result = unhexlify('746865206b696420646f6e277420706c6179')
html_test(bxor(A,B) == expected_result)
The hex encoded string:
1b37373331363f78151b7f2b783431333d78397828372d363c78373e783a393b3736
... has been XOR'd against a single character. Find the key, decrypt the message.
You can do this by hand. But don't: write code to do it for you.
How? Devise some method for "scoring" a piece of English plaintext. Character frequency is a good metric. Evaluate each output and choose the one with the best score.
What they mean here is that they took one byte (one byte is 8 bits, like 01111011 for instance that represents the number 123), and XORed each byte of the message against this "key" byte.
An example:
# our message
msg = b'hi there!'
# let's pick a random number between 0 and 255 for our key; what about 77?
key = b'\x77'
# the byte string we will XOR the message against is usually called the "keystream"
keystream = key*len(msg)
# here is our "ciphertext"
bxor(msg, keystream)
Note that this method of encryption is equivalent to the famous "caesar cipher" (see Wikipedia:Caesar_cipher) that is presented in every introduction to cryptography. It's ust that this time we are working with bytes instead of letters and that the operation is XOR instead of rotation in the alphabet (both operations are easily computable permutations in their respective domain, which is why it works)
Usually you hear about breaking the caesar cipher by analysing the frequency of letters in the ciphertext and matching it with the frequency of letters in the alphabet.
For the caesar cipher, and the similar one that we are attacking in this challenge, there is a simpler attack: try every single key.
This is because the number of possible keys is very small (256 possibilities in our case, 26 in the "classical" caesar cipher with letters, compared to $ 2^{128} ≈ 3.4 × 10^{38} $ for AES-128 that we will use later in these challenges).
Here is, then, our (very simple) strategy to recover the message: we decrypt the message using every possible key, and we see what message we get each time. The proper key should be the only one giving a meaningful message as output (the rest should give a meaningless mix of bytes).
We then need to know how to decrypt with this cipher given the key that was used for encryption (or simply the key that we want to try).
With the XOR operation, the decryption operation is simply the same as the encryption one. This is because of the following property of the XOR operation: XORing twice by the same value gives back the original value. That is, for all bytes (or bits, or bytestrings) $ A $ and $ B $, we have:
$$ ( A \oplus B ) \oplus B = A $$Where $ \oplus $ denotes the XOR operation.
That is, to decrypt you just have to re-generate the keystream and XOR the ciphertext against it.
# this is the ciphertext we are supposed to work on
ciphertext = unhexlify('1b37373331363f78151b7f2b783431333d78397828372d363c78373e783a393b3736')
# we said we will try every byte as the key, so let's start with 1
# (zero as a key would be uninteresting)
candidate_key = bytes([1])
keystream = candidate_key*len(ciphertext)
# that's our candidate keystream:
keystream
bxor(ciphertext, keystream)
We can see that using 1
as the key does not give a meaningful message,
meaning that the key is not 1
.
But we're not going to check manually every possible key
(it goes up to 255),
we need an automated test that tells us if the key is the good one or not.
The message is supposed to be english text, so it should have almost only letters.
The isalpha
builtin in python would be interesting
but it only detects if a bytestring contains only alphabetical characters
(see https://docs.python.org/3/library/stdtypes.html#bytes.isalpha)
(b'lol').isalpha()
# the spaces and the question tag are not "alphabetical characters"
(b'how are you?').isalpha()
What we want is the ratio between letters and other symbols:
It will be high for text and low otherwise.
In ASCII encoding, letters a
to z
correspond to numbers 97 to 122.
We can see that with the ord
built-in function:
ord('a'), ord('b'), ord('z')
The "space" character is also very frequent in text, might as well count it too. This will increase accuracy and will prove useful later (Challenge 6).
ord(' ')
ascii_text_chars = list(range(97, 122)) + [32]
[ x in ascii_text_chars for x in b'how are you?']
sum([ x in ascii_text_chars for x in b'how are you?'])
def letter_ratio(input_bytes):
nb_letters = sum([ x in ascii_text_chars for x in input_bytes])
return nb_letters / len(input_bytes)
def is_probably_text(input_bytes):
r = letter_ratio(input_bytes)
return True if r>0.7 else False
is_probably_text(b'Hello, how are you?')
is_probably_text(b'\x1e22643:}\x10\x1ez.}1468}<}-2(39}2;}?<>23')
def attack_single_byte_xor(ciphertext):
# a variable to keep track of the best candidate so far
best = None
for i in range(2**8): # for every possible key
# converting the key from a number to a byte
candidate_key = i.to_bytes(1, byteorder='big')
keystream = candidate_key*len(ciphertext)
candidate_message = bxor(ciphertext, keystream)
nb_letters = sum([ x in ascii_text_chars for x in candidate_message])
# if the obtained message has more letters than any other candidate before
if best == None or nb_letters > best['nb_letters']:
# store the current key and message as our best candidate so far
best = {"message": candidate_message, 'nb_letters': nb_letters, 'key': candidate_key}
return best
result = attack_single_byte_xor(ciphertext)
print('key:', result['key'])
print('message:', result['message'])
print('nb of letters:', result['nb_letters'])
html_test(is_probably_text(result['message']))
What an interesting message we have here...
We modify our previous function so that it returns an error in case the ratio of letters and spaces is too low to be a properly decrypted message:
class InvalidMessageException(Exception):
pass
def attack_single_byte_xor(ciphertext):
best = {"nb_letters": 0}
for i in range(2**8):
candidate_key = i.to_bytes(1, byteorder='big')
candidate_message = bxor(ciphertext, candidate_key*len(ciphertext))
nb_letters = sum([ x in ascii_text_chars for x in candidate_message])
if nb_letters>best['nb_letters']:
best = {"message": candidate_message, 'nb_letters': nb_letters, 'key': candidate_key}
if best['nb_letters'] > 0.7*len(ciphertext):
return best
else:
raise InvalidMessageException('best candidate message is: %s' % best['message'])
Let's try if it works:
from os import urandom
try:
# Instead of giving a real encrypted message we give random bytes
attack_single_byte_xor(urandom(16))
except InvalidMessageException:
print('Got an InvalidMessageException as expected')
else:
print('No exception: something is wrong')
with open('data/04.txt') as data_file:
ciphertext_list = [
# the 'strip' is to remove the "newline" character
# which python keeps when reading a file line by line
unhexlify(line.strip())
for line in data_file
]
candidates = list()
# for the "enumerate" builtin function, see
# https://docs.python.org/3/library/functions.html#enumerate
for (line_nb, ciphertext) in enumerate(ciphertext_list):
try:
message = attack_single_byte_xor(ciphertext)['message']
except InvalidMessageException:
pass
else:
candidates.append({
'line_nb': line_nb,
'ciphertext': ciphertext,
'message': message
})
if len(candidates) > 1:
print("Error: more than one candidate")
html_test(false)
else:
for (key, value) in candidates[0].items():
print(f'{key}: {value}')
html_test(is_probably_text(candidates[0]['message']))
Here is the opening stanza of an important work of the English language:
Burning 'em, if you ain't quick and nimble I go crazy when I hear a cymbal
Encrypt it, under the key "ICE", using repeating-key XOR.
In repeating-key XOR, you'll sequentially apply each byte of the key; the first byte of plaintext will be XOR'd against I, the next C, the next E, then I again for the 4th byte, and so on.
It should come out to:
0b3637272a2b2e63622c2e69692a23693a2a3c6324202d623d63343c2a26226324272765272 a282b2f20430a652e2c652a3124333a653e2b2027630c692b20283165286326302e27282f
Encrypt a bunch of stuff using your repeating-key XOR function. Encrypt your mail. Encrypt your password file. Your .sig file. Get a feel for it. I promise, we aren't wasting your time with this.
message = b"Burning 'em, if you ain't quick and nimble\nI go crazy when I hear a cymbal"
key = b'ICE'
# We have to repeat the key until the keystream is at least as long as the message
# our "bxor" function gives an output as long as the shortest input
# so the output will be as long as the message here
keystream = key*(len(message)//len(key) + 1)
ciphertext = bxor(message, keystream)
expected_result = unhexlify(
b'0b3637272a2b2e63622c2e69692a23693a2a3c6324202d623d6'
b'3343c2a26226324272765272a282b2f20430a652e2c652a3124'
b'333a653e2b2027630c692b20283165286326302e27282f'
)
html_test(ciphertext == expected_result)
There's a file here. It's been base64'd after being encrypted with repeating-key XOR.
Decrypt it.
Here's how:
Let KEYSIZE be the guessed length of the key; try values from 2 to (say) 40.
Write a function to compute the edit distance/Hamming distance between two strings.
For each KEYSIZE, take the first KEYSIZE worth of bytes, and the second KEYSIZE worth of bytes, and find the edit distance between them. Normalize this result by dividing by KEYSIZE.
The KEYSIZE with the smallest normalized edit distance is probably the key. You could proceed perhaps with the smallest 2-3 KEYSIZE values. Or take 4 KEYSIZE blocks instead of 2 and average the distances.
Now that you probably know the KEYSIZE: break the ciphertext into blocks of KEYSIZE length.
Now transpose the blocks: make a block that is the first byte of every block, and a block that is the second byte of every block, and so on.
Solve each block as if it was single-character XOR. You already have code to do this. For each block, the single-byte XOR key that produces the best looking histogram is the repeating-key XOR key byte for that block. Put them together and you have the key.
This code is going to turn out to be surprisingly useful later on. Breaking repeating-key XOR ("Vigenere") statistically is obviously an academic exercise, a "Crypto 101" thing. But more people "know how" to break it than can actually break it, and a similar technique breaks something much more important.
Here is an illustration of what we are told to do:
Here is the idea:
First we have to find out the lenght of the key. This is the purpose of the steps using the "edit distance".
When we have the key length, we are able to group together ciphertext bytes that share the same key byte.
This result in "blocks" each being a "single-character XOR" which we know how to break since challenge 3.
The most complicated part is probably then the technique to find the key lenght. Why it works is not obvious (at least not to me?), but how it's done is given step by step so we can just follow these steps.
# the bitwise-xor of two strings will have ones
# only in places where the two strings have different bits
# hence, computing the hamming distance of the two strings
# is equivalent to counting the numbers of one (the "hamming weight") of their XOR sum
def hamming_distance(a, b):
return sum(bin(byte).count('1') for byte in bxor(a,b))
with open("data/06.txt") as file:
ciphertext = b64decode(file.read())
def score_vigenere_key_size(candidate_key_size, ciphertext):
# as suggested in the instructions,
# we take samples bigger than just one time the candidate key size
slice_size = 2*candidate_key_size
# the number of samples we can make
# given the ciphertext length
nb_measurements = len(ciphertext) // slice_size - 1
# the "score" will represent how likely it is
# that the current candidate key size is the good one
# (the lower the score the *more* likely)
score = 0
for i in range(nb_measurements):
s = slice_size
k = candidate_key_size
# in python, "slices" objects are what you put in square brackets
# to access elements in lists and other iterable objects.
# see https://docs.python.org/3/library/functions.html#slice
# here we build the slices separately
# just to have a cleaner, easier to read code
slice_1 = slice(i*s, i*s + k)
slice_2 = slice(i*s + k, i*s + 2*k)
score += hamming_distance(ciphertext[slice_1], ciphertext[slice_2])
# normalization: do not forget this
# or there will be a strong biais towards long key sizes
# and your code will not detect key size properly
score /= candidate_key_size
# some more normalization,
# to make sure each candidate is evaluated in the same way
score /= nb_measurements
return score
def find_vigenere_key_length(ciphertext, min_length=2, max_length=30):
# maybe this code is a bit over-sophisticated
# it just outputs the key size for wich
# the score at the "score_vigenere_key_size" function is the *lowest*
key = lambda x: score_vigenere_key_size(x,ciphertext)
return min(range(min_length, max_length), key=key)
def attack_repeating_key_xor(ciphertext):
keysize = find_vigenere_key_length(ciphertext)
# we break encryption for each character of the key
key = bytes()
message_parts = list()
for i in range(keysize):
# the "i::keysize" slice accesses elements in an array
# starting at index 'i' and using a step of 'keysize'
# this gives us a block of "single-character XOR" (see figure above)
part = attack_single_byte_xor(bytes(ciphertext[i::keysize]))
key += part["key"]
message_parts.append(part["message"])
# then we rebuild the original message
# by putting bytes back in the proper order
# TODO again code may be over-sophisticated and not very readable here
message = bytes()
for i in range(max(map(len, message_parts))):
message += bytes([part[i] for part in message_parts if len(part)>=i+1])
return {'message':message, 'key':key}
result = attack_repeating_key_xor(ciphertext)
print("key:",result["key"],'\n')
print('message:\n')
print(result["message"].decode())
The Base64-encoded content in this file has been encrypted via AES-128 in ECB mode under the key
"YELLOW SUBMARINE".
(case-sensitive, without the quotes; exactly 16 characters; I like "YELLOW SUBMARINE" because it's exactly 16 bytes long, and now you do too).
Decrypt it. You know the key, after all.
Easiest way: use OpenSSL::Cipher and give it AES-128-ECB as the cipher.
We will use the excellent cryptography
Python library (https://cryptography.io/).
The documentation for AES in ECB mode is there:
https://cryptography.io/en/latest/hazmat/primitives/symmetric-encryption/
ECB mode is the most straightforward way to encrypt a long message with a block cipher like AES. It simply consists in splitting the message in parts that have exactly the block lenght, and encrypting each part with the key.
However, and as we are going to see it soon in the next challenges, ECB is completely insecure!.
from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes
from cryptography.hazmat.backends import default_backend
backend = default_backend()
def decrypt_aes_128_ecb(ctxt, key):
cipher = Cipher(algorithms.AES(key), modes.ECB(), backend=backend)
decryptor = cipher.decryptor()
decrypted_data = decryptor.update(ctxt) + decryptor.finalize()
# would need some padding stripping actually (we'll see padding later)
message = decrypted_data
return message
with open("data/07.txt") as file:
data = file.read()
print(decrypt_aes_128_ecb(
ctxt = b64decode(data),
key=b"YELLOW SUBMARINE"
).decode())
In this file are a bunch of hex-encoded ciphertexts.
One of them has been encrypted with ECB.
Detect it.
Remember that the problem with ECB is that it is stateless and deterministic; the same 16 byte plaintext block will always produce the same 16 byte ciphertext.
Remember: ECB mode splits the message in blocks of 16 bytes (the block size of the cipher we are using) and encrypts each block using the same key. And AES is a deterministic process: running AES with the same key and same message always gives the same output.
Thus, if there are two identical blocks in the message, there will be two identical blocks in the ciphertext.
In some cases, this suffices for an adversary to recover the message directly from the ciphertext. One famous example is this picture of the Linux logo (a penguin) encrypted with ECB mode. Here it is from Wikipedia:
The long series of white pixels in the background causes a somewhat uniform pattern on the encrypted picture that reveals the silhouette of the penguin.
This is quite an extreme case, and in many situations it's not because the message is encrypted using ECB that you will be able to learn its content instantly just by looking at it. It should be enough, however, to convince you that ECB is insecure as hell: this kind of weakness is not what we expect from a strong encryption mechanism.
But the true purpose of this challenge is to prepare us to mess with ECB encryption to be able to recover any message encrypted under ECB, provided that we have a way to obtain encryptions of messages of our choice.
with open('data/08.txt') as f:
ctxts = [unhexlify(line.strip()) for line in f]
def has_repeated_blocks(ctxt, blocksize=16):
'''blocksize is in bytes'''
if len(ctxt) % blocksize != 0:
raise Exception('ciphertext length is not a multiple of block size')
else:
num_blocks = len(ctxt) // blocksize
blocks = [ctxt[i*blocksize:(i+1)*blocksize] for i in range(num_blocks)]
if len(set(blocks)) != num_blocks:
return True
else:
return False
hits = [ctxt for ctxt in ctxts if has_repeated_blocks(ctxt)]
hits
html_test(len(hits)==1)