Digital evidence, like any type of evidence, requires a means of identification, a way to prove that what you are presenting as evidence was not modified in any way. The best way to prove that nothing was changed during the investigation is the use of a hash algorithm.
When working with digital evidence, it’s a best practice that the first thing you do is to calculate a hash value. A hash is a mathematical calculation. Simply said, you take all data, do a calculation and the result is your hash value. Commonly used (forensic) hash algorithms are:
- MD5
- SHA1
- SHA2
- RipeMD
Since every file has a unique hash value you can understand it’s the best way to identify files during an investigation. It’s common for investigators to use the hash of a file to prove that a certain piece of evidence was found in the suspect’s system. For example, imagine a case where you have to screen multiple systems to see if someone has stored any classified files on their hard drives. The fastest way to do this is to generate a list of hash values of the classified files and generate a list of hash values of the files stored on the employees systems. By comparing these lists you will be able to quickly determine if any of these classified files were stored on their systems.
Since every file has a unique value the contents of the files with the same hash value must be identical. Right?
In 2004, Xiaoyun Wang and Hongbo Yu of Shandong University in China published an article in which they describe an algorithm that can find two different sequences of 128 bytes with the same MD5 hash. The most famous example of 128 bytes with the same MD5 hash are:
d1 31 dd 02 c5 e6 ee c4 69 3d 9a 06 98 af f9 5c 2f ca b5 87 12 46 7e ab 40 04 58 3e b8 fb 7f 89 55 ad 34 06 09 f4 b3 02 83 e4 88 83 25 71 41 5a 08 51 25 e8 f7 cd c9 9f d9 1d bd f2 80 37 3c 5b d8 82 3e 31 56 34 8f 5b ae 6d ac d4 36 c9 19 c6 dd 53 e2 b4 87 da 03 fd 02 39 63 06 d2 48 cd a0 e9 9f 33 42 0f 57 7e e8 ce 54 b6 70 80 a8 0d 1e c6 98 21 bc b6 a8 83 93 96 f9 65 2b 6f f7 2a 70 |
MD5 value: 79054025255fb1a26e4bc422aef54eb4
d1 31 dd 02 c5 e6 ee c4 69 3d 9a 06 98 af f9 5c 2f ca b5 07 12 46 7e ab 40 04 58 3e b8 fb 7f 89 55 ad 34 06 09 f4 b3 02 83 e4 88 83 25 f1 41 5a 08 51 25 e8 f7 cd c9 9f d9 1d bd 72 80 37 3c 5b d8 82 3e 31 56 34 8f 5b ae 6d ac d4 36 c9 19 c6 dd 53 e2 34 87 da 03 fd 02 39 63 06 d2 48 cd a0 e9 9f 33 42 0f 57 7e e8 ce 54 b6 70 80 28 0d 1e c6 98 21 bc b6 a8 83 93 96 f9 65 ab 6f f7 2a 70 |
MD5 value: 79054025255fb1a26e4bc422aef54eb4
So here we have a potential problem. Two different files are able to have the same result as MD5 hash. Knowing this we could argue that the MD5 hash matching the file doesn’t have to mean that the contents of these files are identical.
Should you still use MD5?
The reality is that these examples are being generated on purpose exploiting a known weakness in MD5. As far as I know, there hasn’t been a known case where two files have the same MD5 value without someone manipulating the file on purpose. Also, you should never state that two files are identical based purely on any kind of hash value, you always need to check if two files are identically by examining the files.
A hash value is a great way to filter and quickly identify certain files. It’s also a great way to prove a file has not been modified in any way. And this is where things get tricky and why you should always use two hash values.
If you only use MD5 to prove the integrity of your evidence, someone could claim that you modified the evidence and generated an MD5 collision afterward to that the modified evidence has the same MD5 value as the original. And without a secondary hash value like SHA1 you will have a hard time proving them wrong.
When you are processing evidence, you should always calculate at least 2 hash function e.g. MD5 and SHA1. MD5 is still a valid way to identify files, and when you combine 2 functions you eliminate the possibility of a collision since it should be impossible to generate a collision in 2 hash functions simultaneously.