/Building wordlists from Forensic Images

Building wordlists from Forensic Images

Reading Time: 4 minutes

Encryption has become widespread and it’s common to encounter at least a few encrypted files during an investigation. Bruteforcing a password is always an option, however, depending on the type of encryption that has been used this can take a few minutes or even centuries using commonly available computer hardware.

Your best bet when trying to gain access to a file/document or even entire encrypted volume is using a personalized word list. In this post, I am going to explain on how to generate such a wordlist using the free utility bulk_extractor.


A wordlist is, as its name suggests, a list of words. Most decryption tools have support for wordlists or even include some standard wordlists. There are a lot of wordlists freely available on the internet, these lists vary from common dictionary words in a certain language to a specialized list containing video-game characters.

When attacking an encrypted file your best bet will be a personalized wordlist, one way to get such a list is by extracting all words from a drive image. It’s common for people to store their passwords in some way on their drives or to base their passwords on something familiar (family members, pet names, owned cars). By generating a list of all words stored on a computer system there is a good chance your list will contain (parts of) the password. The wordlist doesn’t have to contain the exact password used by the person, most decryption tools are able to perform some basic manipulation on the words, for example, if the tool finds the word tiger in the wordlist, it will automatically try Tiger, TIGER, T1ger, T1G3R, Tiger! etc.


When it comes to passwords, length beats complexity.


Bulk_extractor is a tool that is able to scan a forensic disk image, directory or file and extract useful information. This information is stored in text files that can be analyzed further. In this post, I will only use it’s wordlist generating capabilities.
You can get bulk_extractor for free from its website or GitHub. I will be using “bulk_extractor64.exe”, this is the 64bit windows version of bulk_extractor version 1.5.2. The main advantage of this version that it requires no installation. You should be able to follow the guide using any version of bulk_extractor.

!Please note, I have renamed bulk_extractor64.exe to bulk_extractor.exe

Once you have acquired bulk_extractor you can use the following command to generate a wordlist:

bulk_extractor -E wordlist -o E:\Results\ E:\Images\Evidence.001

The command broken down:

bulk_extractorRunning the tool
-E wordlistOnly run the wordlist module.
(The wordlist module is disabled by default)
-o E:\Results\Save the results in this directory
E:\Images\Evidence.001The image file we are going to scan.

By default bulk_extractor only exports words between the 6 and 14 characters long. This is a good setting to start with since bulk_extractor will also generate a lot of “noise”. If you want to change the length of the words it exports you can use the following -S switches:

-S word_min(default: 6) the minimal word length
-S word_max(default: 14) the maximum word length

The time it takes to generate the list varies depending on the size of the image, the contents of the image and the system you are using. Bulk_extractor is I/O and CPU intensive, so using a fast multi-core CPU and an SSD will speed things up noticeably.
With an i7-6700K and an SSD, the average overall performance will be around 100MB/sec. In my case, it took 41 seconds to process a 30GB Windows XP Image (containing 4GB of data).

E:\Tools\bulk_extractor -E wordlist -o E:\Results/ E:\Images\Evidence.001

bulk_extractor version: 1.5.2
Input file: E:\Images\Evidence.001
Output directory: E:\Results\
Disk Size: 4127195136
Threads: 8
20:01:58 Offset 67MB (1.63%) Done in 0:00:18 at 20:02:16

20:02:22 Offset 30698MB (99.19%) Done in 0:00:00 at 20:02:22
All data are read; waiting for threads to finish…
Time elapsed waiting for 8 threads to finish:
(timeout in 60 min.)
All Threads Finished!
Producer time spent waiting: 10.0013 sec.
Average consumer time spent waiting: 0.894464 sec.
** bulk_extractor is probably CPU bound. **
** Run on a computer with more cores **
** to get better performance. **
MD5 of Disk Image: 3ed8e2a3c123f44842e3e5e7d7841c50
Phase 2. Shutting down scanners
Phase 3. Uniquifying and recombining wordlist
Phase 3. Creating Histograms
Elapsed time: 41.7673 sec.
Total MB processed: 30953
Overall performance: 98.8139 MBytes/sec (12.3517 MBytes/sec/thread)


When bulk_extractor finishes it will generate 4 files in the result directory.

alerts.txtAny errors/alerts generated during the scan will be stored here.
report.xmlA report stored in XML format containing scan details.
wordlist.txtA list of all words extracted during the scan.
wordlist_split_xxx.txtThe wordlist without duplicates.

The wordlist_split_xxx.txt is the list you will want to use when running an attack on an encrypted file/container. It will contain all the “words” that were detected during the scan but without any additional data or duplicates.

When you open the list you will notice that the list will contain a lot of “noise” meaning there are a lot of “words” that don’t make any sense or aren’t even words, to begin with. This is one of the limitations of these tools. There have been attempts to clean these lists up by comparing them to a dictionary and filtering all the gibberish out, but doing this also removes all password that the user might have made up. While generated wordlists might be long and will contain a lot of noise, they remain a great way to break encryption.


Password Strength” by Randall Munroe (XKCD) is licensed under CC BY-NC 2.5