CeWL - Custom Word List generator

Based on a discussion on PaulDotCom about creating custom word lists by spidering a targets website and collecting unique words I decided to write CeWL, the Custom Word List generator. CeWL is a ruby app which spiders a given url to a specified depth, optionally following external links, and returns a list of words which can then be used for password crackers such as John the Ripper.

By default, CeWL sticks to just the site you have specified and will go to a depth of 2 links, this behaviour can be changed by passing arguments. Be careful if setting a large depth and allowing it to go offsite, you could end up drifting on to a lot of other domains. All words of three characters and over are output to stdout. This length can be increased and the words can be written to a file rather than screen so the app can be automated.

Version 3 of CeWL addresses a problem spotted by Josh Wright. The Spider gem doesn't handle JavaScript redirection URLs, for exmaple an index page containing just the following:

<script language="JavaScript">
self.location.href =
'http://www.FOO.com/FOO/connect/FOONet/Top+Navigator/Home';
</script>

wasn't spidered because the redirect wasn't picked up. I now scan through a page looking for any lines containing location.href= and then add the given URL to the list of pages to spider.

Version 2 of CeWL can also create two new lists, a list of email addresses found in mailto links and a list of author/creator names collected from meta data found in documents on the site. It can currently process documents in Office pre 2007, Office 2007 and PDF formats. This user data can then be used to create the list of usernames to be used in association with the password list.

CeWL also has an associated command line app, FAB (Files Already Bagged) which uses the same meta data extraction techniques to create author/creator lists from already downloaded.

Pronunciation

Seeing as I was asked, CeWL is pronounced "cool".

Download

download cewl version 3.0

Installation

CeWL needs the rubygems package to be installed along with the following gems:

  • hpricot
  • http_configuration
  • mime-types
  • mini_exiftool
  • rubyzip
  • spider

All these gems were available by running gem install xxx as root. The mini_exiftool gem also requires the exiftool application to be installed.

To use the gems you may also need to set the following environment variable:

RUBYOPT="rubygems"

Then just save CeWL to a directory and make it executable.

Usage

cewl [OPTION] ... URL

--help, -h
Show help
--depth x, -d x
The depth to spider to, default 2
--min_word_length, -m
The minimum word length, this strips out all words under the specified length, default 3
--offsite, -o
By default, the spider will only visit the site specified. With this option it will also visit external sites
--write, -w file
Write the ouput to the file rather than to stdout
--ua, -u user-agent
Change the user agent
-v
Verbose, show debug and extra output
--no-words, -n
Don't output the wordlist
--meta, -a file
Include meta data, optional output file
--email, -e file
Include email addresses, optional output file
--meta-temp-dir directory
The directory used used by exiftool when parsing files, the default is /tmp
URL
The site to spider.

If you need to use a proxy server, you will need to uncomment three lines in CeWL, just look through for the comments.

Common Problems

Here are a couple of the common problems people have seen while trying to use CeWL and FAB.

Bug in the Spider gem

While trying to track down why some pages weren't getting spidered I found a bug in the Spider gem. I've reported this to the developer but haven't heard back from him yet so I assume it isn't likely to be fixed any time soon.

The problem is with the regex that spots anchor tags in pages. The regex assumes that the url will be enclosed in double quotes however single quotes are also allowed. The fix for this is a simple change to the regex in the file spider/spider_instance.rb. Search for the following lines:

base_url = (web_page.scan(/base\s+href="(.*?)"/i).flatten +
    [a_url[0,a_url.rindex('/')]])[0]
base_url = remove_trailing_slash(base_url)
web_page.scan(/href="(.*?)"/i).flatten.map do |link|

And alter the two regexs so they now read

base_url = (web_page.scan(/base\s+href=['"](.*?)['"]/i).flatten +
    [a_url[0,a_url.rindex('/')]])[0]
base_url = remove_trailing_slash(base_url)
web_page.scan(/href=['"](.*?)['"]/i).flatten.map do |link|

In case you can't spot it, the change is the ['"] rather than just " at the start and end of the regular expression. If you want to catch even more URLs then some people miss the quotes completely so you extend the regex to catch those as well.

Missing exiftool

If you see this error while trying to run either CeWL or FAB


/usr/lib/ruby/gems/1.8/gems/mini_exiftool-1.0.1/lib/mini_exiftool.rb:246:in `exiftool_version': Command 'exiftool' not found (MiniExiftool::Error)
from /usr/lib/ruby/gems/1.8/gems/mini_exiftool-1.0.1/lib/mini_exiftool.rb:265
from /usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:36:in `gem_original_require'
from /usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:36:in `require'
from ./cewl_lib.rb:1
from /usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:31:in `gem_original_require'
from /usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:31:in `require'
from ./cewl.rb:58

then the application can't access exiftool. Either install it or make sure it is in your path.

HTTPS Problem

It has been reported that if you see this problem

/usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:31:in `gem_original_require': no such file to load -- net/https (LoadError)
from /usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:31:in `require'
from /usr/lib/ruby/gems/1.8/gems/spider-0.4.4/lib/spider/spider_instance.rb:30
from /usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:31:in `gem_original_require'
from /usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:31:in `require'
from /usr/lib/ruby/gems/1.8/gems/spider-0.4.4/lib/spider.rb:26
from /usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:36:in `gem_original_require'
from /usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:36:in `require'
from ./cewl.rb:56

Then you need the Ruby libopenssl package. In Debian the package is called libopenssl-ruby.

Missing Gem

This error:

./cewl.rb:56:in `require': no such file to load -- spider (LoadError)
from ./cewl.rb:56

means you are missing a ruby gem, in this example it is spider. You can get a list of installed gems by running

gem list

If you think you have the gems installed but are still getting the error make sure you have the RUBYOPT setting.

Spider Missing Pages

Someone has reported that the spider misses some pages which are have querystrings on them. I haven't been able to reproduce this in my tests. If anyone has this problem and can reproduce it please let me know and I'll investigate it further.

Change Log

Keeping track of history.

  • Version 3 - Now spiders pages referenced in JavaScript location commands
  • Version 2.2 - Data from email addresses and meta data can be written to their own files
  • Version 2.1 - Fixed a bug some people were having while using the email option
  • Version 2 - Added meta data support
  • Version 1 - released

Ruby Doc

CeWL is commented up in Ruby Doc format.

Table of Contents

Categories