phishbench.feature_extraction

This module handles feature extraction. It contains the models for user-defined features, code for loading and extracting features, and a library of built-in features.

User-Defined Features

@phishbench.feature_extraction.reflection.register_feature(feature_type, config_name, default_value=- 1)

Registers a feature for use in Phishbench

Parameters
  • feature_type (FeatureType) – The type of feature

  • config_name (str) – The name of the feature in the config file

  • default_value – The value to use if there is an error

enum phishbench.feature_extraction.reflection.FeatureType(value)

The types of features that can be extracted

Valid values are as follows:

EMAIL_BODY
EMAIL_HEADER
URL_RAW
URL_NETWORK
URL_WEBSITE
URL_WEBSITE_JAVASCRIPT
class phishbench.feature_extraction.reflection.FeatureMC(name, bases, attrs)

The metaclass for Feature Classes

Feature Loading & Extraction

phishbench.feature_extraction.reflection.load_features(internal_features=None, feature_filter=None)

Searches for python modules in the current directory and loads features from any modules it finds.

Parameters
  • internal_features (Union[ModuleType, List]) – The module or a list of modules to load internal features from

  • feature_filter (Union[str, None]) –

    The filter to use when loading features. Allowed values are:

    • "Email" - Only include email types that are enabled in the configuration.

    • "URL" - Only include URL types that are enabled in the configuration.

    • None - Don’t filter features.

Returns

Return type

A list of features.

phishbench.feature_extraction.reflection.load_features_from_module(features_module, filter_features=None)

Loads features from a module

Parameters
  • features_module (ModuleType) – The module to import features from

  • filter_features (Union[str, None]) – Whether or not to filter out features according to the current configuration

Returns

Return type

A list of features in the module.

URL Feature Extraction

phishbench.feature_extraction.url.extract_features_from_single(features, url)

Extracts multiple features from a single url :param features: The features to extract :type features: List[FeatureClass] :param url: The url to extract the features from :type url: URLData

Returns

  • feature_values (Dict) – The extracted feature values

  • extraction_times (Dict) – The time it took to extract each feature

Parameters
  • features (List[phishbench.feature_extraction.reflection.models.FeatureClass]) –

  • url (phishbench.input.url_input._url_data.URLData) –

Return type

Tuple[Dict, Dict]

phishbench.feature_extraction.url.extract_features_list(urls, features)

Extracts features from a list of URLs

Parameters
  • urls (List[URLData]) – The urls to extract features from

  • features (List[FeatureClass]) – The features to extract

Returns

feature_list_dict – A list of dicts containing the extracted features

Return type

List[Dict[str]]

phishbench.feature_extraction.url.extract_labeled_dataset(legit_dataset_folder, phish_dataset_folder, features=None)

Extract features from a labeled dataset split by files

Parameters
  • legit_dataset_folder (str) – The path of the folder/file containing the legitimate urls

  • phish_dataset_folder (str) – The path of the folder/file containing the phishing urls

  • features (Optional[List[FeatureClass]]) – A list of feature objects or None. If None, then this function will load and instantiate new instances of the features

Returns

  • features (List[Dict]) – A list of dicts containing the extracted features

  • labels (List[int]) – A list of labels. 0 is legitimate and 1 is phishing

  • features (List[FeatureClass]) – The feature instances used.

Email Feature Extraction

phishbench.feature_extraction.email.extract_features_from_single(features, email_msg)

Extracts multiple features from a single email

Parameters
  • features (List) – The features to extract

  • email_msg (EmailMessage) – The email to extract the features from

Returns

  • feature_values (Dict) – The extracted feature values

  • extraction_times (Dict) – The time it took to extract each feature

Return type

Tuple[Dict, Dict]

phishbench.feature_extraction.email.extract_features_list(emails, features)

Extracts features from a list of EmailMessage objects

Parameters
  • emails (List[EmailMessage]) – The emails to extract features from

  • features (List[FeatureClass]) – The features to extract

Returns

feature_list_dict – A list of dicts containing the extracted features

Return type

List[Dict[str]]

phishbench.feature_extraction.email.extract_labeled_dataset(legit_dataset_folder, phish_dataset_folder, features=None)

Extracts features from a dataset of emails split in two folders

Parameters
  • legit_dataset_folder (str) – The folder containing legitimate emails

  • phish_dataset_folder (str) – The folder containing phishing emails

  • features (Optional[List[FeatureClass]]) – A list of feature objects or None. If None, then this function will load and instantiate new instances of the features

Returns

  • feature_values (List[Dict]) – The feature values in a list of dictionaries. Features are mapped config_name to value.

  • labels (List[int]) – The class labels for the dataset

  • features (List[FeatureClass]) – The feature instances used.

Built-In features

URL Features

URL Features

average_domain_token_length

Average length of tokens from the URL domain

average_path_token_length

Average length of tokens from the URL path

brand_in_url

Whether or not the name of popular phishing targets are in the URL

char_dist

The character distribution of the url

char_dist_euclidian_distance

The Euclidean distance (L2 norm of u-v) between the URL and the English character distribution

char_dist_kl_divergence

The Kullback_Leibler divergence between the URL and the English character distribution

char_dist_kolmogorov_shmirnov

The Kolmogorov_Shmirnov statistic between the URL and the English character distribution

consecutive_numbers

The sum of squares of the length of substrings that are consecutive numbers

digit_letter_ratio

Number of digits divided by number of letters

domain_length

The length of the domain of the url

domain_letter_occurrence

The number of times each letter occurs in the domain

double_slashes_in_path

Whether or not there are escaped hex characters in the URL

has_anchor_tag

Whether the url has an anchor tag

has_at_symbol

Whether or not the character @ is in the url

has_hex_characters

Whether or not there are escaped hex characters in the URL

has_https

Whether or not the url is https

has_more_than_three_dots

Whether the url without www. has more than three dots

has_port

Whether or not the URL has a port number

has_www_in_middle

Whether or not there the string “www” in the middle of the domain.

http_in_middle

Whether or not the string ‘http’ occurs in the middle of the url.

is_common_tld

Whether or not the tld is one of: .com, .net, .org, .edu, .mil, .gov, .co, .biz, .info, .me

is_ip_addr

Whether or not the url is an IPv4 address

is_whitelisted

Whether or not the domain is one of the targeted brands

longest_domain_token_length

Length of the length of tokens from the URL domain

null_in_domain

Whether or not the string null is in the url (ignoring case)

num_punctuation

Number of punctuation. Punctuation is defined as string.punctuation

number_of_dashes

The number of dashes in the url

number_of_digits

The number of digits in the url

number_of_dots

The number of times the . character occurs in the url

number_of_slashes

The number forward or backward slashes in the url

protocol_port_match

Whether or not there are escaped hex characters in the URL

special_char_count

The number of @ or - charcters in the url

special_pattern

Whether or not the string ?gws_rd=ssl appears in the url

token_count

The number of tokens in the path

top_level_domain

The top level domain of the url

url_length

The length of the url.

hisc_whole

Number of characters from the set {@xQ+]M&=<}#[?|' in the URL

hisc_host

Number of characters from the set XznGR%rmNM=DIZc: in the host section of the URL

hisc_path

Number of characters from the set Y{x+]p!=}#[|:h in the path section of the URL

Reference

Tao Feng and Chuan Yue. 2020. Visualizing and Interpreting RNN Models in URL-based Phishing Detection. <https://doi.org/10.1145/3381991.3395602>

hisc_query

Number of characters from the set 5)-x+]M=}D#[?|'(h~} in the query section of the URL

Network Features

as_number

The as number of the url

creation_date

The whois info creation date

dns_ttl

The TTL for DNS requests

expiration_date

The whois info expiration date

number_name_server

The number of name servers returned by the NS query

updated_date

The whois info update date

HTML Features

website_tfidf

TF-IDF world-level vectors of the downloaded websites

content_length

The value of the Content-Length header

website_content_type

The value of the Content-Type header

has_password_input

Whether or not the website has an input element of type password

is_redirect

Whether or not the URL redirects to different url

number_of_anchor

The number of anchor (a) tags

number_of_audio

The number of audio tags

number_of_body

The number of body tags

number_of_embed

The number of embed tags

number_of_external_content

The number of content tags hosted on external domains. A content tag is defined as any of the following tags: audio, embed, iframe, img, input, script, source, track, video

number_of_head

The number of head tags

number_of_hidden_div

The number of div of height or width 0

number_of_hidden_iframe

The number of iframes with a height or width of 0

number_of_hidden_input

The number of hidden input fields

number_of_hidden_object

The number of objects of height or width 0

number_of_hidden_svg

The number of svgs of height or width 0

number_of_html

The number of html tags

number_of_iframe

The number of iframe tags

number_of_img

The number of img tags

number_of_internal_content

The number of content tags hosted on the same domain. A content tag is defined as any of the following tags: audio, embed, iframe, img, input, script, source, track, video

number_of_scripts

The number of script tags

number_of_tags

The total number of tags

number_of_title

The number of title tags

number_of_video

The number of video tags

number_suspicious_content

The number of suspicious tags. A tag is considered suspicious if its length is greater than 128, and less than 5% of it is spaces.

x_powered_by

The value of the X-Powered-By header

Javascript Features

number_of_escape

Number of escape calls in the embedded javascript

number_of_eval

Number of eval calls in the embedded javascript

number_of_event_attachment

Number of addEventListener or attachEvent calls in the embedded javascript

number_of_event_dispatch

Number of dispatchEvent or fireEvent calls in the embedded javascript

number_of_exec

Number of exec calls in the embedded javascript

number_of_iframes_in_script

Number of times the token iframe shows up in embedded javascript

number_of_set_timeout

Number of setTimeout calls in the embedded javascript

number_of_unescape

Number of unescape calls in the embedded javascript

right_click_modified

Whether or not the right click event has been modified.

Email Features

Header Features

mime_version

The MIME version

header_file_size

The size of the header in bytes

return_path

The return path of the email

X-mailer

The x-mailer of the email

x_originating_hostname

The x-originating-hostname header of the email

x_originating_ip

The x-originating-ip header of the email

x_virus_scanned

Whether or not the x-virus-scanned header is in the email

x_spam_flag

Whether or not the x-spam-flag header is in the email

received_count

The number of Received headers

authentication_results_spf_pass

Whether or not spf=pass is in the authentication results

authentication_results_dkim_pass

Whether or not dkim= is in the authentication results

has_x_original_authentication_results

Whether or not the email has the X-Original-Authentication-Results header

has_received_spf

Whether or not the email has the Recieved-SPF header

has_dkim_signature

Whether or not the email has the DKIM-Signature header

compare_sender_domain_message_id_domain

Whether or not the domain for the sender address and the message id is the same

compare_sender_return

Whether or not the return path and the sender address are the same

blacklisted_words_subject

Number of times the following words/phrases appear in the subject:

  • urgent

  • account

  • closing

  • act now

  • click here

  • limited

  • suspension

  • your account

  • verify your account

  • agree

  • bank

  • dear

  • update

  • confirm

  • customer

  • client

  • suspend

  • restrict

  • verify

  • login

  • ssn

  • username

  • click

  • log

  • inconvenient

  • alert

  • paypal

number_cc

Number of addresses in the CC field

number_bcc

Number of addresses in the BCC field

number_to

Number of addresses in the to field

number_of_words_subject

The number of words in the subject

number_of_characters_subject

The number of characters in the subject

number_of_special_characters_subject

The number of special characters in the subject

is_forward

Whether or not “fw:” is in the subject

is_reply

Whether or not “re:” is in the subject

vocab_richness_subject

The vocabulary richness (yule) of the subject

Body Features

is_html

Whether or not the email has a HTML body

num_content_type

The number of parts that declare a Content-Type.

num_unique_content_type

The number of unique Content-Type values.

num_content_type_text_plain

The number of parts that declare a Content-Type value of text/plain.

num_content_type_text_html

The number of parts that declare a Content-Type value of text/html.

num_content_type_multipart_mixed

The number of parts that declare a Content-Type value of multipart/mixed.

num_content_type_multipart_encrypted

The number of parts that declare a Content-Type value of multipart/encrypted.

num_content_type_form_data

The number of parts that declare a Content-Type value of multipart/form-data.

num_content_type_multipart_byterange

The number of parts that declare a Content-Type value of multipart/byterange.

num_content_type_multipart_parallel

The number of parts that declare a Content-Type value of multipart/parallel.

num_content_type_multipart_report

The number of parts that declare a Content-Type value of multipart/report.

num_content_type_multipart_alternative

The number of parts that declare a Content-Type value of multipart/alternative.

num_content_type_multipart_signed

The number of parts that declare a Content-Type value of multipart/signed.

num_content_type_multipart_x_mix_replaced

The number of parts that declare a Content-Type value of multipart/x-mixed-replaced.

num_content_disposition

The number of parts that declare a Content-Disposition.

num_unique_content_disposition

The number of different Content-Dispositions types used.

num_charset

The number of parts with a declared charset.

num_charset_utf7

The number of parts using the utf-7 charset.

num_charset_utf8

The number of parts using the utf-8 charset.

num_charset_gb2312

The number of parts using the gb2312 charset.

num_charset_shift_js

The number of parts using the shift-jis charset.

num_charset_koi

The number of parts using the koi charset.

num_unique_attachment

The number of attachments

num_unique_attachment_filetypes

The number of attachment filetypes

num_content_transfer_encoding
num_unique_content_transfer_encoding
num_content_transfer_encoding_7bit
num_content_transfer_encoding_8bit
num_content_transfer_encoding_binary
num_content_transfer_encoding_quoted_printable
num_words_body

The number of words in the body.

num_unique_words_in_body

The number of unique words in the body.

number_of_characters_body

The number of characters in the body.

number_unique_chars_body

The number of unique characters in the body.

number_of_special_characters_body

The number of characters matching the regex _|[^\w\s].

vocab_richness_body

The vocabulary richness of the body, as measured by the inverse of Yule’s K measure.

greetings_body

Whether or not the string “Dear User” is in the text.

hidden_text

Whether or not the email contains hidden text.

num_href_tag
num_end_tag
num_open_tag
num_on_mouse_over
blacklisted_words_body
number_of_scripts
function_words_count

The number of function words in the body.

flesh_read_score
smog_index
flesh_kincaid_score
coleman_liau_index
automated_readability_index
dale_chall_readability_score
difficult_words
linsear_score
gunning_fog
text_standard

References

McGrath, D. Kevin, and Minaxi Gupta. (2008) “Behind Phishing: An Examination of Phisher Modi Operandi

Rakesh Verma and Keith Dyer. 2015. On the Character of Phishing URLs: Accurate and Robust Statistical Learning Classifiers. In Proceedings of the 5th ACM Conference on Data and Application Security and Privacy (CODASPY ‘15). Association for Computing Machinery, New York, NY, USA, 111–122. DOI:https://doi.org/10.1145/2699026.2699115

  1. Das, S. Baki, A. El Aassal, R. Verma and A. Dunbar, “SoK: A Comprehensive Reexamination of Phishing Research From the Security Perspective,” in IEEE Communications Surveys & Tutorials, vol. 22, no. 1, pp. 671-708, Firstquarter 2020, doi: 10.1109/COMST.2019.2957750.