`phishbench.feature_extraction`¶

This module handles feature extraction. It contains the models for user-defined features, code for loading and extracting features, and a library of built-in features.

User-Defined Features¶

@phishbench.feature_extraction.reflection.register_feature(feature_type, config_name, default_value=- 1)¶

Registers a feature for use in Phishbench

Parameters

feature_type (FeatureType) – The type of feature
config_name (str) – The name of the feature in the config file
default_value – The value to use if there is an error

enum phishbench.feature_extraction.reflection.FeatureType(value)¶

The types of features that can be extracted

Valid values are as follows:

EMAIL_BODY¶

EMAIL_HEADER¶

URL_RAW¶

URL_NETWORK¶

URL_WEBSITE¶

URL_WEBSITE_JAVASCRIPT¶

class phishbench.feature_extraction.reflection.FeatureMC(name, bases, attrs)¶: The metaclass for Feature Classes

Feature Loading & Extraction¶

phishbench.feature_extraction.reflection.load_features(internal_features=None, feature_filter=None)¶

Searches for python modules in the current directory and loads features from any modules it finds.

Parameters

internal_features (Union[ModuleType, List]) – The module or a list of modules to load internal features from
feature_filter (Union[str, None]) –
The filter to use when loading features. Allowed values are:
- "Email" - Only include email types that are enabled in the configuration.
- "URL" - Only include URL types that are enabled in the configuration.
- None - Don’t filter features.

Returns

Return type

A list of features.

phishbench.feature_extraction.reflection.load_features_from_module(features_module, filter_features=None)¶

Loads features from a module

Parameters

features_module (ModuleType) – The module to import features from
filter_features (Union[str, None]) – Whether or not to filter out features according to the current configuration

Returns

Return type

A list of features in the module.

URL Feature Extraction¶

phishbench.feature_extraction.url.extract_features_from_single(features, url)¶

Extracts multiple features from a single url :param features: The features to extract :type features: List[FeatureClass] :param url: The url to extract the features from :type url: URLData

Returns

feature_values (Dict) – The extracted feature values
extraction_times (Dict) – The time it took to extract each feature

Parameters

features (List[phishbench.feature_extraction.reflection.models.FeatureClass]) –
url (phishbench.input.url_input._url_data.URLData) –

Return type

Tuple[Dict, Dict]

phishbench.feature_extraction.url.extract_features_list(urls, features)¶

Extracts features from a list of URLs

Parameters

urls (List[URLData]) – The urls to extract features from
features (List[FeatureClass]) – The features to extract

Returns

feature_list_dict – A list of dicts containing the extracted features

Return type

List[Dict[str]]

phishbench.feature_extraction.url.extract_labeled_dataset(legit_dataset_folder, phish_dataset_folder, features=None)¶

Extract features from a labeled dataset split by files

Parameters

legit_dataset_folder (str) – The path of the folder/file containing the legitimate urls
phish_dataset_folder (str) – The path of the folder/file containing the phishing urls
features (Optional[List[FeatureClass]]) – A list of feature objects or None. If None, then this function will load and instantiate new instances of the features

Returns

features (List[Dict]) – A list of dicts containing the extracted features
labels (List[int]) – A list of labels. 0 is legitimate and 1 is phishing
features (List[FeatureClass]) – The feature instances used.

Email Feature Extraction¶

phishbench.feature_extraction.email.extract_features_from_single(features, email_msg)¶

Extracts multiple features from a single email

Parameters

features (List) – The features to extract
email_msg (EmailMessage) – The email to extract the features from

Returns

feature_values (Dict) – The extracted feature values
extraction_times (Dict) – The time it took to extract each feature

Return type

Tuple[Dict, Dict]

phishbench.feature_extraction.email.extract_features_list(emails, features)¶

Extracts features from a list of EmailMessage objects

Parameters

emails (List[EmailMessage]) – The emails to extract features from
features (List[FeatureClass]) – The features to extract

Returns

feature_list_dict – A list of dicts containing the extracted features

Return type

List[Dict[str]]

phishbench.feature_extraction.email.extract_labeled_dataset(legit_dataset_folder, phish_dataset_folder, features=None)¶

Extracts features from a dataset of emails split in two folders

Parameters

legit_dataset_folder (str) – The folder containing legitimate emails
phish_dataset_folder (str) – The folder containing phishing emails
features (Optional[List[FeatureClass]]) – A list of feature objects or None. If None, then this function will load and instantiate new instances of the features

Returns

feature_values (List[Dict]) – The feature values in a list of dictionaries. Features are mapped config_name to value.
labels (List[int]) – The class labels for the dataset
features (List[FeatureClass]) – The feature instances used.

Built-In features¶

URL Features¶

`average_domain_token_length`¶

Average length of tokens from the URL domain

`average_path_token_length`¶

Average length of tokens from the URL path

`brand_in_url`¶

Whether or not the name of popular phishing targets are in the URL

`char_dist`¶

The character distribution of the url

`char_dist_euclidian_distance`¶

The Euclidean distance (L2 norm of u-v) between the URL and the English character distribution

`char_dist_kl_divergence`¶

The Kullback_Leibler divergence between the URL and the English character distribution

`char_dist_kolmogorov_shmirnov`¶

The Kolmogorov_Shmirnov statistic between the URL and the English character distribution

`consecutive_numbers`¶

The sum of squares of the length of substrings that are consecutive numbers

`digit_letter_ratio`¶

Number of digits divided by number of letters

`domain_length`¶

The length of the domain of the url

`domain_letter_occurrence`¶

The number of times each letter occurs in the domain

`double_slashes_in_path`¶

Whether or not there are escaped hex characters in the URL

`has_anchor_tag`¶

Whether the url has an anchor tag

`has_at_symbol`¶

Whether or not the character @ is in the url

`has_hex_characters`¶

Whether or not there are escaped hex characters in the URL

`has_https`¶

Whether or not the url is https

`has_more_than_three_dots`¶

Whether the url without www. has more than three dots

`has_port`¶

Whether or not the URL has a port number

`has_www_in_middle`¶

Whether or not there the string “www” in the middle of the domain.

`http_in_middle`¶

Whether or not the string ‘http’ occurs in the middle of the url.

`is_common_tld`¶

Whether or not the tld is one of: .com, .net, .org, .edu, .mil, .gov, .co, .biz, .info, .me

`is_ip_addr`¶

Whether or not the url is an IPv4 address

`is_whitelisted`¶

Whether or not the domain is one of the targeted brands

`longest_domain_token_length`¶

Length of the length of tokens from the URL domain

`null_in_domain`¶

Whether or not the string null is in the url (ignoring case)

`num_punctuation`¶

Number of punctuation. Punctuation is defined as string.punctuation

`number_of_dashes`¶

The number of dashes in the url

`number_of_digits`¶

The number of digits in the url

`number_of_dots`¶

The number of times the . character occurs in the url

`number_of_slashes`¶

The number forward or backward slashes in the url

`protocol_port_match`¶

Whether or not there are escaped hex characters in the URL

`special_char_count`¶

The number of @ or - charcters in the url

`special_pattern`¶

Whether or not the string ?gws_rd=ssl appears in the url

`token_count`¶

The number of tokens in the path

`top_level_domain`¶

The top level domain of the url

`url_length`¶

The length of the url.

`hisc_whole`¶

Number of characters from the set {@xQ+]M&=<}#[?|' in the URL

Reference¶

Tao Feng and Chuan Yue. 2020. Visualizing and Interpreting RNN Models in URL-based Phishing Detection.

`hisc_host`¶

Number of characters from the set XznGR%rmNM=DIZc: in the host section of the URL

Reference¶

Tao Feng and Chuan Yue. 2020. Visualizing and Interpreting RNN Models in URL-based Phishing Detection.

`hisc_path`¶

Number of characters from the set Y{x+]p!=}#[|:h in the path section of the URL

Reference¶

Tao Feng and Chuan Yue. 2020. Visualizing and Interpreting RNN Models in URL-based Phishing Detection. <https://doi.org/10.1145/3381991.3395602>

`hisc_query`¶

Number of characters from the set 5)-x+]M=}D#[?|'(h~} in the query section of the URL

Reference¶

Tao Feng and Chuan Yue. 2020. Visualizing and Interpreting RNN Models in URL-based Phishing Detection.

Network Features¶

`as_number`¶

The as number of the url

`creation_date`¶

The whois info creation date

`dns_ttl`¶

The TTL for DNS requests

`expiration_date`¶

The whois info expiration date

`number_name_server`¶

The number of name servers returned by the NS query

`updated_date`¶

The whois info update date

HTML Features¶

`website_tfidf`¶

TF-IDF world-level vectors of the downloaded websites

`content_length`¶

The value of the Content-Length header

`website_content_type`¶

The value of the Content-Type header

`has_password_input`¶

Whether or not the website has an input element of type password

`is_redirect`¶

Whether or not the URL redirects to different url

`link_alexa_global_rank`¶

The mean and standard deviation of the global alexa ranks of the links on the website bucketed into the ranges

<1,000,

<10,000

<100,000

<500,000

<1,000,000

unranked

Reference¶

Zhou, Xin, and Rakesh Verma. “Phishing sites detection from a web developer’s perspective using machine learning.” Proceedings of the 53rd Hawaii International Conference on System Sciences. 2020.

`link_tree_features`¶

Partitions the links on the page into 36 sets as follows:

Splits all links in the page by the following HTML tags:

<a>

<link>

<script>

<video>

<img>

<meta>

Then divide into the following categories:

social network links (Facebook, YouTube, Google, Twitter, Instagram, Pinterest)

other https links,

other http links,

contains current domain,

internal links.

any URL

Returns the size, mean length, and length standard deviation of each set

Reference¶

Zhou, Xin, and Rakesh Verma. “Phishing sites detection from a web developer’s perspective using machine learning.” Proceedings of the 53rd Hawaii International Conference on System Sciences. 2020.

`number_of_anchor`¶

The number of anchor (a) tags

`number_of_audio`¶

The number of audio tags

`number_of_body`¶

The number of body tags

`number_of_embed`¶

The number of embed tags

`number_of_external_links`¶

The number of links to a different domain

`number_of_external_content`¶

The number of content tags hosted on external domains. A content tag is defined as any of the following tags: audio, embed, iframe, img, input, script, source, track, video

`number_of_head`¶

The number of head tags

`number_of_hidden_div`¶

The number of div of height or width 0

`number_of_hidden_iframe`¶

The number of iframes with a height or width of 0

`number_of_hidden_input`¶

The number of hidden input fields

`number_of_hidden_object`¶

The number of objects of height or width 0

`number_of_hidden_svg`¶

The number of svgs of height or width 0

`number_of_html`¶

The number of html tags

`number_of_iframe`¶

The number of iframe tags

`number_of_img`¶

The number of img tags

`number_of_internal_content`¶

The number of content tags hosted on the same domain. A content tag is defined as any of the following tags: audio, embed, iframe, img, input, script, source, track, video

`number_of_internal_links`¶

The number of links to the same domain

`number_of_scripts`¶

The number of script tags

`number_of_tags`¶

The total number of tags

`number_of_title`¶

The number of title tags

`number_of_video`¶

The number of video tags

`number_suspicious_content`¶

The number of suspicious tags. A tag is considered suspicious if its length is greater than 128, and less than 5% of it is spaces.

Reference¶

Canali et al. (2011) Prophiler: a fast filter for the large-scale detection of malicious web pages: A Fast Filter for the Large-Scale Detection of Malicious Web Pages

`x_powered_by`¶

The value of the X-Powered-By header

Javascript Features¶

`number_of_escape`¶

Number of escape calls in the embedded javascript

`number_of_eval`¶

Number of eval calls in the embedded javascript

`number_of_event_attachment`¶

Number of addEventListener or attachEvent calls in the embedded javascript

`number_of_event_dispatch`¶

Number of dispatchEvent or fireEvent calls in the embedded javascript

`number_of_exec`¶

Number of exec calls in the embedded javascript

`number_of_iframes_in_script`¶

Number of times the token iframe shows up in embedded javascript

`number_of_link`¶

Number of link calls in the embedded javascript

`number_of_search`¶

Number of search calls in the embedded javascript

`number_of_set_timeout`¶

Number of setTimeout calls in the embedded javascript

`number_of_unescape`¶

Number of unescape calls in the embedded javascript

`right_click_modified`¶

Whether or not the right click event has been modified.

Email Features¶

Header Features¶

`mime_version`¶

The MIME version

`header_file_size`¶

The size of the header in bytes

`return_path`¶

The return path of the email

`X-mailer`¶

The x-mailer of the email

`x_originating_hostname`¶

The x-originating-hostname header of the email

`x_originating_ip`¶

The x-originating-ip header of the email

`x_virus_scanned`¶

Whether or not the x-virus-scanned header is in the email

`x_spam_flag`¶

Whether or not the x-spam-flag header is in the email

`received_count`¶

The number of Received headers

`authentication_results_spf_pass`¶

Whether or not spf=pass is in the authentication results

`authentication_results_dkim_pass`¶

Whether or not dkim= is in the authentication results

`has_x_original_authentication_results`¶

Whether or not the email has the X-Original-Authentication-Results header

`has_received_spf`¶

Whether or not the email has the Recieved-SPF header

`has_dkim_signature`¶

Whether or not the email has the DKIM-Signature header

`compare_sender_domain_message_id_domain`¶

Whether or not the domain for the sender address and the message id is the same

`compare_sender_return`¶

Whether or not the return path and the sender address are the same

`blacklisted_words_subject`¶

Number of times the following words/phrases appear in the subject:

urgent

account

closing

act now

click here

limited

suspension

your account

verify your account

agree

bank

dear

update

confirm

customer

client

suspend

restrict

verify

login

ssn

username

click

log

inconvenient

alert

paypal

`number_cc`¶

Number of addresses in the CC field

`number_bcc`¶

Number of addresses in the BCC field

`number_to`¶

Number of addresses in the to field

`number_of_words_subject`¶

The number of words in the subject

`number_of_characters_subject`¶

The number of characters in the subject

`number_of_special_characters_subject`¶

The number of special characters in the subject

`is_forward`¶

Whether or not “fw:” is in the subject

`is_reply`¶

Whether or not “re:” is in the subject

`vocab_richness_subject`¶

The vocabulary richness (yule) of the subject

Body Features¶

`is_html`¶

Whether or not the email has a HTML body

`num_content_type`¶

The number of parts that declare a Content-Type.

`num_unique_content_type`¶

The number of unique Content-Type values.

`num_content_type_text_plain`¶

The number of parts that declare a Content-Type value of text/plain.

`num_content_type_text_html`¶

The number of parts that declare a Content-Type value of text/html.

`num_content_type_multipart_mixed`¶

The number of parts that declare a Content-Type value of multipart/mixed.

`num_content_type_multipart_encrypted`¶

The number of parts that declare a Content-Type value of multipart/encrypted.

`num_content_type_form_data`¶

The number of parts that declare a Content-Type value of multipart/form-data.

`num_content_type_multipart_byterange`¶

The number of parts that declare a Content-Type value of multipart/byterange.

`num_content_type_multipart_parallel`¶

The number of parts that declare a Content-Type value of multipart/parallel.

`num_content_type_multipart_report`¶

The number of parts that declare a Content-Type value of multipart/report.

`num_content_type_multipart_alternative`¶

The number of parts that declare a Content-Type value of multipart/alternative.

`num_content_type_multipart_signed`¶

The number of parts that declare a Content-Type value of multipart/signed.

`num_content_type_multipart_x_mix_replaced`¶

The number of parts that declare a Content-Type value of multipart/x-mixed-replaced.

`num_content_disposition`¶

The number of parts that declare a Content-Disposition.

`num_unique_content_disposition`¶

The number of different Content-Dispositions types used.

`num_charset`¶

The number of parts with a declared charset.

`num_charset_utf7`¶

The number of parts using the utf-7 charset.

`num_charset_utf8`¶

The number of parts using the utf-8 charset.

`num_charset_gb2312`¶

The number of parts using the gb2312 charset.

`num_charset_shift_js`¶

The number of parts using the shift-jis charset.

`num_charset_koi`¶

The number of parts using the koi charset.

`num_unique_attachment`¶

The number of attachments

`num_unique_attachment_filetypes`¶

The number of attachment filetypes

`num_content_transfer_encoding`¶

`num_unique_content_transfer_encoding`¶

`num_content_transfer_encoding_7bit`¶

`num_content_transfer_encoding_8bit`¶

`num_content_transfer_encoding_binary`¶

`num_content_transfer_encoding_quoted_printable`¶

`num_words_body`¶

The number of words in the body.

`num_unique_words_in_body`¶

The number of unique words in the body.

`number_of_characters_body`¶

The number of characters in the body.

`number_unique_chars_body`¶

The number of unique characters in the body.

`number_of_special_characters_body`¶

The number of characters matching the regex _|[^\w\s].

`vocab_richness_body`¶

The vocabulary richness of the body, as measured by the inverse of Yule’s K measure.

`greetings_body`¶

Whether or not the string “Dear User” is in the text.

`hidden_text`¶

Whether or not the email contains hidden text.

`num_href_tag`¶

`num_end_tag`¶

`num_open_tag`¶

`num_on_mouse_over`¶

`blacklisted_words_body`¶

`number_of_scripts`¶

`number_of_img_links`¶

`function_words_count`¶

The number of function words in the body.

`flesh_read_score`¶

`smog_index`¶

`flesh_kincaid_score`¶

`coleman_liau_index`¶

`automated_readability_index`¶

`dale_chall_readability_score`¶

`difficult_words`¶

`linsear_score`¶

`gunning_fog`¶

`text_standard`¶

References¶

McGrath, D. Kevin, and Minaxi Gupta. (2008) “Behind Phishing: An Examination of Phisher Modi Operandi”

Rakesh Verma and Keith Dyer. 2015. On the Character of Phishing URLs: Accurate and Robust Statistical Learning Classifiers. In Proceedings of the 5th ACM Conference on Data and Application Security and Privacy (CODASPY ‘15). Association for Computing Machinery, New York, NY, USA, 111–122. DOI:https://doi.org/10.1145/2699026.2699115

Das, S. Baki, A. El Aassal, R. Verma and A. Dunbar, “SoK: A Comprehensive Reexamination of Phishing Research From the Security Perspective,” in IEEE Communications Surveys & Tutorials, vol. 22, no. 1, pp. 671-708, Firstquarter 2020, doi: 10.1109/COMST.2019.2957750.

phishbench.feature_extraction¶

User-Defined Features¶

Feature Loading & Extraction¶

URL Feature Extraction¶

Email Feature Extraction¶

Built-In features¶

URL Features¶

URL Features¶

average_domain_token_length¶

average_path_token_length¶

brand_in_url¶

char_dist¶

char_dist_euclidian_distance¶

char_dist_kl_divergence¶

char_dist_kolmogorov_shmirnov¶

consecutive_numbers¶

digit_letter_ratio¶

domain_length¶

domain_letter_occurrence¶

double_slashes_in_path¶

has_anchor_tag¶

has_at_symbol¶

has_hex_characters¶

has_https¶

has_more_than_three_dots¶

has_port¶

has_www_in_middle¶

http_in_middle¶

is_common_tld¶

is_ip_addr¶

is_whitelisted¶

longest_domain_token_length¶

null_in_domain¶

num_punctuation¶

number_of_dashes¶

number_of_digits¶

number_of_dots¶

number_of_slashes¶

protocol_port_match¶

special_char_count¶

special_pattern¶

token_count¶

top_level_domain¶

url_length¶

hisc_whole¶

Reference¶

hisc_host¶

Reference¶

hisc_path¶

Reference¶

hisc_query¶

Reference¶

Network Features¶

as_number¶

creation_date¶

dns_ttl¶

expiration_date¶

number_name_server¶

updated_date¶

HTML Features¶

website_tfidf¶

content_length¶

website_content_type¶

has_password_input¶

is_redirect¶

link_alexa_global_rank¶

Reference¶

link_tree_features¶

Reference¶

number_of_anchor¶

number_of_audio¶

number_of_body¶

number_of_embed¶

number_of_external_links¶

number_of_external_content¶

number_of_head¶

number_of_hidden_div¶

number_of_hidden_iframe¶

number_of_hidden_input¶

number_of_hidden_object¶

`phishbench.feature_extraction`¶

`average_domain_token_length`¶

`average_path_token_length`¶

`brand_in_url`¶

`char_dist`¶

`char_dist_euclidian_distance`¶

`char_dist_kl_divergence`¶

`char_dist_kolmogorov_shmirnov`¶

`consecutive_numbers`¶

`digit_letter_ratio`¶

`domain_length`¶

`domain_letter_occurrence`¶

`double_slashes_in_path`¶

`has_anchor_tag`¶

`has_at_symbol`¶

`has_hex_characters`¶

`has_https`¶

`has_more_than_three_dots`¶

`has_port`¶

`has_www_in_middle`¶

`http_in_middle`¶

`is_common_tld`¶

`is_ip_addr`¶

`is_whitelisted`¶

`longest_domain_token_length`¶

`null_in_domain`¶

`num_punctuation`¶

`number_of_dashes`¶

`number_of_digits`¶

`number_of_dots`¶

`number_of_slashes`¶

`protocol_port_match`¶

`special_char_count`¶

`special_pattern`¶

`token_count`¶

`top_level_domain`¶

`url_length`¶

`hisc_whole`¶

`hisc_host`¶

`hisc_path`¶

`hisc_query`¶

`as_number`¶

`creation_date`¶

`dns_ttl`¶

`expiration_date`¶

`number_name_server`¶

`updated_date`¶

`website_tfidf`¶

`content_length`¶

`website_content_type`¶

`has_password_input`¶

`is_redirect`¶

`link_alexa_global_rank`¶

`link_tree_features`¶

`number_of_anchor`¶

`number_of_audio`¶

`number_of_body`¶

`number_of_embed`¶

`number_of_external_links`¶

`number_of_external_content`¶

`number_of_head`¶

`number_of_hidden_div`¶

`number_of_hidden_iframe`¶

`number_of_hidden_input`¶

`number_of_hidden_object`¶

`number_of_hidden_svg`¶

`number_of_html`¶

`number_of_iframe`¶

`number_of_img`¶

`number_of_internal_content`¶

`number_of_internal_links`¶

`number_of_scripts`¶

`number_of_tags`¶

`number_of_title`¶

`number_of_video`¶

`number_suspicious_content`¶

`x_powered_by`¶

`number_of_escape`¶

`number_of_eval`¶

`number_of_event_attachment`¶