phishbench.feature_extraction¶
This module handles feature extraction. It contains the models for user-defined features, code for loading and extracting features, and a library of built-in features.
User-Defined Features¶
-
@phishbench.feature_extraction.reflection.register_feature(feature_type, config_name, default_value=- 1)¶ Registers a feature for use in Phishbench
- Parameters
feature_type (FeatureType) – The type of feature
config_name (str) – The name of the feature in the config file
default_value – The value to use if there is an error
-
enum
phishbench.feature_extraction.reflection.FeatureType(value)¶ The types of features that can be extracted
Valid values are as follows:
-
EMAIL_BODY¶
-
EMAIL_HEADER¶
-
URL_RAW¶
-
URL_NETWORK¶
-
URL_WEBSITE¶
-
URL_WEBSITE_JAVASCRIPT¶
-
-
class
phishbench.feature_extraction.reflection.FeatureMC(name, bases, attrs)¶ The metaclass for Feature Classes
Feature Loading & Extraction¶
-
phishbench.feature_extraction.reflection.load_features(internal_features=None, feature_filter=None)¶ Searches for python modules in the current directory and loads features from any modules it finds.
- Parameters
internal_features (Union[ModuleType, List]) – The module or a list of modules to load internal features from
feature_filter (Union[str, None]) –
The filter to use when loading features. Allowed values are:
"Email"- Only include email types that are enabled in the configuration."URL"- Only include URL types that are enabled in the configuration.None- Don’t filter features.
- Returns
- Return type
A list of features.
-
phishbench.feature_extraction.reflection.load_features_from_module(features_module, filter_features=None)¶ Loads features from a module
- Parameters
features_module (ModuleType) – The module to import features from
filter_features (Union[str, None]) – Whether or not to filter out features according to the current configuration
- Returns
- Return type
A list of features in the module.
URL Feature Extraction¶
-
phishbench.feature_extraction.url.extract_features_from_single(features, url)¶ Extracts multiple features from a single url :param features: The features to extract :type features: List[FeatureClass] :param url: The url to extract the features from :type url: URLData
- Returns
feature_values (Dict) – The extracted feature values
extraction_times (Dict) – The time it took to extract each feature
- Parameters
features (List[phishbench.feature_extraction.reflection.models.FeatureClass]) –
url (phishbench.input.url_input._url_data.URLData) –
- Return type
Tuple[Dict, Dict]
-
phishbench.feature_extraction.url.extract_features_list(urls, features)¶ Extracts features from a list of URLs
- Parameters
urls (List[URLData]) – The urls to extract features from
features (List[FeatureClass]) – The features to extract
- Returns
feature_list_dict – A list of dicts containing the extracted features
- Return type
List[Dict[str]]
-
phishbench.feature_extraction.url.extract_labeled_dataset(legit_dataset_folder, phish_dataset_folder, features=None)¶ Extract features from a labeled dataset split by files
- Parameters
legit_dataset_folder (str) – The path of the folder/file containing the legitimate urls
phish_dataset_folder (str) – The path of the folder/file containing the phishing urls
features (Optional[List[FeatureClass]]) – A list of feature objects or None. If None, then this function will load and instantiate new instances of the features
- Returns
features (List[Dict]) – A list of dicts containing the extracted features
labels (List[int]) – A list of labels. 0 is legitimate and 1 is phishing
features (List[FeatureClass]) – The feature instances used.
Email Feature Extraction¶
-
phishbench.feature_extraction.email.extract_features_from_single(features, email_msg)¶ Extracts multiple features from a single email
- Parameters
features (List) – The features to extract
email_msg (EmailMessage) – The email to extract the features from
- Returns
feature_values (Dict) – The extracted feature values
extraction_times (Dict) – The time it took to extract each feature
- Return type
Tuple[Dict, Dict]
-
phishbench.feature_extraction.email.extract_features_list(emails, features)¶ Extracts features from a list of EmailMessage objects
- Parameters
emails (List[EmailMessage]) – The emails to extract features from
features (List[FeatureClass]) – The features to extract
- Returns
feature_list_dict – A list of dicts containing the extracted features
- Return type
List[Dict[str]]
-
phishbench.feature_extraction.email.extract_labeled_dataset(legit_dataset_folder, phish_dataset_folder, features=None)¶ Extracts features from a dataset of emails split in two folders
- Parameters
legit_dataset_folder (str) – The folder containing legitimate emails
phish_dataset_folder (str) – The folder containing phishing emails
features (Optional[List[FeatureClass]]) – A list of feature objects or None. If None, then this function will load and instantiate new instances of the features
- Returns
feature_values (List[Dict]) – The feature values in a list of dictionaries. Features are mapped config_name to value.
labels (List[int]) – The class labels for the dataset
features (List[FeatureClass]) – The feature instances used.
Built-In features¶
URL Features¶
URL Features¶
average_domain_token_length¶
Average length of tokens from the URL domain
average_path_token_length¶
Average length of tokens from the URL path
brand_in_url¶
Whether or not the name of popular phishing targets are in the URL
char_dist¶
The character distribution of the url
char_dist_euclidian_distance¶
The Euclidean distance (L2 norm of u-v) between the URL and the English character distribution
char_dist_kl_divergence¶
The Kullback_Leibler divergence between the URL and the English character distribution
char_dist_kolmogorov_shmirnov¶
The Kolmogorov_Shmirnov statistic between the URL and the English character distribution
consecutive_numbers¶
The sum of squares of the length of substrings that are consecutive numbers
digit_letter_ratio¶
Number of digits divided by number of letters
domain_length¶
The length of the domain of the url
domain_letter_occurrence¶
The number of times each letter occurs in the domain
double_slashes_in_path¶
Whether or not there are escaped hex characters in the URL
has_anchor_tag¶
Whether the url has an anchor tag
has_at_symbol¶
Whether or not the character @ is in the url
has_hex_characters¶
Whether or not there are escaped hex characters in the URL
has_https¶
Whether or not the url is https
has_more_than_three_dots¶
Whether the url without www. has more than three dots
has_port¶
Whether or not the URL has a port number
has_www_in_middle¶
Whether or not there the string “www” in the middle of the domain.
http_in_middle¶
Whether or not the string ‘http’ occurs in the middle of the url.
is_common_tld¶
Whether or not the tld is one of: .com, .net, .org, .edu, .mil, .gov, .co, .biz, .info, .me
is_ip_addr¶
Whether or not the url is an IPv4 address
is_whitelisted¶
Whether or not the domain is one of the targeted brands
longest_domain_token_length¶
Length of the length of tokens from the URL domain
null_in_domain¶
Whether or not the string null is in the url (ignoring case)
num_punctuation¶
Number of punctuation. Punctuation is defined as string.punctuation
number_of_dashes¶
The number of dashes in the url
number_of_digits¶
The number of digits in the url
number_of_dots¶
The number of times the . character occurs in the url
number_of_slashes¶
The number forward or backward slashes in the url
protocol_port_match¶
Whether or not there are escaped hex characters in the URL
special_char_count¶
The number of @ or - charcters in the url
special_pattern¶
Whether or not the string ?gws_rd=ssl appears in the url
token_count¶
The number of tokens in the path
top_level_domain¶
The top level domain of the url
url_length¶
The length of the url.
hisc_whole¶
Number of characters from the set
{@xQ+]M&=<}#[?|'in the URLReference¶
Tao Feng and Chuan Yue. 2020. Visualizing and Interpreting RNN Models in URL-based Phishing Detection.
hisc_host¶
Number of characters from the set
XznGR%rmNM=DIZc:in the host section of the URLReference¶
Tao Feng and Chuan Yue. 2020. Visualizing and Interpreting RNN Models in URL-based Phishing Detection.
hisc_path¶
Number of characters from the set
Y{x+]p!=}#[|:hin the path section of the URLReference¶
Tao Feng and Chuan Yue. 2020. Visualizing and Interpreting RNN Models in URL-based Phishing Detection. <https://doi.org/10.1145/3381991.3395602>
hisc_query¶
Number of characters from the set
5)-x+]M=}D#[?|'(h~}in the query section of the URLReference¶
Tao Feng and Chuan Yue. 2020. Visualizing and Interpreting RNN Models in URL-based Phishing Detection.
Network Features¶
as_number¶
The as number of the url
creation_date¶
The whois info creation date
dns_ttl¶
The TTL for DNS requests
expiration_date¶
The whois info expiration date
number_name_server¶
The number of name servers returned by the NS query
updated_date¶
The whois info update date
HTML Features¶
website_tfidf¶
TF-IDF world-level vectors of the downloaded websites
content_length¶
The value of the Content-Length header
website_content_type¶
The value of the Content-Type header
has_password_input¶
Whether or not the website has an input element of type password
is_redirect¶
Whether or not the URL redirects to different url
link_alexa_global_rank¶
The mean and standard deviation of the global alexa ranks of the links on the website bucketed into the ranges
<1,000,
<10,000
<100,000
<500,000
<1,000,000
unranked
Reference¶
Zhou, Xin, and Rakesh Verma. “Phishing sites detection from a web developer’s perspective using machine learning.” Proceedings of the 53rd Hawaii International Conference on System Sciences. 2020.
link_tree_features¶
Partitions the links on the page into 36 sets as follows:
- Splits all links in the page by the following HTML tags:
<a>
<link>
<script>
<video>
<img>
<meta>
- Then divide into the following categories:
social network links (Facebook, YouTube, Google, Twitter, Instagram, Pinterest)
other https links,
other http links,
contains current domain,
internal links.
any URL
Returns the size, mean length, and length standard deviation of each set
Reference¶
Zhou, Xin, and Rakesh Verma. “Phishing sites detection from a web developer’s perspective using machine learning.” Proceedings of the 53rd Hawaii International Conference on System Sciences. 2020.
number_of_anchor¶
The number of anchor (
a) tags
number_of_audio¶
The number of
audiotags
number_of_body¶
The number of
bodytags
number_of_embed¶
The number of
embedtags
number_of_external_links¶
The number of links to a different domain
number_of_external_content¶
The number of content tags hosted on external domains. A content tag is defined as any of the following tags:
audio,embed,iframe,img,input,script,source,track,video
number_of_head¶
The number of
headtags
number_of_html¶
The number of
htmltags
number_of_iframe¶
The number of
iframetags
number_of_img¶
The number of
imgtags
number_of_internal_content¶
The number of content tags hosted on the same domain. A content tag is defined as any of the following tags:
audio,embed,iframe,img,input,script,source,track,video
number_of_internal_links¶
The number of links to the same domain
number_of_scripts¶
The number of
scripttags
number_of_tags¶
The total number of tags
number_of_title¶
The number of
titletags
number_of_video¶
The number of
videotags
number_suspicious_content¶
The number of suspicious tags. A tag is considered suspicious if its length is greater than 128, and less than 5% of it is spaces.
x_powered_by¶
The value of the X-Powered-By header
Javascript Features¶
number_of_escape¶
Number of
escapecalls in the embedded javascript
number_of_eval¶
Number of
evalcalls in the embedded javascript
number_of_event_attachment¶
Number of
addEventListenerorattachEventcalls in the embedded javascript
number_of_event_dispatch¶
Number of
dispatchEventorfireEventcalls in the embedded javascript
number_of_exec¶
Number of
execcalls in the embedded javascript
number_of_iframes_in_script¶
Number of times the token
iframeshows up in embedded javascript
number_of_link¶
Number of
linkcalls in the embedded javascript
number_of_search¶
Number of
searchcalls in the embedded javascript
number_of_set_timeout¶
Number of
setTimeoutcalls in the embedded javascript
number_of_unescape¶
Number of
unescapecalls in the embedded javascript
right_click_modified¶
Whether or not the right click event has been modified.
Email Features¶
Header Features¶
mime_version¶
The MIME version
header_file_size¶
The size of the header in bytes
return_path¶
The return path of the email
X-mailer¶
The x-mailer of the email
x_originating_hostname¶
The x-originating-hostname header of the email
x_originating_ip¶
The x-originating-ip header of the email
x_virus_scanned¶
Whether or not the x-virus-scanned header is in the email
x_spam_flag¶
Whether or not the x-spam-flag header is in the email
received_count¶
The number of Received headers
authentication_results_spf_pass¶
Whether or not spf=pass is in the authentication results
authentication_results_dkim_pass¶
Whether or not dkim= is in the authentication results
has_x_original_authentication_results¶
Whether or not the email has the X-Original-Authentication-Results header
has_received_spf¶
Whether or not the email has the Recieved-SPF header
has_dkim_signature¶
Whether or not the email has the DKIM-Signature header
compare_sender_domain_message_id_domain¶
Whether or not the domain for the sender address and the message id is the same
compare_sender_return¶
Whether or not the return path and the sender address are the same
blacklisted_words_subject¶
Number of times the following words/phrases appear in the subject:
urgent
account
closing
act now
click here
limited
suspension
your account
verify your account
agree
bank
dear
update
confirm
customer
client
suspend
restrict
verify
login
ssn
username
click
log
inconvenient
alert
paypal
number_cc¶
Number of addresses in the CC field
number_bcc¶
Number of addresses in the BCC field
number_to¶
Number of addresses in the to field
number_of_words_subject¶
The number of words in the subject
number_of_characters_subject¶
The number of characters in the subject
number_of_special_characters_subject¶
The number of special characters in the subject
is_forward¶
Whether or not “fw:” is in the subject
is_reply¶
Whether or not “re:” is in the subject
vocab_richness_subject¶
The vocabulary richness (yule) of the subject
Body Features¶
is_html¶
Whether or not the email has a HTML body
num_content_type¶
The number of parts that declare a
Content-Type.
num_unique_content_type¶
The number of unique
Content-Typevalues.
num_content_type_text_plain¶
The number of parts that declare a
Content-Typevalue oftext/plain.
num_content_type_text_html¶
The number of parts that declare a
Content-Typevalue oftext/html.
num_content_type_multipart_mixed¶
The number of parts that declare a
Content-Typevalue ofmultipart/mixed.
num_content_type_multipart_encrypted¶
The number of parts that declare a
Content-Typevalue ofmultipart/encrypted.
num_content_type_form_data¶
The number of parts that declare a
Content-Typevalue ofmultipart/form-data.
num_content_type_multipart_byterange¶
The number of parts that declare a
Content-Typevalue ofmultipart/byterange.
num_content_type_multipart_parallel¶
The number of parts that declare a
Content-Typevalue ofmultipart/parallel.
num_content_type_multipart_report¶
The number of parts that declare a
Content-Typevalue ofmultipart/report.
num_content_type_multipart_alternative¶
The number of parts that declare a
Content-Typevalue ofmultipart/alternative.
num_content_type_multipart_signed¶
The number of parts that declare a
Content-Typevalue ofmultipart/signed.
num_content_type_multipart_x_mix_replaced¶
The number of parts that declare a
Content-Typevalue ofmultipart/x-mixed-replaced.
num_content_disposition¶
The number of parts that declare a
Content-Disposition.
num_unique_content_disposition¶
The number of different
Content-Dispositionstypes used.
num_charset¶
The number of parts with a declared charset.
num_charset_utf7¶
The number of parts using the
utf-7charset.
num_charset_utf8¶
The number of parts using the
utf-8charset.
num_charset_gb2312¶
The number of parts using the
gb2312charset.
num_charset_shift_js¶
The number of parts using the
shift-jischarset.
num_charset_koi¶
The number of parts using the
koicharset.
num_unique_attachment¶
The number of attachments
num_unique_attachment_filetypes¶
The number of attachment filetypes
num_content_transfer_encoding¶
num_unique_content_transfer_encoding¶
num_content_transfer_encoding_7bit¶
num_content_transfer_encoding_8bit¶
num_content_transfer_encoding_binary¶
num_content_transfer_encoding_quoted_printable¶
num_words_body¶
The number of words in the body.
num_unique_words_in_body¶
The number of unique words in the body.
number_of_characters_body¶
The number of characters in the body.
number_unique_chars_body¶
The number of unique characters in the body.
number_of_special_characters_body¶
The number of characters matching the regex
_|[^\w\s].
vocab_richness_body¶
The vocabulary richness of the body, as measured by the inverse of Yule’s K measure.
greetings_body¶
Whether or not the string “Dear User” is in the text.
num_href_tag¶
num_end_tag¶
num_open_tag¶
num_on_mouse_over¶
blacklisted_words_body¶
number_of_scripts¶
number_of_img_links¶
function_words_count¶
The number of function words in the body.
flesh_read_score¶
smog_index¶
flesh_kincaid_score¶
coleman_liau_index¶
automated_readability_index¶
dale_chall_readability_score¶
difficult_words¶
linsear_score¶
gunning_fog¶
text_standard¶
References¶
McGrath, D. Kevin, and Minaxi Gupta. (2008) “Behind Phishing: An Examination of Phisher Modi Operandi”
Rakesh Verma and Keith Dyer. 2015. On the Character of Phishing URLs: Accurate and Robust Statistical Learning Classifiers. In Proceedings of the 5th ACM Conference on Data and Application Security and Privacy (CODASPY ‘15). Association for Computing Machinery, New York, NY, USA, 111–122. DOI:https://doi.org/10.1145/2699026.2699115
Das, S. Baki, A. El Aassal, R. Verma and A. Dunbar, “SoK: A Comprehensive Reexamination of Phishing Research From the Security Perspective,” in IEEE Communications Surveys & Tutorials, vol. 22, no. 1, pp. 671-708, Firstquarter 2020, doi: 10.1109/COMST.2019.2957750.