phishbench.input¶
This module loads datasets from raw emails and URLs. It is split into two sub-modules. The email_input submodule handles email datasets, and the url_input submodule handles URL datasets.
In addition, this module contains read_train_set and read_test_set functions which uses the relevant submodule to read in datasets according to the global configuration.
-
phishbench.input.read_train_set(download_url=False)¶ Reads in the training set according to the configuration file.
- Parameters
download_url (bool) – When loading a URL dataset, whether or not to download the websites pointed to by the URLs. Ignored when loading a Email dataset.
- Returns
data (A list of URLData or EmailMessage objects) – The read-in data.
labels (List[int]) – A list of labels. 0 is legitimate and 1 is phish
- Return type
Tuple[Union[List[phishbench.input.url_input._url_data.URLData], List[phishbench.input.email_input.models._email.EmailMessage]], List[int]]
-
phishbench.input.read_test_set(download_url=False)¶ Reads in the test set according to the configuration file.
- Parameters
download_url (bool) – When loading a URL dataset, whether or not to download the websites pointed to by the URLs. Ignored when loading a Email dataset.
- Returns
data (URl mode: - A list of URLData or EmailMessage objects) – The read-in data.
labels (List[int]) – A list of labels. 0 is legitimate and 1 is phish
- Return type
Tuple[Union[List[phishbench.input.url_input._url_data.URLData], List[phishbench.input.email_input.models._email.EmailMessage]], List[int]]
Email Input¶
Handles email input
-
phishbench.input.email_input.read_dataset_email(folder_path)¶ Reads a folder of emails
- Parameters
folder_path (str) – The path to the folder you want to read
- Returns
emails (List[EmailMessage]) – The parsed emails
files (List[str]) – The paths of the files loaded
- Return type
Tuple[List[phishbench.input.email_input.models._email.EmailMessage], List[str]]
-
phishbench.input.email_input.read_email_from_file(file_path)¶ Reads a email from a file
- Parameters
file_path (str) – The path of the email to read
- Returns
msg – The parsed email
- Return type
-
class
phishbench.input.email_input.EmailMessage(msg)¶ A parsed email.
-
raw_message¶ The raw email message object
- Type
email.message.Message
-
header¶ The header of the email.
- Type
-
__init__(msg)¶ Constructs an EmailMessage with a raw email.
- Parameters
msg (email.message.Message) – The raw email to parse
-
-
class
phishbench.input.email_input.EmailHeader(msg)¶ Represents the header of an email
-
orig_date¶ The origination date of the email
- Type
datetime
-
x_priority¶ The X-Priority header value. If present, an integer between 1 and 5. Otherwise, None
- Type
int
-
subject¶ The value of the subject header field if present. None otherwise.
- Type
str
-
return_path¶ The return path of the email without angle brackets
- Type
str
-
reply_to¶ The reply-to values
- Type
List[str]
-
sender_full¶ The sender of the email.
- Type
str
-
sender_name¶ The sender’s display name
- Type
str
-
sender_email_address¶ The email address of the sender
- Type
str
-
to¶ The raw mailboxes in the To: field
- Type
List[str]
-
recipient_full¶ The mailbox the email was sent to, or the first mailbox in the To field if we cannot figure out who received the email
- Type
str
-
recipient_name¶ The name of the recipient
- Type
str
-
recipient_email_address¶ The email address of the recipient
- Type
str
-
message_id¶ A unique message identifier for the email.
- Type
str
-
x_mailer¶ The desktop client which sent the email, as indicated by the X-Mailer header.
- Type
str
-
x_originating_hostname¶ The originating hostname if available
- Type
str
-
x_originating_ip¶ The originating ip if available
- Type
str
-
x_virus_scanned¶ Whether or not the email has been scanned for a virus
- Type
bool
-
dkim_signed¶ Whether or no the email has a DKIM signature
- Type
bool
-
received_spf¶ Whether or not the Received-SPF header is present in the email
- Type
bool
-
x_original_authentication_results¶ Whether or not the X-Original-Authentication-Results header is present in the email
- Type
bool
-
authentication_results¶ The contents of the Authentication-Results header
- Type
str
-
received¶ A list containing the Received headers of the email
- Type
List[str]
-
mime_version¶ The value of the MIME-Version header field
- Type
str
-
-
class
phishbench.input.email_input.EmailBody(msg)¶ A class representing the body of an email.
-
text¶ The raw text of the email.
- Type
str
-
raw_html¶ The uncleaned html of the email.
- Type
str
-
html¶ The cleaned html of the email. Cleaning removes scripts and style from the html.
- Type
str
-
is_html¶ Whether or not the email is html
- Type
bool
-
num_attachment¶ The number of attachments the email contains
- Type
int
-
content_disposition_list¶ A list of the content dispositions of each part of the email
- Type
List[str]
-
content_type_list¶ A list of the content types of each part of the email
- Type
List[str]
-
content_transfer_encoding_list¶ A list of the content transfer encodings for each part
- Type
List[str]
-
file_extension_list¶ A list containing the file extensions for each attachment
- Type
List[str]
-
charset_list¶ A list containing the charsets for each part
-
URL Input¶
This module handles URL input.
-
phishbench.input.url_input.read_dataset_url(dataset_path, download_url, remove_dup=True)¶ Reads in a dataset of URLs from a file or a folder of files
- Parameters
dataset_path (str) – The location of the dataset to read from. This can either be a folder or a file.
download_url (bool) – Whether or not to download the websites pointed to by the URLs
remove_dup (bool) – Whether or not to remove duplicates.
- Returns
urls (List[URLData]) – A list of URLData objects representing the dataset
bad_url_list (List[str]) – The URLs that failed to extract.
- Return type
Tuple[List[phishbench.input.url_input._url_data.URLData], List[str]]