phishbench.input

This module loads datasets from raw emails and URLs. It is split into two sub-modules. The email_input submodule handles email datasets, and the url_input submodule handles URL datasets.

In addition, this module contains read_train_set and read_test_set functions which uses the relevant submodule to read in datasets according to the global configuration.

phishbench.input.read_train_set(download_url=False)

Reads in the training set according to the configuration file.

Parameters

download_url (bool) – When loading a URL dataset, whether or not to download the websites pointed to by the URLs. Ignored when loading a Email dataset.

Returns

  • data (A list of URLData or EmailMessage objects) – The read-in data.

  • labels (List[int]) – A list of labels. 0 is legitimate and 1 is phish

Return type

Tuple[Union[List[phishbench.input.url_input._url_data.URLData], List[phishbench.input.email_input.models._email.EmailMessage]], List[int]]

phishbench.input.read_test_set(download_url=False)

Reads in the test set according to the configuration file.

Parameters

download_url (bool) – When loading a URL dataset, whether or not to download the websites pointed to by the URLs. Ignored when loading a Email dataset.

Returns

  • data (URl mode: - A list of URLData or EmailMessage objects) – The read-in data.

  • labels (List[int]) – A list of labels. 0 is legitimate and 1 is phish

Return type

Tuple[Union[List[phishbench.input.url_input._url_data.URLData], List[phishbench.input.email_input.models._email.EmailMessage]], List[int]]

Email Input

Handles email input

phishbench.input.email_input.read_dataset_email(folder_path)

Reads a folder of emails

Parameters

folder_path (str) – The path to the folder you want to read

Returns

  • emails (List[EmailMessage]) – The parsed emails

  • files (List[str]) – The paths of the files loaded

Return type

Tuple[List[phishbench.input.email_input.models._email.EmailMessage], List[str]]

phishbench.input.email_input.read_email_from_file(file_path)

Reads a email from a file

Parameters

file_path (str) – The path of the email to read

Returns

msg – The parsed email

Return type

EmailMessage

class phishbench.input.email_input.EmailMessage(msg)

A parsed email.

raw_message

The raw email message object

Type

email.message.Message

header

The header of the email.

Type

EmailHeader

body

The body of the email

Type

EmailBody

__init__(msg)

Constructs an EmailMessage with a raw email.

Parameters

msg (email.message.Message) – The raw email to parse

class phishbench.input.email_input.EmailHeader(msg)

Represents the header of an email

orig_date

The origination date of the email

Type

datetime

x_priority

The X-Priority header value. If present, an integer between 1 and 5. Otherwise, None

Type

int

subject

The value of the subject header field if present. None otherwise.

Type

str

return_path

The return path of the email without angle brackets

Type

str

reply_to

The reply-to values

Type

List[str]

sender_full

The sender of the email.

Type

str

sender_name

The sender’s display name

Type

str

sender_email_address

The email address of the sender

Type

str

to

The raw mailboxes in the To: field

Type

List[str]

recipient_full

The mailbox the email was sent to, or the first mailbox in the To field if we cannot figure out who received the email

Type

str

recipient_name

The name of the recipient

Type

str

recipient_email_address

The email address of the recipient

Type

str

message_id

A unique message identifier for the email.

Type

str

x_mailer

The desktop client which sent the email, as indicated by the X-Mailer header.

Type

str

x_originating_hostname

The originating hostname if available

Type

str

x_originating_ip

The originating ip if available

Type

str

x_virus_scanned

Whether or not the email has been scanned for a virus

Type

bool

dkim_signed

Whether or no the email has a DKIM signature

Type

bool

received_spf

Whether or not the Received-SPF header is present in the email

Type

bool

x_original_authentication_results

Whether or not the X-Original-Authentication-Results header is present in the email

Type

bool

authentication_results

The contents of the Authentication-Results header

Type

str

received

A list containing the Received headers of the email

Type

List[str]

mime_version

The value of the MIME-Version header field

Type

str

class phishbench.input.email_input.EmailBody(msg)

A class representing the body of an email.

text

The raw text of the email.

Type

str

raw_html

The uncleaned html of the email.

Type

str

html

The cleaned html of the email. Cleaning removes scripts and style from the html.

Type

str

is_html

Whether or not the email is html

Type

bool

num_attachment

The number of attachments the email contains

Type

int

content_disposition_list

A list of the content dispositions of each part of the email

Type

List[str]

content_type_list

A list of the content types of each part of the email

Type

List[str]

content_transfer_encoding_list

A list of the content transfer encodings for each part

Type

List[str]

file_extension_list

A list containing the file extensions for each attachment

Type

List[str]

charset_list

A list containing the charsets for each part

URL Input

This module handles URL input.

phishbench.input.url_input.read_dataset_url(dataset_path, download_url, remove_dup=True)

Reads in a dataset of URLs from a file or a folder of files

Parameters
  • dataset_path (str) – The location of the dataset to read from. This can either be a folder or a file.

  • download_url (bool) – Whether or not to download the websites pointed to by the URLs

  • remove_dup (bool) – Whether or not to remove duplicates.

Returns

  • urls (List[URLData]) – A list of URLData objects representing the dataset

  • bad_url_list (List[str]) – The URLs that failed to extract.

Return type

Tuple[List[phishbench.input.url_input._url_data.URLData], List[str]]