Configuration Files¶

The PhishBench Configuration File is an ini file defined according to the Python [ConfigParser](https://docs.python.org/3/library/configparser.html) specification. In general, most settings are binary features which can be toggled via a True or False like so:

Confusion_matrix = True
Cross_validation = False

The `PhishBench` Section¶

This section contains the highest-level settings for the basic experiment and toggles for each part of the pipeline.

The mode setting specifies what type of data PhishBench will be operating with. The options are URL or Email.
The feature extraction setting toggles feature extraction from the dataset
The preprocessing setting toggles pre-proccesses the features
The classification setting toggles training and evaluation of classifiers.

The `Dataset Path` Sections¶

This section contains the paths of the dataset to be used. You can specify a path using either a relative path to the current directory or an absolute path.

In URL mode, the subsets can either be a text file or folder of text files with one URL per line.
In Email mode, the subsets should be folders of eamils, with one file per email.

The `Extraction` Section¶

This section controls the behavior of the input and feature extraction modules.

The training dataset setting controls the Basic Experiment Script’s training set. If True, then PhishBench extracts features from the raw dataset at path_legit_train and path_phish_train. Otherwise, it will attempt to load a pre-extracted dataset from OUTPUT_INPUT_DIR.
The testing dataset setting controls the Basic Experiment Script’s testing set. If True, then PhishBench extracts features from the raw datset at path_legit_test and path_phish_test. Otherwise, its behavior will be determined by the split dataset setting.
If the split dataset setting is True, then PhishBench will randomly split the training set into a 75/25 train-test split.

The `Features Export` Section¶

This sections specifies the formats PhishBench will output the extracted features in. Currently, only csv is supported.

The `Preprocessing` Section¶

This section contains toggles for the preprocessing pipeline steps.

The `Feature Selection` Section¶

This section contains settings for feature selection.

The number of best features setting is the number of features to select.
The with tfidf setting specifies whether to select Tf-IDF features.

The `Feature Selection Methods` Section¶

This section contains toggles for the feature selection methods.

The `Dataset Balancing` Section¶

This section contains toggles for the dataset balancing methods

The `Classification` Section¶

This section controls the behavior of the classification module. The internal logic is as follows:

if classification_settings.load_models():
    classifier.load_model()
elif classification_settings.weighted_training():
    classifier.fit_weighted(x_train, y_train)
elif classification_settings.param_search():
    classifier.param_search(x_train, y_train)
else:
    classifier.fit(x_train, y_train)

The `Classifiers` Section¶

This section contains toggles for the built-in and user-implemented classifiers.

The `Evaluation Metrics` Section¶

This section contains toggles for the built-in and user-implemented evaluation metrics.

The Feature Sections¶

The rest of the configuration file contains toggles for the built-in and user-implemented features. The Email_Feature_Types or URL_Feature_Types sections contain toggles for the respective types, and toggles for individual features sectioned by type.

PhishBench will extract a feature if the following conditions are met:

The feature type matches the mode
The feature’s type is enabled.
The feature is enabled.

Configuration Files¶

The PhishBench Section¶

The Dataset Path Sections¶

The Extraction Section¶

The Features Export Section¶

The Preprocessing Section¶

The Feature Selection Section¶

The Feature Selection Methods Section¶

The Dataset Balancing Section¶

The Classification Section¶

The Classifiers Section¶

The Evaluation Metrics Section¶