Configuration Files¶
The PhishBench Configuration File is an ini file defined according to the Python [ConfigParser](https://docs.python.org/3/library/configparser.html) specification. In general, most settings are binary features which can be toggled via a True or False like so:
Confusion_matrix = True
Cross_validation = False
The PhishBench Section¶
This section contains the highest-level settings for the basic experiment and toggles for each part of the pipeline.
The
modesetting specifies what type of data PhishBench will be operating with. The options areURLorEmail.The
feature extractionsetting toggles feature extraction from the datasetThe
preprocessingsetting toggles pre-proccesses the featuresThe
classificationsetting toggles training and evaluation of classifiers.
The Dataset Path Sections¶
This section contains the paths of the dataset to be used. You can specify a path using either a relative path to the current directory or an absolute path.
In URL mode, the subsets can either be a text file or folder of text files with one URL per line.
In Email mode, the subsets should be folders of eamils, with one file per email.
The Extraction Section¶
This section controls the behavior of the input and feature extraction modules.
The
training datasetsetting controls the Basic Experiment Script’s training set. IfTrue, then PhishBench extracts features from the raw dataset atpath_legit_trainandpath_phish_train. Otherwise, it will attempt to load a pre-extracted dataset fromOUTPUT_INPUT_DIR.The
testing datasetsetting controls the Basic Experiment Script’s testing set. IfTrue, then PhishBench extracts features from the raw datset atpath_legit_testandpath_phish_test. Otherwise, its behavior will be determined by thesplit datasetsetting.If the
split datasetsetting isTrue, then PhishBench will randomly split the training set into a 75/25 train-test split.
The Features Export Section¶
This sections specifies the formats PhishBench will output the extracted features in. Currently, only csv is supported.
The Preprocessing Section¶
This section contains toggles for the preprocessing pipeline steps.
The Feature Selection Section¶
This section contains settings for feature selection.
The
number of best featuressetting is the number of features to select.The
with tfidfsetting specifies whether to select Tf-IDF features.
The Feature Selection Methods Section¶
This section contains toggles for the feature selection methods.
The Dataset Balancing Section¶
This section contains toggles for the dataset balancing methods
The Classification Section¶
This section controls the behavior of the classification module. The internal logic is as follows:
if classification_settings.load_models():
classifier.load_model()
elif classification_settings.weighted_training():
classifier.fit_weighted(x_train, y_train)
elif classification_settings.param_search():
classifier.param_search(x_train, y_train)
else:
classifier.fit(x_train, y_train)
The Classifiers Section¶
This section contains toggles for the built-in and user-implemented classifiers.
The Evaluation Metrics Section¶
This section contains toggles for the built-in and user-implemented evaluation metrics.
The Feature Sections¶
The rest of the configuration file contains toggles for the built-in and user-implemented features. The Email_Feature_Types or URL_Feature_Types sections contain toggles for the respective types, and toggles for individual features sectioned by type.
PhishBench will extract a feature if the following conditions are met:
The feature type matches the mode
The feature’s type is enabled.
The feature is enabled.