This study was performed in Konkuk University Medical Center (KUMC), which is 700-bed sized tertiary-care teaching hospital in Seoul, South Korea. The study was conducted according to the Declaration of Helsinki, the protocol approved an exemption by the Institutional Review Board (IRB) of KUMC, and obtaining informed consent from the study patients was not necessary (IRB approval No. KUH1200110). The data collection was done at the Department of Laboratory Medicine, Konkuk University Medical Center from February 2019 to March 2019. The data was anonymized due to the sensitivity of patients’ information. CPD parameters and International Classification of Diseases, 10th Revision (ICD-10) codes were included. The demographic patient information, i.e., gender and age, were also included for better prediction outcomes.

We performed the hematologic analysis using Mindray BC-6800 (Mindray, Shenzhen, China) automated hematology analyzer that yielded CPD including CBC, leukocyte differentiation and reticulocyte count with information on volume, conductivity and different scatter measures^{23}. After preprocessing (see the following section), a total of 882 cases were included for analysis. Detailed number of hematologic diseases including malignancies and non-malignancies are shown below in Table 1.

### Preprocessing

The dataset contained several missing values. We handled this issue in two steps. First, the cases that had more than 90% values missing were excluded. In total, 17 cases were excluded, while 882 cases were further analyzed. Second, missing values were predicted with two machine learning algorithms. The missing numerical variables were predicted with linear regression, while the missing categorical variables were predicted with decision tree classifier. In both cases, the learning data contained a subset of numerical attributes and a subset of instances with no missing values.

After handling missing values, we selected only laboratory data and demographic patient information (gender and age) for further analysis. As a result, the number of variables (before feature selection) was 61.

There are different ranges of measurements and units in the laboratory data, therefore, in order to normalize our dataset, we used the scaling process. We selected Min-Max Scalar as scaling feature to transform the normal values to end up within the range of 0 to 1. In order to make the gender values in numerical form, we used the value of 0 for female and 1 for male.

### Bias variable

We applied point-biserial correlation to determine which variables have significant influence on malignant or non-malignant hematologic diseases. Point-biserial correlation is assessed between −1 to 1. The value closer to −1 shows the strong confidence of negative linear relationship between two variables, and the value closer to 1 shows strong confidence of positive linear relationship.

The presented approach uses filter-based variable/feature selection. However, there exist two additional approaches for selecting the most appropriate features: wrapper and embedded approaches. The main differences among them are the following. Filter methods use a selected measure to get the best subset of features prior machine learning phase. Wrapper methods use machine learning model to score the feature subsets and select the best performing one. Embedded methods perform feature selection as a part of model construction process.

### Variable selection

In order to find out the variables with high significance, either negative or positive point-biserial correlation, we used the absolute value by changing the results from negative correlation to positive value, and ranked them from high to low. Table 2 shows the selected variables based on point-biserial correlation.

### Model selection

In our study, we applied seven machine learning models: Stochastic Gradient Descent (SGD), Support Vector Machine (SVM), Random Forests (RF), Decision Tree (DT), an adapted Linear Regression – its output was discretized into two classes by using a threshold – (LINEAR), Logistic Regression (LOGIT), and Artificial Neural Networks (ANN). The first six models were used from the Scikit-learn library^{24} with the default parameter values, while ANN used the Keras library^{25}.

ANN consisted of a 3-layer architecture and was trained in 300 epochs with batch size 48. The first hidden layer had 128 nodes with Rectified Linear Unit (ReLU) activation function, and the second hidden layer had 64 nodes with ReLU activation function. A single node with Sigmoid/Logistic activation was used for the output layer. The output layer was defined as malignancies predictive value, which is a continuous variable from 0 (haematologic non-malignancies) to 1 (haematologic malignancies). This architecture was selected based on our past experience on processing similar medical datasets. A more appropriate approach for the selection of the architecture would include evaluation of various parameter values (such as number of layers). However, such an optimization is very complex and time-consuming thus will be carried out in future work if deemed necessary.

### Performance evaluation

To evaluate the performance of the ML models, we used the stratified 10-fold cross-validation. In stratified cross-validation, the folds are selected in such a way that the percentage of samples is preserved for each class^{26}. That is, the procedure maintains the same distribution of the target variable when randomly selecting examples for each fold; in our case, the same proportion between malignant and non-malignant cases. More precisely, this procedure divides the set of cases into k groups (k = 10) or folds of approximately equal sizes. The first fold is treated as a testing set, and the remaining k-1 folds are used for training the model (90% training data vs. 10% testing data). This is repeated 10 times, each time selecting a different fold as the testing set and the remaining folds as the training set. The performance metrics are then averaged over all the 10 steps. To avoid double dipping, training and testing sets (folds) are always disjoint sets and thus they do not share any sample^{27}.

In our study we tested data with True Positive (TP) as real malignancies that are correctly predicted, False Positive (FP) as real malignancies that are incorrectly classified to be non-malignancies, True Negative (TN) as real non-malignancies that are correctly predicted, and False Negative (FN) as real non-malignancies that are incorrectly predicted. The results of tested performance measures from precision denotes the proportion of predicted positive cases or TP. Recall refers to sensitivity, and in medical term to identify all positive cases or rate of TP. Accuracy is predicting the correct ratio of samples, and is one of the most intuitive and basic performance measures for any ML model. Area Under the Curve (AUC) is used to determine the best cutoff point and compare two or more tests or observers of each calculated fold^{28}. AUC compares rate of TP (TPR) and rate of FP (FPR). It is created by plotting the TPR against the FPR^{29}.