# Optimization for Machine Learning: Shallow and Deep Supervised Learning Models

In this post, I explain how supervised learning models in machine learning use single or multiple neurons in a single layer (shallow learning) or multiple neurons in multiple layers (deep learning) to generate a real, binary, or integer value beneficial for predictive analytics.

## Mathematical Definition of a Neuron

In a machine learning model, a neuron, or an artificial neuron, is a numerical processor unit (node) comprised of two functions:

• Pre-Activation Function (Pre-AF), which is mostly linear but can also be non-linear regarding its inputs and generates a real number. This function has parameters such as weights (assigned to each of its inputs) and a bias, which are considered as optimization variables during the training process, but are assumed as parameters during validation and testing processes. It is common to use multivariate linear regression or multivariate polynomial regression as a Pre-AF.
• Post-Activation Function (Post-AF), which is mostly non-linear but can also be linear regarding its inputs and generates a specific number. This function may have some parameters, but they are not considered as variables for optimization. Therefore, parameters of this function remain static in each of the training, validation, and testing processes. It is common to use Identity, ReLU, Leaky ReLU, Tanh, or Sigmoid as Post-AF, but other functions can also be used (See [1], [2], [3] or [4]).

The figure below indicates how a neuron leverages multivariate linear regression as its Pre-AF and Sigmoid as its Post-AF to convert input(s) to output(s).

## Supervised Learning

Consider a dataset as follows:

In supervised learning, we want our model to generate $g_t$ which is close to $b_t$ by designing a machine learning system which is characterized as follows:

• Using a single neuron or multiple neurons.
• Using a single layer containing neurons or multiple layers containing them.
• Using a Pre-AF (it is common to use the same Pre-AF for all neurons).
• Using a Post-AF (it is common to use the same Post-AF for all neurons).
• Using a subset or all affecting features that are impactful on the value of the target feature.
• The choice of a regularization parameter.
• … .

In the figure, the training dataset is used to develop your supervised learning model. Alongside, the development (also called cross-validation) dataset is utilized to have a sensitivity analysis for reaching an optimal setting for your model (e.g., number of neurons or layers, etc.). Lastly, the test dataset is used to assess the accuracy of your model. Moreover:

• $\alpha$ is the ratio of all examples in the dataset ($M$) used for training your model.
• $\beta$ is the ratio of all examples in your dataset ($M$) used for debugging and finalizing the development of your model.
• Clearly, $1-\alpha-\beta$ would be the remaining ratio of all examples in your dataset ($M$), used for accuracy test and performance analysis of your supervised learning model.

For instance, if you have a dataset containing 135 examples, setting $\alpha=70\%$, $\beta=10\%$ means that you are using data points from 1 to 94 for training, from 95 to 108 for validation, and from 109 to 135 for testing your supervised learning model. In what continues, two popular supervised learning models are investigated, which have vast applications.

## Optimization Models in Shallow Supervised Learning

For shallow supervised learning, we use a single (for regression or binary classification) or even multiple neurons (for multiclass classification) in a single layer called the “output” layer to generate our values of interest. The optimization model of such a learning procedure is as follows:

$\begin{array}{rlrl}& \text{min}& & \sum _{t=1}^{T}\left({g}_{t}-{b}_{t}{\right)}^{2}+\lambda \sum _{j=1}^{N}{x}_{j}^{2}\\ & \text{s.t.}& & h\left({y}_{t}\right)={g}_{t},\phantom{\rule{1em}{0ex}}\mathrm{\forall }t\in \left\{1,...,T\right\}\\ & & & z+\sum _{j=1}^{N}{a}_{jt}{x}_{j}={y}_{t},\phantom{\rule{1em}{0ex}}\mathrm{\forall }t\in \left\{1,...,T\right\}\\ & & & x,z,y,g\in \mathbb{R}\end{array}$

Where $h(y_t)$ is our desired Post-AF and $\lambda$ is a penalty (regularization controller). The objective function is comprised of two terms:

• The first is to minimize the squared difference that the output value the neuron generates, i.e., $g_t$ and the existing target feature value $b_t$ in the training dataset may have.
• To prevent the neuron from overfitting (or overlearning), the second term in the objective function called a regularization term is added not to let all $x$ values be greater than zero or too large. Accordingly, the neuron is regularized and can not select values for $x$, which only works well for the training dataset. By doing so, we ensure the neuron maintains an acceptable accuracy for predicting values while being applied for predictive analytics in the test dataset.

By using constraint one, we generate a value $g_t$ using the activation function $h(y_t)$ which should be close to $b_t$ to minimize the objective function. The choice of activation function is case-dependent and depends on your target feature values $b_t$. For instance, if $b_t$ per each training example $t$ is a positive real number then you might choose the identity or ReLU, however, if it is a binary or integer number you might use Sigmoid, TanH or softmax.

By using constraint two, we define the preprocessor function, which is called a regression function that generates a real number and gives it to the activation function in constraint one that generates desired values for us. This regression function is multivariate (as gets multiple affecting features $j$ per each training example $t$ as $a_jt$) and is linear as does not considers a power for each of the affecting features. If we consider powers for the input features, then, the model is converted to the following one, which uses multivariate polynomial regression function as its processor:

$\begin{array}{rlrl}& \text{min}& & \sum _{t=1}^{T}\left({g}_{t}-{b}_{t}{\right)}^{2}+\lambda \sum _{d=1}^{D}\sum _{k=1}^{K}{x}_{kd}^{2}\\ & \text{s.t.}& & h\left({y}_{t}\right)={g}_{t},\phantom{\rule{1em}{0ex}}\mathrm{\forall }t\in \left\{1,...,T\right\}\\ & & & z+\sum _{d=1}^{D}\sum _{k=1}^{K}\left(\prod _{j=1}^{N}\left({a}_{jt}{\right)}^{{P}_{jk}}\right){x}_{kd}={y}_{t},\phantom{\rule{1em}{0ex}}\mathrm{\forall }t\in \left\{1,...,T\right\}\\ & & & x,z,y,g\in \mathbb{R}\end{array}$

In this formultion, $D$ is the degree of polynomial, $K=\left(\begin{array}{l}d+N-1\\ d\end{array}\right)$, $P_{jk}$ is the power of the $k^{th}$ term in the polynomial in which $\sum_{j=1}^{N} P_{jk}=d$. Finally, the number of terms in the polynomial would be $\sum _{d=1}^{D}\left(\begin{array}{l}d+N-1\\ d\end{array}\right)$.

## Optimization Models in Deep Supervised Learning

Different from above, in deep supervised learning models, multiple neurons in multiple layers are used to develop a model which can process large amount of data with higher accuracy. The optimization model of this aproach is as follows:

$\begin{array}{rlrl}& \text{min}& & \sum _{t=1}^{T}\sum _{j=1}^{{U}_{L}}\left({g}_{jLt}-{b}_{jt}{\right)}^{2}+\lambda \sum _{l=2}^{L}\sum _{i=1}^{{U}_{l}}\sum _{j=1}^{{U}_{l+1}}{x}_{lij}^{2}\\ & \text{s.t.}& & {g}_{jlt}={a}_{jt},\phantom{\rule{1em}{0ex}}\mathrm{\forall }l\in \left\{1\right\},\mathrm{\forall }j\in \left\{1,...,{U}_{l}\right\},\mathrm{\forall }t\in \left\{1,...,T\right\}\\ & & & {g}_{jlt}=h\left({y}_{jlt}\right),\phantom{\rule{1em}{0ex}}\mathrm{\forall }l\in \left\{2,...,L\right\},\mathrm{\forall }j\in \left\{1,...,{U}_{l}\right\},\mathrm{\forall }t\in \left\{1,...,T\right\}\\ & & & {y}_{jlt}=\sum _{i=1}^{{U}_{l-1}}{g}_{jl-1t}{x}_{lij}+{z}_{lj},\phantom{\rule{1em}{0ex}}\mathrm{\forall }l\in \left\{2,...,L\right\},\mathrm{\forall }j\in \left\{1,...,{U}_{l}\right\},\mathrm{\forall }t\in \left\{1,...,T\right\}\\ & & & x,y,z,g\in \mathbb{R}\end{array}$

The notations are similar to the previous, but two indices are added to denote a neuron $i,j$ and a layer $l$. Moreover, $U_l$ is the number of neurons in layer $l$. This optimization model takes the value of each feature $a_{jt}$ and gives it to the preprocessor function of all neurons in the first layer. Next, these neurons, using their pre-AF, generate $y_{jlt}$ and, using their post-AF, generate $g_{jlt}$ and pass it to the next neurons in the next layers. The goal is to minimize the deviation of the values generated $g_{jLt}$ in the final layer $L$, which should be close to $b_{jt}$, per each training example $t$.