Given a dataset of $N$ observations $\{\mathbf{x}_n, y_n\},\; y_n \in C_k, n = 1,\dots,N$. Where $C_k$ is the set of $K$ discrete classes.
The goal is to take an input vector $x$ and to assign it to one of these $K$ classes. For binary classification $C_k = \{-1, +1\}$.
Let us think of predicting the probability of object to belong to positive class $p_+ = p\left(y_i = 1 \mid \mathbf{x}\right)$.
We can not use linear regression directly to predict this value since the probability is a real number between 0 and 1.
A few observations:
It means we can predict the value of $\log(\frac{p_+}{1 - p_+})$ using linear regression model.
Combining this expression with the equation of linear regression we get \[ \begin{equation} \log(\frac{p_+}{1 - p_+}) = \mathbf{w}^T\mathbf{x} \end{equation} \] and \[ \begin{equation} p_{+} = \frac{1}{1 + \exp^{-\mathbf{w}^T\mathbf{x} }} = \sigma(\mathbf{w}^T\mathbf{x}) \end{equation} \]
More general, we rewrite last expression as the following:
\[ \begin{equation} p\left(y = y_i \mid \mathbf{x}_i, \mathbf{w}\right) = \sigma(y_i\mathbf{w}^T\mathbf{x}_i) \end{equation} \]
Assuming that the objects in our data set are i.i.d. the likelihood of the data set can be written \[ \begin{equation} P\left(\mathbf{y} \mid \mathbf{X}, \mathbf{w}\right) = \prod_{i=1}^{N} p\left(y = y_i \mid \mathbf{x}_{i}, \mathbf{w}\right), \end{equation} \]
\log P(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) &= \log \prod_{i=1}^{N} p(y = y_i \mid \mathbf{x}_{i}, \mathbf{w}) \\
&= \log \prod_{i=1}^{N} \sigma(y_i\mathbf{w}^{T}\mathbf{x}_i) \\
&= \sum_{i=1}^{N} \log \sigma(y_i\mathbf{w}^{T}\mathbf{x}_i) \\
&= \sum_{i=1}^{N} \log \frac{1}{1 + \exp^{-y_i\mathbf{w}^{T}\mathbf{x}_i }} \\
&= -\sum_{i=1}^{N} \log (1 + \exp^{-y_i\mathbf{w}^{T}\mathbf{x}_i }) \\
That gives us the logistic loss
\[ \begin{equation} E_{\text{logistic}} = \sum_{i=1}^{N} \log (1 + \exp^{-y_i\mathbf{w}^{T}\mathbf{x}_i }) \end{equation} \] where $y_i \in \{-1, +1\}$.
Note: Logistic regression model predicts calibrated probabilities. Predicted probabilities that match the expected distribution of probabilities for each class are referred to as calibrated.