The Weston-Watkins SVM dual problem

2020/11/06

In this blog post, we derive the dual problem for the Weston-Watkins support vector machine (SVM). The dual optimization for the Weston-Watkins SVM can be found in the literature, e.g., Keerthi et al. (2008) and Weston and Watkins (1999). However, they often omit the tedious derivation.

Set-up and notations

Let $$k \ge 2$$ be an integer representing the number of classes and $$[k] = \{1,\dots, k\}$$.

Let $$n$$ be the size of the training data.

For each $$i \in [n]$$, $$x_i \in \mathbb{R}^d$$ is column vector and $$y_i \in [k]$$.

Let $$W = [w_1,\dots, w_k] \in \mathbb{R}^{d\times k}$$ where $$w_j$$ is the $$j$$-th column of $$W$$.

Let $$e_i \in \mathbb{R}^k$$ be the $$i$$-th elementary basis (column) vector.

Let $$M \in \mathbb{R}_{>0}$$ be a number. We are only interested when $$M \in \{1, 1/2\}$$.

Primal problem

The Weston-Watkins SVM minimizes over $$W$$ the following regularized empirical risk:

$\frac{1}{2}\|W\|_F^2 + C \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \max \{0, 1 - M(w_{y_i}' x_i - w_j'x_i)\}.$

When $$M = 1$$, then When $$M = 1/2$$, we get the formulation of Doǧan, Glasmachers, and Igel (2016).

If $$\widetilde W = [\widetilde w_1,\dots, \widetilde w_k]$$ is the optimizer, then the classifier is $x \mapsto \mathrm{argmax}_{j \in [k]} \widetilde w_j 'x.$

Now, for each $$j, l \in [k]$$, let $$\Delta_{j,l} = e_j - e_l$$. Then we can rewrite the regularized empirical risk as

$\frac{1}{2}\|W\|_F^2 + C \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \max \{0, 1 - M \Delta_{y_i, j}' W' x_i\}.$

Introducing the slack variable $$\xi_{ij}$$, we can minimize

$\frac{1}{2}\|W\|_F^2 + C \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \xi_{ij}$

subject to

$\begin{cases} \xi_{ij} \ge 0 \\ \xi_{ij} \ge 1 - M\Delta_{y_i, j}' W' x_i \end{cases}$

or, equivalently, subject to

$\begin{cases} 0 \ge 1 - M\mathrm{tr}( W' x_i\Delta_{y_i, j}') - \xi_{ij}\\ 0 \ge -\xi_{ij} \end{cases}$

where $$\mathrm{tr}$$ is the trace operator. We observe that $$W' x_i\Delta_{y_i, j}' \in \mathbb{R}^{k\times k}$$.

Dual problem

Let $$\alpha_{ij} \ge 0$$ and $$\beta_{ij} \ge 0$$ be the dual variables for the above constraints, respectively.

The Lagrangian is

$L(W, \xi, \alpha, \beta) = \frac{1}{2}\|W\|_F^2 + \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} C \xi_{ij} - \beta_{ij}\xi_{ij} + \alpha_{ij} (1 - M\mathrm{tr}( W' x_i\Delta_{y_i, j}') - \xi_{ij})$

Rearranging, we get

$L(W, \xi, \alpha, \beta) = \frac{1}{2}\|W\|_F^2 + \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \xi_{ij} (C - \beta_{ij} -\alpha_{ij}) + \alpha_{ij} (1 - M\mathrm{tr}( W' x_i\Delta_{y_i, j}'))$

Setting to zero the gradient of $$L$$ with respect to $$W$$, we get

$0 = \nabla_W L = W - M \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij} x_i\Delta_{y_i, j}'$

Equivalently, $W = M \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij} x_i\Delta_{y_i, j}'$

Next, setting to zero the gradient of $$L$$ with respect to $$\xi_{ij}$$, we get

$0 = \nabla_{\xi_{ij}} L = C - \beta_{ij} - \alpha_{ij}.$

Thus, the dependencies in the Lagrangian on $$\xi_{ij}$$ and $$\beta_{ij}$$ are removed and so

$L(W, \alpha) = \frac{1}{2}\|W\|_F^2 + \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij} (1 - M\mathrm{tr}( W' x_i\Delta_{y_i, j}'))$

Now, $\sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij} \mathrm{tr}( W' x_i\Delta_{y_i, j}') = \mathrm{tr}\left( W' \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij} x_i\Delta_{y_i, j}'\right) = \frac{1}{M} \mathrm{tr}(W'W) = \frac{1}{M} \|W\|_F^2.$

Hence, we have

$L(W, \alpha) = -\frac{1}{2}\|W\|_F^2 + \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij}$

Now, let $$\alpha_{iy_i} = -\sum_{j\in[n]: j \ne y_i} \alpha_i$$. Using the definition of $$\Delta_{y_i, j} = e_{y_i} - e_j$$, we get $\frac{1}{M} W = \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} \alpha_{ij} x_i\Delta_{y_i, j}' = \sum_{i=1}^n \sum_{j \in [k]: j \ne y_i} - \alpha_{ij} x_i e_j' + \sum_{i=1}^n ( \sum_{j \in [k]: j \ne y_i} \alpha_{ij} ) x_i e_{y_i}' = - \sum_{i=1}^n \sum_{j \in [k]} \alpha_{ij} x_i e_j'$

Now, let us define the column vector $\alpha_i = \begin{bmatrix} \alpha_{i1}\\ \vdots \\ \alpha_{ik} \end{bmatrix} \in \mathbb{R}^k$ Then we have $W = - M \sum_{i = 1}^n x_i \alpha_i'$ where we observe that $$x_i \alpha_i' \in \mathbb{R}^{d \times k}$$.

Now, $\frac{1}{M^2} \|W\|_F^2 = \frac{1}{M^2} \mathrm{tr} (W'W) = \mathrm{tr} \left( \sum_{i, s\in [n]} \alpha_s x_s' x_i \alpha_i' \right) = \mathrm{tr} \left( \sum_{i, s\in [n]} x_s' x_i \alpha_i' \alpha_s \right) = \sum_{i, s\in [n]} x_s' x_i \alpha_i' \alpha_s.$ This eliminates the dependencies in the Lagrangian on $$W$$ and so $L(\alpha) = -\frac{M^2}{2} \sum_{i, s\in [n]} x_s' x_i \alpha_i' \alpha_s + \sum_{i \in [n]} \sum_{j \in [k]: j \ne y_i} \alpha_{ij}$

Expression for the dual problem

Thus, the dual problem is

$\mathrm{minimize}_{\alpha} \quad f(\alpha) := \frac{M^2}{2} \sum_{i, s\in [n]} x_s' x_i \alpha_i' \alpha_s - \sum_{i \in [n]} \sum_{j \in [k]: j \ne y_i} \alpha_{ij}$ subject to $\begin{cases} 0 \le \alpha_{ij} \le C &: j \ne y_i\\ \alpha_{iy_i} = \sum_{j \in [k]: j \ne y_i} - \alpha_j &: j = y_i. \end{cases}$

When $$M = 1$$, the above is the same as equation (16) of Keerthi et al. (2008). Actually, there is a typo in (16) where the $$\alpha_{ij}$$s should have a negative sign in front. Later, Keerthi et al. (2008) computes the derivative of $$f$$ with the correct sign.

Derivatives

Below, fix such a $$i \in [n]$$ and $$j \in [k]$$ such that $$j \ne y_i$$.

In this section, we compute $$\frac{\partial f}{\partial \alpha_{ij}}$$.

First, let $$i \ne s$$ and consider the term $x_s' x_i \alpha_i'\alpha_s = x_s' x_i (\alpha_{ij} \alpha_{sj} + \alpha_{iy_i} \alpha_{sy_i} + \mathrm{constant} )$ where $$\mathrm{constant}$$ collects terms that do not depend on $$\alpha_{ij}$$.

Taking derivative of $$x_s' x_i \alpha_i'\alpha_s$$ with respect to $$\alpha_{ij}$$, we get

$x_s' x_i (\alpha_{sj} - \alpha_{sy_i})$ where we recall that $$\alpha_{iy_i} = \sum_{l \in [k]: l \ne y_i} -\alpha_{il}$$.

Similarly, when $$i = s$$, we have that the derivative of $$x_i' x_i \alpha_i'\alpha_i$$ is

$2 x_i' x_i (\alpha_{ij} - \alpha_{i y_i|}).$

From this, we get that $\frac{\partial f}{\partial \alpha_{ij}} (\alpha) = -1 + M^2 \sum_{s \in [n]} x_i' x_s (\alpha_{sj} - \alpha_{s y_i})$

Now, observe that

$\sum_{s \in [n]} x_i' x_s (\alpha_{sj} - \alpha_{s y_i}) = x_i ' \left(\sum_{s \in [n]} \alpha_{sj} x_s -\alpha_{s y_i} x_s\right)$

Recall from earlier that

$W = - M \sum_{i = 1}^n x_i \alpha_i'$

Thus, $w_j = We_j = - M \sum_{i = 1}^n \alpha_{ij}x_i \quad \mbox{and} \quad w_{y_i}= We_{y_i} = - M \sum_{i = 1}^n \alpha_{iy_i}x_{i}.$

Thus, we get $Mx_i ' \left(\sum_{s \in [n]} \alpha_{sj} x_s -\alpha_{s y_i} x_s\right) = w_{y_i}' x_i - w_{j} ' x_i.$

From this, we conclude that $\frac{\partial f}{\partial \alpha_{ij}} (\alpha) = M(w_{y_i}' x_i - w_{j}'x_i) -1.$ Since we are minimizing, it is more convenient to consider the negative gradient: $-\frac{\partial f}{\partial \alpha_{ij}} (\alpha) = 1- M (w_{y_i}' x_i - w_{j}'x_i).$ Again, when $$M = 1$$, note that this equivalent to equation (17) from Keerthi et al. (2008).

When $$M = 1/2$$, this is equivalent to Shark’s calcGradient function, where the negative gradient is computed on line 415.

Coordinate descent in Shark

In this section, we fix $$i \in [n]$$. We will explain how Shark (Igel, Heidrich-Meisner, and Glasmachers (2008)) solves the subproblem:

$\mathrm{minimize}_{\alpha_{i1},\dots, \alpha_{ik}} \quad f(\alpha) := \frac{M^2}{2} \sum_{i, s\in [n]} x_s' x_i \alpha_i' \alpha_s - \sum_{i \in [n]} \sum_{j \in [k]: j \ne y_i} \alpha_{ij}$ subject to $\begin{cases} 0 \le \alpha_{ij} \le C &: j \ne y_i\\ \alpha_{iy_i} = \sum_{j \in [k]: j \ne y_i} - \alpha_j &: j = y_i. \end{cases}$

Note that $$\alpha_{st}$$ is fixed for all $$s \ne i$$ and all $$j \in [k]$$. By cycling through the $$i\in [n]$$ and (approximately) solving the above subproblem, we get a form of coordinate descent.

Let $$\widehat{\alpha}$$ be the next iterate of $$\alpha$$ where

$\widehat{\alpha}_s = \alpha_s,\quad \forall s \ne i.$

Then we have

$f(\widehat{\alpha}) = \frac{M^2}{2}\|x_i\|^2 \|\widehat\alpha_i\|^2 + M^2 x_i' \left( \sum_{s \in [n]: s \ne i} x_s \alpha_s' \right) \widehat \alpha_i - \sum_{j \in [n]: j \ne y_i} \widehat{\alpha}_{ij} +C_i$ where $$C_i \in \mathbb{R}$$ does not depend on $$\widehat{\alpha}_{i}$$.

Now, observe that

$x_i' \left( \sum_{s \in [n]: s \ne i} x_s \alpha_s' \right) \widehat \alpha_i = x_i' \left( \sum_{s \in [n]} x_s \alpha_s' \right) \widehat \alpha_i - x'_ix_i \alpha_i' \widehat\alpha_i = - x_i' W \widehat \alpha_i - \|x_i\|^2 \alpha_i' \widehat\alpha_i$

Thus $f(\widehat{\alpha}) = \frac{M^2}{2}\|x_i\|^2 \|\widehat\alpha_i\|^2 -M^2 x_i' W\widehat{\alpha}_i -M^2\|x_i\|^2\alpha_i \widehat{\alpha}_i - \sum_{j \in [n]: j \ne y_i} \widehat{\alpha}_{ij} +C_i.$

Let $$\mathbb{1}_{\neg y_i} \in \mathbb{R}^k$$ be the vector whose entries are all ones except the $$y_i$$-th entry, which is zero. The above can be written succinctly as

$f(\widehat{\alpha}) = \frac{M^2}{2}\|x_i\|^2 \|\widehat\alpha_i\|^2 - \widehat{\alpha}_i' ( M^2 W' x_i + M^2\|x_i\|^2 \alpha_i + \mathbb{1}_{\neg y_i} ) +C_i.$

Update of a single block dual variable

Shark considers updates of the following form:

$\widehat{\alpha}_s = \begin{cases} \alpha_s &: s \ne i\\ \alpha_i + \mu &: s = i \end{cases}$ where $$\mu \in \mathbb{R}^k$$ is a step vector satisfying

$\mu_{y_i} = -\sum_{t \in [k]: t \ne y_i} \mu_t.$ The above constraint is to ensure that $$\widehat{\alpha}$$ remains (dual) feasible. The update step is implemented in the updateWeightVectors function assuming that $$\mu$$ is given. In the next section, we discuss how Shark computes the $$\mu$$ vector.

Computing the step vector

Let $$m$$ be arbitrary such that $$\alpha_{ij} + m \in [0,C]$$. We consider the updates $\begin{cases} \hat{\alpha}_{it} = \alpha_{it} &: t \not\in \{j,y_i\} \\ \hat{\alpha}_{it} = \alpha_{it} + m &: t =j \\ \hat{\alpha}_{it} = \alpha_{it} - m &: t =y_i \\ \end{cases}$

Similarly, define

$\widehat W = [\hat w_1, \cdots \hat w_k] = - M \sum_{i = 1}^n x_i \hat \alpha_i'.$

Then by construction, we have for $$t \in [k]$$ that

$\hat w_t = \begin{cases} w_t &: t \not \in \{j, y_i\}\\ w_j - Mmx_i &: t = j \\ w_{y_i} + Mmx_i &: t = y_i. \end{cases}$

Thus, we have $-\frac{\partial f}{\partial \alpha_{it}}( \hat \alpha) = 1 - M(\hat w_{y_i} ' x_i - \hat w_t' x_i) = \begin{cases} 1 - M(w_{y_i} ' x_i - w_j' x_i) - 2 M^2m \|x_i\|^2 &: t = j \\ 1 - M(w_{y_i} ' x_i - w_t' x_i) - 1 M^2m \|x_i\|^2 &: t \ne j \\ \end{cases}$

Thus, $-\frac{\partial f}{\partial \alpha_{it}}( \hat \alpha) = \begin{cases} -\frac{\partial f}{\partial \alpha_{it}}( \alpha) - 2 M^2m \|x_i\|^2 &: t = j \\ -\frac{\partial f}{\partial \alpha_{it}}( \alpha) - 1 M^2m \|x_i\|^2 &: t \ne j \\ \end{cases}$

In Shark, we have $$M = 1/2$$ and so the gradient update is

$-\frac{\partial f}{\partial \alpha_{it}}( \hat \alpha) = \begin{cases} -\frac{\partial f}{\partial \alpha_{it}}( \alpha) - m \frac{\|x_i\|^2}{2} &: t = j, \\ -\frac{\partial f}{\partial \alpha_{it}}( \alpha) - 0.5 m \frac{\|x_i\|^2}{2} &: t \ne j. \end{cases}$

This is the mathematical justification behind this code snippet from Shark.

Greedy step size

Consider a function $\Phi(x) = ax^2 + bx+c.$ The derivative at $$x$$ is $\Phi'(x) = 2ax + b.$ Setting this to zero, we have $-\frac{b}{2a} =x= x + (-\frac{b}{2a}-x) = x + m$

Next, observe that $m= -\frac{b}{2a}-x = -\frac{1}{2a} (2ax+b) = -\frac{1}{2a} \Phi'(x)$

Applying this principle to the problem above, where we have $$x = \alpha_{ij}$$, $$\Phi'(x) = \frac{\partial f}{\partial \alpha_{ij}} (\alpha)$$ and $$a =M^2\|x_i\|^2$$. Plugging everything in, we have

$m = \frac{1}{2(M\|x_i\|)^2} \left(- \frac{\partial f}{\partial \alpha_{ij}} (\alpha)\right) = - \frac{\partial f}{\partial \alpha_{ij}} (\alpha) \left(\frac{1}{2(M\|x_i\|)^2}\right).$

Since $$M = 1/2$$, we have

$m = - \frac{\partial f}{\partial \alpha_{ij}} (\alpha) \frac{2}{\|x_i\|^2}.$

This is the update at line 464.

Doǧan, Ürün, Tobias Glasmachers, and Christian Igel. 2016. “A Unified View on Multi-Class Support Vector Classification.” The Journal of Machine Learning Research 17 (1). JMLR. org: 1550–1831.

Igel, Christian, Verena Heidrich-Meisner, and Tobias Glasmachers. 2008. “Shark.” Journal of Machine Learning Research 9 (Jun): 993–96.

Keerthi, S Sathiya, Sellamanickam Sundararajan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. 2008. “A Sequential Dual Method for Large Scale Multi-Class Linear Svms.” In Proceedings of the 14th Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 408–16.

Weston, Jason, and Chris Watkins. 1999. “Support Vector Machines for Multi-Class Pattern Recognition.” In Proc. 7th European Symposium on Artificial Neural Networks, 1999.