This blog post decribes the recent NeurIPS 2018 paper and companion code on smooth training of max-margin structured prediction models. Training structured prediction models consists in optimizing a non-smooth objective using an inference combinatorial optimization algorithm.
We propose a framework called Casimir based on the Catalyst acceleration and infimal-convolution smoothing allowing us to break the non-smoothness barrier and obtain fast incremental algorithms for large-scale training of deep structured prediction models.
Structured prediction consists in predicting complex outputs such as sequences, trees or lattices. For instance, named entity recognition can be cast as the task of predicting a sequence of tags, one for each word which identifies the word as a named entity.
In this example, an output is a chain of tags, where each tag can take values from a dictionary. In general, the set of all outputs is finite but too large to enumerate. To overcome this difficulty, a score function , parameterized by , is defined to measure the compatibility of the input-output pair as . This score function decomposes over the structure of the outputs (e.g., a chain) so that predictions can be made by an inference procedure which finds
The inference problem can be solved in various settings of interest by efficient combinatorial algorithms such as the Viterbi algorithm for named entity recognition.
The goal of the learning problem is to find the best parameter so that inference produces the correct output. Given a loss such as the Hamming loss, max-margin structured prediction aims to minimize a surrogate called the structural hinge loss, which is defined for an input-output pair as
where is defined as a generalization of the margin used in classical support vector machine,
The optimization problem is defined for samples as the regularized empirical surrogate risk minimization
The subgradient of is computed by running the inference procedure as
Though the above formulation allows one to use tractable first-order information through combinatorial procedures, its non-smoothness prevents us from using fast incremental optimization algorithms. We overcome this challenge by blending an extrapolation scheme for acceleration and an adaptive smoothing scheme.
We now wish to smooth the objective function in order to apply incremental algorithms for smooth optimization. This is not straightforward because each is computed by a discrete inference algorithms.
To smooth , we first note that it can be written as the composition , where
The non-smooth max function is simply smoothed by adding a strongly convex function to its dual formulation as
|Smoothing type||Smoothing computation|
|projection on simplex|
The smooth structural hinge loss is obtained by replacing the non-smooth with its smooth counterpart as
In the structured prediction setting, entropy smoothing is equivalent to a conditional random field [LMP01], which is only tractable for tree structured outputs. On the other hand, the sparse outputs of smoothing can be well approximated by picking a small integer and considering the top- highest scoring outputs. This makes smoothing more feasible for tree structured outputs as well as select loopy output structures. See the illustration below for the example of named entity recognition.
Formally, we define inference oracles, namely the max, top- and exp oracles as first order oracles for the structural hinge loss and its smoothed variants with and entropy smoothing respectively. This allows us to measure the complexity of optimization algorithms, which we discuss next. The table below shows how the smooth inference oracles are implemented for a given max oracle. Their computational complexity is given in terms of , the cost of max oracle and , the size of each output .
|Max oracle||Top- oracle||Exp oracle|
|Max-product||Top- max-product, time||Sum-product, time|
|Graph cut||BMMF, time||Intractable|
|Graph matching||BMMF, time||Intractable|
|Branch and bound search||Top- search||Intractable|
Here, BMMF is the Best max marginal first algorithm of [YW03].
The optimization algorithms depend on the nature of the map . The structural hinge loss, and thus , are convex if this map is linear. Otherwise, could be nonconvex in general.
Convex structured prediction
Classical structured prediction [ATH03, TGK04] uses a linear score where is a hand-engineered feature map. The objective is convex here and we extend the generic Catalyst acceleration scheme [LMH18] to optimize nonsmooth objectives by using smooth surrogates. See also this blog post for a review. In particular, each iteration considers smoothed and regularized surrogates of the form
Starting at , at each step , do
- Approximately solve using a linearly convergent algorithm :
- Extrapolate to get
When using, for instance, SVRG [JZ15] as the linearly convergent algorithm , we are guaranteed to get approximate solution after iterations where
Deep structured prediction
More broadly, deep structured prediction attempts to learn the feature map . The score function is nonlinear in so that is nonconvex. In this case, the prox-linear algorithm [B85,DP18] is applicable. It was described previously here. This amounts to considering a linear approximation to as
to get a regularized convex model
The convex sub-problem above can then be optimized by the convex optimization algorithm described earlier.
Given below are results of numerical experiments for named entity recognition on CoNLL-2003 dataset and visual object localization on the PASCAL VOC dataset. We first show the performance of the proposed algorithm for the convex case, where the feature maps are predefined. Casimir-SVRG-const and Casimir-SVRG-adapt are the two variants with constant and adaptive smoothing respectively.
Next, we consider deep structured prediction where the feature map is learnt using a convolutional neural network.