Training Deep Structured Prediction Models at Scale


This blog post decribes the recent NeurIPS 2018 paper and companion code on smooth training of max-margin structured prediction models. Training structured prediction models consists in optimizing a non-smooth objective using an inference combinatorial optimization algorithm.

We propose a framework called Casimir based on the Catalyst acceleration and infimal-convolution smoothing allowing us to break the non-smoothness barrier and obtain fast incremental algorithms for large-scale training of deep structured prediction models.

Setting

Structured prediction consists in predicting complex outputs such as sequences, trees or lattices. For instance, named entity recognition can be cast as the task of predicting a sequence of tags, one for each word which identifies the word as a named entity.

In this example, an output is a chain of tags, where each tag can take values from a dictionary. In general, the set of all outputs is finite but too large to enumerate. To overcome this difficulty, a score function , parameterized by , is defined to measure the compatibility of the input-output pair as . This score function decomposes over the structure of the outputs (e.g., a chain) so that predictions can be made by an inference procedure which finds

The inference problem can be solved in various settings of interest by efficient combinatorial algorithms such as the Viterbi algorithm for named entity recognition.

The goal of the learning problem is to find the best parameter so that inference produces the correct output. Given a loss such as the Hamming loss, max-margin structured prediction aims to minimize a surrogate called the structural hinge loss, which is defined for an input-output pair as

where is defined as a generalization of the margin used in classical support vector machine,

The optimization problem is defined for samples as the regularized empirical surrogate risk minimization

The subgradient of is computed by running the inference procedure as

Though the above formulation allows one to use tractable first-order information through combinatorial procedures, its non-smoothness prevents us from using fast incremental optimization algorithms. We overcome this challenge by blending an extrapolation scheme for acceleration and an adaptive smoothing scheme.

Smoothing

We now wish to smooth the objective function in order to apply incremental algorithms for smooth optimization. This is not straightforward because each is computed by a discrete inference algorithms.

To smooth , we first note that it can be written as the composition , where

The non-smooth max function is simply smoothed by adding a strongly convex function to its dual formulation as

where is the simplex in . It can be shown that is a smooth approximation of upto [N05,BT12]. Common choices of are given below.

Smoothing type Smoothing computation
entropy log-sum-exp
projection on simplex

The smooth structural hinge loss is obtained by replacing the non-smooth with its smooth counterpart as

In the structured prediction setting, entropy smoothing is equivalent to a conditional random field [LMP01], which is only tractable for tree structured outputs. On the other hand, the sparse outputs of smoothing can be well approximated by picking a small integer and considering the top- highest scoring outputs. This makes smoothing more feasible for tree structured outputs as well as select loopy output structures. See the illustration below for the example of named entity recognition.

nonsmooth l2 smoothing entropy
Non-smooth smoothing entropy smoothing

Formally, we define inference oracles, namely the max, top- and exp oracles as first order oracles for the structural hinge loss and its smoothed variants with and entropy smoothing respectively. This allows us to measure the complexity of optimization algorithms, which we discuss next. The table below shows how the smooth inference oracles are implemented for a given max oracle. Their computational complexity is given in terms of , the cost of max oracle and , the size of each output .

Max oracle Top- oracle Exp oracle
Max-product Top- max-product, time Sum-product, time
Graph cut BMMF, time Intractable
Graph matching BMMF, time Intractable
Branch and bound search Top- search Intractable

Here, BMMF is the Best max marginal first algorithm of [YW03].

Optimization algorithms

The optimization algorithms depend on the nature of the map . The structural hinge loss, and thus , are convex if this map is linear. Otherwise, could be nonconvex in general.

Convex structured prediction

Classical structured prediction [ATH03, TGK04] uses a linear score where is a hand-engineered feature map. The objective is convex here and we extend the generic Catalyst acceleration scheme [LMH18] to optimize nonsmooth objectives by using smooth surrogates. See also this blog post for a review. In particular, each iteration considers smoothed and regularized surrogates of the form

Algorithm

Starting at , at each step , do

  • Approximately solve using a linearly convergent algorithm :
  • Extrapolate to get

When using, for instance, SVRG [JZ15] as the linearly convergent algorithm , we are guaranteed to get approximate solution after iterations where

Deep structured prediction

More broadly, deep structured prediction attempts to learn the feature map . The score function is nonlinear in so that is nonconvex. In this case, the prox-linear algorithm [B85,DP18] is applicable. It was described previously here. This amounts to considering a linear approximation to as

to get a regularized convex model

The convex sub-problem above can then be optimized by the convex optimization algorithm described earlier.

Numerical experiments

Given below are results of numerical experiments for named entity recognition on CoNLL-2003 dataset and visual object localization on the PASCAL VOC dataset. We first show the performance of the proposed algorithm for the convex case, where the feature maps are predefined. Casimir-SVRG-const and Casimir-SVRG-adapt are the two variants with constant and adaptive smoothing respectively.

ner localization

Next, we consider deep structured prediction where the feature map is learnt using a convolutional neural network.

localization

A software package called Casimir implementing all these algorithms and more is available here.