# Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch

Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, Hongsheng Li

PPublished as a conference paper at ICLR 2021 L EARNING

N:M F INE - GRAINED S TRUCTURED S PARSE N EURAL N ETWORKS F ROM S CRATCH

Aojun Zhou ∗ , Yukun Ma ∗ , Junnan Zhu , Jianbo Liu , Zhijie Zhang , Kun Yuan Wenxiu Sun Hongsheng Li Sensetime, CUHK-Sensetime Joint Lab, CUHK, Northwestern University, NLPR, CASIA [email protected] [email protected] [email protected] A BSTRACT

Sparsity in Deep Neural Networks (DNNs) has been widely studied to compressand accelerate the models on resource-constrained environments. It can be gen-erally categorized into unstructured ﬁne-grained sparsity that zeroes out multipleindividual weights distributed across the neural network, and structured coarse-grained sparsity which prunes blocks of sub-networks of a neural network. Fine-grained sparsity can achieve a high compression ratio but is not hardware friendlyand hence receives limited speed gains. On the other hand, coarse-grained sparsitycannot concurrently achieve both apparent acceleration on modern GPUs and de-cent performance. In this paper, we are the ﬁrst to study training from scratch an

N:M ﬁne-grained structured sparse network, which can maintain the advantages ofboth unstructured ﬁne-grained sparsity and structured coarse-grained sparsity si-multaneously on speciﬁcally designed GPUs. Speciﬁcally, a sparse networkcould achieve × speed-up without performance drop on Nvidia A100 GPUs.Furthermore, we propose a novel and effective ingredient, sparse-reﬁned straight-through estimator (SR-STE), to alleviate the negative inﬂuence of the approxi-mated gradients computed by vanilla STE during optimization. We also deﬁne ametric, Sparse Architecture Divergence (SAD), to measure the sparse network’stopology change during the training process. Finally, We justify SR-STE’s ad-vantages with SAD and demonstrate the effectiveness of SR-STE by performingcomprehensive experiments on various tasks. Source codes and models are avail-able at https://github.com/NM-sparsity/NM-sparsity. NTRODUCTION

Deep neural networks (DNNs) have shown promising performances on various tasks including com-puter vision, natural language processing, speech recognition, etc. However, a DNN usually comeswith a large number of learnable parameters, ranging from millions of to even billions of ( e . g .,GPT-3 (Brown et al., 2020)), making the DNN model burdensome and difﬁcult to be applied toreal-world deployments. Therefore, researchers began to investigate how to speed up and compressDNNs via various methods such as knowledge distillation (Hinton et al., 2015), quantization (Jacobet al., 2018; Zhou et al., 2017), designing efﬁcient model architectures (Howard et al., 2017), andstructured sparsity (Wen et al., 2016; Li et al., 2016).In this paper, we focus on the problem of sparsifying DNNs. Sparsity in DNNs can be categorizedinto unstructured sparsity and structured sparsity. Unstructured sparsity prunes individual weightsat any location, which is ﬁne-grained and can achieve extremely high compression ratio (Han et al.,2015; Guo et al., 2016). However, unstructured sparsity struggles to take advantage of vector-processing architectures such as SIMD and poorly utilizes memory buses, which increases latencydue to dependent sequences of reads (Nvidia, 2020). Compared with unstructured sparsity, struc-tured sparsity is more friendly to hardware, especially for block pruning (Wang et al., 2019), kernelshape sparsity (Tan et al., 2020) or channel and ﬁlter pruning (Li et al., 2016; Wen et al., 2016). Al-though structured sparsity can speed up DNNs on commodity hardware, it hurts model performancemore signiﬁcantly than unstructured ﬁne-grained sparsity. For example, ResNet-50 network gener-ated by unstructured pruning can achieve a . × compression ratio, with the same accuracy as the ∗ The ﬁrst two authors equally contribute to this paper. a r X i v : . [ c s . C V ] F e b ublished as a conference paper at ICLR 2021original network, but it can only achieve × compression in the case of structured sparsity (Rendaet al., 2020). Therefore, how to combine the unstructured sparsity and structured sparsity to acceler-ate DNNs on modern hardware ( e . g ., GPU) becomes a challenging yet valuable problem. Recently,Nvidia Ampere A100 is equipped with the Sparse Tensor Cores to accelerate 2:4 structured ﬁne-grained sparsity. Here,

N:M sparsity indicates the sparsity of DNNs in which only N weights arenon-zero for every continuous M weights. To the best of our knowledge, A100 is the ﬁrst commod-ity sparse hardware, where the sparse tensor core can support several common operations includinglinear, convolutional, recurrent cells, transformer blocks, etc. Speciﬁcally, suppose a typical matrixmultiplication X × W in DNNs, X and W denote input tensor and parameter tensor respectively.The Dense Tensor Cores implement X × × W × matrix multiplication by 2 cycles while theSparse Tensor Cores only need 1 cycle if the parameter tensor W satisﬁes the 2:4 structured sparsepattern.Nvidia has proposed an ASP (APEX’s Automatic Sparsity) solution (Nvidia, 2020) to sparsify adense neural network to satisfy the 2:4 ﬁne-grained structured sparsity requirement. The recipecontains three steps: (1) training a dense network until converge; (2) pruning for sparsity withmagnitude-based single-shot pruning; (3) repeating the original training procedure. However, ASPis computationally expensive since it requires training the full dense models from scratch and ﬁne-tuning again. Therefore, we still lack a simple recipe to obtain a structured sparse DNN modelconsistent with the dense network without extra ﬁne-tuning.This paper addresses this question: Can we design a simple yet universal recipe to learn N : M sparsity neural networks from scratch in an efﬁcient way ?It is difﬁcult to ﬁnd the optimal sparse architecture (connections) and optimal parameters (Evciet al., 2019b) simultaneously during training sparse CNNs and Transformers although SET-MLPcould easily outperform dense MLP (Bourgin et al., 2019). There are two schemes to obtain suchsparse models. One is a two-stage scheme, which discovers a sparse neural architecture by pruning awell-trained dense network and then uses the same or even greater computational resources to retrainthe sparse models (Nvidia, 2020; Evci et al., 2019b; Han et al., 2015; Frankle & Carbin, 2018). Theother is a one-stage scheme, which adopts the dynamic method to alternatively optimize parametersand prunes network architectures based on different criteria (Bellec et al., 2017; Mocanu et al., 2018;Mostafa & Wang, 2019; Evci et al., 2019b; Kusupati et al., 2020; Dettmers & Zettlemoyer, 2019).Compared with the two-stage scheme, the one-stage scheme can save training time and cost howeverusually obtains lower performance.To overcome the aforementioned trade-off between training cost and performance, we present asimple yet effective framework to train sparse neural networks from scratch. Speciﬁcally, we em-ploy the magnitude-based pruning method (Renda et al., 2020; Gale et al., 2019) during the forwardprocess. Considering that the pruning operation is a non-differentiable operator (a similar dilemmain model quantization (Courbariaux et al., 2016)), we extend the widely used Straight-through Es-timator (STE) (Bengio et al., 2013) in model quantization to aid sparse neural network’s back-propagation. However, perturbations are introduced during the back-propagation (Yin et al., 2019;Bengio et al., 2013). Hence we deﬁne Sparse Architecture Divergence (SAD) to further analyze N : M sparse networks trained by STE methods so that we can identify the impact of perturbationson sparse neural networks training. Based on SAD analysis, to alleviate the negative impact, wepropose a sparse-reﬁned term mitigating the approximated gradients’ inﬂuence.We also compare the performance of neural networks with different granularities of ﬁne-grainedstructured sparsity ( i . e ., 1:4, 2:4, 2:8, 4:8) and conduct thorough experiments on several typicaldeep neural networks with different N : M sparsity levels, covering image classiﬁcation, detection,segmentation, optical ﬂow estimation, and machine translation. Experimental results have shownthat the models with our proposed structured sparsity can achieve neglectful performance drop andcan even sometimes outperform the dense model.The main contributions of this paper are summarized as three-fold. (1) To the best of our knowledge,this is the ﬁrst systematic study into training N : M structured sparse neural networks from scratchwithout performance drop. The N : M structured sparsity is a missing yet promising ingredient inmodel acceleration, which can be a valuable supplement with various compression methods. (2) Weextend STE to tackle the problem of training N : M sparse neural networks. To alleviate the limita- https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity (3) We conduct extensive experiments on vari-ous tasks with N : M ﬁne-grained sparse nets, and provide benchmarks for N : M sparse net trainingto facilitate co-development of related software and hardware design. ELATED W ORK

Unstructured and Structured Sparsity.

Sparsity of DNNs is a promising direction to compressand accelerate a deep learning model. Among all sparsity types, unstructured sparsity can achievea signiﬁcantly high compression ratios ( e.g. × (Han et al., 2015) and 108 × (Guo et al., 2016))while ensuring decent accuracy by pruning. Many different pruning criterions and pruning meth-ods are proposed for unstructured sparsity, e . g ., magnitude-based pruning (Han et al., 2015; Frankle& Carbin, 2018), Hessian based heuristics (LeCun et al., 1990), and pruning with connection sen-sitivity (Lee et al., 2018). However, unstructured sparsity’s ability to accelerate is highly limitedsince it takes a lot of overhead to store the irregular non-zero index matrix. On the other hand,Wen et al. (2016) introduces the structural sparsity to speed up deep models on CPU/GPU. Exist-ing structural sparsity contains ﬁlter-wise sparsity (Li et al., 2016), channel-wise sparsity (Li et al.,2016), ﬁlter-shape-wise sparsity, and depth-wise sparsity. Different from existing sparsity patterns(ﬁne-grained unstructured sparsity and coarse-grained structured sparsity), this paper presents an N : M ﬁne-grained structured sparsity, a sparsity type that has both high efﬁciency and lossless per-formance. One-stage and two-stage methods.

There are mainly two types of techniques to obtain a sparseneural network, one-stage methods and two-stage ones. The two-stage method ﬁrst prunes a traineddense neural network and then retrains ﬁxed sparse network to recover its performance. Typicaltwo-stage methods include single-shot pruning (Lee et al., 2018) and iterative pruning (Han et al.,2015; Guo et al., 2016). Later, the lottery ticket hypothesis (Frankle & Carbin, 2018) shows that thesparse sub-network (winning tickets) can be trained from scratch with the same initialization whilethe winning tickets are discovered by dense training. Deep-Rewiring (Bellec et al., 2017), on theother hand, is a typical one-stage method, which takes a Bayesian perspective and samples sparsenetwork connections from a posterior, however is computationally expensive and challenging to beapplied to large-scale tasks. Sparse Evolutionary Training (Mocanu et al., 2018)(SET) is proposedas a simpler scheme where weights are pruned according to the standard magnitude criterion usedin pruning and growing connections in random locations. Dettmers & Zettlemoyer (2019) uses themomentum of each parameter as the criterion for growing weights and receives an improvement intest accuracy. GMP (Gale et al., 2019) trains the unstructured sparse net using variational dropoutand l regularization from scratch, and shows that unstructured sparse architectures learned throughpruning cannot be trained from scratch to have the same testing performance as dense models do.Recently proposed state-of-the-art method STR (Kusupati et al., 2020) introduces pruning learnablethresholds to obtain a non-uniform sparse network. RigL (Evci et al., 2019a) uses the magnitude-based method to prune and the periodic dense gradients to regrow connection. However, comparedwith training dense neural networks from scratch, to achieve the same performance, RigL needs 5 × more training time. The most closely related work to ours may be DNW(Wortsman et al., 2019)which uses a fully dense gradient in the backward run to discover optimal wiring on the ﬂy. ETHOD

N:M

FINE - GRAINED STRUCTURED SPARSITY

Here we deﬁne the problem of training a neural network with N : M ﬁne-grained structured sparsity.A neural network with N : M sparsity satisﬁes that, in each group of M consecutive weights of thenetwork, there are at most N weights have non-zero values. Fig. 1 illustrates a 2:4 sparse network.Generally, our objective is to train an N : M sparse neural network as min S ( W ,N,M ) L ( W ; D ) , (1)where D denotes the observed data, L represents the loss function, W = {W l : 0 < l (cid:53) L } indicates the parameters of an L -layer neural network, and S ( W , N, M ) is the N : M sparse neuralnetwork parameters. 3ublished as a conference paper at ICLR 2021 𝑪𝑪𝑹

Grouping Pruning 𝑹 DeployDense weights 𝑊 Sparse weights

𝑊𝑴 𝑵

𝐶 𝑁𝑀 log 𝑀 -bitsindices 𝐶 𝑁𝑀

Non-zero data valuesSparse subnetwork mask

Figure 1: Illustration of achieving N : M structure sparsity. (Left) In a weight matrix of : sparseneural network, whose shape is R × C ( e . g ., R = output channels and C = input channels in alinear layer), at least two entries would be zero in each group of 4 consecutive weights. (Middle &Right) The process that the original matrix is compressed, which enables processing of the matrixto be further accelerated by designated processing units ( e . g ., Nvidia A100).3.2 S TRAIGHT - THROUGH E STIMATOR (STE)

ON TRAINING

N:M

SPARSE NETWORKS

A straightforward solution for training an N : M sparsity network is to simply extend Straight-through Estimator (STE) (Bengio et al., 2013) to perform online magnitude-based pruning andsparse parameter updating, which is depicted in Fig. 2(a). STE is widely used in model quanti-zation (Rastegari et al., 2016), since the quantized function is non-differentiable without STE andthe networks optimized with STE has decent performance under careful settings (Yin et al., 2019).In STE, a dense network is maintained during the training process. During the forward pass, weproject the dense weights W into sparse weights (cid:102) W = S ( W , N, M ) satisfying N : M sparsity. Let w ⊂ W be a group of consecutive M parameters in W and (cid:101) w ⊂ (cid:102) W be the corresponding group in (cid:102) W . The projection of w can be formulated as: (cid:101) w i = (cid:26) w i if | w i | (cid:62) ξ if | w i | < ξ , for i = 1 , , . . . , M (2)where ξ is the N -th largest value in w = {| w | , | w | , . . . , | w M |} . Intuitively, this projection func-tion S ( · ) produces sparse parameters (cid:102) W by setting N parameters that have the least signiﬁcantabsolute values to zero in each consecutive M -parameter group, while keeping the other parametersthe same as before. The computation of an N : M sparse sub-network on-the-ﬂy in the forward passis illustrated in Fig. 1.The projection function S ( · ) , which is non-differentiable during back-propagation, generates the N : M sparse sub-network on the ﬂy. To get gradients during back-propagation, STE computes thegradients of the sub-network g ( (cid:102) W ) = (cid:53) (cid:102) W L ( (cid:102) W ; D ) based on the sparse sub-network (cid:102) W , which canbe directly back-projected to the dense network as the approximated gradients of the dense parame-ters. The approximated parameter update rule for the dense network (see Fig. 2(a) in Appendix) canbe formulated as W t +1 ← W t − γ t g ( (cid:102) W t ) , (3)where W t represents dense parameters at iteration t and γ t indicates the learning rate.3.2.1 A NALYSIS OF DYNAMIC PRUNING USING

STETo validate the performance of STE on N:M sparse networks, we perform a pilot experiment withSTE. The results are shown in Fig. 3(a). From Fig. 3(a), N : M neural network trained with theabove mentioned STE shows signiﬁcant performance drop compared with the dense network. Weconjecture that this drop results from unstable neural architecture updates caused by the approxi-mated gradients of the dense parameters from the STE-modiﬁed chain rules. As Eq. 3 shows, g ( (cid:102) W t ) is a rough estimate of the gradients for W t due to the mismatch between the forward and back-ward passes. When conducting gradient descent to W with the rough gradients estimated by STE,4ublished as a conference paper at ICLR 2021 Grouping Pruning

Dense weights 𝑊 Sparse weights 𝑊 Features 𝑥 𝑙 𝜕ℒ𝜕𝑊 WeightUpdate BackPropagation

Features 𝑥 𝑙 +1 𝜕ℒ𝜕𝑥 𝑙 +1 ForwardBackward (a) STE

GroupingPruning

Dense weights 𝑊 Sparse weights 𝑊 Features 𝑥 𝑙 𝜕ℒ𝜕𝑊 WeightUpdate BackPropagation

Pruned Weight Mask ℰ Features 𝑥 𝑙 +1 𝜕ℒ𝜕𝑥 𝑙 +1 (b) SR-STE Figure 2: In this ﬁgure, (cid:74) represents element-wise multiplication and (cid:78) indicates matrix multipli-cation. (a) This ﬁgure shows the forward and backward pass during training an

N:M sparse network.In the forward stage, (cid:102) W is obtained by pruning W . And in the backward stage, the gradient w.r.t. (cid:102) W will be applied to W directly. (b) This ﬁgure illustrates the training process with SR-STE. Theforward pass is the same as in (b). However, in the backward pass, the weights of W are updated bynot only ∂ L ∂ (cid:102) W , but also ¯ E (cid:12) W , where ¯ E is the mask matrix for the pruned weights in (cid:102) W . Epoch T o p - A cc Sparse(STE)Dense (a) top-1 accuracy v.s. epoch S A D ×10 Sparse weights forward (1 iter)Dense weights forward (1 iter)0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2001234 S A D ×10 Sparse weights forward (10 iters)Dense weights forward (10 iters) (b) SAD v.s. layer number

Figure 3: We compare two networks respectively trained with regular SGD method and STE-modiﬁed gradient descent. (a) This ﬁgure shows sparse networks trained with STE has a signiﬁcantperformance drop in top-1 accuracy compared with dense networks. (b) This ﬁgure illustrates thelayer-wise SAD between the weights after certain number of iterations and the initial weights, fortwo networks trained with STE (sparse forward) and regular SGD(dense forward). Compared withnetworks trained with sparse forward gradient, the one with dense forward gradient displays smallerSAD, indicating fewer updates in its sparse network architectures.discrepancies between the accurate gradients and approximated ones may lead to erroneous param-eter updates. These imprecise value updates on the pruned parameters may further produce unsta-ble alternations of the architecture of the on-the-ﬂy pruned sparse neural networks in the forwardpass, which causes the notable performance drop. To demonstrate the possible relationship betweensparse network architecture updates and performance drops, we deﬁne SAD (Sparse ArchitectureDivergence) and measure the network architecture change with this metric.Before formally deﬁning SAD, we ﬁrst deﬁne the binary parameter mask produced in the magnitude-based pruning process as E = {E l ∈ { , } N l : 0 < l ≤ L } where N l represents the lengthof W l . Speciﬁcally, if the i -th parameter of W l survived (not pruned) in the pruned sub-network (cid:102) W , we set E li = 1 , and E li = 0 otherwise. Thus, the sparse sub-network can be represented as (cid:102) W = {W l (cid:12) E l : 0 < l ≤ L } , where (cid:12) represents element-wise multiplication. For convenience,we deﬁne ¯ E = − E .For a single training run, we propose Sparse Architecture Divergence (SAD) to measure the changeof the binary mask E from W i (the weights after the i -th iteration) to W j (the weights after the5ublished as a conference paper at ICLR 2021 j -th iteration). We deﬁne SAD i : j = (cid:107)E j − E i (cid:107) , where E i and E j are the binary masks for W i and W j respectively. This formula measures the number of connections(weights) that are pruned in (cid:102) W i = W i (cid:12) E i and not pruned in (cid:102) W j = W j (cid:12) E j , or pruned in (cid:102) W i while not pruned in (cid:102) W j . SmallerSAD i : j indicates less discrepancy between the network architectures of (cid:102) W i and (cid:102) W j .To reveal the impact of STE to the obtained sparse network architecture, we analyze SAD betweenthe weights after different number of training iterations of two training schemes. Our quantitativeresults are shown in Fig. 3(b). The ﬁrst scheme applies forward pass with dense weights and updatesthe net weights by W t +1 ← W t − γ t g ( W t ) . Then the corresponding sparse sub-network is obtainedby pruning the trained dense network. The other updating scheme carries out forward pass withsparse layers and uses STE-modiﬁed chain rule, W t +1 ← W t − γ t g ( (cid:102) W t ) , in backward pass. Wedeﬁne D (cid:102) W i to be the sparse model pruned from the model trained for i iterations of regular gradientdescent, and S (cid:102) W i to be the sparse model pruned from the network trained with i iterations of STE-modiﬁed gradient descent. Let S SAD l t denote SAD between the l -th layer of S (cid:102) W and S (cid:102) W t trainedwith sparse forward. Similarly, D SAD l t represents the SAD value between the l -th layer of D (cid:102) W and D (cid:102) W t when trained with dense forward. As depicted in Fig. 3(b), for each layer number l , D SAD l t < S SAD l t holds both when t = 1 and t = 10 . Meanwhile, | S SAD l t − D SAD l t | growsas t increases from 1 to 10. This phenomenon indicates a positive correlation between the poorperformance of a sparse neural network and high SAD.Before our deﬁned SAD, Liu et al. (2020) proposes Neural Network Sparse Topology Distance(NNSTD) to measure topological distances between two sparse neural networks. It is worth notingthat NNSTD reorders the neurons in each layer, based on their connections to the previous layer,to maximize the similarities between the two compared networks’ topologies. Hence, NNSTD canonly measure general topological differences between two networks but fails to reﬂect the transitionsof individual connections’ states (pruned or not pruned). However, when calculating SAD, the maskfor each connection is directly computed without neurons being ordered, hence SAD could providemore precise estimation of actual state changes (from being pruned to not pruned, and vice versa)of network connections, which is what we most concern about in this paper. It is also worth notingthat SAD has been implicitly adopted in existing published research papers. In RigL (Evci et al.,2019a), the authors consider the case of SAD t − t = 2 k , where k is dynamically calculated foreach layer during training. RigL can achieve state-of-the-art results on training unsturctured sparsenetworks from scratch. Another example is that, when we prune the network at initialization anddon’t update the connections, according to the recent work (Frankle et al., 2020), the performancedrops signiﬁcantly. This can be regarded as setting SAD t − t = 0 during the whole training phase.3.3 S PARSE - REFINED S TRAIGHT - THROUGH E STIMATOR (SR-STE) ON T RAINING

N:M

SPARSE NETWORKS

Inspired by above observations made from SAD analysis, we aim to reduce SAD to improve thesparse network’s performance. Since the magnitude of parameter | w | are used as a metric to prunethe network weights, we need to alternate the weight updating process in order to prevent high SAD.Two choices are left to us to achieve this goal: (1) restricting the values of weights pruned in (cid:102) W t , (2)promoting the non-pruned weights in (cid:102) W t . It is worth noting that although gradients of parameterscalculated from STE are all approximated, the pruned parameters’ gradients are more coarse thanthe non-pruned ones. Because for pruned parameters, the values to compute gradients in (cid:102) W t andthe values to be updated in W t are distinct, however for non-pruned parameters those two stagesuse the same values. These statements make penalizing weights pruned in (cid:102) W t a natural choice forrestraining SAD.Hence, we propose a sparse-reﬁned term in the STE update formulation. We denote this new schemeas SR-STE. Compared to STE, when utilizing SR-STE in N : M sparse model training, the backwardpass is carried out with a reﬁned gradients for pruned weights, as illustrated in Fig. 2(b). The purposeof the regularization term is to decrease the magnitude of the pruned weights, which are determinedin the forward pass. Intuitively, we encourage the pruned weights at the current iteration to be prunedalso in the following iterations so that the sparse architecture is stabilized for enhancing the trainingefﬁciency and effectiveness. 6ublished as a conference paper at ICLR 2021Formally, the network parameter update rule changes from Eq. 3 to the following formulation witha sparse-reﬁned regularization term, W t +1 = W t − γ t ( g ( (cid:102) W t ) − λ W ( ¯ E t (cid:12) W t )) , (4)where ¯ E t denotes the mask for the pruned weights, (cid:12) denotes Hadamard product, λ W denotes therelative weight for the sparse-reﬁned term, and γ t denotes the learning rate.When λ W = 0 , Eq. 4 is equivalent to Eq. 3, which is the STE update rule. In general, we set λ W > . SR-STE terms with a positive λ W set a constraint on and only on the pruned parameters,to prevent them from 1) being unpruned due to the different levels of mismatch between prunedparameter gradients and non-pruned parameter gradients, and 2) ineffectively alternating the prunednetwork architecture. When fewer sparse connections in the network are alternated, a more stabletraining process and a higher validation accuracy would be expected, which has been demonstratedin the analysis above and manifested in following experiments.We perform extensive experiments with SR-STE, and these results can be found in Fig. 4. Theexperiments here are conducted with ResNet-18 (He et al., 2016) on ImageNet (Deng et al., 2009).In Fig. 4(a), 4 different settings of λ W , namely , . , . , − . , are inspected. With λ W < , the potential negative impact of pruned weights’ coarse gradients are enlarged, whichleads to the poorest top-1 accuracy (68.5%) and the most signiﬁcant SAD. For smaller values of λ W corresponding to the standard version of SR-STE, SAD shrinks meanwhile the top-1 accuracyreceives clear increase. Furthermore, we examine performances of three neural networks: 1) a denseneural network trained with regular SGD method; 2) an N : M sparse network optimized with STE;3) an N : M sparse network trained with SR-STE. The curves of their top-1 accuracy for all trainingepochs are illustrated in Fig. 4(b). The accuracy curve of STE is consistently below the other twocurves, and has more turbulence between different training epochs. Note that the SAD value isassociated with the learning rate γ . For instance, SAD grows rapidly during the ﬁrst 5 epochs inFig. 4(a) since the learning rate increases from 0 to 0.1 in the so-called “warm-up” process. Besides,we also present other formations of sparse-reﬁned term in Appendix A.5. Epoch S A D ×10 W =0) W =0.0002 W =0.00045 W = 0.00002 (a) SAD v.s. epoch Epoch T o p - A cc Sparse(STE)DenseSparse(SR-STE) (b) top-1 accuracy v.s. epoch

Figure 4: (a) This ﬁgure illustrates SAD as a function of training epoch number with 4 differentsettings of λ W in the SR-STE term. When λ W < , the perturbations brought by coarse gradientsof sparse wights are widened, SAD gets higher and the top-1 accuracy becomes lower. When λ W is set to a reasonable positive value, sparse nets received high performance and low SAD. (b) Thisﬁgure compares the top-1 accuracy curves of sparse net trained with STE, sparse net trained withSR-STE, and dense net. Sparse networks naively trained with STE have signiﬁcant performancedrop compared with dense ones. After introducing the SR-STE term into optimization process, thesparse network’s performance jumps to a comparable level with dense networks. XPERIMENTS

In this section, we demonstrate the effectiveness of our proposed N : M ﬁne-grained structured spar-sity neural network on computer vision tasks ( e . g ., image classiﬁcation, object detection, instance7ublished as a conference paper at ICLR 2021Table 1: ImageNet validation accuracy on ResNet with different N : M sparse patterns. Model Method Sparse Pattern Top-1 Acc(%) Params(M) Flops(G)ResNet50 - Dense 77.3 25.6 4.09ResNet50 SR-STE 2:4 77.0 12.8 2.05ResNet50

SR-STE 4:8 77.4 12.8 2.05

ResNet50 SR-STE 1:4 75.9 6.4 1.02ResNet50

SR-STE 2:8 76.4 6.4 1.02

ResNet50 x1.25

SR-STE 2:8 77.5

Table 2: Experimental results of different training methods for training the N : M sparse network. Model Method Sparse Pattern Top-1 Acc EpochsResNet18 ASP (Nvidia, 2020) 2:4 70.7 200ResNet18 STE 2:4 69.9 120ResNet18

SR-STE

SR-STE 2:4 77.0 120 segmentation, and optical ﬂow prediction) and machine translation tasks. For these experiments, theimplementation details, including dataset settings, training schedules, and evaluation metrics, arelisted in the Appendix A.4. Meanwhile, we set λ W as . in all experiments because this valuegives a good performance in our experiments.4.1 IMAGE CLASSIFICATION

In this section, we ﬁrst conduct several experiments to evaluate the effects of different N : M sparsepatterns and different training methods on the image classiﬁcation benchmark ImageNet-1K (Denget al., 2009) with different backbones. Then, we provide the performance of our proposed N : M ﬁne-grained structured sparsity network compared with the state-of-the-art sparsity methods. Different N : M Sparse Patterns.

To investigate the performance of different N : M sparse patterns,we exploit the popular ResNet50 (He et al., 2016) model with four different N : M structural sparsepatterns: 2:4, 4:8, 1:4, and 2:8. The baseline model is the traditional dense model. For the designedpatterns, the 2:4 and 4:8 have the same sparsity 50%. The sparsity of 1:4 and 2:8 are both 75%.In Table 1, we observe that the 4:8 structural sparsity outperforms 2:4 with the same computationalcost, and 2:8 also performs better than 1:4. the training curve in Fig. 6(a). It shows that with thesame sparsity for N : M structural sparse patterns, a larger M will lead to better performance since itcan provide more abundant convolution kernel shape (we visualize and analysis the convolution ker-nel shape in Appendix A.2). For the ﬁxed M , we can adjust N to obtain the different sparsity ranges.With the same M , it is reasonable that the performance of the larger N is better due to more pa-rameters and computational cost. Meanwhile, the performance of the ResNet50 with 1.25 width canachieve 77.5% top-1 accuracy about only 71% sparsity of original dense ResNet50. We also conductseveral experiments with RegNetXs (Radosavovic et al., 2020) to evaluate the effectiveness of ourproposed N : M ﬁne-grained structural sparse patterns in the compact models in Appendix A.3. Different Training Methods.

We also verify the effectiveness of the proposed SR-STE for trainingthe N : M sparse pattern neural network. In Table 2, we ﬁnd that SR-STE outperforms the NVIDIAASP method and STE with less training epochs and better accuracy. Comparison with State-of-the-arts.

Before the advent of N : M ﬁne-grained structured sparsity,there exist many state-of-the-art methods to generate sparsity models, including DSR (Mostafa &Wang, 2019), RigL (Evci et al., 2019a), GMP (Gale et al., 2019), and STR (Kusupati et al., 2020).SR-STE is compared to those methods on ResNet50 in mid-level sparsity( 80%) and ultra-levelsparsity( 95%). Table 3 shows that SR-STE can outperform all state-of-the-art methods, even if othermethods are unstructured sparsity. And STR (Kusupati et al., 2020) shows that training the modelwith non-uniform sparsity can improve the performance consistently, thus the SR-STE can extendthe sparsity with non-uniform structural sparsity setting ( e . g ., mixed N : M ﬁne-grained structuralsparsity). We believe that the mixed N : M sparsity could further improve the results and we leavethis exploration for the future work. 8ublished as a conference paper at ICLR 2021Table 3: Experimental results of the proposed N : M sparse pattern with SR-STE and state-of-the-artsparsity methods. ∗ imply that the ﬁrst and last layer keep dense. Method Top-1 Acc(%) Sparsity(%) Params(M) Flops(G) Structured UniformResNet50 77.3 0.0 25.6 4.09 - -DSR ∗ (cid:55) (cid:55) RigL 74.6 80 5.12 0.92 (cid:55) (cid:51)

GMP 75.6 80 5.12 0.82 (cid:55) (cid:51)

STR 76.1 81 5.22 0.71 (cid:55) (cid:55)

STE 76.2 80 5.12 0.82 (cid:55) (cid:51)

SR-STE 77.0 80 5.12 0.82 (cid:55) (cid:51)

SR-STE 76.4 75(2:8) 6.40 1.02 (cid:51) (cid:51)

RigL 67.5 95 1.28 0.32 (cid:55) (cid:51)

GMP 70.6 95 1.28 0.20 (cid:55) (cid:51)

STR 70.2 95 1.24 0.16 (cid:55) (cid:55)

STE 68.4 95 1.28 0.20 (cid:55) (cid:51)

SR-STE 72.4 95 1.28 0.20 (cid:55) (cid:51)

SR-STE 72.2 94(1:16) 1.60 0.25 (cid:51) (cid:51)

BJECT D ETECTION AND I NSTANCE S EGMENTATION

We further conduct experiments on the challenging COCO dataset (Lin et al., 2014) to evaluate theefﬁciency of the proposed approach for two important computer vision tasks, i . e ., object detectionand instance segmentation. We exploit the classical model Faster RCNN (Ren et al., 2015) forobject detection and Mask RCNN (He et al., 2017) for instance segmentation. All the experimentsare conducted based on MMDetection (Chen et al., 2019). Table 4 and Table 5 show that 2:8 (25%)structured sparsity can achieve comparable result with dense baseline models. Furthermore, 4:8(50%) sparsity can provide even better result than dense models. These results also illustrate thatthe N : M sparsity pre-trained model gives a similar or better feature transfer ability.Table 4: Object detection results on COCO.

Model Method Sparse Pattern LR Schd mAPF-RCNN-R50 – Dense × SR-STE × F-RCNN-R50

SR-STE × × SR-STE × F-RCNN-R50

SR-STE × Table 5:

Instance segmentation results on COCO.

Model Method Sparse Pattern LR Schd Box mAP Mask mAPM-RCNN-R50 – Dense × SR-STE × M-RCNN-R50

SR-STE × × SR-STE × M-RCNN-R50

SR-STE × PTICAL F LOW

Optical ﬂow prediction is one representative dense pixel-level prediction task in computer vision.We verify our proposed method on a recent state-of-the-art model RAFT (Teed & Deng, 2020) modelon FlyingChairs (Dosovitskiy et al., 2015). The smaller value of the metric end-point-error (EPE)represents better performance. Compared with the dense model for optical ﬂow prediction, Table 6shows that our proposed method can achieve comparable accuracy with half of the parameters.Table 6: RAFT results on FlyingChairs.

Model Method Sparse Pattern EPE Params(M) Flops(G)RAFT - Dense 0.86 5.3 134RAFT

SR-STE 2:4 0.88 2.65 67

Table 7: MT Results on the EN-DE WMT’14.

Model Method Sparse pattern BLUE Params(M) Flops(G)Transformer - Dense 27.31 63 10.2Transformer

SR-STE 2:4 27.23 31.5 5.1

ACHINE T RANSLATION (MT)Besides the computer vision tasks, we investigate the effectiveness of our method on one of themost common tasks in natural language processing, i . e ., machine translation. We conduct our ex-periments based on Transformer, which employs a number of linear layers. We train our modelswith transformer base adopted by Vaswani et al. (2017), which contains a 6-layer encoder and a6-layer decoder with 512-dimensional hidden representations. The larger value of the metric BLUErepresents better performance. Compared with the dense model, Table 7 shows that our proposedmethod can achieve the negligible accuracy loss. 9ublished as a conference paper at ICLR 2021 ISCUSSION AND C ONCLUSION

In this work, we present SR-STE for the ﬁrst time to train N : M ﬁne-grained structural sparse net-works from scratch. SR-STE extends Straight-Through Estimator with a regularization term to alle-viate ineffective sparse architecture updates brought by coarse gradients computed by STE-modiﬁedchain rules. We deﬁne a metric, Sparse Architecture Difference (SAD), to analyze these architec-ture updates. The experimental results show that SAD correlates strongly with pruned network’sperformance. We hope this work could shed light on machine learning acceleration and SAD couldinspire more theoretical and empirical studies in sparse network training and other ﬁelds such asneural architecture search. R EFERENCES

Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Trainingvery sparse deep networks. arXiv preprint arXiv:1711.05136 , 2017.Yoshua Bengio, Nicholas L´eonard, and Aaron Courville. Estimating or propagating gradientsthrough stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 , 2013.David D Bourgin, Joshua C Peterson, Daniel Reichman, Stuart J Russell, and Thomas L Grifﬁths.Cognitive model priors for predicting human decisions. In

International conference on machinelearning , pp. 5133–5141, 2019.Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models arefew-shot learners. arXiv preprint arXiv:2005.14165 , 2020.Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, WansenFeng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 , 2019.Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarizedneural networks: Training deep neural networks with weights and activations constrained to+ 1or-1. arXiv preprint arXiv:1602.02830 , 2016.Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi-erarchical image database. In ,pp. 248–255. Ieee, 2009.Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losingperformance. arXiv preprint arXiv:1907.04840 , 2019.Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov,Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical ﬂow withconvolutional networks. In

Proceedings of the IEEE international conference on computer vision ,pp. 2758–2766, 2015.Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery:Making all tickets winners. arXiv preprint arXiv:1911.11134 , 2019a.Utku Evci, Fabian Pedregosa, Aidan Gomez, and Erich Elsen. The difﬁculty of training sparseneural networks. arXiv preprint arXiv:1906.10732 , 2019b.Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neuralnetworks. arXiv preprint arXiv:1803.03635 , 2018.Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Pruning neuralnetworks at initialization: Why are we missing the mark? arXiv preprint arXiv:2009.08576 ,2020.Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXivpreprint arXiv:1902.09574 , 2019. 10ublished as a conference paper at ICLR 2021Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efﬁcient dnns. In

Advances in neural information processing systems , pp. 1379–1387, 2016.Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections forefﬁcient neural network. In

Advances in Neural Information Processing Systems (NIPS) , pp.1135–1143, 2015.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-nition. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pp.770–778, 2016.Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. Mask r-cnn. In

Proceedings of theIEEE international conference on computer vision , pp. 2961–2969, 2017.Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks forimage classiﬁcation with convolutional neural networks. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 558–567, 2019.Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 , 2015.Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks formobile vision applications. arXiv preprint arXiv:1704.04861 , 2017.Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard,Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks forefﬁcient integer-arithmetic-only inference. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pp. 2704–2713, 2018.Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In

Proceedingsof the International Conference on Learning Representations (ICLR) , 2015.Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, ShamKakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. arXivpreprint arXiv:2002.03231 , 2020.Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In

Advances in neuralinformation processing systems , pp. 598–605, 1990.Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruningbased on connection sensitivity. arXiv preprint arXiv:1810.02340 , 2018.Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning ﬁlters forefﬁcient convnets. arXiv preprint arXiv:1608.08710 , 2016.Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, PiotrDoll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In

Europeanconference on computer vision , pp. 740–755. Springer, 2014.Shiwei Liu, Tim Van der Lee, Anil Yaman, Zahra Atashgahi, Davide Ferraro, Ghada Sokar, MykolaPechenizkiy, and Decebal Constantin Mocanu. Topological insights in sparse neural networks. arXiv preprint arXiv:2006.14085 , 2020.Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu,and Antonio Liotta. Scalable training of artiﬁcial neural networks with adaptive sparse connec-tivity inspired by network science.

Nature communications , 9(1):1–12, 2018.Hesham Mostafa and Xin Wang. Parameter efﬁcient training of deep convolutional neural networksby dynamic sparse reparameterization. arXiv preprint arXiv:1902.05967 , 2019.Nvidia. Nvidia a100 tensor core gpu architecture. , 2020.11ublished as a conference paper at ICLR 2021Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automaticevaluation of machine translation. In

Proceedings of the 40th annual meeting of the Associationfor Computational Linguistics , pp. 311–318, 2002.Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Doll´ar. Designingnetwork design spaces. In

Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition , pp. 10428–10436, 2020.Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenetclassiﬁcation using binary convolutional neural networks. In

European conference on computervision , pp. 525–542. Springer, 2016.Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time objectdetection with region proposal networks. In

Advances in neural information processing systems ,pp. 91–99, 2015.Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and ﬁne-tuning in neuralnetwork pruning. arXiv preprint arXiv:2003.02389 , 2020.Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words withsubword units. In

Proceedings of the 54th Annual Meeting of the Association for ComputationalLinguistics (ACL) , pp. 1715–1725, 2016.Zhanhong Tan, Jiebo Song, Xiaolong Ma, Sia-Huat Tan, Hongyang Chen, Yuanqing Miao, YifuWu, Shaokai Ye, Yanzhi Wang, Dehui Li, et al. Pcnn: Pattern-based ﬁne-grained regular pruningtowards optimizing cnn accelerators. arXiv preprint arXiv:2002.04997 , 2020.Zachary Teed and Jia Deng. Raft: Recurrent all-pairs ﬁeld transforms for optical ﬂow. arXiv preprintarXiv:2003.12039 , 2020.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In

Advances in neural informationprocessing systems , pp. 5998–6008, 2017.Yanzhi Wang, Shaokai Ye, Zhezhi He, Xiaolong Ma, Linfeng Zhang, Sheng Lin, Geng Yuan,Sia Huat Tan, Zhengang Li, Deliang Fan, et al. Non-structured dnn weight pruning consideredharmful. arXiv preprint arXiv:1907.02124 , 2019.Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity indeep neural networks. In

Advances in neural information processing systems , pp. 2074–2082,2016.Mitchell Wortsman, Ali Farhadi, and Mohammad Rastegari. Discovering neural wirings. In

Ad-vances in Neural Information Processing Systems , pp. 2684–2694, 2019.Penghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley Osher, Yingyong Qi, and Jack Xin. Under-standing straight-through estimator in training activation quantized neural nets. arXiv preprintarXiv:1903.05662 , 2019.Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantiza-tion: Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044 , 2017.

A A

PPENDIX

A.1

ALGORITHM

A.2

KERNEL SHAPE

Fig. 5 illustrates six learnt convolution kernels which are picked up from the trained ResNet50 2:8sparse model. Note that, for these six convolution kernels, the shape of non-zero elements under the2:8 sparsity constraints cannot be acquired or learnt in the case of 1:4 sparsity.12ublished as a conference paper at ICLR 2021

Algorithm 1

Training N:M sparse Neural Networks from Scratch of SR-STE

Require: N , M , λ W , dataset D , randomly initilized model W , learning rate γ t for each training iteration t do Sample mini batch of data ( X , Y ) ∼ D (cid:102) W ← S ( W , N, M ) (cid:46) Eq. 2 Obtain corresponding mask E (cid:96) ← L ( (cid:102) W , ( (cid:101) X , Y )) (cid:46) Forward pass g ( (cid:102) W ) = ∂(cid:96)∂ (cid:102) W (cid:46) Backward pass W ← W − γ t g ( (cid:102) W ) − λ W ( ¯ E (cid:12) W ) , (cid:46) Eq. 4 layer1.1.conv2: (0,32) layer1.1.conv2: (0,44) layer4.1.conv2: (0,6)layer4.1.conv2: (0,147)layer4.1.conv2: (0,73)layer4.1.conv2: (0,68)

Figure 5: Illustration of kernel shape in ResNet50 with 2:8 structured sparsity trained model,layer1.1.conv2: (0,32) denotes layer name: (index of input channel, index of output channel).A.3 R EG N ET X S ON I MAGE N ET -1KWe further verify whether N:M sparsity can boost the compact models. Recent RegNetXs (Ra-dosavovic et al., 2020) are state-of-the-art and hardware-friendly models, which are the best modelsout of a search space with candidate models. The Table 8 shows that SR-STE can improveRegNetX002 performance than STE signiﬁcantly, and the 2:4 structured sparsity can outperformthe dense RegNetXs model with the same Flops. Therefore, we can obtain N:M ﬁne-grained struc-tured sparse model with SR-STE easily and we believe that the proposed

N:M ﬁne-grained structuredsparsity method could become a standard technique on model deployment.Table 8: ImageNet validation accuracy on RegNet with different N : M sparse patterns. Model Method Sparse Pattern Top-1 acc(%) Flops(G)RegNetX002

SR-STE

SR-STE

SR-STE

SR-STE

SR-STE

SR-STE

A.4

IMPLEMENTATION DETAILS

A.4.1 C

LASSIFICATION

Dataset.

ImageNet-1K (Deng et al., 2009) is a large-scale classiﬁcation task, which is known asthe most challenging image classiﬁcation benchmark. ImageNet-1K dataset has about 1.2 million13ublished as a conference paper at ICLR 2021training images and 50 thousand validation images. Each image is annotated as one of 1000 objectclasses.

Training scheduler.

All ImageNet-1K experiments trained on images of × , the densemodels baselines following the hyperparameter setting in (He et al., 2019). Speciﬁcally, all modelsare trained with batch size of 256 over 120 epochs and learning rates are annealed from 0.1 to 0 witha cosine scheduler and ﬁrst 5 epochs the learning rate linearly increases from 0 to 0.1. Evaluation Metric.

All reported results follow standard Top-1 accuracy.A.4.2 O

BJECT D ETECTION AND I NSTANCE S EGMENTATION

Dataset.

All experiments are performed on the challenging MS COCO 2017 dataset (Lin et al.,2014) of 80 categories. It consists of 115 K images for training ( train-2017 ) and K images forvalidation ( val-2017 ). We train models on the training dataset train-2017 , and evaluate models onthe validation dataset val-2017 . Training scheduler.

For object detection and instance segmentation, we conduct these experimentson MMDetection (Chen et al., 2019). The 1x and 2x training schedule settings follow the settings inMMDetection (Chen et al., 2019).

Evaluation Metric.

All reported results follow standard COCO-style Average Precision (AP) met-ric, i . e ., mAP of IoUs from 0.5 to 0.95.A.4.3 O PTICAL F LOW

Dataset.

The optical ﬂow prediction is conducted on the FlyingChairs (Dosovitskiy et al., 2015)dataset, which is a synthetic dataset with optical ﬂow ground truth. This dataset consists of 22,872image pairs and corresponding ﬂow ﬁelds. The training dataset contains 22,232 samples and thevalidation dataset contains 640 test samples. We train the RAFT (Teed & Deng, 2020) model on thetraining dataset and report the ﬁnal results on this validation dataset.

Training scheduler.

We employ the original standard open framework to run the RAFT (Teed &Deng, 2020) model. And the training settings for the FlyingChairs (Dosovitskiy et al., 2015) datasethave been listed in Teed & Deng (2020). Evaluation Metric.

We choose the endpoint error (EPE) to evaluate the predicted result. EPE is theEuclidean distance between the predicted ﬂow vector and the ground truth, averaged over all pixels.A.4.4 M

ACHINE T RANSLATION

Dataset.

For English-German translation, the training set consists of about 4.5 million bilingualsentence pairs from WMT 2014. We use newstest2013 as the validation set and newstest2014 as thetest set. Sentences are encoded using BPE, which has a shared vocabulary of about 37,000 tokens.

Training scheduler.

We use subword method (Sennrich et al., 2016) to encode source side sentencesand the combination of target side sentences. The vocabulary size is 37,000 for both sides. Eachmini-batch on one GPU contains a set of sentence pairs with roughly 4,096 source and 4,096 targettokens. We use Adam optimizer (Kingma & Ba, 2015) with β = 0 . and β = 0 . . For ourmodel, we train for 300,000 steps. We employ four Titan XP GPUs to train both the baseline andour model. Evaluation Metric.

All reported results follow standard BLUE (Papineni et al., 2002) on tokenized,truecase output. https://github.com/princeton-vl/RAFT/ Epoch T o p - A cc (a) Top-1 Accuracy v.s. epoch Epoch Sp a r s e A r c h i t e c t u r e D i v e r g e n c e ( S A D ) ×10 W =0.0002 g =0.1 c =0.000001dense training (b) SAD v.s. epoch Figure 6: (a) This ﬁgure compares the top-1 accuracy curves of different

N:M patterns with SR-STE. (b) This ﬁgure illustrates the SAD curves as each training epoch number with 3 differentsparse-reﬁned formulationA.5

OTHER REFINED FORMULATION we present another sparse-reﬁned regularization term, i . e ., sign constant, as follow: W t +1 = W t − γ t ( g ( (cid:102) W t ) − λ c ( ¯ E t (cid:12) sign ( W t ))) , (5)Apart from the parameter of model, we modiﬁed the Eq. 4 to apply on approximated gradient di-rectly, as follow: W t +1 = W t − γ t ( g ( (cid:102) W t ) − λ g ( ¯ E t (cid:12) ( γ t g ( W t ))) ,,