Xinlei Yi, Shengjun Zhang, Tao Yang, Tianyou Chai,, and Karl Henrik Johansson,
Abstract—The distributed nonconvex optimization problem of minimizing a global cost function formed by a sum of n local cost functions by using local information exchange is considered. This problem is an important component of many machine learning techniques with data parallelism, such as deep learning and federated learning. We propose a distributed primal-dual stochastic gradient descent (SGD) algorithm, suitable for arbitrarily connected communication networks and any smooth(possibly nonconvex) cost functions. We show that the proposed algorithm achieves the linear speedup convergence rate nT)for general nonconvex cost functions and the linear speedup convergence rate O(1/(nT)) when the global cost function satisfies the Polyak-?ojasiewicz (P-?) condition, where T is the total number of iterations. We also show that the output of the proposed algorithm with constant parameters linearly converges to a neighborhood of a global optimum. We demonstrate through numerical experiments the efficiency of our algorithm in comparison with the baseline centralized SGD and recently proposed distributed SGD algorithms.
Note that the algorithms proposed in the aforementioned references use at least gradient information of the cost functions, and sometimes even the second- or higher-order information. However, in many applications explicit expressions of the gradients are often unavailable or difficult to obtain. In this paper, we consider the case where each agent can only collect stochastic gradients of its local cost function and propose a distributed stochastic gradient descent (SGD)algorithm to solve (1). In general, SGD algorithms are suitable for scenarios where explicit expressions of the gradients are unavailable or difficult to obtain. For example, in some big data applications, such as empirical risk minimization, the actual gradient is calculated from the entire data set, which results in a heavy computational burden. A stochastic gradient can be calculated from a randomly selected subset of the data and is often an efficient way to replace the actual gradient.Other examples which SGD algorithms are suitable for include scenarios where data are arriving sequentially such as in online learning [10].
This paper provides positive answers for the above two questions. More specifically, the contributions of this paper are summarized as follows.
i) We propose a distributed primal-dual SGD algorithm to solve the optimization problem (1). In the proposed algorithm,each agent maintains the primal and dual variable sequences and only communicates the primal variable with its neighbors.This algorithm is suitable for arbitrarily connected communication networks and any smooth (possibly nonconvex) cost functions.
iv) We show in Theorems 4 and 5 that the output of our algorithm with constant parameters linearly converges to a neighborhood of a global optimum when the global cost function satisfies the P-? condition. Compared with [26],[46]–[49], which used the strong convexity assumption, we achieve the similar convergence result under weaker assumptions on the cost function.
The detailed comparison of this paper with other related studies in the literature is summarized in Table I.
The rest of this paper is organized as follows. Section II presents the novel distributed primal-dual SGD algorithm.
TABLE I COMPARISON OF THIS PAPER TO SOME RELATED WORKS
This corresponds to our proposed distributed primal-dual SGD algorithm, which is presented in pseudo-code as Algorithm 1.
Algorithm 1 Distributed Primal-Dual SGD Algorithm 1: Input: parameters , , .xi,0 ∈Rp vi,0=0p, ?i ∈[n]2: Initialize: and .{αk} {βk} {ηk}?(0,+∞)3: for do i=1,...,n k=0,1,...4: for in parallel do xi,k Ni xj,k j ∈Ni 5: Broadcast to and receive from ;gi(xi,k,ξi,k)6: Sample stochastic gradient ;7: Update by (6a);xi,k+1 8: Update by (6b).9: end for 10: end for{xk}vi,k+1 11: Output: .
Remark 4:It should be highlighted the P-? constantνis not used to design the algorithm parameters. Therefore, the constantνdoes not need to be known in advance. Similar convergence result as stated in (19) was achieved by the distributed SGD algorithms proposed in [26], [46]–[49] when each local cost function is strongly convex, which obviously is stronger than the P-? condition assumed in Theorem 4. In addition to the strong convexity condition, in [26], it was also assumed that each local cost function is Lipschitz-continuous.Some information related to the Lyapunov function and global parameters, which may be difficult to get, were furthermore needed to design the stepsize. Moreover, in [46]–[49], the strong convexity constant was needed to design the stepsize and in [48], [49], ap-dimensional auxiliary variable, which is used to track the global gradient, was communicated between agents. The potential drawbacks of the results stated in Theorem 4 are that i) we consider undirected graphs rather than directed graphs as considered in [49]; and ii) we do not analyze the robustness level to gradient noise as [46] did. We leave the extension to the (time-varying) directed graphs and the robustness level analysis as future research directions.
Note that the unbiased assumption, i.e., Assumption 5, can be removed, as shown in the following.
In this section, we evaluate the performance of the proposed distributed primal-dual SGD algorithm through numerical experiments. All algorithms and agents are implemented and simulated in MATLAB R2018b, run on a desktop with Intel Core i5-9600K processor, Nvidia RTX 2070 super, 32 GB RAM, Ubuntu 16.04.
We consider the training of neural networks (NN) for image classification tasks of the database MNIST [59]. The same NN is adopted as in [28] for each agent and the communication graph is generated randomly. The graph is shown in Fig. 1 and the corresponding Laplacian matrixLis given in (22). The corresponding mixing matrixWis constructed by Metropolis weights, which is given in (23).
Fig. 1. Connection topology.
Each local neural network consists of a single hidden layer of 50 neurons, followed by a sigmoid activation layer,followed by the output layer of 10 neurons and another sigmoid activation layer. In this experiment, we use a subset of MNIST data set. Each agent is assigned 2500 data points randomly, and at each iteration, only one data point is picked up by the agent following a uniform distribution.
We compare our proposed distributed primal-dual SGD algorithm with time-varying and fixed parameters (DPDSGD-T and DPD-SGD-F) with state-of-the-art algorithms:distributed momentum SGD algorithm (DM-SGD) [23],distributed SGD algorithm (D-SGD-1) [26], [27], distributed SGD algorithm (D-SGD-2) [28], D2[36], distributed stochastic gradient tracking algorithm (D-SGT-1) [37], [49],distributed stochastic gradient tracking algorithm (D-SGT-2)[38], [48], and the baseline centralized SGD algorithm (CSGD). We list all the parameters1Note: the parameter names are different in each paper.we choose in the NN experiment for each algorithm in Table II.
TABLE II PARAMETERS IN EACH ALGORITHM IN NN EXPERIMENT
We demonstrate the result in terms of the empirical risk function [60], which is given as
Fig. 2 shows that the proposed distributed primal-dual SGD algorithms with time-varying parameters converges almost as fast as the distributed SGD algorithm in [26], [27] and faster than the distributed SGD algorithms in [28], [36]–[38], [48],[49] and the centralized SGD algorithm. Note that our algorithm converges slower than the distributed momentum SGD algorithm [23]. This is reasonable since that algorithm is an accelerated algorithm with extra requirement on the cost functions, i.e., the deviations between the gradients of local cost functions are bounded, and it requires each agent to communicate threep-dimensional variables with its neighbors at each iteration. The slopes of the curves are however almost the same. The accuracy of each algorithm is given in Table III.
Fig. 2. Empirical risk.
TABLE III ACCURACY ON EACH ALGORITHM IN NN EXPERIMENT
Let us consider the training of a convolutional neural networks (CNN) model. We build a CNN model for each agent with five 3×3 convolutional layers using ReLU as activation function, one average pooling layer with filters of size 2×2, one sigmoid layer with dimension 360, another sigmoid layer with dimension 60, one softmax layer with dimension 10. In this experiment, we use the whole MNIST data set. We use the same communication graph as in the above NN experiment. Each agent is assigned 6000 data points randomly. We set the batch size as 20, which means at each iteration, 20 data points are chosen by the agent to update the gradient, which is also following a uniform distribution.For each algorithm, we do 10 epochs to train the CNN model.
We compare our algorithms DPD-SGD-T and DPD-SGD-F with the fastest ones for the neural networks case, i.e., DMSGD [23], D-SGD-1 [26], [27], and C-SGD. We list all the parameters we choose in the CNN experiment for each algorithm in Table IV.
We demonstrate the training loss and the test accuracy of each algorithm in Figs. 3 and 4 respectively. Here we use Categorical Cross-Entropy loss, which is a softmax activation plus a Cross-Entropy loss. We can see that our algorithms perform almost the same as the DM-SGD and better than the D-SGD-1 and the centralized C-SGD. The accuracy of each algorithm is given in Table V.
TABLE IV PARAMETERS IN EACH ALGORITHM IN CNN EXPERIMENT
Fig. 3. CNN training loss.
Fig. 4. CNN accuracy.
TABLE V ACCURACY ON EACH ALGORITHM IN CNN EXPERIMENT
In this paper, we studied distributed nonconvex optimization. We proposed a distributed primal-dual SGD algorithm and derived its convergence rate. More specifically, the linear
IEEE/CAA Journal of Automatica Sinica2022年5期