• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Distributed Stochastic Optimization with Compression for Non-Strongly Convex Objectives

    2024-01-20 13:01:14XuanjieLiandYuedongXu

    Xuanjie Li and Yuedong Xu

    School of Information Science and Technology,Fudan University,Shanghai,China

    ABSTRACT We are investigating the distributed optimization problem,where a network of nodes works together to minimize a global objective that is a finite sum of their stored local functions.Since nodes exchange optimization parameters through the wireless network,large-scale training models can create communication bottlenecks,resulting in slower training times.To address this issue,CHOCO-SGD was proposed,which allows compressing information with arbitrary precision without reducing the convergence rate for strongly convex objective functions.Nevertheless,most convex functions are not strongly convex (such as logistic regression or Lasso),which raises the question of whether this algorithm can be applied to non-strongly convex functions.In this paper,we provide the first theoretical analysis of the convergence rate of CHOCO-SGD on non-strongly convex objectives.We derive a sufficient condition,which limits the fidelity of compression,to guarantee convergence.Moreover,our analysis demonstrates that within the fidelity threshold,this algorithm can significantly reduce transmission burden while maintaining the same convergence rate order as its no-compression equivalent.Numerical experiments further validate the theoretical findings by demonstrating that CHOCO-SGD improves communication efficiency and keeps the same convergence rate order simultaneously.And experiments also show that the algorithm fails to converge with low compression fidelity and in time-varying topologies.Overall,our study offers valuable insights into the potential applicability of CHOCO-SGD for non-strongly convex objectives.Additionally,we provide practical guidelines for researchers seeking to utilize this algorithm in real-world scenarios.

    KEYWORDS Distributed stochastic optimization;arbitrary compression fidelity;non-strongly convex objective function

    1 Introduction

    In modern machine learning problems,datasets are frequently too large to be processed by a single machine or may be distributed across multiple computing nodes due to legal restrictions or cost considerations.In such scenarios,distributed learning has emerged as a viable solution to train models in a decentralized fashion.Each computing node trains on its local data samples,and communicates with other nodes to exchange information and update model parameters.This approach can significantly reduce processing time and has gained increasing attention in recent years.

    Formally,we consider an optimization problem of the form

    wherefi: Rd→R is the local objective function of computing node (alternatively referred to as a worker)ifori∈{1,···,n},x usually refers to the neural network model parameter,and eachfiis determined by the local dataset of nodei,i.e.,

    wheremidenotes the number of data samples of nodei,ζjrepresents one data sample,andFi: Rd×Ω→R is the loss function.For example,in a regression problem,a square loss function is commonly used asFi(·).

    Consensus-based gradient descent algorithms are extensively researched and employed in distributed learning owing to their simplicity and efficiency.These algorithms enable nodes to exchange parameters with neighboring nodes and incorporate new information from received messages to drive the parameters toward the global average parameter value.Meanwhile,each node computes gradients using its local data samples and updates its parameters using the gradient descent methods.

    One limitation of these algorithms is the requirement for exact parameter transmission among nodes,which can be challenging in scenarios where there are restrictions on data transmission amount or high communication latency(e.g.,with large neural networks[1]or limited network bandwidth).In such cases,compressing the parameters becomes necessary.However,parameter compression usually results in a decrease in the convergence rate.To address this challenge,Koloskova et al.proposed CHOCO-SGD [2],which was a compressed distributed algorithm designed for strongly convex objective functions.CHOCO-SGD allows for both biased and unbiased compression operators and can achieve arbitrary compression precision.Arbitrary compression precision means that any compression fidelity 0<ω≤ 1 (w=1 indicating no compression) can guarantee the same convergence rate order as the no-compression equivalent.

    However,strong convexity is a restrictive assumption that may not hold in various practical applications,such as in operations research and machine learning.For instance,consider the Lasso problem[3]with an objective function represented as(〈di,x〉-li)2+β‖x‖1,wherei∈{1,···,m}

    denotes the data index,βdenotes the regularization coefficient,anddiandlirepresent data features and labels,respectively.In this case,each component function is non-strongly convex,as well as the global objective.Another well-known example is logistic regression [4],where the objective islog(1+exp(-li〈di,x〉).Therefore,it is crucial to investigate the behavior and performance of CHOCO-SGD for objectives that are non-strongly convex in nature.The main contributions are summarized as follows:

    · We establish a compression fidelity bound for CHOCO-SGD,ensuring its convergence over non-strongly convex functions.This criterion serves as a valuable guideline for researchers and engineers when applying CHOCO-SGD in practical scenarios.

    · We rigorously prove that within the compression fidelity threshold,CHOCO-SGD achieves the same convergence rate order as DSGD over non-strongly convex objectives,and compression fidelity only affects the higher-order terms in the rate.This means that CHOCO-SGD effectively reduces the transmitted data without compromising training efficiency.

    · We conduct comprehensive experiments to validate our theoretical results.The numerical outcomes show that when controlling compression fidelity above the lower bound,CHOCOSGD effectively reduces communication costs,while remains a comparable convergence rate to that of DSGD.Additionally,our experiments reveal that CHOCO-SGD fails to converge when the fidelity falls below acceptable limits,implying that this algorithm cannot achieve arbitrary compression fidelity on non-strongly convex objectives.Moreover,we also observe that CHOCO-SGD cannot converge in time-varying network topologies.

    Notations.In this paper,we use uppercase letters likeX,Ato denote various matrices,including parameter matrix,gradient matrix,weight matrix,etc.Lowercase letters with subscripts are used to denote vectors,for example,ximeans the parameter vector of nodei.In terms of some special matrices and vectors,we use 1 for a column vector whose elements are all 1 and I for the identity matrix whose size can be inferred from the context.We uniformly use‖·‖without subscripts for norm notations.When the norm is on a vector,it should represent the?2norm,but when it is on a matrix,it should be the Frobenius norm.

    2 Related Work

    Consensus algorithm.For the distributed optimization problem,we want every node to reach an agreement on the global parameter value.The agreement means finding the average ofninitial values stored onnnodes [5,6].So,the consensus problem can be regarded as a subproblem of distributed optimization problem.There are a plethora of research works conducted to explore the consensus problem.It has been proved that the consensus problem can be addressed in the sense of linear convergence on undirected graph[6].

    Distributed optimization.On the basis of consensus algorithms,researchers proposed many distributed gradient-based algorithms[7-12].In undirected topology,Nedic et al.[7]and Yuan et al.[8]analyzed the convergence of Decentralized Stochastic Gradient Descent algorithm(DSGD)on nonstrongly convex objective functions.Lian et al.proved that for nonconvex and Lipschitz-smooth objective functions,DSGD can converge to a stationary point with the rate[13],whereTdenotes the number of episodes.Furthermore,Shi et al.proposed the EXTRA algorithm that had linear convergence rate for strongly convex objectives[14].Assran et al.further addressed challenges introduced by directed graphs by designing the SGP algorithm based on DSGD [15].More related algorithms has been reviewed in[16,17].

    Parameter/gradient compression.With the rapid growth of learning models,traditional distributed learning faces significant communication pressure.As a result,Researchers have increasingly focused on designing communication-efficient algorithms[2,18-24].The proposed methods can generally be classified into three main categories.The first category is quantization.Instead of transmitting precise vectors (such as model parameters or gradients),nodes exchange limited bits that approximate the original vectors.The second category is sparsification,where only a few elements of a vector are transmitted accurately,while the remaining elements are forced to be zero.The third category involves transmitting parameters in some iterations rather than in every iteration.This aims to reduce the amount of transmitted information.All these methods aim to find the optimal parameter value by transmitting less information.Therefore,in this paper,we use the termcompressionto encompass these methods.Kolovaskova et al.proposed the CHOCO-SGD algorithm[2],a distributed stochastic algorithm over undirected graphs.This algorithm allows for arbitrary compression precision and has been proven to converge for strongly convex objectives.Taheri et al.[25] extended this algorithm to directed graphs for both convex and nonconvex objectives based on the SGP algorithm[15].Different from this work,our study is based on undirected graphs.Table 1 further illustrates the highlights of this work and the differences between the work involved.

    Table 1: Comparisons on different algorithms for decentralized optimization problem

    3 CHOCO-SGD Algorithm

    In this section,we will introduce the basic knowledge of distributed optimization and present the classic Distributed Stochastic Gradient Descent(DSGD)algorithm.Additionally,we will discuss the fundamental ideas underlying the CHOCO-SGD algorithm.

    We assume that the computing nodes are distributed across an undirected graphG=(V,E),whereVandErepresent the set of nodes and edges,respectively.The DSGD algorithm is an efficient method to solve the distributed optimization problem in Eq.(1).It consists of two key components:consensus,also referred to asgossip,and parameter updates.The consensus step aims to achieve parameter agreement among all nodes.Specifically,we denote the local parameter of nodeias xi,and perform the following parameter aggregation process:

    wheretdenotes the iteration number andaijdenotes the weight assigned by nodeito nodej.For compatibility with the underlying graph structure,aijis set to be positive if and only if edge(i,j) ∈Eori=j.This ensures that only neighboring nodes,as determined by the graph,can communicate and exchange their parameters.By using a matrixA=(aij)n×n1aij represents the entry in the ith row and the jth column of A.to incorporate all the weights,it has been proven that whenAis row stochastic(i.e.,each row ofAsums to one and all the entries are nonnegative),the parameters of all nodes can reach a consensus.This implies that each parameter xi(t)will converge to the average of the initial values(1)[26].This consensus algorithm enables the nodes to synchronize their parameter values.

    The second component of the DSGD algorithm involves parameter updates using stochastic gradient descent.Combining these components,the update rule of DSGD is given as follows:

    where ?Fi(xi(t),ζi,t)represents the gradient of the local objective function evaluated at xi(t)using the data batchζi,t,andαdenotes the learning rate.This update rule combines the consensus step,which ensures parameter agreement,with the gradient descent step,which drives the parameters towards optimal values.We formally present the DSGD algorithm in Algorithm 1.Under some constraints,extensive studies [16] have demonstrated that the DSGD algorithm achieves a convergence rate ofon non-strongly convex objective functions,whereTdenotes the number of episodes.

    However,when dealing with massive training models,the DSGD algorithm encounters a communication bottleneck.To address this limitation,a straightforward yet effective solution is to compress the parameters before transmitting them over the network.For this purpose,We introduce a compression operator denoted asQ,along with the definition of compression fidelityω,which indicates the ratio of preserved information.A higher value ofωsignifies a greater fidelity,implying that less information is discarded during the compression process.The value ofωsatisfies the following assumption.

    Assumption 3.1.The compression operatorQsatisfies

    EQ‖Q(x)-x‖2≤(1-ω)‖x‖2.

    We take the expectation in this context because there might be inherent randomness in the compression operators,such as in the case of randomized gossip discussed later.It is worth noting that whenω=1,the compressed parameterQ(x) should ideally be the same as x (almost surely,although we may disregard negligible difference).And we state that the operatorQcan be either biased or unbiased,encompassing a wide range of compression techniques.

    Example 3.1.Examples of compression

    · randomized gossip:

    Q(x)=

    wherep∈(0,1].We can findω=p.

    · Top-k sparsification[20]:for a parameter vector x ∈Rd,we selectkdimensions with the highest magnitude,wherek∈{1,2,···,d}.Subsequently,we transmit a compressed version of the parameter vector,retaining only the selectedkitems while setting all other positions to 0.It is worth noting that the compression fidelity in this case can be calculated asω

    Through conducting simple experiments,one can observe that when integrating compression directly into the transmitted parameter in Algorithm 1 (i.e.,replacing the transmission of xi(t) withQ(xi(t))),the algorithm typically fails to converge.The primary reason behind this behavior stems from the magnitude of xi(t).In general,it lacks a bound,meaning its magnitude is not necessarily close to zero.As a result,the error introduced by compressing (such as forcing certain positions to be zero)becomes difficult to control.Hence,it is important to employ the compression operator appropriately and judiciously.

    Based on Assumption 3.1,CHOCO-SGD was proposed and is presented in Algorithm 2.As the transmitted parameters are compressed and inexact,each nodeineeds to maintainnadditional auxiliary variables {?xj(t),j∈N} to approximate the true parameters of allnnodes.Specifically,for each nodei∈N,the newly added auxiliary variable ?xj(t) represents an estimation of the actual local parameter of nodej.In Algorithm 2,instead of sending the compressed parameters,the nodes transmit the compressed value of the difference between the true local parameter and its estimator.Subsequently,upon receiving messages from neighboring nodes,every node updates its estimators of parameters of other nodes,denoted as{?x1(t+1),···,?xn(t+1)}.If the estimations maintained by each node are accurate,the error introduced by compression is minimal and will eventually vanish.It has been proven thatQ(?xi(t)-xi(t))asymptotically equals to ?xi(t)-xi(t),which does not hold forQ(xi(t))and xi(t).

    Koloskova et al.[2]has proved that Algorithm 2 achieved a convergence rate ofon strongly convex objective functions when specific parameters such as the step size and regularization parameter were appropriately chosen.Furthermore,it has been established that this algorithm can converge with arbitrary compression fidelity,indicating minimal communication pressure.To evaluate its potential in practical scenarios,we investigate its convergence properties on non-strongly convex objective functions.

    4 Convergence Analysis for Non-Strongly Convex Objectives

    In this section,we begin by introducing assumptions regarding the underlying graph structure.As mentioned previously,we useG=(V,E) to denote the undirected graph and useNito denote all the neighbors of nodeitogether with itself,i.e.,Ni={j: (i,j) ∈E}∪{i}.We have a couple of well-accepted assumptions in distributed optimization,which are related to the graph structure and the corresponding weight matrix.

    Assumption 4.1.The underlying graph structureGis connected.

    Assumption 4.2.Given the aggregation weightsaij,the weight matrixA=(aij)n×nis doubly stochastic and non-negative,i.e.,

    A·1=1,AT1=1,aij≥0,whereAsatisfiesσ2<1,whereσ2denotes the second largest singular value ofA.

    Remark 4.1.Under the undirected graph setting,a doubly stochastic matrix can be easily found.Given an undirected graphG,with its adjacent matrixA0,the degree matrixDand the maximum degreeΔ,it can be directly shown thatis a valid doubly stochastic matrix[27].

    We next introduce several basic assumptions on the objective function and its gradients.Note that these assumptions are common in first-order continuous optimization.

    Assumption 4.3.Each local objective functionfi(·)is convex,i.e.,

    andL-smooth,i.e.,there exists a constantL>0(for alli∈N),such that

    The definition ofL-smooth is equivalent to the following inequality:

    Assumption 4.4.There exists a constantM,such that

    Assumption 4.5.There exists a constantσ,such that

    Assumptions 4.4 and 4.5 bound the local gradients and variance with respect to data samples,respectively.The latter assumption can be easily satisfied when the dataset across nodes is independent and identically distributed.Besides,the following asymptotic property ofAimplies that all the elements inAtwill be distributed in the vicinity ofast→∞.

    Lemma 4.1.IfA∈Rn×nis a doubly stochastic matrix withσ2<1,then there exists constantsλ∈(0,1)andC>0,such that

    whereAtrefers to the product oftmatrices.

    This lemma states that as matrix multiplication progresses,all elements ofAtconverge toThe detailed proof is presented in the appendix.With these assumptions and lemma,we can analyze the convergence properties of CHOCO-SGD for non-strongly convex objectives.

    Theorem 4.1.Under Assumptions 3.1 and 4.1-4.5,and assumeα≤the CHOCO-SGD algorithm satisfies

    Furthermore,the expressionω>is a sufficient condition of achievingconvergence rate.In this constraint,ωneeds to be higher than a threshold related to the number of nodes and communication topology.We will demonstrate with experiments that CHOCOSGD cannot converge on non-strongly convex objectives with low compression fidelity.In other words,the CHOCO-SGD algorithm does not allow arbitrary compression fidelity.

    Proof.For simplicity,we let

    and then the updating rule of Algorithm 2 can be expressed as

    Note that when we use the compression operatorQon a matrix,we mean compressing every column of that matrix.Since the underlying topology is fixed,nodeiand nodej(i,j∈Nk)have the same estimatorfor their common neighbor nodek.Then we can use one matrix ?X(t) to denote the originaln×nauxiliary variables.Now,we start the proof by analyzing the distance between the average parameter in the(t+1)thiteration,denoted as(t+1),and the optimum,denoted as x*.By the updating rule and properties ofA,we have

    where the inequality comes from Cauchy-Schwarz inequality.The two terms in the right-hand side(RHS)of the last equation can be bounded as follows:

    Therefore,aftertiterations,the difference between the current parameter and the global average can be expressed as

    where the last inequality applies Assumption 4.5.Then Eq.(25)can be written as

    By substituting this result to Eq.(19),we can get Theorem 4.1.

    5 Experiments

    5.1 Evaluation Setup

    Testbed setting and topology.Our experiments are conducted on a testbed of 8 servers,each with an 8-core Intel Xeon W2140b CPU at 3.20 GHz,a 32 GB DDR4 RAM and a Mellanox Connect-X3 NIC supporting 10 GbE links.We use PyTorch as the experiment platform and employ its Distributed Data-Parallel Training (DDP) paradigm to realize parameter communication among the 8 servers.The main topology in experiments is a ring topology,as shown in Fig.1a.Ring topology can be used in Local Area Network(LAN)or Wide Area Network(WAN)networks,and is also commonly used[28]in distributed learning.

    Figure 1:Undirected topologies with 8 nodes

    We consider a task where all nodes aim to classify images in the MNIST or CIFAR-10 dataset into 10 classes using a linear neural network.The key features of both datasets are listed in Table 2.The objective function for this task is cross-entropy loss.The cross-entropy loss of a data batch is defined as

    wheremandlrepresent the batch size and the number of categories,respectively.In this equation,q(xij)denotes the predicted probability of sampleibelonging to classj,whilep(xij)represents the true label,which is equal to 1 when sampleibelongs to classjand 0 otherwise.The entire training dataset is divided into 8 equal partitions,each stored in an independent node.Unless otherwise specified,the learning rate and batch size are set to be 0.01 and 64,respectively.

    Table 2: Key features of datasets

    5.2 Basic Results

    We evaluate the performance of the CHOCO-SGD algorithm in comparison to the DSGD baseline.In CHOCO-SGD,we examine the efficacy of two compression operators outlined in Example 1:randomized gossip withp=0.7 and Top-k sparsification withω=0.7.To align with our theoretical convergence criterion,the performance is evaluated based on the global loss computed on the global average parameter,denoted asf(ˉx(t)).The model is trained for 25 episodes,where each episode corresponds to a pass through the entire dataset.

    The results in Fig.2a demonstrate that CHOCO-SGD and DSGD exhibit comparable convergence rates,with similar reductions in loss after the same number of episodes.To distinguish the curves,we scale the loss values to the range[0,1]with the following formula:

    wherevis the real loss value in the training process,vminandvmaxdenote the minimal and the maximal loss value achieved by three algorithms,respectively,andvscalerepresents the scaled loss value.The scaled results are depicted in Fig.2b on a logarithmic scale.In the following context,the scaled training loss is processed likewise.In Fig.2b,CHOCO-SGD and DSGD show only marginal differences in convergence rate.Furthermore,Figs.2c and 2d (scaled) illustrate the comparison of the loss with respect to the number of transmitted bits.Remarkably,CHOCO-SGD,employing either randomized gossip or Top-k sparsification methods,achieves lower loss values while transmitting the same volume of data.This outcome can be attributed to the compression of certain values,which enables nodes to transmit fewer data in each communication round without adversely affecting the convergence rate.Similar results also appear on CIFAR-10 datasets,as shown in Fig.3.CHOCO-SGD transmits about 1.2×1010bits,while DSGD needs to transmit 1.7×1010bits,when they achieve the same training error.These results align with our theoretical analysis,indicating that compression can effectively reduce the amount of transmitted data while exerting minimal influence on the convergence rate for non-strongly convex objectives,provided that the fidelity threshold is appropriately maintained.

    Figure 2:Training loss vs.episodes/transmitted data amount of CHOCO-SGD and DSGD on MNIST Dataset

    Figure 3:Training loss vs.episodes/transmitted data amount of CHOCO-SGD and DSGD on CIFAR-10 Dataset

    5.3 Sensitivity to Compression Fidelity

    To further investigate the impact of compression fidelity,we conduct several experiments by controlling the values ofpandωwithin the range of 0.9 to 0.5 for two compression operators.In randomized gossip,a smaller value ofpincreases the probability of transmitting full-zero parameters,which could lead to inaccuracies in ?xjas described in Algorithm 2.The change in (scaled) training loss with respect to iterations and transmitted bits is depicted in Figs.4a,4b(MNIST)and Figs.5a,5b(CIFAR-10),respectively.We observe that lowering the value ofpresults in fewer transmission bits without slowing down the convergence rate.Similarly,the results obtained using Top-k sparsification,as shown in Figs.4c,4d and 5c,5d indicate that reducing the number of bits,while maintainingωwithin a certain threshold,ensures satisfactory convergence rates and reduces communication burden.

    Figure 4:Training loss of CHOCO-SGD with randomized gossip(a),(b)and with Top-k sparsification(c),(d)on MNIST dataset.Both p and ω vary from 0.9 to 0.5

    Figure 5:Training loss of CHOCO-SGD with randomized gossip(a),(b)and with Top-k sparsification(c),(d)on CIFAR-10 dataset.Both p and ω vary from 0.9 to 0.5

    Furthermore,we conduct experiments to determine if CHOCO-SGD fails to converge when the compression fidelity is small for non-strongly convex objectives,as stated in Eq.(6).Specifically,for randomized gossip and Top-k sparsification,we set bothpandωto 0.4.Fig.6 demonstrates that in an over-compression scenario,the training loss of CHOCO-SGD continues to increase,implying that the algorithm fails to converge.With these results,we empirically verify that within the context of non-strongly convex functions,CHOCO-SGD requires control of the compression fidelity within a threshold to ensure convergence,which differs from the situation with strongly convex functions.In other words,CHOCO-SGD cannot achieve arbitrary compression in a non-strongly convex scenario.

    Figure 6:Convergence of CHOCO-SGD with low compression fidelity on MNIST dataset

    5.4 Robustness towards Learning Hyperparameters

    In this section,we study the influence of hyperparameters(learning rateαand batch sizeb),on the convergence performance of the CHOCO-SGD algorithm for non-strongly convex functions.The compression fidelity is set to bep=ω=0.7.We test on three learning ratesα=0.01,0.05,0.1 and three batch sizesb=32,64,128.The results are displayed in Tables 3 and 4.With the increase of the batch size,the final loss values of three algorithms after 25 training episodes become larger.Because each episode is one pass through the whole dataset,and larger batch size leads to fewer batches and thus fewer times of gradient descent and data transmission.Furthermore,the loss goes smaller when the learning rate gets larger,since the global parameter takes a larger step towards the global minimum.No matter in which hyperparameter setting,CHOCO-SGD and DSGD algorithms can achieve similar convergence error with the same training episodes,but CHOCO-SGD transmits less data amount.Furthermore,according to Table 4,the change of the learning rate has negligible influence on communication load,while larger batch size leads to fewer transmission data.

    Table 3:Summary of training loss with different hyperparameter settings.The loss values are recorded after 25 training episodes and are rounded to three decimal places

    Table 4: Summary of Transmitted bits with different hyperparameter settings.The bits values are recorded after 25 training episodes and are rounded to three decimal places

    5.5 Applicability to Various Topologies

    5.5.1InvestigatingDifferentFixedTopologies

    To explore the algorithm’s applicability for non-strongly convex objectives,we test the convergence performance of CHOCO-SGD on two different topologies,as displayed in Fig.7.Fig.7a shows a bipartite topology,where nodes 1-4 and 5-8 are divided into two groups and each edge can only connect nodes in different groups.Besides,Fig.7b displays a more sparse topology,where parameters can only be transmitted along one path in two directions.The compression fidelity is set to bep=ω=0.7 which is the same as basic experiments.Figs.8 and 9 show the scaled training loss of MNIST dataset training on two topologies,respectively.In Fig.8,CHOCO-SGD and DSGD have the same convergence speed,since they have similar loss values after the same training episodes.But the transmission amount of CHOCO-SGD,about 5 × 109,is much less than 7 × 109of DSGD.Besides,in the sparser topology,Fig.9b demonstrates that both algorithms can transmit fewer data to achieve convergence,but the transmitted amount of DSGD is 40% more than that of CHOCOSGD.Therefore,these results further illustrate the robustness and applicability of CHOCO-SGD for non-strongly convex functions on various topologies.

    Figure 7:Different undirected and connected topologies

    Figure 8:Training loss vs.episodes/transmitted data amount of CHOCO-SGD and DSGD on bipartite topology Fig.7a

    Figure 9:Training loss vs.episodes/transmitted data amount of CHOCO-SGD and DSGD on sparse topology Fig.7b

    5.5.2Time-VaryingTopologies

    We further explore the impact of time-varying topologies on the convergence of CHOCO-SGD.The training topology alternates between the two graphs in Fig.1.For instance,if the nodes are initially distributed over the left topology,in the next iteration,they will be connected according to the right topology.It is important to note that this setup still ensures connectivity (as stated in Assumption 4.1).We set the compression coefficients top=ω=0.9 to reduce their influence on the convergence.Fig.10 illustrates that CHOCO-SGD is not able to converge in a non-strongly convex and time-varying environment.Because some nodes receive parameters from specific neighbors inconsistently,resulting in an uncorrectable bias from true parameters and the algorithm diverges eventually.

    Figure 10:Convergence of CHOCO-SGD on time-varying topologies

    6 Conclusion and Future Work

    In this work,we study CHOCO-SGD algorithm in the case that the objective function is nonstrongly convex.We provide the first theoretical analysis of this situation and prove that when controlling compression fidelity within a certain threshold,it has the same convergence rate order as DSGD.Experimental results validate the theoretical analysis by demonstrating that CHOCOSGD converges more quickly than DSGD when transmitting the same data amount.Besides,a low compression fidelity and time-varying topology can make CHOCO-SGD not converge in the end.

    There are several open topics worthy of investigating in our future work.The first one is to explore CHOCO-SGD’s performance in more complex non-convex scenarios,which can provide a deeper understanding of its capabilities.Besides,it will be meaningful to improve this algorithm to achieve arbitrary compression fidelity for non-strongly convex functions,since most real-world problems are non-strongly convex and such improvement can enhance CHOCO-SGD’s durability and applicability.

    Acknowledgement:The authors are grateful for the theoretical support by Simiao Jiao.Besides,this paper’s logical rigor,integrity,and content quality have been greatly enhanced,so the authors also wish to express their appreciation to anonymous reviewers and journal editors for assistance.

    Funding Statement:This work was supported in part by the Shanghai Natural Science Foundation under the Grant 22ZR1407000.

    Author Contributions:The authors’contributions to this manuscript are as follows:study conception:X.L.,Y.X.;experiment design and data collection:X.L.;theoretical analysis,interpretation of results,and draft manuscript preparation:X.L.,Y.X.All authors reviewed the results and approved the final version of the manuscript.

    Availability of Data and Materials:The data,including MNIST and CIFAR-10 datasets,used in this manuscript is accessible via the following link:https://www.kaggle.com/datasets.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    Appendix A.Proof of Lemma 4.1

    We firstly prove that for any matrixM∈Rd×n,ifM1=0,we have‖MA‖≤σ2‖M‖.Assume that with singular value decomposition,A=UΣVT,whereUis a matrix consists of eigenvectors ofAAT.IfU=(u1,···,un),becauseAAThas eigenvalue 1,we can chooseu1to be 1,which is the eigenvector of eigenvalue 1.Then

    大话2 男鬼变身卡| 久久午夜综合久久蜜桃| 久久久国产一区二区| 熟女少妇亚洲综合色aaa.| 国产片内射在线| 一区二区三区乱码不卡18| 99国产精品一区二区三区| 婷婷成人精品国产| 天天躁狠狠躁夜夜躁狠狠躁| 中文字幕人妻丝袜一区二区| videosex国产| 中文欧美无线码| 婷婷丁香在线五月| 国产成人a∨麻豆精品| 一级黄色大片毛片| av在线老鸭窝| 国产伦理片在线播放av一区| 黄频高清免费视频| 欧美人与性动交α欧美精品济南到| 国产成人欧美在线观看 | 亚洲国产最新在线播放| 久久久久国产精品人妻一区二区| 99热全是精品| 九色亚洲精品在线播放| 亚洲精品国产av成人精品| 老司机在亚洲福利影院| 妹子高潮喷水视频| 涩涩av久久男人的天堂| 亚洲精品久久成人aⅴ小说| 国产成人一区二区在线| 色网站视频免费| 日韩av在线免费看完整版不卡| 日本一区二区免费在线视频| 久久久久精品国产欧美久久久 | 国产亚洲欧美精品永久| 啦啦啦中文免费视频观看日本| 国产精品久久久久成人av| 国产精品99久久99久久久不卡| 免费在线观看影片大全网站 | 欧美日韩亚洲综合一区二区三区_| 欧美xxⅹ黑人| av视频免费观看在线观看| 久久av网站| 九草在线视频观看| 99国产精品一区二区蜜桃av | 精品第一国产精品| a级毛片黄视频| 搡老岳熟女国产| 性高湖久久久久久久久免费观看| 国产精品偷伦视频观看了| 亚洲av成人精品一二三区| av片东京热男人的天堂| 免费av中文字幕在线| 人人澡人人妻人| av国产久精品久网站免费入址| 狂野欧美激情性bbbbbb| 飞空精品影院首页| 日韩电影二区| 国产在线免费精品| 母亲3免费完整高清在线观看| 久9热在线精品视频| 精品欧美一区二区三区在线| 久久青草综合色| 成年人午夜在线观看视频| 一级黄片播放器| 亚洲国产精品国产精品| 人人妻人人爽人人添夜夜欢视频| 欧美xxⅹ黑人| 操出白浆在线播放| 好男人视频免费观看在线| 乱人伦中国视频| 免费高清在线观看日韩| 亚洲免费av在线视频| 无限看片的www在线观看| 国产野战对白在线观看| 少妇裸体淫交视频免费看高清 | 亚洲国产成人一精品久久久| 国产真人三级小视频在线观看| 黄色视频不卡| 伊人亚洲综合成人网| 日韩一卡2卡3卡4卡2021年| av一本久久久久| 一级毛片我不卡| www.自偷自拍.com| 中文字幕人妻丝袜一区二区| 亚洲激情五月婷婷啪啪| 可以免费在线观看a视频的电影网站| 热re99久久国产66热| 99国产精品一区二区三区| 国产黄色视频一区二区在线观看| 狠狠精品人妻久久久久久综合| 国产免费又黄又爽又色| 老汉色av国产亚洲站长工具| 欧美精品一区二区大全| 18禁黄网站禁片午夜丰满| 激情五月婷婷亚洲| 亚洲成人手机| 九色亚洲精品在线播放| 亚洲精品一区蜜桃| 亚洲久久久国产精品| 亚洲激情五月婷婷啪啪| 啦啦啦在线免费观看视频4| 成年av动漫网址| 色94色欧美一区二区| 国产成人91sexporn| 午夜福利影视在线免费观看| 国产又色又爽无遮挡免| 成年人黄色毛片网站| 真人做人爱边吃奶动态| 亚洲国产毛片av蜜桃av| a级毛片在线看网站| 男女高潮啪啪啪动态图| 999久久久国产精品视频| 国产av精品麻豆| 成年女人毛片免费观看观看9 | 国产精品九九99| 精品一区二区三区四区五区乱码 | 黄色一级大片看看| av网站在线播放免费| 秋霞在线观看毛片| 色94色欧美一区二区| 一本色道久久久久久精品综合| 成年人午夜在线观看视频| 国产精品国产av在线观看| 久久九九热精品免费| 国产成人免费无遮挡视频| 人人妻,人人澡人人爽秒播 | 欧美性长视频在线观看| 亚洲,欧美,日韩| 国产极品粉嫩免费观看在线| 欧美国产精品va在线观看不卡| 80岁老熟妇乱子伦牲交| 精品国产一区二区久久| 欧美黑人精品巨大| 高清黄色对白视频在线免费看| 国产成人啪精品午夜网站| 老司机在亚洲福利影院| 人妻一区二区av| 又大又黄又爽视频免费| 国产精品久久久久成人av| videos熟女内射| 国产在线免费精品| 免费看十八禁软件| 性少妇av在线| 精品少妇久久久久久888优播| 精品福利观看| 天天躁夜夜躁狠狠躁躁| 国产又色又爽无遮挡免| 久久精品国产亚洲av高清一级| 国产成人av教育| 91老司机精品| 久热爱精品视频在线9| 国产xxxxx性猛交| 亚洲人成77777在线视频| 制服诱惑二区| 国产精品一区二区在线不卡| 一边摸一边抽搐一进一出视频| 无遮挡黄片免费观看| 亚洲九九香蕉| 久久ye,这里只有精品| 欧美 日韩 精品 国产| 国产有黄有色有爽视频| 国产成人啪精品午夜网站| 亚洲精品国产区一区二| 国产成人精品久久二区二区91| 制服人妻中文乱码| 天天躁夜夜躁狠狠躁躁| 免费看十八禁软件| 日韩免费高清中文字幕av| 考比视频在线观看| 中文字幕人妻熟女乱码| 97人妻天天添夜夜摸| 日本黄色日本黄色录像| 国产男女内射视频| 999精品在线视频| 久久中文字幕一级| 国产人伦9x9x在线观看| 国产亚洲av片在线观看秒播厂| 国产精品 国内视频| 日日夜夜操网爽| 丝袜喷水一区| 69精品国产乱码久久久| 亚洲精品日本国产第一区| 亚洲成av片中文字幕在线观看| 极品人妻少妇av视频| 水蜜桃什么品种好| videosex国产| 两个人看的免费小视频| 男人添女人高潮全过程视频| 久久这里只有精品19| 下体分泌物呈黄色| 99精国产麻豆久久婷婷| 我的亚洲天堂| 成年动漫av网址| 欧美人与性动交α欧美精品济南到| 国产精品久久久久成人av| 黄片小视频在线播放| 极品人妻少妇av视频| 一级a爱视频在线免费观看| 亚洲中文av在线| av网站在线播放免费| 亚洲国产欧美日韩在线播放| 人成视频在线观看免费观看| 99九九在线精品视频| 少妇被粗大的猛进出69影院| 18禁裸乳无遮挡动漫免费视频| 国产亚洲一区二区精品| 纵有疾风起免费观看全集完整版| 丝袜美足系列| 赤兔流量卡办理| 日韩制服骚丝袜av| 999精品在线视频| 久久精品国产亚洲av高清一级| 色视频在线一区二区三区| 国产精品麻豆人妻色哟哟久久| 国产黄频视频在线观看| 免费日韩欧美在线观看| 亚洲av电影在线进入| 久久ye,这里只有精品| 精品亚洲乱码少妇综合久久| 亚洲熟女精品中文字幕| 国产精品免费视频内射| 久久人妻熟女aⅴ| 亚洲欧美日韩高清在线视频 | 丝瓜视频免费看黄片| 日韩人妻精品一区2区三区| 中国国产av一级| 国产免费一区二区三区四区乱码| 国产成人一区二区三区免费视频网站 | 亚洲精品美女久久久久99蜜臀 | 中文乱码字字幕精品一区二区三区| 午夜免费观看性视频| 1024视频免费在线观看| 成人国语在线视频| av国产久精品久网站免费入址| 在线亚洲精品国产二区图片欧美| 啦啦啦 在线观看视频| 精品熟女少妇八av免费久了| 久久国产亚洲av麻豆专区| av线在线观看网站| 久久国产精品大桥未久av| 亚洲伊人久久精品综合| 午夜福利影视在线免费观看| 少妇精品久久久久久久| 男女之事视频高清在线观看 | 水蜜桃什么品种好| 国产又色又爽无遮挡免| 波多野结衣一区麻豆| 亚洲 欧美一区二区三区| 欧美日韩精品网址| 婷婷丁香在线五月| 在现免费观看毛片| 久久久精品免费免费高清| 国产成人影院久久av| 国产成人av激情在线播放| 999久久久国产精品视频| 国产av精品麻豆| 自拍欧美九色日韩亚洲蝌蚪91| 丝袜脚勾引网站| 91麻豆精品激情在线观看国产 | 久久鲁丝午夜福利片| 久久这里只有精品19| 老司机深夜福利视频在线观看 | 欧美日本中文国产一区发布| 久久久精品国产亚洲av高清涩受| 国语对白做爰xxxⅹ性视频网站| 在线av久久热| 亚洲午夜精品一区,二区,三区| kizo精华| 免费观看a级毛片全部| 91精品伊人久久大香线蕉| 一个人免费看片子| 人成视频在线观看免费观看| xxx大片免费视频| 亚洲欧美中文字幕日韩二区| 精品亚洲成国产av| 高清黄色对白视频在线免费看| 大片电影免费在线观看免费| 日本欧美国产在线视频| 久久天堂一区二区三区四区| 久久性视频一级片| 国产成人精品久久二区二区免费| 久久久国产欧美日韩av| 欧美在线一区亚洲| 精品第一国产精品| www.自偷自拍.com| 日本a在线网址| 多毛熟女@视频| 老汉色∧v一级毛片| 成人亚洲欧美一区二区av| 一级片'在线观看视频| 亚洲熟女毛片儿| 国产欧美亚洲国产| 18禁裸乳无遮挡动漫免费视频| 亚洲五月色婷婷综合| 免费在线观看日本一区| 无限看片的www在线观看| 18禁观看日本| 日本a在线网址| 真人做人爱边吃奶动态| 国产又色又爽无遮挡免| 亚洲欧洲国产日韩| 少妇 在线观看| 色网站视频免费| 精品一区在线观看国产| 女人高潮潮喷娇喘18禁视频| 国产成人精品无人区| 天堂8中文在线网| 精品人妻1区二区| 日韩av在线免费看完整版不卡| 日日夜夜操网爽| 美女午夜性视频免费| 香蕉国产在线看| 欧美 亚洲 国产 日韩一| 久久av网站| 一本色道久久久久久精品综合| 日韩一区二区三区影片| 叶爱在线成人免费视频播放| 一级毛片电影观看| 天天躁夜夜躁狠狠躁躁| 高清黄色对白视频在线免费看| 中文字幕最新亚洲高清| 男女午夜视频在线观看| 视频在线观看一区二区三区| 国产成人免费无遮挡视频| 成人午夜精彩视频在线观看| 精品久久久久久电影网| 国产精品一二三区在线看| 午夜福利在线免费观看网站| 久久久久精品国产欧美久久久 | 精品欧美一区二区三区在线| 久久人人爽人人片av| 九草在线视频观看| 可以免费在线观看a视频的电影网站| 久热爱精品视频在线9| av有码第一页| 亚洲综合色网址| 久久久久久免费高清国产稀缺| 国产av精品麻豆| 国产高清国产精品国产三级| 亚洲av成人不卡在线观看播放网 | 国产色视频综合| 人妻人人澡人人爽人人| 免费看不卡的av| 精品一区二区三区四区五区乱码 | 在线精品无人区一区二区三| 最新在线观看一区二区三区 | 观看av在线不卡| 欧美亚洲日本最大视频资源| 观看av在线不卡| 婷婷成人精品国产| 别揉我奶头~嗯~啊~动态视频 | 国产亚洲欧美在线一区二区| 国产av国产精品国产| 日本黄色日本黄色录像| 91精品国产国语对白视频| 国产欧美亚洲国产| 国产成人精品久久二区二区免费| 色网站视频免费| 亚洲欧美一区二区三区久久| 香蕉丝袜av| 夜夜骑夜夜射夜夜干| 成在线人永久免费视频| 国产精品久久久久成人av| 免费观看人在逋| av电影中文网址| 久久人妻福利社区极品人妻图片 | 久久久久久久久久久久大奶| 国产精品免费视频内射| 超色免费av| 国产亚洲一区二区精品| 亚洲一卡2卡3卡4卡5卡精品中文| 嫩草影视91久久| 黄色片一级片一级黄色片| 成人国产一区最新在线观看 | 日韩欧美一区视频在线观看| www.熟女人妻精品国产| 熟女av电影| 成人18禁高潮啪啪吃奶动态图| 国产91精品成人一区二区三区 | 免费久久久久久久精品成人欧美视频| 久久狼人影院| 男女国产视频网站| 国产不卡av网站在线观看| 成人影院久久| 男女床上黄色一级片免费看| 亚洲,一卡二卡三卡| 欧美激情极品国产一区二区三区| 99久久99久久久精品蜜桃| 大陆偷拍与自拍| 亚洲欧美清纯卡通| 国产福利在线免费观看视频| 国产黄色视频一区二区在线观看| 国产精品偷伦视频观看了| 亚洲av综合色区一区| 欧美日韩视频精品一区| 91字幕亚洲| 亚洲伊人久久精品综合| 欧美在线黄色| 中文字幕最新亚洲高清| 亚洲精品美女久久av网站| 免费一级毛片在线播放高清视频 | 91精品三级在线观看| 国产99久久九九免费精品| av有码第一页| 我要看黄色一级片免费的| 亚洲国产精品一区二区三区在线| 97精品久久久久久久久久精品| 欧美日韩成人在线一区二区| 中文字幕亚洲精品专区| 免费不卡黄色视频| 美女中出高潮动态图| 欧美少妇被猛烈插入视频| 久久天堂一区二区三区四区| xxxhd国产人妻xxx| 成在线人永久免费视频| 亚洲午夜精品一区,二区,三区| 日韩一本色道免费dvd| 亚洲三区欧美一区| 国产极品粉嫩免费观看在线| av在线老鸭窝| www.熟女人妻精品国产| 亚洲精品美女久久av网站| 精品高清国产在线一区| 国产伦人伦偷精品视频| 久久久久国产精品人妻一区二区| 国产av一区二区精品久久| xxx大片免费视频| 一区二区日韩欧美中文字幕| 天堂8中文在线网| 国产精品九九99| 一级毛片我不卡| 高清黄色对白视频在线免费看| 一二三四在线观看免费中文在| 只有这里有精品99| 自线自在国产av| 国产97色在线日韩免费| 精品人妻在线不人妻| 9色porny在线观看| 国产免费现黄频在线看| 老司机亚洲免费影院| a级毛片在线看网站| 韩国高清视频一区二区三区| 18禁黄网站禁片午夜丰满| 久久久国产欧美日韩av| 男女高潮啪啪啪动态图| 国产欧美亚洲国产| 中文字幕精品免费在线观看视频| 中国美女看黄片| 丝袜在线中文字幕| av片东京热男人的天堂| 精品人妻在线不人妻| 欧美日韩视频高清一区二区三区二| 欧美日本中文国产一区发布| 日韩免费高清中文字幕av| 亚洲黑人精品在线| 老汉色∧v一级毛片| 精品久久久久久电影网| 在线观看免费视频网站a站| 两个人看的免费小视频| 老汉色∧v一级毛片| 国产有黄有色有爽视频| 午夜激情av网站| 两性夫妻黄色片| 老熟女久久久| 午夜久久久在线观看| 尾随美女入室| 中文字幕高清在线视频| 男女免费视频国产| 韩国高清视频一区二区三区| 国产精品秋霞免费鲁丝片| 天天影视国产精品| 精品第一国产精品| 国产精品一区二区在线观看99| 超碰成人久久| 91国产中文字幕| 国产深夜福利视频在线观看| 日韩av免费高清视频| 免费在线观看完整版高清| 国产1区2区3区精品| 欧美xxⅹ黑人| 久久精品久久久久久久性| 日本wwww免费看| 国产精品香港三级国产av潘金莲 | 午夜福利在线免费观看网站| a 毛片基地| 两性夫妻黄色片| 午夜免费成人在线视频| 久久久精品94久久精品| 岛国毛片在线播放| 国产片特级美女逼逼视频| 国产色视频综合| 十八禁高潮呻吟视频| 国产激情久久老熟女| 一本一本久久a久久精品综合妖精| 中文字幕av电影在线播放| 欧美xxⅹ黑人| 婷婷色av中文字幕| 在线av久久热| 亚洲第一青青草原| 一区二区三区乱码不卡18| 日韩一本色道免费dvd| 国产欧美日韩一区二区三区在线| 自拍欧美九色日韩亚洲蝌蚪91| 欧美黄色片欧美黄色片| 一本久久精品| 国产精品国产三级专区第一集| 嫩草影视91久久| 国产男女内射视频| 国产在线免费精品| 老司机在亚洲福利影院| 国产欧美日韩综合在线一区二区| 男女无遮挡免费网站观看| 高潮久久久久久久久久久不卡| 欧美日韩亚洲综合一区二区三区_| 侵犯人妻中文字幕一二三四区| 免费在线观看黄色视频的| 国产真人三级小视频在线观看| 黄色a级毛片大全视频| 叶爱在线成人免费视频播放| 亚洲三区欧美一区| 最新在线观看一区二区三区 | 国产男女内射视频| 岛国毛片在线播放| 欧美日韩精品网址| 亚洲欧美成人综合另类久久久| 亚洲精品久久成人aⅴ小说| 亚洲,一卡二卡三卡| 国产成人精品久久二区二区91| 老汉色∧v一级毛片| 日韩中文字幕视频在线看片| 欧美人与善性xxx| 一本综合久久免费| 欧美精品一区二区免费开放| 午夜影院在线不卡| 脱女人内裤的视频| 欧美精品一区二区大全| 亚洲 国产 在线| 欧美另类一区| 国产精品一区二区在线观看99| 久热爱精品视频在线9| 久久久精品国产亚洲av高清涩受| 亚洲人成网站在线观看播放| 精品国产国语对白av| 最新的欧美精品一区二区| 精品人妻一区二区三区麻豆| 欧美大码av| 一级毛片黄色毛片免费观看视频| 亚洲中文字幕日韩| 各种免费的搞黄视频| 欧美人与善性xxx| 欧美国产精品va在线观看不卡| 日本色播在线视频| avwww免费| 在线观看免费高清a一片| 亚洲国产欧美一区二区综合| 午夜精品国产一区二区电影| svipshipincom国产片| 久久精品人人爽人人爽视色| 秋霞在线观看毛片| 一本色道久久久久久精品综合| 中文字幕人妻丝袜一区二区| 久久久久网色| 女警被强在线播放| 免费看不卡的av| 最近手机中文字幕大全| 国产精品一国产av| 久久狼人影院| 大片免费播放器 马上看| 亚洲精品久久久久久婷婷小说| 亚洲成人免费av在线播放| 色视频在线一区二区三区| 国产成人精品久久二区二区91| 999久久久国产精品视频| av一本久久久久| 久久人妻福利社区极品人妻图片 | 亚洲精品乱久久久久久| 日韩,欧美,国产一区二区三区| a级毛片黄视频| 国产精品秋霞免费鲁丝片| a级片在线免费高清观看视频| 婷婷色综合www| 别揉我奶头~嗯~啊~动态视频 | 十八禁人妻一区二区| 国产成人系列免费观看| 久久青草综合色| 亚洲国产欧美一区二区综合| av在线app专区| 一级毛片女人18水好多 | 午夜精品国产一区二区电影| 国产片内射在线| 狠狠婷婷综合久久久久久88av| 国产成人欧美| 欧美少妇被猛烈插入视频| 精品国产乱码久久久久久男人| 每晚都被弄得嗷嗷叫到高潮| 夫妻午夜视频| 久久人妻福利社区极品人妻图片 | 国产精品人妻久久久影院| 国产一区二区三区综合在线观看| 黄色视频在线播放观看不卡| 亚洲av男天堂| 十八禁网站网址无遮挡| 97人妻天天添夜夜摸| 久久久久久亚洲精品国产蜜桃av| 久久鲁丝午夜福利片| 亚洲欧美日韩另类电影网站| 国产精品人妻久久久影院| 久久久国产精品麻豆| 在线观看www视频免费| 亚洲国产中文字幕在线视频| 国产高清不卡午夜福利| 丝袜在线中文字幕| 伦理电影免费视频| 亚洲av日韩在线播放|