Hongzhi Tong and Michael Ng
1 School of Statistics,University of International Business and Economics,Beijing 100029,China
2 Institute of Data Science and Department of Mathematics,The University of Hong Kong,Pokfulam,Hong Kong,China
Abstract.We consider a gradient iteration algorithm for prediction of functional linear regression under the framework of reproducing kernel Hilbert spaces.In the algorithm,we use an early stopping technique,instead of the classical Tikhonov regularization,to prevent the iteration from an overfitting function.Under mild conditions,we obtain upper bounds,essentially matching the known minimax lower bounds,for excess prediction risk.An almost sure convergence is also established for the proposed algorithm.
Key words: Gradient iteration algorithm,functional linear regression,reproducing kernel Hilbert space,early stopping,convergence rates.
Due to advance in technology,data are increasingly collected in the form of random functions or curves,as opposed to scalars or vectors.Functional data analysis(FDA)is developed to handle this situation,has drawn considerable attention in recent decades.Various approaches for the analysis of functional data have been developed and proposed in the literature [6,8,12,13,17],offering a comprehensive overview.
Among many problems involving functional data,functional linear regression is widely used to model the prediction of a functional predictor.Consider the following functional linear regression model
whereYis a scalar response,X(·) is a square integrable random function (with respect to Lesbesgue measure) on a bounded interval I,α0is the intercept,β0(·)is an unknown slope function andis a centered noise random variable.Without loss of much generality,throughout the paper we assume E(X)0 and the interceptα00,since the intercept can be easily estimated.
The goal of the prediction problems is to estimate the functional
based on a set of training data{(Xi,Yi):i1,···,n}consisting ofnindependent copies of (X,Y).Define the risk for a predictionηas
where (X*,Y *) is a copy of (X,Y) independent of the training data,and E*represents expectations taken overX*andY *only.Let ?ηbe a prediction constructed from the training data.Then,its accuracy can be naturally measured by the excess risk:
In the context of functional linear regression,most of existing methods are based upon functional principal analysis (FPCA),see,e.g.,[2,7,13,19].The FPCA approach expands the unknown functionβ0using the eigenfunction of the predictor covariance operator.For such a strategy to work well,it is usually necessary to assume that such a basis provides a good presentation of the slope function,which may not have anything to do with the predictor in terms of basis representation accuracy.A more general assumption for slope function may be on its smoothness,it makes a reproducing kernel Hilbert space (RKHS) [9] an interesting alternative,see [3,11,20].In particular,it has already been shown in [3] that the approach based on RKHS performs better when the slope function does not align well with the eigenfunctions of the covariance kernel.
Motivated by these observations,in this paper we will develop an iterative estimation procedure for functional linear model (1.1) within the framework of RKHS,under which the unknown slope functionβ0is assumed to reside in an RKHSHK.
More specifically,we are concerned with a kind of iteration path from origin.To prevent the iteration to converge to an overfitting function,it is necessary to choose the moment when the iteration stops.Through a detailed theoretical analysis for a form of gradient-descent applied to the least-squares loss function,we have come up with an early stop rule,which plays a role of regularization.Under the early stopping strategy and some realistic conditions,we show that the same minimax convergence rate as that of [3] is preserved for our algorithm.Different from the result in [3] established in expectation,our convergence rate is in probability.As a consequence,we deduce almost sure convergence of the gradient iteration algorithm by applying the Borel-Cantelli lemma.
Although the gradient iteration methods with early stopping have been well studied in the context of vector data,see e.g.,[14,18],it is still challenging to deal with the more complicated functional data.To our knowledge,this is the first time we provide a theoretical guarantee for use of these algorithms in functional settings.It is expected to bring some new insights into broadening the methods of FDA.The outline of this paper is given as follows.In Section 2,we review the mathematical framework to formulate the algorithm.In Section 3,we show our main results.A brief discussion are given in Section 4.
We begin by introducing some notations on RKHS and kernel-induced integral operators,before turning to a precise formulation of the algorithm studied in this paper.
Recall a reproducing kernelK:I×I→R is a real,symmetric,continuous,and nonnegative definite function.The RKHSHKassociated with the kernelKis the completion of the linear span of functions{Ks:K(s,·)I}with the inner product given byK(s,u). The so-called reproducing property ofHKis critical in both theoretical analysis and computation,which states that
DenoteL2the Hilbert space of square integrable functions on I endowed with inner product〈·,·〉and corresponding norm‖·‖(i.e.,‖f‖2〈f,f〉).We consider that‖·‖opstands for the usual operator norm,that is,for an operator U:L2→L2.Define the integral operator LK:L2→L2associated with kernelKby
where for2,f ?g:L2→L2is defined by (f ?g)(h)〈g,h〉ffor any2.The empirical version of T is given by
Both two operators will play the key role in the formulation and analysis of our gradient iteration algorithm.
Consider the empirical risk minimization based on the observation{(Xi,Yi):i1,···,n}of least square scheme
for any.In the following,we assume thatHKis dense inL2,which ensures thatf0andfare uniquely defined.This assumption can be satisfied whenKis a universal kernel,such as Gaussian kernel,see [15].Now (2.1) can be rewritten as
Our algorithm is then taken as a gradient descent method for (2.2),which can be stated iteratively as a sequence2by
whereγ >0 is the step size.Note that
it follows by induction with0 that
where I is the identity operator and
It is not hard to see that
From the iteration relation (2.4),we can obtain
In the next section,we shall estimate each of terms of (2.5) so that the bound of the excess riskE()-E(η0) can be obtained.
This section presents our main results.Theorem 3.1 and Theorem 3.2 contain separate estimates of the first and second term of (2.5) and lead to Theorem 3.3 which gives an upper bound ofE()-E(η0).
To derive nontrivial rates of convergence,we need impose some basic assumptions.The following boundedness condition is introduced in [16]
(A1) There exists a constantκ>0 such that≤κ2almost surely.
It follows from (A1) that both T and Tnare compact trace-class operators.
We assume the noisein model (1.1) satisfies Bernstein condition:holds almost surely,for every integerl≥2 and some constantsM,v>0.
Assumption (A2) is a broad model for the noise,it is satisfied when the noise is uniformly bounded,Gaussian or sub-Gaussian.
We also introduce a regularity condition of the operator T,which is measured in terms of the so-called effective dimensionality (see [4,22]).Forλ>0,the effective dimension of T is defined to be the trace of the operator (T+λI)-1T as follows:
(A3) There exist constants 0<θ≤1 andcθ >0,such that
In order to derive the error bounds of the terms in (2.5),we still need some preliminary lemmas.The following lemma was proved in [16],which will play a key role in our analysis.
Lemma 3.1.Under the assumption in(A1),for any0<δ <1,with confidence at least1-δ/2,there holds
The next lemma was established in [4],based on a Bernstein-type inequality for random variables taking values in Hilbert space,see [10,21].
Lemma 3.2.Let H be a Hilbert space and ξ be a random variable with values in H.Let {ξ1,ξ2,···,ξn} be a sample of n independent observations for ξ.If for some constants u,v>0,the bound
holds for every2N,then for any0<δ<1,
with confidence at least1-δ.
We first bound the first term on the right-hand side of (2.5) by using Lemma 3.1.
Theorem 3.1.Under the assumption in(A1)and0<γ <κ-2,for any0<δ <1,with confidence at least1-δ/2,there holds
Proof.Since
Using a operator norm inequality for self-adjoint positive operators A and B,see [1,Theorem X.1.1],where the result is stated for positive matrices,but the proof applies as well to positive operators on a Hilbert space.We can get from Lemma 3.1 that with confidence at least 1-δ/2,
Hence,
This together with (3.2) proves the theorem.
Next we turn to bound the second term of (2.5) by using Lemmas 3.1 and 3.2.
Theorem 3.2.Under the assumption of(A1),(A2)and0<γ<κ-2,for any0<δ<1,with confidence at least1-δ,there holds
Proof.Since
Here in the last inequality we have used the fact (3.1) again.By Lemma 3.1,we have with the same confidence set of (3.2),
To estimate
and forl≥2,
Note that
Hence we obtain from (A2) that
By applying Lemma 3.2 toξ,we have with confidence at least 1-δ/2,
This together with (3.3),(3.4) proves the theorem.
Now we can derive the explicit learning rates of gradient iteration algorithm(2.3)for functional linear regression model (1.1).
where C is a constant independent of n or δ,denotes the smallest integer greater or equal than xR.
Since Theorems 3.1 and 3.2 hold simultaneously with probability at least 1-δ,we get from (2.5) that
This proves the theorem with
Based on the confidence-based error estimate in Theorem 3.3,we can deduce almost sure convergence of gradient iteration algorithm (2.3) by applying the following Borel-Cantelli lemma [5,p.262].
Lemma 3.3.Let μi be a sequence of events in some probability space and νi be a sequence of positive numbers satisfyinglimi→∞νi0.If
then μi converges to μ almost surely.
almost surely.
Proof.Takeδδnn-2in Theorem 3.3.Then for anynandε >0,we get from Theorem 3.3,
We can write
it is easy to see
and limn→∞νn0.Then our conclusion follows by Lemma 3.3.
In this paper we have introduced a gradient iteration approach to functional linear regression model.To prevent the iteration to converge to an overfitting function,we develop an early stopping rule.Rigorous theoretical analysis grantees the proposed algorithm converge to target almost surely.
The main difference is that for functional linear regression,the unknown slope function is assumed in an RKHS,and the functionsXiare not in the RKHS.The minimization problem is written directly inL2,and gradients are taken w.r.t.theL2inner product.While in most of the papers for classical non-parametric regression in RKHS (see,e.g.,[14,18]),the gradients are taken w.r.t.the inner product in the RKHS,and the feature mapsbelong to the RKHS.
It is interesting to discuss what happens if the slope function is out of the RKHS,a promising way is to quantify the regularity of the slope function by means of operator T.It is involved in the alignment of the reproducing kernelKand the covariance function ofX,and will be a topic in the future work.
We finally present a detailed comparison of our result with related results for model (1.1).In [3],the authors derive minmax optimal rates for the Tikhonov regularized least squares scheme
Under two crucial assumptions: the eigenvalues of T decay
withr>1/2 and
for some constantc>0 and any2.The authors choose a regularization parameterleading to the excess prediction risk
being minimax optimal.
Inequality (4.2) may be very difficult to verify except for Gaussian dataX.The boundedness assumption (A1) is used to replace it.Although condition (A1) is not satisfied for any non-degenerate Gaussian process,real-data processes are bounded as usual.Hence,Assumption (A1) is more realistic and at least can be considered as complimentary conditions to (4.2).
On the other hand,Assumption(A3)is more general than(4.1).In fact,if(4.1)is satisfied,it is easy to check that
The convergence rate coincides the minimax rate (4.3) of [3],and thus is optimal.
Acknowledgements
H.Tong’s research is supported in part by National Natural Science Foundation of China(Grant No.11871438).M.Ng’s research is supported in part by the HKRGC GRF Nos.12300218,12300519,17201020,17300021,C1013-21GF,C7004-21GF,and Joint NSFC-RGC N-HKU76921.