Yahui Liu, Bin Tian, Yisheng Lv,,, Lingxi Li,,, and Fei-Yue Wang,,
Abstract—Recently, there have been some attempts of Transformer in 3D point cloud classification.In order to reduce computations, most existing methods focus on local spatial attention,but ignore their content and fail to establish relationships between distant but relevant points.To overcome the limitation of local spatial attention, we propose a point content-based Transformer architecture, called PointConT for short.It exploits the locality of points in the feature space (content-based), which clusters the sampled points with similar features into the same class and computes the self-attention within each class, thus enabling an effective trade-off between capturing long-range dependencies and computational complexity.We further introduce an inception feature aggregator for point cloud classification, which uses parallel structures to aggregate high-frequency and low-frequency information in each branch separately.Extensive experiments show that our PointConT model achieves a remarkable performance on point cloud shape classification.Especially, our method exhibits 90.3% Top-1 accuracy on the hardest setting of ScanObjectNN.Source code of this paper is available at https://github.com/yahuiliu99/PointConT.
3D point cloud analysis has gained tremendous popularity in many fields, including scene understanding [1]–[3],robotics and self-driving vehicles [4]–[6].Compared with 2D images, 3D point clouds can provide sufficient spatial and geometric information, but they are not arranged in any particular order.Due to its irregular structure, the convolutional neural networks cannot be directly applied to point cloud processing, while Transformer [7] architecture is inherently permutation-invariant and natural-suited for point cloud learning.
Recently, some explorations have been made on the Transformer architecture in point cloud analysis [8]–[14].However,a common downside of these models, the high computational cost, has caught the attention of researchers and motivated them to consider the trade-off between accuracy and inference speed.The two main approaches to reduce the computational complexity are downsampling points and local selfattention [8], [11]–[13].Points downsampling algorithms,such as farthest point sampling (FPS) [15], provide uniform coverage of the entire point cloud.Local self-attention computes the relationship within a subset of points (patch or cubic window) that is partitioned in 3D space.Although local spatial attention significantly improves efficiency, it still faces difficulty in capturing interactions among distant but similar points.
Therefore, we propose a simple yet powerful architecture for point cloud classification, named point content-based Transformer (PointConT), which exploits local self-attention in the feature space (content-based) instead of the 3D space,as visualized in Fig.1.Starting from the content of the point cloud, we cluster the sampled points into classes based on their similarity, and compute the self-attention within each class, which preserves the ability of the global self-attention mechanism to capture long-range feature dependencies, while significantly reducing computational complexity.Specifically,it dynamically divides all queries into multiple clusters according to their contents (i.e., features) in each block, and selects the corresponding keys and values to compute local self-attention.The clustering varies accordingly at each stage and each head in the Transformer, adequately reflecting the content dynamics.Note that unlike the K-nearest neighborhood (KNN), the clusters are non-overlapping, which further reduces the computational complexity.
Moreover, we complement the point cloud feature aggregation from a frequency standpoint.Recent studies [16], [17]found that max-pooling amplifies high-frequency features while average pooling and Transformer reduce high-frequency components, which also accords with the observations in our ablation experiments.In order to aggregate highfrequency and low-frequency features, we design an Inception feature aggregator composed of two branches, where the name of “inception” is derived from the Inception module[18], [19].The high-frequency aggregation branch consists of a max-pooling operation and a residual multi-layer perceptron(MLP) module, while the low-frequency aggregation branch is implemented by an average pooling operation and the contentbased Transformer block.
Fig.1.Comparison between 3D space locality and content-based locality.The red point denotes the sampled center point, and the blue points denote neighborhood or cluster.In content-based attention, points will be clustered into multiple clusters based on their feature similarity.
The main contributions of this paper are summarized as follows.
1) We propose the point content-based Transformer (Point-ConT) to cluster points according to their content and compute self-attention within each cluster, establishing long-range feature dependencies while significantly reducing computations.
2) We design an Inception feature aggregator for point cloud classification, using parallel structures to aggregate high-frequency and low-frequency information in each branch separately.
3) Experiments show the competitiveness of our model on ModelNet40 [20] and ScanObjectNN [21] datasets.Extensive ablation studies verify benefits of each component in the PointConT design.
There are mainly two branches of methods for processing the point clouds.One is to convert point clouds into a regular grid structure that can be directly consumed by convolutional neural networks, such as volumetric representation [22]–[24](through voxelization) or images [25], [26] (through projection or rendering).The other is point-based modeling, where the raw point clouds are directly fed into deep networks without any conversion.This paper focuses on point-based methods.
PointNet is a pioneering work that successfully applies deep architecture on raw point sets [27].It is constructed as a symmetric function using shared MLP and max-pooling, which guarantees its permutation-invariance.However, PointNet only learns either single-point or global features, and thus is limited in capturing interactions among points.PointNet++ is built on top of the PointNet, which learns hierarchical point cloud features and is able to aggregate features in local geometric neighbors using set abstraction [15].
Following them, some works have extended the point-based methods to various local aggregation operators.The explorations of local aggregation operators can be categorized into three groups: convolution-based [28]–[35], graph-based[36]–[39], and attention-based [8], [9], [40]–[42] methods.
1)Convolution-Based Methods: References [31] and [32]learn the kernel within a local region through predefined geometric priors.Another type of point convolutions, KPConv[34], relates the weight matrices with predefined kernel points in 3D space.However, the fixed kernel points may not be optimal for modeling the complicated 3D position variations.PAConv constructs a position adaptive convolution operator with a dynamic kernel [35], which assembles basic weight matrices in Weight Bank.The assembling coefficients are learned from relative point positions by MLPs.
2)Graph-Based Methods:The rise of the graph-based methods began with dynamic graph convolutional neural network (DGCNN) [37], which learns on graphs dynamically updated at each layer.It proposes a local feature aggregation operator, named EdgeConv, which generates edge features that describe the semantic relationships between key points and their neighbors in the feature space.Besides, CurveNet explores geometric information by taking guided walks to group contiguous segments of points as curves [39].
3)Attention-Based Methods:Point cloud Transformer designs offset attention for extracting global features and uses a neighbor embedding strategy to augment local feature representation [9].Point Transformer proposes a modified Transformer architecture that aggregates local features with vector attention and relative position encoding [8].Stratified Transformer [12], inspired by Swin Transformer [43], partitions the 3D space into non-overlapping cubic windows, and proposes a stratified strategy for sampling keys.
In addition, PointASNL [44] leverages non-local network[45] and adaptive sample module to enhance the long-dependency correlation learning.Recently, PointNeXt [46] explores more advanced training and data augmentation strategies with the PointNet++ backbone to further improve the accuracy and efficiency.
Fig.2.Overall architecture of point content-based Transformer (PointConT).The network is composed of a stack of inception feature aggregator blocks.
In recent years, compared to familiar convolutional networks, Transformer architectures have shown great success in 2D images understanding.Vision Transformer (ViT) [47] is the first paper that successfully applies a Transformer encoder on images.It divides an image into non-overlapping patches(tokens), which are then linearly embedded.Further, pyramid ViT (PVT) [48], [49] proposes a hierarchical structure into Transformer framework.Transformer in Transformer (TNT)[50] extends the ViT baseline with sub-patch-wise attention within patches.More recently, methods of [43], [51]–[53]compute attentions within local windows.Swin [43] is a representative approach, which employs two key concepts to improve the original ViT — hierarchical feature maps and shifted window attention.Beyond image-space hand-crafted window, dynamic group Transformer (DGT) [54] and bilateral local attention vision Transformer (BOAT) [55] exploit feature-space locality.DGT [54] dynamically divides queries into multiple groups and selects the most relevant keys/values for each group to compute the attention.BOAT [55] supplements the existing window-based local attention with the feature space local attention module, which enables the modeling ability for long-range feature dependencies to be significantly improved.
Although Transformer is highly capable of establishing long-range dependencies, recent studies present intuitive visual explanations that Transformer lacks the ability to capture high frequencies that predominantly convey local information [16], [17].In other words, Transformer is a low-pass filter.To address this issue, inception Transformer (iFormer)[17] designs a channel splitting mechanism to adopt parallel convolution path and self-attention path as high-frequency and low-frequency mixers.
Inspired by the concepts of feature space local attention and features in different frequencies, we adopt content-based Transformer and Inception feature aggregator for 3D point cloud classification.
An overview of the proposed PointConT architecture is shown in Fig.2.The backbone structure consists of five hierarchical stages of Inception feature aggregator blocks.
Given an input point cloudp∈RN×3, containingNpoints in 3-dimensional space.The “Stage 1” inception feature aggregator block partitions the point cloud into overlapping patches and then embeds the input coordinates into a new feature space (dimension denoted asC).It halves the number of points and doubles the number of feature dimensions stage by stage.Consequently, the output containsN/2mpoints and 2m-1Cfeature dimensions at them-th stage.For classification,the final classifier head is a global max-pooling followed by two linear layers.
Fig.3.The details of the inception feature aggregator.
wherefj-fidenotes thatfjminusfito obtain the neighboring features relative to the centroidi, //·// is the concatenation operation and MLP is a simple network that includes a pointwise convolutional layer, a batch normalization layer, and an activation function.Note that unlike DGCNN, which defines its KNN in the feature space, we adopt neighbor search in the 3D space.
Next, we propose a mix pooling strategy to aggregate the features of local patches.In most previous works, max-pooling has been verified as effective in aggregating the local features, for the reason that it can capture high frequencies that predominantly convey local information.Instead of directly combining max-pooling and Transformer block in a serial manner, in our PointConT, we use a parallel structure composed of a high-frequency aggregation branch and a low-frequency aggregation branch.The max-pooling operation aggregates high-frequency signals, while the average pooling operation filters low-frequency representations.
1)High-Frequency Aggregation Branch: This branch can be defined as
where MaxPool and ResMLP denote max-pooling operation and residual MLP block, respectively.
2)Low-Frequency Aggregation Branch: We simply utilize an average pooling layer (AvgPool) before the content-based Transformer (ConT), and this design allows the content-based Transformer to focus on embedding low-frequency information.This branch can be defined as
In the end, we concatenate the features from the high-frequency aggregation branch and the low-frequency aggregation branch, and then feed them to an MLP block as the Inception aggregator output featuresf′.
Differently from point Transformer [8], which computes self-attention among local spatial neighbors, we propose a content-based attention, as visualized in Fig.4.It dynamically divides all queries into multiple clusters according to their content (i.e., features) at each block, and selects the corresponding keys and values to compute local self-attention.
Fig.4.Illustration of content-based attention.It can dynamically cluster all queries into multiple groups and compute the self-attention within each group.
LetX∈RS×d(dis the feature space dimension,Sdenotes the length of features) be a set of feature vectors.Furthermore,we get embeddingsQ=XWQ,K=XWKandV=XWVto represent the queries, keys and values, respectively.
Then we use the clustering algorithm so that the queries are scattered in different clusters.K-means clustering algorithm is a classic method for clustering problems.However, K-means clustering generally enables each cluster to contain varying numbers of queries, and therefore this algorithm cannot be implemented in a parallel way by using GPUs.To address this issue, we refer to the balanced binary clustering algorithm proposed in BOAT [55], which equally divides a set of queries into two clusters hierarchically.
whereYiis the output of each subset.Lastly, all subsetsare merged into the outputY∈RS×din keeping with their original order.
Multi-head configuration is a standard practice in Transformer, we expand multiple heads and each head performs query/key/value embeddings and clustering independently.This setting brings clustering diversity to a great extent, as visualized in Fig.5.
1)Hierarchical Binary Clustering: Similarly to K-means clustering that the cluster assignment relies on the distance between all cluster centroids and each sample, our binary clustering starts with a random division of queriesinto two clusters and then calculates the two cluster centroids,denoted asc1andc2, respectively.After that, we compute the distance ratio to perform the hard assignment.Above operations can be summarized as
Fig.5.Visualization of points clustering in different heads.
wheredistmeans the Euclidean distance in feature space,C1and C2represent the two equal size clusters through balanced binary clustering.After performingniterations (n=log2L) of binary clustering, we obtainLsubsets with the same size.
2)Choice of SA: In point Transformer, the choice of the self-attention has a crucial influence on the properties of Transformer block.One choice of SA is the standard scalar attention as
Another choice of SA is the vector attention [8], [56] that we adopt in this paper
3)Complexity Analysis: Given a set of feature vectors with a sizeS×d, for standard multi-head self-attention (MSA), the computational complexity of a local MSA module is?(S×(4kd2+2k2d))=4S kd2+2S k2d, wherekis the number of points in a local neighborhood.Unlike scalar attention, vector attention further reduces the complexity to ?(4S kd2+2S kd).In our PointConT, hierarchical binary clustering algorithm divides the feature vectors in a non-overlapping manner, dramatically reducing the complexity to ? (4S d2+2S d).
In this section, we show experimental results of the proposed model on the shape classification task.All the experiments are performed on one Tesla V100 GPU.
Implementation Details:We implement the PointConT in PyTorch framework and train the network using the SGD(stochastic gradient descent) optimizer (momentum and weight decay set to 0.9 and 0.0001, respectively), cosine learning rate schedule starting at 0.001 (warm up steps set to 10), and cross-entropy with label smoothing.We fix the random seed in all experiments to eliminate the influence of randomness.
For shape classification training, we only use 1024 uniformly sampled points as network inputs.Moreover, we use RSMix [57] in addition to random scaling and translation as data augmentation.We train PointConT on ModelNet40 and ScanObjectNN with a batch size of 32 and 64 for 300 and 400 epochs, respectively.For testing, batch size is set to 16 and 32 on ModelNet40 and ScanObjectNN, respectively.
We evaluate the model on the ModelNet40 [20] shape classification benchmark.There are 12 308 computer-aided design(CAD) models of point clouds from 40 common categories.The dataset is divided as 9840 models for training and 2468 models for testing.
The results are presented in Table I.The overall accuracy of PointConT on ModelNet40 is 93.5%, which is a competitiveresult in attention-based models.Besides, our PointConT presents a high inference speed (166 samples/second in training and 279 samples/second in testing), which is 3.5 × faster than the original PointMLP [58] paper and 1.4× faster than the lightweight version PointMLP-elite.We visualize the clustering results at each stage in Fig.6.The clusters are able to cover long-range dependencies.
TABLE I SHAPE CLASSIFICATION RESULTS ON THE MODELNET40 DATASET (OA: OVERALL ACCURACY)
We furthermore perform experiments on a recent real-world point cloud classification dataset — ScanObjectNN [21], which consists of 15 k objects from 15 categories.We only report the heavy permutations from rigid transformations PB_T50_RS dataset.Unlike sampled virtual point clouds in Model-Net40, objects in ScanObjectNN are obtained from real-world scans.Therefore, the point clouds in ScanObjectNN are noisy(background, occlusions) and not axis-aligned, which brings a significant challenge to existing point cloud analysis methods.
Table II shows the classification results on ScanObjectNN.PointConT outperforms prior models with 88.0% overall accuracy without voting [31] and reaches Top-1 90.3% when averages 10 prediction votes.This suggests that the Point-ConT is effective in real world point clouds.
We perform ablation studies for the key designs of our methods on the shape classification task.All experiments are conducted under the same training settings.
Fig.6.Visualizations of clustering results at each stage (The points of the same cluster are plotted with the same color.Different clusters are distinguished by random colors).
TABLE II SHAPE CLASSIFICATION RESULTS ON THE SCANOBJECTNN DATASET(* DENOTES METHOD EVALUATED WITH VOTING STRATEGY [31].MACC: MEAN CLASS ACCURACY; OA: OVERALL ACCURACY)
1)Component Ablation Study:Table III reports the classification results of removing each component in PointConT.Comparing Exp.III and Exp.IV, we notice that with the content-based Transformer, the model improves with 0.6% on ModelNet40 and 1.7% on ScanObjectNN.This demonstrates that the content Transformer can enhance the representation power of point clouds.Remarkably, the result of Exp.V drops a lot.In the absence of the average pooling, Exp.V means that the content-based Transformer follows after max-pooling and residual MLP in a serial manner, which indicates that the mix pooling strategy plays an important role in PointConT.Observably, by combining all these components, we obtain the best results on ModelNet40 and ScanObjectNN, which implies the effectiveness of content-based Transformer and Inception feature aggregator in point cloud classification.
2)The Number of Stages: In Table IV, we ablate different number of stages in PointConT.We gradually increase the depth of PointConT on ModelNet40 and ScanObjectNN datasets to test the effectiveness of greater depth.We find that the stage number of 5 is sufficient for full exploitation.A deeper model will bring redundant information and perfor-mance decline.
TABLE III ABLATION STUDY (MP: MAX-POOLING; RES: RESIDUAL MLP;AP: AVERAGE POOLING; CONT: CONTENT-BASED TRANSFORMER.METRIC: OA)
TABLE IV ABLATION STUDY: THE NUMBER OF STAGES
3)Local Cluster Size: We investigate the setting of the local cluster size and show the result in Table V.The best performance is achieved when the local cluster size is set to 16.
4)Similarity Metric: We compare two important measures of similarity for clustering: the cosine similarity and the Euclidean distance.The cosine similarity is proportional to the dot product of two vectors.Hence, vectors with a high cosine similarity lied in the close direction from the origin, while the Euclidean distance corresponds to the L2-norm of a difference between vectors.Vectors with a small Euclidean distance are located in the close region of a vector space.The result in Table VI shows that clustering according to Euclidean distance is better than cosine similarity in the classification task.
5)Attention Type: Finally, we compare the scalar attentionand the vector attention introduced in Section III-C.The results are shown in Table VII.It is obvious that the attention module is more effective than the no-attention, and vector attention slightly outperforms scalar attention.As described in point Transformer, vector attention supports adaptive modulation of individual feature channels, rather than just the entire feature vector, which can be beneficial in 3D point cloud analysis.
TABLE V ABLATION STUDY: LOCAL CLUSTER SIZE
TABLE VI ABLATION STUDY: SIMILARITY METRIC
TABLE VII ABLATION STUDY: ATTENTION TYPE
In this paper, we propose point content-based Transformer(PointConT), a simple yet powerful architecture, adopting content-based Transformer, which clusters the sampled points with similar features into the same class and computes the self-attention within each class.Content-based Transformer can establish long-range feature dependencies compared to local spatial attention.Moreover, we design an Inception feature aggregator to combine high-frequency and low-frequency information in a parallel manner.The max-pooling operation aggregates high-frequency signals, while the average pooling operation and Transformer filter low-frequency representations.We hope that this study will provide valuable insights into the point cloud Transformer designs.
It is noticed that the balanced clustering algorithm generates clusters with the same size, which limits generality and flexibility of the proposed PointConT.Advanced clustering approaches (e.g., [63], [64]) and CUDA can be used to implement cluster-wise matrix multiplication in future work.
IEEE/CAA Journal of Automatica Sinica2024年1期