Last week I gave a talk in the Omek-3D forum. The title of the talk was (the same as the title of this post) “**3D Point Cloud Classification using Deep Learning**“.

Here is a short summary ( that came out a little longer than expected) about what I presented there.

**UPDATE 1** (February 2018): We recently uploaded to arXiv a new paper on 3D point cloud classification. Check it out, we introduced a new representation which enables using CNNs. Our code will be available after publication. I am also working on additional posts on this subject.

**UPDATE 2** (April 2018): I am starting to actively search for a post-doc position (to start summer 2019). If you know a professor that might be looking for someone that knows his way around deep learning applied to point clouds (preferably with a robotics / CAD/engineering orientation), please contact me.

**UPDATE 3** (April 2018): A fellow scholar recently referred me to a few papers I wasn’t familiar with which are a great contribution to this post. They are added below.

##### Introduction

Deep learning on 2D images has been vastly researched in the past few years. It achieves excellent results on classification tasks thanks to two main things

- Convolutional neural networks.
- Data – tons of image data is available.

For 3D, data is now growing rapidly. Whether it originated from a human designed CAD model or a scanned point cloud from a LiDAR sensor or an RGBD camera – point clouds are everywhere. In addition, most systems acquire 3D directly (rather than take images and process them).

Therefore, one of the expected next steps in research is how can we apply these amazing deep learning tools, that work so well for images, on 3D point clouds?

##### Challenges

It turns out that it is not that simple. I subdivide the challenges into two main categories: the ‘new’ challenges related to the learning process and ‘old’ challenges related to data corruptions.

Neural Network Challenges:

- Unstructured data (no grid): point clouds are XYZ points distributed in space. There is no structured grid to help us for the CNN filters.
- Invariance to permutations: a point cloud is essentially a long list of points (nx3 matrix where n is the number of points). Geometrically, the order of the points doesn’t matter however in the underlying matrix structure it does, e.g. the same point cloud can be represented by two very different matrices. (the image below illustrates this point).
- The number of points changes: In images, the number of pixels is a given constant and depends on the camera. The number of points, however, may vary dramatically, depending on the sensor.

Data Challenges:

- Missing data: The scanned models are usually occluded and parts of the data are missing.
- Noise: All sensors are noisy. There are a few types of noise which include point perturbations and outliers. It means that a point has some probability to be within a sphere of a certain radius around the place it was sampled (perturbations) or it may appear in a random position in space (outliers).
- Rotation: a car turning left and the same car turning right will have different point clouds that represent the same car, right?

The straightforward approach to applying deep learning on point clouds is to convert the data into a volumetric representation. e.g a voxel grid. This way we can train a CNN with 3D filters without the NN issues ( the grid provides structure, the conversion to the grid solves the permutation problem and the number of voxels is constant). However, there are some downsides to this. Volumetric data can become very big, very fast. Lets think of a typical image size of 256×256 = 65536 pixels, now lets add a dimension 256x256x256 =16777216 voxels. This is a lot of data (even though GPU memory is growing every day). This also means very slow processing time. Therefore, usually, we need to compromise and take a lower resolution ( some methods use 64x64x64) but it comes at the cost of quantization errors.

Sooooooooooo, the desired solution is a deep learning method that will work **directly** on the points.

##### The Dataset

Like every vision task, if you want to prove that your method works you have to use a benchmark.

I focused on Princeton’s Modelnet40 dataset. It contains approximately 12311 CAD models of 40 object categories (such as airplanes, tables, plants etc.) represented as triangle meshes. The data is split into 9843 models for training and 2468 for testing. I did some visualization of the dataset using the GitHub code for PointNet (thanks Charles!).

##### Related Work

In the talk, I surveyed three recent papers that did just that (apply deep learning on point clouds). But before I get into that I wanted to show a nice bar-plot that summarizes the latest accuracy results on the dataset. It shows the type of data each method is working on. You can see that in 2015 most methods worked on multi-view data (which is a short way of saying – let’s take a few pictures of the 3D model and process them using 2D methods), in 2016 more methods used volumetric representation with the pioneer of point cloud learnings and 2017 has a large increase in point-based methods.

*All images below were taken from the original papers linked in the title ( credit to the authors )

~~The pioneers! (from Stanford, no surprise there) They were the first to take on this challenge (The deletion is because a similar paper was published at around the same time. I don’t want to step on anyone’s toes).~~ The paper was posted on arXiv in 2016 and immediately got a lot of attention. They did something surprisingly simple and proved why it works well they trained an MLP on each point separately (with shared weights across points). Each point was ‘projected’ to a 1024 dimension space. Then, they solved the order problem using a symmetric function (max-pool) over the points. This yielded a 1 x 1024 global feature for every point cloud which they fed into a nonlinear classifier. They also solved the rotation problem using a ‘mini-network’ they called T-net. It learns a transformation matrix over the points (3 x 3) and over mid-level features (64 x 64). Calling it ‘mini’ is a bit misleading since it is actually about the size of the main network. In addition, because of the large increase in the number of parameters they introduced a loss term to constraint the 64 x 64 matrix to be close to orthogonal.

They also used a similar network for part segmentation.

Oh, and they also did scene semantic segmentation.

Oh, and they also did normal vector estimation.

Great work! I highly recommend the read (or you can watch the presentation video too).

This paper had a great 89.2% accuracy on the ModelNet40 dataset.

Cite: Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classication and segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

The code is available on GitHub: PointNet code

**Deep sets (NIPS2017 / ICLR2017)**

These guys from CMU posted this work a few weeks before PointNet. The general idea is quite similar (not taking sides on who was first ðŸ™‚ ). While PointNet focus is mainly on the architecture, Deep sets focus is primarily to produce a layer similar to convolution for sets – the “permutation equivariance” (PE) layer which has linear time complexity in the size of the set. They claim that the PE layer is the most expressive form of parameter sharing for its purpose. Experiments show better results for point-cloud classification (90%).

They also predicted the sum of MNIST digits.

They also did a set anomaly detection.

Very interesting work, I highly recommend it for machine/deep learning-oriented readers.

Cite: Ravanbakhsh, Siamak, Jeff Schneider, and Barnabas Poczos. “Deep learning with sets and point clouds.” *arXiv preprint arXiv:1611.04500* (2016).

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., & Smola, A. J. (2017). Deep sets. In *Advances in Neural Information Processing Systems* (pp. 3394-3404).

The code is available on GitHub: Deep sets code

When I first read the PointNet paper, there was one thing that bothered me the most – why are they not using local neighborhoods of points? So, I guess it bothered them too because not long after PointNet, they introduced Pointnet++. It is essentially a hierarchical version of PointNet. Each layer has three sub stages: sampling, grouping, and PointNeting. In the first stage, they select centroids and in the second stage, they take their surrounding neighboring points (within a given radius) to create multiple sub-point clouds. Then they feed them to a PointNet and get a higher dimensional representation of these sub-point clouds. Then, they repeat the process (sample centroids, find their neighbors and Pointnet on their higher order representation to get an even higher one). They reported using 3 of these layers. They also tested some different aggregation methods for the different hierarchy levels in order to overcome differences in sampling density (which is known to be a big issue for most sensors = sample densely when objects are near and sparsely when they are far away).

They got an improvement on the original PointNet with a 90.7% accuracy on ModelNet40. (It got even better scores when they incorporated additional features).

Cite: Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413, 2017.

This paper uses the well-known Kd-tree to create some order in the point cloud. Once the point cloud is structured they learned weights for each node in the tree (which represents a subdivision along a specific axis). The weights are shared for each axis across a single tree level (so all the greens in the image below have shared weights because they subdivided the data along the x dimension ). They tested random and deterministic subdivisions of space and reported that the random version works best. They report some drawbacks. It is sensitive to rotations (since it changes the tree structure) and to noise (if it changes the tree structure). I found another drawback – for every number of input points you either need to upsample, downsample or train a new model.

They report a 90.6% accuracy dataset for 1024 points (depth 10 tree) and 91.8% for ~32K points (depth 15 tree) on the Modelnet40 .

Despite its drawbacks, I find the approach very interesting. It also seems that there is some additional work to be done (like try other tree types as they suggested).

Oh, and they also did part segmentation.

Oh, and they also did shape retrieval.

Cite: Roman Klokov and Victor Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. arXiv preprint arXiv:1704.01222, 2017.

###### Summary

In the vision community, scores are really important so let’s line them up

Since the scores are so close to each other and could be affected by a lot of parameters I find great interest in the different approaches. Pointnet and Pointnet++ used a symmetric function to solve the order problem while kd-Network used the Kd-tree. The Kd-tree also solved the structure problem while in the PointNets the MLP was trained on each point separately.

I have a method of my own to solve the point cloud classification task cooking up. I’ll post it after it’s published and link to it. (Here is the link again if you missed the update at the top)

I just hope no one will publish it before me ðŸ™‚