Cameras as Rays: Pose Estimation via Ray Diffusion

In the latest episode of Talking Papers Podcast, I had the immense pleasure of hosting Jason Zhang, who’s currently cutting a dynamic path as a PhD student at the Robotics Institute at Carnegie Mellon University. Our conversation revolved around his captivating paper, “Cameras as Rays: Pose Estimation via Ray Diffusion,” which was recently accepted to ICLR 2024 as an oral presentation. This work is a game-changing twist in the arena of camera pose estimation, where they estimate camera poses by treating a camera as a bundle of rays rather than the standard practice of predicting global parametrizations of camera extrinsics. This unique representation, linked tightly with spatial image features, elevates the precision aspect of the pose.

Zhang’s work is a profound reflection of out-of-the-box thinking, an attribute I always advocate for in research. The idea of camera pose estimation embodied as a combination of rays opens up an entirely new vista. While further exploring this avenue, both regression and diffusion models are deployed that seemed to robustly work in harmony with this representation. The outcome? Even with the bare minimum of views, the results are staggeringly impressive, hinting at a breakthrough in the realm of camera pose estimation. The paper escalates to demonstrate state-of-the-art performance, attaining a notable landmark in CO3D applications while being versatile enough to fit into unseen object categories and in-the-wild captures.

It’s interesting how these professional encounters can lead to personal connections, and this was certainly the case with Jason. Before this podcast episode, our paths had never crossed. What started as a conversation about his trailblazing paper morphed into learning about his personal journey from the Bay Area to Pittsburgh. The exchange was so natural, so comfortable, that the thought of wrapping up the conversation seemed almost preposterous. It’s these fascinating encounters and path-breaking research like Jason’s that keeps the intellectual flame burning, and I eagerly look forward to his future works. Stay tuned with Talking Papers Podcast and ICLR2024 for more such enriching experiences.


Jason Y. Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, Shubham Tulsiani


Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparse views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures.


📚Why Having 10000 Parameters in Your Camera Model Is Better Than Twelve

📚A general imaging model and a method for finding its parameters




💻Project page


To stay up to date with his latest research, follow on:

👨🏻‍🎓Personal website

👨🏻‍🎓Google scholar


This episode was recorded on March 13th 2024


If you would like to be a guest, sponsor or share your thoughts, feel free to reach out via email:


🎧Subscribe on your favourite podcast app

📧Subscribe to our mailing list

🐦Follow us on Twitter

🎥Subscribe to our YouTube channel