In the latest episode of the Talking Papers Podcast, I had the pleasure of hosting Jiahao Li, a talented PhD student at Toyota Technological Institute at Chicago (TTIC). Our conversation centered around his paper titled “Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model,” which was recently published in ICLR 2024.
The paper addresses the challenges of text-to-3D generation, specifically the slow inference, low diversity, and quality issues faced by existing methods. Jiahao introduced Instant3D, a novel approach that generates high-quality and diverse 3D assets from text prompts in a feed-forward manner. This is achieved through a two-stage paradigm involving the generation of a sparse set of four structured and consistent views from text, followed by the direct regression of the NeRF (Neural Radiance Field) from the generated images using a transformer-based sparse-view reconstructor.
I must say that the results are truly impressive, especially given the remarkable speed at which they are generated. As someone with a strong inclination towards 3D, the notion of going through a 2D projection initially felt peculiar to me. However, I can’t argue with the visually striking outputs that Instant3D produces. This research underscores the importance of obtaining more and better 3D data, further pushing the boundaries of text-to-3D conversion.
It’s worth mentioning that I was introduced to Jiahao through Yicong Hong, another guest on the podcast who coincidentally happens to be a co-author on this paper as well. Yicong was aPhD student at the Australian National University (ANU) while I was doing my postdoc there, also interned at Adobe with Jiahao. It’s always interesting to see connections and collaborations come full circle in the research community.
While I regret that the Instant3D model is not made public, I understand Adobe’s decision, considering the substantial computational resources required to train such models and copyright issues. Nevertheless, I am excited to see what future research Jiahao and his collaborators will bring to the field of text-to-3D conversion. Stay tuned for more exciting episodes of the Talking Papers Podcast, where we continue to delve into the latest research and discoveries in academia.
AUTHORS
Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, Sai Bi
ABSTRACT
Text-to-3D with diffusion models has achieved remarkable progress in recent years. However, existing methods either rely on score distillation-based optimization which suffer from slow inference, low diversity and Janus problems, or are feed-forward methods that generate low-quality results due to the scarcity of 3D training data. In this paper, we propose Instant3D, a novel method that generates high-quality and diverse 3D assets from text prompts in a feed-forward manner. We adopt a two-stage paradigm, which first generates a sparse set of four structured and consistent views from text in one shot with a fine-tuned 2D text-to-image diffusion model, and then directly regresses the NeRF from the generated images with a novel transformer-based sparse-view reconstructor. Through extensive experiments, we demonstrate that our method can generate diverse 3D assets of high visual quality within 20 seconds, which is two orders of magnitude faster than previous optimization-based methods that can take 1 to 10 hours. Our project webpage: this https URL.
RELATED WORKS
๐DreamFusion
๐Shape-E
๐Prolific Dreamer
LINKS AND RESOURCES
๐Preprint
๐ปProject page
To stay up to date with his latest research, follow on:
๐จ๐ปโ๐Personal website
๐จ๐ปโ๐Google scholar
๐ฆTwitter
๐จ๐ปโ๐LinkedIn
This episode was recorded on January 24th 2024
CONTACT
If you would like to be a guest, sponsor or share your thoughts, feel free to reach out via email: talking.papers.podcast@gmail.com
SUBSCRIBE AND FOLLOW
๐งSubscribe on your favourite podcast app
๐งSubscribe to our mailing list
๐ฆFollow us on Twitter
๐ฅSubscribe to our