AI That Turns Photos into 3D Worlds: Tencent Voyager

Tencent has launched Voyager, a powerful new AI mannequin that may rework a single {photograph} right into a three-dimensional scene. The mannequin concurrently generates each an RGB video and depth info, providing a strong strategy to 3D reconstruction with out the necessity for conventional modeling methods. Nevertheless, it requires a big quantity of {hardware} to run successfully.

How Voyager Works

The HunyuanWorld-Voyager mannequin takes a single picture and a user-defined digicam path—equivalent to a pan, tilt, or dolly-in movement—to generate a brief video. It produces each the video and a simultaneous depth map, guaranteeing that the spatial relationships of objects within the scene stay constant. The system maintains geometric coherence by evaluating every new body with the earlier content material utilizing 3D level clouds. Nevertheless, distortions can nonetheless happen with lengthy or complicated digicam actions, significantly with 360-degree rotations.

Tencent‘s technical report highlights a further part known as the “world cache,” which shops information from every new body. This permits for information reuse in subsequent frames, considerably preserving geometric consistency over movies which can be a number of minutes lengthy.

Coaching and Necessities

Voyager was skilled on a large dataset of over 100,000 actual and artificial video clips, together with scenes from Unreal Engine environments. This in depth coaching helped the mannequin perceive numerous digicam actions. The coaching course of used an automatic depth estimation technique, eliminating the necessity for guide labeling.

Whereas technologically highly effective, Voyager has excessive {hardware} necessities. Working the mannequin at a 540p decision requires 60 GB of GPU reminiscence, and optimum outcomes want 80 GB. The system helps multi-GPU scaling, with an 8-GPU setup operating roughly 6.7 instances quicker than a single GPU. The mannequin weights have been made out there to researchers on Hugging Face.

Voyager vs. Different AI Fashions

Voyager’s strategy units it other than present video technology fashions. In contrast to OpenAI’s Sora, which focuses on visible realism, Voyager prioritizes geometric consistency between frames. This focus helped it obtain a high rating of 77.62 on Stanford’s WorldScore benchmark, outperforming opponents like WonderWorld and CogVideoX-I2V. Nevertheless, it nonetheless has some limitations in exact digicam management.

Moreover, there are some licensing restrictions for Voyager. Its use is prohibited within the European Union, the United Kingdom, and South Korea. Industrial functions serving over 100 million energetic customers require a further settlement.